Dev Documentation
Annotation
Core annotation engine for building STR-annotated VCF records.
- strvcf_annotator.core.annotation.make_modified_header(vcf_in: VariantFile) VariantHeader[source]
Create VCF header with STR-specific INFO and FORMAT fields.
Creates a modified VCF header that includes all original header information plus STR-specific annotations. Replaces existing RU, PERIOD, REF, PERFECT INFO fields and REPCN FORMAT field with STR-specific definitions.
- Parameters:
vcf_in (pysam.VariantFile) – Input VCF file object
- Returns:
New header with STR-specific fields
- Return type:
pysam.VariantHeader
Notes
- INFO fields added/replaced:
RU: Repeat unit
PERIOD: Repeat period (length of unit)
REF: Reference copy number
PERFECT: Indicates perfect repeats in REF and ALT
- FORMAT field added/replaced:
REPCN: Genotype as number of repeat motif copies
- strvcf_annotator.core.annotation.build_new_record(record: VariantRecord, str_row: Dict | Series, header: VariantHeader, parser: BaseVCFParser, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') VariantRecord | None[source]
Build annotated VCF record with STR alleles and metadata.
Constructs a new VCF record where alleles represent full repeat sequences (before and after mutation) and adds STR-specific annotations to INFO and FORMAT fields.
- Parameters:
record (pysam.VariantRecord) – Original VCF record with mutation
str_row (Dict or pd.Series) – STR metadata (CHROM, START, END, RU, PERIOD)
header (pysam.VariantHeader) – Modified header with STR fields
parser (BaseVCFParser) – Parser for extracting genotype information
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
- Which source to treat as correct when there is a mismatch:
”panel”: trust panel repeat sequence (default behavior)
”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
”skip”: skip record with mismatch
- Returns:
New record with STR alleles and annotations
- Return type:
pysam.VariantRecord
Notes
Logs warning if reference mismatch detected
Calculates repeat copy numbers for REF and ALT
Marks PERFECT=TRUE only if both alleles are perfect repeats
Preserves all original FORMAT fields
Returns None if there is a reference mismatch and mismatch_truth is “skip”
- strvcf_annotator.core.annotation.should_skip_genotype(record: VariantRecord, parser: BaseVCFParser) bool[source]
Determine if record should be skipped based on genotype filtering.
Skips records where: - Not exactly 2 samples present - Genotypes are invalid or missing - Both samples have identical genotypes
- Parameters:
record (pysam.VariantRecord) – VCF record to check
parser (BaseVCFParser) – Parser for extracting genotypes
- Returns:
True if record should be skipped, False otherwise
- Return type:
bool
Process vcf
VCF file processing and workflow management.
- strvcf_annotator.core.vcf_processor.check_vcf_sorted(vcf_in: VariantFile) bool[source]
Validate VCF sorting by chromosome and position.
Checks if VCF records are sorted by chromosome and position. Rewinds the file after checking.
- Parameters:
vcf_in (pysam.VariantFile) – Input VCF file
- Returns:
True if VCF is sorted, False otherwise
- Return type:
bool
- strvcf_annotator.core.vcf_processor.reset_and_sort_vcf(vcf_in: VariantFile) List[VariantRecord][source]
Sort VCF records in memory when needed.
Loads all VCF records into memory and sorts them by chromosome and position according to the contig order in the header.
- Parameters:
vcf_in (pysam.VariantFile) – Input VCF file
- Returns:
Sorted list of VCF records
- Return type:
List[pysam.VariantRecord]
Notes
This loads the entire VCF into memory, so use with caution for large files.
- strvcf_annotator.core.vcf_processor.generate_annotated_records(vcf_in: VariantFile, str_df: DataFrame, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') Iterator[VariantRecord][source]
Generator yielding annotated VCF records.
Processes VCF records and yields annotated records for variants that overlap with STR regions. Handles sorting if needed and optionally filters records based on genotype criteria. When multiple STR regions overlap the same POS, try all overlapping STR candidates and pick the first that produces a meaningful STR allele change.
- Parameters:
vcf_in (pysam.VariantFile) – Input VCF file
str_df (pd.DataFrame) – DataFrame with STR regions (from load_str_reference)
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool, optional) – If True, suppresses warnings about reference mismatches between the STR panel and VCF REF alleles. Default is False.
mismatch_truth (str, optional) – Specifies which source to consider as ground truth for mismatches. Options are “panel”, “vcf”, or “skip”. Default is “panel”.
- Yields:
pysam.VariantRecord – Annotated VCF records
Notes
Automatically sorts VCF if not sorted
Skips records without STR overlap
If somatic_mode=True, filters records with identical genotypes
ignore_mismatch_warnings controls logging of reference mismatches
mismatch_truth controls which source is considered ground truth for mismatches
- strvcf_annotator.core.vcf_processor.annotate_vcf_to_file(vcf_path: str, str_df: DataFrame, output_path: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') None[source]
Process VCF file and write annotated output.
Reads a VCF file, annotates variants that overlap with STR regions, and writes the annotated records to an output file.
- Parameters:
vcf_path (str) – Path to input VCF file
str_df (pd.DataFrame) – DataFrame with STR regions (from load_str_reference)
output_path (str) – Path to output VCF file
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
- Which source to treat as correct when there is a mismatch:
”panel”: trust panel repeat sequence (default behavior)
”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
”skip”: skip record with mismatch
Notes
Prints summary statistics after processing.
- strvcf_annotator.core.vcf_processor.process_directory(input_dir: str, str_bed_path: str, output_dir: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') None[source]
Batch process directory of VCF files.
Processes all VCF files in a directory and writes annotated versions to the output directory.
- Parameters:
input_dir (str) – Directory containing input VCF files
str_bed_path (str) – Path to BED file with STR regions
output_dir (str) – Directory for output VCF files
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
- Which source to treat as correct when there is a mismatch:
”panel”: trust panel repeat sequence (default behavior)
”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
”skip”: skip record with mismatch
VCF validation
Input validation functions.
- exception strvcf_annotator.utils.validation.ValidationError[source]
Bases:
ExceptionException raised for validation errors.
- strvcf_annotator.utils.validation.validate_file_path(file_path: str, must_exist: bool = True) Path[source]
Validate file path.
- Parameters:
file_path (str) – Path to validate
must_exist (bool, optional) – If True, file must exist. Default is True.
- Returns:
Validated Path object
- Return type:
Path
- Raises:
ValidationError – If file path is invalid or doesn’t exist when required
- strvcf_annotator.utils.validation.validate_directory_path(dir_path: str, must_exist: bool = True, create: bool = False) Path[source]
Validate directory path.
- Parameters:
dir_path (str) – Directory path to validate
must_exist (bool, optional) – If True, directory must exist. Default is True.
create (bool, optional) – If True, create directory if it doesn’t exist. Default is False.
- Returns:
Validated Path object
- Return type:
Path
- Raises:
ValidationError – If directory path is invalid
- strvcf_annotator.utils.validation.validate_vcf_file(vcf_path: str) bool[source]
Validate VCF file format.
- Parameters:
vcf_path (str) – Path to VCF file
- Returns:
True if VCF is valid
- Return type:
bool
- Raises:
ValidationError – If VCF file is invalid or cannot be opened
- strvcf_annotator.utils.validation.validate_bed_file(bed_path: str) bool[source]
Validate BED file format.
- Parameters:
bed_path (str) – Path to BED file
- Returns:
True if BED is valid
- Return type:
bool
- Raises:
ValidationError – If BED file is invalid or cannot be opened
- strvcf_annotator.utils.validation.validate_str_bed_file(bed_path: str) bool[source]
Validate STR BED file format with required columns.
- Parameters:
bed_path (str) – Path to STR BED file
- Returns:
True if STR BED is valid
- Return type:
bool
- Raises:
ValidationError – If STR BED file is invalid or missing required columns
VCF utils
Utility functions for VCF processing.
- strvcf_annotator.utils.vcf_utils.chrom_to_order(chrom: str) int[source]
Map chromosome names like ‘chr1’, ‘chrX’, ‘1’ to an integer order so that sorting is: 1,2,…,22,X,Y,M/MT,others.
- strvcf_annotator.utils.vcf_utils.normalize_info_fields(record: VariantRecord, header: VariantHeader) Dict[str, Any][source]
Normalize INFO fields for proper VCF serialization.
Handles various INFO field types and ensures they are properly formatted for writing to VCF files. Handles Flags, Strings, and R-type fields specially.
- Parameters:
record (pysam.VariantRecord) – VCF record with INFO fields to normalize
header (pysam.VariantHeader) – VCF header with field definitions
- Returns:
Normalized INFO fields ready for VCF writing
- Return type:
Dict[str, Any]
Notes
Flag fields are included only if True
String fields with multiple values are joined with “|”
R-type fields (REF + ALT) are clipped to first 2 values
Unknown fields are skipped
- strvcf_annotator.utils.vcf_utils.get_sample_by_name(record: VariantRecord, sample_name: str) Any[source]
Get sample data by name from VCF record.
- Parameters:
record (pysam.VariantRecord) – VCF record
sample_name (str) – Name of sample to retrieve
- Returns:
Sample data object
- Return type:
Any
- Raises:
KeyError – If sample name not found in record
- strvcf_annotator.utils.vcf_utils.get_sample_by_index(record: VariantRecord, sample_idx: int) Any[source]
Get sample data by index from VCF record.
- Parameters:
record (pysam.VariantRecord) – VCF record
sample_idx (int) – Index of sample to retrieve
- Returns:
Sample data object
- Return type:
Any
- Raises:
IndexError – If sample index out of range
- strvcf_annotator.utils.vcf_utils.has_format_field(record: VariantRecord, field_name: str) bool[source]
Check if FORMAT field exists in any sample.
- Parameters:
record (pysam.VariantRecord) – VCF record
field_name (str) – Name of FORMAT field to check
- Returns:
True if field exists in at least one sample
- Return type:
bool
STR reference processing
STR reference management for BED file loading and region lookups.
- strvcf_annotator.core.str_reference.load_str_reference(str_path: str) DataFrame[source]
Load STR reference data from BED file.
Loads a BED file containing STR (Short Tandem Repeat) regions and converts coordinates from 0-based BED format to 1-based VCF format. Calculates the number of repeat units for each region.
- Parameters:
str_path (str) – Path to BED file with STR regions
- Returns:
DataFrame with columns: CHROM, START, END, PERIOD, RU, COUNT - CHROM: Chromosome name - START: 1-based start position (converted from BED 0-based) - END: 1-based end position - PERIOD: Length of repeat unit - RU: Repeat unit sequence - COUNT: Number of repeat units in the region
- Return type:
pd.DataFrame
Notes
BED files use 0-based coordinates, but VCF files use 1-based coordinates. This function converts START positions by adding 1. END positions are kept as-is since BED END is exclusive and VCF END is inclusive.
- strvcf_annotator.core.str_reference.find_overlapping_str(str_df: DataFrame, chrom: str, pos: int, end: int) Dict | None[source]
Find STR region overlapping with variant coordinates.
Searches for an STR region that overlaps with the given variant position. Uses efficient binary search on sorted DataFrame.
- Parameters:
str_df (pd.DataFrame) – DataFrame with STR regions (from load_str_reference)
chrom (str) – Chromosome name
pos (int) – Variant start position (1-based)
end (int) – Variant end position (1-based)
- Returns:
Dictionary with STR region data if overlap found, None otherwise Contains keys: CHROM, START, END, PERIOD, RU, COUNT
- Return type:
Optional[Dict]
- strvcf_annotator.core.str_reference.get_str_at_position(str_df: DataFrame, chrom: str, pos: int) Dict | None[source]
Get STR region containing a specific position.
- Parameters:
str_df (pd.DataFrame) – DataFrame with STR regions (from load_str_reference)
chrom (str) – Chromosome name
pos (int) – Position to query (1-based)
- Returns:
Dictionary with STR region data if position is within an STR, None otherwise
- Return type:
Optional[Dict]
Repeat utils module
Utilities for repeat sequence operations.
- strvcf_annotator.core.repeat_utils.extract_repeat_sequence(str_row: Dict | Series) str[source]
Reconstruct repeat sequence from STR metadata.
Generates the full repeat sequence by repeating the repeat unit (RU) the calculated number of times (COUNT).
- Parameters:
str_row (Dict or pd.Series) – STR region data containing ‘RU’ (repeat unit) and ‘COUNT’ (number of repeats)
- Returns:
Full repeat sequence
- Return type:
str
Examples
>>> str_row = {'RU': 'CAG', 'COUNT': 5} >>> extract_repeat_sequence(str_row) 'CAGCAGCAGCAGCAG'
- strvcf_annotator.core.repeat_utils.count_repeat_units(sequence: str, motif: str) int[source]
Return the longest contiguous run of motif in sequence.
The function looks for exact, non-overlapping copies of motif that occur consecutively and returns the maximum number of such copies in any run.
This corresponds to how STR repeat counts are typically defined: the length of the longest perfect contiguous block of the repeat unit.
- Parameters:
sequence (str) – DNA sequence to search.
motif (str) – Repeat unit motif to count (e.g. ‘A’, ‘CAG’).
- Returns:
Length of the longest contiguous run of motif in sequence.
- Return type:
int
- Raises:
ValueError – If motif is empty or if either argument is not a string.
Examples
Perfect repeats
>>> count_repeat_units("CAGCAGCAG", "CAG") 3
Imperfect tail
>>> count_repeat_units("CAGCAGCA", "CAG") 2
No repeats
>>> count_repeat_units("ATCG", "CAG") 0
Homopolymer runs
>>> count_repeat_units("ATAAAAA", "A") 5 >>> count_repeat_units("AAAATAAA", "A") 4 # longest contiguous run is 'AAAA'
Overlapping motifs
‘AAAA’ with motif ‘AA’ contains two non-overlapping copies: ‘AA’ ‘AA’ >>> count_repeat_units(“AAAA”, “AA”) 2
- strvcf_annotator.core.repeat_utils.normalize_variant(pos: int, ref: str, alt: str) Tuple[int, str, str][source]
Locally normalize (pos, ref, alt) by trimming shared prefix/suffix.
pos is 1-based VCF coordinate.
Trimming is case-insensitive.
We always keep at least 1 base in ref and alt if they differ.
- Return type:
new_pos, new_ref, new_alt
- strvcf_annotator.core.repeat_utils.apply_variant_to_repeat(pos: int, ref: str, alt: str, repeat_start: int, repeat_seq: str) str[source]
Apply a variant to an STR repeat sequence.
The function applies a VCF variant to a reference STR sequence while respecting VCF normalization rules and STR boundaries.
The algorithm works as follows:
Normalize
pos,ref, andaltby trimming shared prefixes and suffixes.If the normalized variant lies fully inside the STR region, apply the full
ALTallele.If the variant partially overlaps the STR region:
SNP-like variants (
len(ref) == len(alt)) are aligned positionally.Indel-like variants (
len(ref) != len(alt)) use the suffix ofALTthat overlaps the STR.
Any parts of the variant outside the STR window are ignored.
Notes
The genomic reference is conceptually treated as:
repeat_seq + UNKNOWN_SUFFIX
Differences outside the STR window do not affect the resulting mutated STR sequence.
Case handling rules:
All matching and overlap logic is case-insensitive.
Output case follows the overlapping STR segment:
lowercase STR slice → lowercase ALT
uppercase STR slice → uppercase ALT
mixed case → ALT is left unchanged
- Parameters:
pos (int) – Variant position (1-based VCF coordinate).
ref (str) – Reference allele from the VCF record.
alt (str) – Alternate allele from the VCF record.
repeat_start (int) – Start position of the STR region (1-based).
repeat_seq (str) – Reference STR sequence from the panel.
- Returns:
The mutated STR sequence after applying the variant. If the variant does not overlap the STR region, the original
repeat_seqis returned unchanged.- Return type:
str
- strvcf_annotator.core.repeat_utils.is_perfect_repeat(sequence: str, motif: str) bool[source]
Check if sequence is a perfect repeat of the motif.
A perfect repeat means the sequence consists entirely of exact copies of the motif with no interruptions or variations.
- Parameters:
sequence (str) – DNA sequence to check
motif (str) – Repeat unit motif
- Returns:
True if sequence is a perfect repeat, False otherwise
- Return type:
bool
Examples
>>> is_perfect_repeat('CAGCAGCAG', 'CAG') True >>> is_perfect_repeat('CAGCAGCA', 'CAG') False
Parser class
Abstract base class for VCF parsers.
- class strvcf_annotator.parsers.base.BaseVCFParser[source]
Bases:
ABCAbstract base class for VCF parsers.
Defines the interface for extracting genotype and variant information from VCF records in a standardized way. All parser implementations must inherit from this class and implement all abstract methods.
- abstractmethod get_genotype(record: VariantRecord, sample_idx: int) Tuple[int, int] | None[source]
Extract genotype as (allele1, allele2) or None if unknown.
- Parameters:
record (pysam.VariantRecord) – The VCF record to extract genotype from
sample_idx (int) – Index of the sample in the record
- Returns:
Genotype as tuple of allele indices (0=REF, 1=ALT), or None if unknown
- Return type:
Optional[Tuple[int, int]]
- abstractmethod has_variant(record: VariantRecord, sample_idx: int) bool[source]
Check if sample has variant even when GT is unknown.
- Parameters:
record (pysam.VariantRecord) – The VCF record to check
sample_idx (int) – Index of the sample in the record
- Returns:
True if variant is present, False otherwise
- Return type:
bool
- abstractmethod extract_info(record: VariantRecord, sample_idx: int) Dict[str, Any][source]
Extract additional fields (AD, DP, etc.) as dictionary.
- Parameters:
record (pysam.VariantRecord) – The VCF record to extract information from
sample_idx (int) – Index of the sample in the record
- Returns:
Dictionary of additional FORMAT fields
- Return type:
Dict[str, Any]
Generic parser
Generic parser for standard VCF format fields.
- class strvcf_annotator.parsers.generic.GenericParser[source]
Bases:
BaseVCFParserGeneric parser for standard VCF format fields.
Handles standard VCF FORMAT fields including GT (genotype), AD (allelic depth), and DP (total depth). Provides robust error handling for missing or invalid data.
- get_genotype(record: VariantRecord, sample_idx: int) Tuple[int, int] | None[source]
Extract GT field, return None for missing/invalid genotypes.
- Parameters:
record (pysam.VariantRecord) – The VCF record to extract genotype from
sample_idx (int) – Index of the sample in the record
- Returns:
Genotype as tuple of allele indices, or None if missing/invalid
- Return type:
Optional[Tuple[int, int]]
- has_variant(record: VariantRecord, sample_idx: int) bool[source]
Check variant presence using GT or alternative evidence.
- Parameters:
record (pysam.VariantRecord) – The VCF record to check
sample_idx (int) – Index of the sample in the record
- Returns:
True if variant is present, False otherwise
- Return type:
bool
- extract_info(record: VariantRecord, sample_idx: int) Dict[str, Any][source]
Extract AD, DP and other standard FORMAT fields.
- Parameters:
record (pysam.VariantRecord) – The VCF record to extract information from
sample_idx (int) – Index of the sample in the record
- Returns:
Dictionary of FORMAT fields (AD, DP, etc.)
- Return type:
Dict[str, Any]