Dev Documentation

Annotation

Core annotation engine for building STR-annotated VCF records.

strvcf_annotator.core.annotation.make_modified_header(vcf_in: VariantFile) VariantHeader[source]

Create VCF header with STR-specific INFO and FORMAT fields.

Creates a modified VCF header that includes all original header information plus STR-specific annotations. Replaces existing RU, PERIOD, REF, PERFECT INFO fields and REPCN FORMAT field with STR-specific definitions.

Parameters:

vcf_in (pysam.VariantFile) – Input VCF file object

Returns:

New header with STR-specific fields

Return type:

pysam.VariantHeader

Notes

INFO fields added/replaced:
  • RU: Repeat unit

  • PERIOD: Repeat period (length of unit)

  • REF: Reference copy number

  • PERFECT: Indicates perfect repeats in REF and ALT

FORMAT field added/replaced:
  • REPCN: Genotype as number of repeat motif copies

strvcf_annotator.core.annotation.build_new_record(record: VariantRecord, str_row: Dict | Series, header: VariantHeader, parser: BaseVCFParser, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') VariantRecord | None[source]

Build annotated VCF record with STR alleles and metadata.

Constructs a new VCF record where alleles represent full repeat sequences (before and after mutation) and adds STR-specific annotations to INFO and FORMAT fields.

Parameters:
  • record (pysam.VariantRecord) – Original VCF record with mutation

  • str_row (Dict or pd.Series) – STR metadata (CHROM, START, END, RU, PERIOD)

  • header (pysam.VariantHeader) – Modified header with STR fields

  • parser (BaseVCFParser) – Parser for extracting genotype information

  • ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.

  • mismatch_truth (str) –

    Which source to treat as correct when there is a mismatch:
    • ”panel”: trust panel repeat sequence (default behavior)

    • ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF

    • ”skip”: skip record with mismatch

Returns:

New record with STR alleles and annotations

Return type:

pysam.VariantRecord

Notes

  • Logs warning if reference mismatch detected

  • Calculates repeat copy numbers for REF and ALT

  • Marks PERFECT=TRUE only if both alleles are perfect repeats

  • Preserves all original FORMAT fields

  • Returns None if there is a reference mismatch and mismatch_truth is “skip”

strvcf_annotator.core.annotation.should_skip_genotype(record: VariantRecord, parser: BaseVCFParser) bool[source]

Determine if record should be skipped based on genotype filtering.

Skips records where: - Not exactly 2 samples present - Genotypes are invalid or missing - Both samples have identical genotypes

Parameters:
  • record (pysam.VariantRecord) – VCF record to check

  • parser (BaseVCFParser) – Parser for extracting genotypes

Returns:

True if record should be skipped, False otherwise

Return type:

bool

Process vcf

VCF file processing and workflow management.

strvcf_annotator.core.vcf_processor.check_vcf_sorted(vcf_in: VariantFile) bool[source]

Validate VCF sorting by chromosome and position.

Checks if VCF records are sorted by chromosome and position. Rewinds the file after checking.

Parameters:

vcf_in (pysam.VariantFile) – Input VCF file

Returns:

True if VCF is sorted, False otherwise

Return type:

bool

strvcf_annotator.core.vcf_processor.reset_and_sort_vcf(vcf_in: VariantFile) List[VariantRecord][source]

Sort VCF records in memory when needed.

Loads all VCF records into memory and sorts them by chromosome and position according to the contig order in the header.

Parameters:

vcf_in (pysam.VariantFile) – Input VCF file

Returns:

Sorted list of VCF records

Return type:

List[pysam.VariantRecord]

Notes

This loads the entire VCF into memory, so use with caution for large files.

strvcf_annotator.core.vcf_processor.generate_annotated_records(vcf_in: VariantFile, str_df: DataFrame, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') Iterator[VariantRecord][source]

Generator yielding annotated VCF records.

Processes VCF records and yields annotated records for variants that overlap with STR regions. Handles sorting if needed and optionally filters records based on genotype criteria. When multiple STR regions overlap the same POS, try all overlapping STR candidates and pick the first that produces a meaningful STR allele change.

Parameters:
  • vcf_in (pysam.VariantFile) – Input VCF file

  • str_df (pd.DataFrame) – DataFrame with STR regions (from load_str_reference)

  • parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.

  • ignore_mismatch_warnings (bool, optional) – If True, suppresses warnings about reference mismatches between the STR panel and VCF REF alleles. Default is False.

  • mismatch_truth (str, optional) – Specifies which source to consider as ground truth for mismatches. Options are “panel”, “vcf”, or “skip”. Default is “panel”.

Yields:

pysam.VariantRecord – Annotated VCF records

Notes

  • Automatically sorts VCF if not sorted

  • Skips records without STR overlap

  • If somatic_mode=True, filters records with identical genotypes

  • ignore_mismatch_warnings controls logging of reference mismatches

  • mismatch_truth controls which source is considered ground truth for mismatches

strvcf_annotator.core.vcf_processor.annotate_vcf_to_file(vcf_path: str, str_df: DataFrame, output_path: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') None[source]

Process VCF file and write annotated output.

Reads a VCF file, annotates variants that overlap with STR regions, and writes the annotated records to an output file.

Parameters:
  • vcf_path (str) – Path to input VCF file

  • str_df (pd.DataFrame) – DataFrame with STR regions (from load_str_reference)

  • output_path (str) – Path to output VCF file

  • parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.

  • ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.

  • mismatch_truth (str) –

    Which source to treat as correct when there is a mismatch:
    • ”panel”: trust panel repeat sequence (default behavior)

    • ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF

    • ”skip”: skip record with mismatch

Notes

Prints summary statistics after processing.

strvcf_annotator.core.vcf_processor.process_directory(input_dir: str, str_bed_path: str, output_dir: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') None[source]

Batch process directory of VCF files.

Processes all VCF files in a directory and writes annotated versions to the output directory.

Parameters:
  • input_dir (str) – Directory containing input VCF files

  • str_bed_path (str) – Path to BED file with STR regions

  • output_dir (str) – Directory for output VCF files

  • parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.

  • ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.

  • mismatch_truth (str) –

    Which source to treat as correct when there is a mismatch:
    • ”panel”: trust panel repeat sequence (default behavior)

    • ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF

    • ”skip”: skip record with mismatch

VCF validation

Input validation functions.

exception strvcf_annotator.utils.validation.ValidationError[source]

Bases: Exception

Exception raised for validation errors.

strvcf_annotator.utils.validation.validate_file_path(file_path: str, must_exist: bool = True) Path[source]

Validate file path.

Parameters:
  • file_path (str) – Path to validate

  • must_exist (bool, optional) – If True, file must exist. Default is True.

Returns:

Validated Path object

Return type:

Path

Raises:

ValidationError – If file path is invalid or doesn’t exist when required

strvcf_annotator.utils.validation.validate_directory_path(dir_path: str, must_exist: bool = True, create: bool = False) Path[source]

Validate directory path.

Parameters:
  • dir_path (str) – Directory path to validate

  • must_exist (bool, optional) – If True, directory must exist. Default is True.

  • create (bool, optional) – If True, create directory if it doesn’t exist. Default is False.

Returns:

Validated Path object

Return type:

Path

Raises:

ValidationError – If directory path is invalid

strvcf_annotator.utils.validation.validate_vcf_file(vcf_path: str) bool[source]

Validate VCF file format.

Parameters:

vcf_path (str) – Path to VCF file

Returns:

True if VCF is valid

Return type:

bool

Raises:

ValidationError – If VCF file is invalid or cannot be opened

strvcf_annotator.utils.validation.validate_bed_file(bed_path: str) bool[source]

Validate BED file format.

Parameters:

bed_path (str) – Path to BED file

Returns:

True if BED is valid

Return type:

bool

Raises:

ValidationError – If BED file is invalid or cannot be opened

strvcf_annotator.utils.validation.validate_str_bed_file(bed_path: str) bool[source]

Validate STR BED file format with required columns.

Parameters:

bed_path (str) – Path to STR BED file

Returns:

True if STR BED is valid

Return type:

bool

Raises:

ValidationError – If STR BED file is invalid or missing required columns

VCF utils

Utility functions for VCF processing.

strvcf_annotator.utils.vcf_utils.chrom_to_order(chrom: str) int[source]

Map chromosome names like ‘chr1’, ‘chrX’, ‘1’ to an integer order so that sorting is: 1,2,…,22,X,Y,M/MT,others.

strvcf_annotator.utils.vcf_utils.normalize_info_fields(record: VariantRecord, header: VariantHeader) Dict[str, Any][source]

Normalize INFO fields for proper VCF serialization.

Handles various INFO field types and ensures they are properly formatted for writing to VCF files. Handles Flags, Strings, and R-type fields specially.

Parameters:
  • record (pysam.VariantRecord) – VCF record with INFO fields to normalize

  • header (pysam.VariantHeader) – VCF header with field definitions

Returns:

Normalized INFO fields ready for VCF writing

Return type:

Dict[str, Any]

Notes

  • Flag fields are included only if True

  • String fields with multiple values are joined with “|”

  • R-type fields (REF + ALT) are clipped to first 2 values

  • Unknown fields are skipped

strvcf_annotator.utils.vcf_utils.get_sample_by_name(record: VariantRecord, sample_name: str) Any[source]

Get sample data by name from VCF record.

Parameters:
  • record (pysam.VariantRecord) – VCF record

  • sample_name (str) – Name of sample to retrieve

Returns:

Sample data object

Return type:

Any

Raises:

KeyError – If sample name not found in record

strvcf_annotator.utils.vcf_utils.get_sample_by_index(record: VariantRecord, sample_idx: int) Any[source]

Get sample data by index from VCF record.

Parameters:
  • record (pysam.VariantRecord) – VCF record

  • sample_idx (int) – Index of sample to retrieve

Returns:

Sample data object

Return type:

Any

Raises:

IndexError – If sample index out of range

strvcf_annotator.utils.vcf_utils.has_format_field(record: VariantRecord, field_name: str) bool[source]

Check if FORMAT field exists in any sample.

Parameters:
  • record (pysam.VariantRecord) – VCF record

  • field_name (str) – Name of FORMAT field to check

Returns:

True if field exists in at least one sample

Return type:

bool

STR reference processing

STR reference management for BED file loading and region lookups.

strvcf_annotator.core.str_reference.load_str_reference(str_path: str) DataFrame[source]

Load STR reference data from BED file.

Loads a BED file containing STR (Short Tandem Repeat) regions and converts coordinates from 0-based BED format to 1-based VCF format. Calculates the number of repeat units for each region.

Parameters:

str_path (str) – Path to BED file with STR regions

Returns:

DataFrame with columns: CHROM, START, END, PERIOD, RU, COUNT - CHROM: Chromosome name - START: 1-based start position (converted from BED 0-based) - END: 1-based end position - PERIOD: Length of repeat unit - RU: Repeat unit sequence - COUNT: Number of repeat units in the region

Return type:

pd.DataFrame

Notes

BED files use 0-based coordinates, but VCF files use 1-based coordinates. This function converts START positions by adding 1. END positions are kept as-is since BED END is exclusive and VCF END is inclusive.

strvcf_annotator.core.str_reference.find_overlapping_str(str_df: DataFrame, chrom: str, pos: int, end: int) Dict | None[source]

Find STR region overlapping with variant coordinates.

Searches for an STR region that overlaps with the given variant position. Uses efficient binary search on sorted DataFrame.

Parameters:
  • str_df (pd.DataFrame) – DataFrame with STR regions (from load_str_reference)

  • chrom (str) – Chromosome name

  • pos (int) – Variant start position (1-based)

  • end (int) – Variant end position (1-based)

Returns:

Dictionary with STR region data if overlap found, None otherwise Contains keys: CHROM, START, END, PERIOD, RU, COUNT

Return type:

Optional[Dict]

strvcf_annotator.core.str_reference.get_str_at_position(str_df: DataFrame, chrom: str, pos: int) Dict | None[source]

Get STR region containing a specific position.

Parameters:
  • str_df (pd.DataFrame) – DataFrame with STR regions (from load_str_reference)

  • chrom (str) – Chromosome name

  • pos (int) – Position to query (1-based)

Returns:

Dictionary with STR region data if position is within an STR, None otherwise

Return type:

Optional[Dict]

Repeat utils module

Utilities for repeat sequence operations.

strvcf_annotator.core.repeat_utils.extract_repeat_sequence(str_row: Dict | Series) str[source]

Reconstruct repeat sequence from STR metadata.

Generates the full repeat sequence by repeating the repeat unit (RU) the calculated number of times (COUNT).

Parameters:

str_row (Dict or pd.Series) – STR region data containing ‘RU’ (repeat unit) and ‘COUNT’ (number of repeats)

Returns:

Full repeat sequence

Return type:

str

Examples

>>> str_row = {'RU': 'CAG', 'COUNT': 5}
>>> extract_repeat_sequence(str_row)
'CAGCAGCAGCAGCAG'
strvcf_annotator.core.repeat_utils.count_repeat_units(sequence: str, motif: str) int[source]

Return the longest contiguous run of motif in sequence.

The function looks for exact, non-overlapping copies of motif that occur consecutively and returns the maximum number of such copies in any run.

This corresponds to how STR repeat counts are typically defined: the length of the longest perfect contiguous block of the repeat unit.

Parameters:
  • sequence (str) – DNA sequence to search.

  • motif (str) – Repeat unit motif to count (e.g. ‘A’, ‘CAG’).

Returns:

Length of the longest contiguous run of motif in sequence.

Return type:

int

Raises:

ValueError – If motif is empty or if either argument is not a string.

Examples

Perfect repeats

>>> count_repeat_units("CAGCAGCAG", "CAG")
3

Imperfect tail

>>> count_repeat_units("CAGCAGCA", "CAG")
2

No repeats

>>> count_repeat_units("ATCG", "CAG")
0

Homopolymer runs

>>> count_repeat_units("ATAAAAA", "A")
5
>>> count_repeat_units("AAAATAAA", "A")
4  # longest contiguous run is 'AAAA'

Overlapping motifs

‘AAAA’ with motif ‘AA’ contains two non-overlapping copies: ‘AA’ ‘AA’ >>> count_repeat_units(“AAAA”, “AA”) 2

strvcf_annotator.core.repeat_utils.normalize_variant(pos: int, ref: str, alt: str) Tuple[int, str, str][source]

Locally normalize (pos, ref, alt) by trimming shared prefix/suffix.

  • pos is 1-based VCF coordinate.

  • Trimming is case-insensitive.

  • We always keep at least 1 base in ref and alt if they differ.

Return type:

new_pos, new_ref, new_alt

strvcf_annotator.core.repeat_utils.apply_variant_to_repeat(pos: int, ref: str, alt: str, repeat_start: int, repeat_seq: str) str[source]

Apply a variant to an STR repeat sequence.

The function applies a VCF variant to a reference STR sequence while respecting VCF normalization rules and STR boundaries.

The algorithm works as follows:

  1. Normalize pos, ref, and alt by trimming shared prefixes and suffixes.

  2. If the normalized variant lies fully inside the STR region, apply the full ALT allele.

  3. If the variant partially overlaps the STR region:

    • SNP-like variants (len(ref) == len(alt)) are aligned positionally.

    • Indel-like variants (len(ref) != len(alt)) use the suffix of ALT that overlaps the STR.

Any parts of the variant outside the STR window are ignored.

Notes

The genomic reference is conceptually treated as:

repeat_seq + UNKNOWN_SUFFIX

Differences outside the STR window do not affect the resulting mutated STR sequence.

Case handling rules:

  • All matching and overlap logic is case-insensitive.

  • Output case follows the overlapping STR segment:

    • lowercase STR slice → lowercase ALT

    • uppercase STR slice → uppercase ALT

    • mixed case → ALT is left unchanged

Parameters:
  • pos (int) – Variant position (1-based VCF coordinate).

  • ref (str) – Reference allele from the VCF record.

  • alt (str) – Alternate allele from the VCF record.

  • repeat_start (int) – Start position of the STR region (1-based).

  • repeat_seq (str) – Reference STR sequence from the panel.

Returns:

The mutated STR sequence after applying the variant. If the variant does not overlap the STR region, the original repeat_seq is returned unchanged.

Return type:

str

strvcf_annotator.core.repeat_utils.is_perfect_repeat(sequence: str, motif: str) bool[source]

Check if sequence is a perfect repeat of the motif.

A perfect repeat means the sequence consists entirely of exact copies of the motif with no interruptions or variations.

Parameters:
  • sequence (str) – DNA sequence to check

  • motif (str) – Repeat unit motif

Returns:

True if sequence is a perfect repeat, False otherwise

Return type:

bool

Examples

>>> is_perfect_repeat('CAGCAGCAG', 'CAG')
True
>>> is_perfect_repeat('CAGCAGCA', 'CAG')
False

Parser class

Abstract base class for VCF parsers.

class strvcf_annotator.parsers.base.BaseVCFParser[source]

Bases: ABC

Abstract base class for VCF parsers.

Defines the interface for extracting genotype and variant information from VCF records in a standardized way. All parser implementations must inherit from this class and implement all abstract methods.

abstractmethod get_genotype(record: VariantRecord, sample_idx: int) Tuple[int, int] | None[source]

Extract genotype as (allele1, allele2) or None if unknown.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to extract genotype from

  • sample_idx (int) – Index of the sample in the record

Returns:

Genotype as tuple of allele indices (0=REF, 1=ALT), or None if unknown

Return type:

Optional[Tuple[int, int]]

abstractmethod has_variant(record: VariantRecord, sample_idx: int) bool[source]

Check if sample has variant even when GT is unknown.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to check

  • sample_idx (int) – Index of the sample in the record

Returns:

True if variant is present, False otherwise

Return type:

bool

abstractmethod extract_info(record: VariantRecord, sample_idx: int) Dict[str, Any][source]

Extract additional fields (AD, DP, etc.) as dictionary.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to extract information from

  • sample_idx (int) – Index of the sample in the record

Returns:

Dictionary of additional FORMAT fields

Return type:

Dict[str, Any]

abstractmethod validate_record(record: VariantRecord) bool[source]

Validate that record is compatible with this parser.

Parameters:

record (pysam.VariantRecord) – The VCF record to validate

Returns:

True if record is valid for this parser, False otherwise

Return type:

bool

Generic parser

Generic parser for standard VCF format fields.

class strvcf_annotator.parsers.generic.GenericParser[source]

Bases: BaseVCFParser

Generic parser for standard VCF format fields.

Handles standard VCF FORMAT fields including GT (genotype), AD (allelic depth), and DP (total depth). Provides robust error handling for missing or invalid data.

get_genotype(record: VariantRecord, sample_idx: int) Tuple[int, int] | None[source]

Extract GT field, return None for missing/invalid genotypes.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to extract genotype from

  • sample_idx (int) – Index of the sample in the record

Returns:

Genotype as tuple of allele indices, or None if missing/invalid

Return type:

Optional[Tuple[int, int]]

has_variant(record: VariantRecord, sample_idx: int) bool[source]

Check variant presence using GT or alternative evidence.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to check

  • sample_idx (int) – Index of the sample in the record

Returns:

True if variant is present, False otherwise

Return type:

bool

extract_info(record: VariantRecord, sample_idx: int) Dict[str, Any][source]

Extract AD, DP and other standard FORMAT fields.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to extract information from

  • sample_idx (int) – Index of the sample in the record

Returns:

Dictionary of FORMAT fields (AD, DP, etc.)

Return type:

Dict[str, Any]

validate_record(record: VariantRecord) bool[source]

Validate record has required fields for generic parsing.

Parameters:

record (pysam.VariantRecord) – The VCF record to validate

Returns:

True if record is valid, False otherwise

Return type:

bool