Main functionality

Library API for programmatic access to STR annotation functionality.

class strvcf_annotator.api.STRAnnotator(str_bed_path: str, parser: BaseVCFParser | None = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel')[source]

Bases: object

Main class for STR annotation functionality.

Provides a high-level interface for annotating VCF files with STR (Short Tandem Repeat) information. Supports both single file and batch directory processing.

Parameters:
  • str_bed_path (str) – Path to BED file containing STR regions

  • parser (BaseVCFParser, optional) – Custom parser for genotype extraction. Uses GenericParser if None.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, variants where all samples have identical genotypes are skipped. If None, the value set on the annotator instance is used.

  • ignore_mismatch_warnings (bool, optional) – If True, suppress warnings about reference mismatches between the STR panel sequence and the VCF REF allele. If None, the value set on the annotator instance is used.

  • mismatch_truth (str, optional) –

    Specifies which source is treated as ground truth when a mismatch between the STR panel and VCF REF allele is detected.

    Allowed values are:

    • "panel": trust the STR panel repeat sequence (default)

    • "vcf": trust the VCF REF allele and patch the overlapping

    panel sequence - "skip": skip variants with mismatches entirely

    If None, the value set on the annotator instance is used.

str_bed_path

Path to STR BED file

Type:

str

str_df

Loaded STR reference data

Type:

pd.DataFrame

parser

Parser for genotype extraction

Type:

BaseVCFParser

somatic_mode

Whether somatic filtering is enabled

Type:

bool

Examples

>>> annotator = STRAnnotator('str_regions.bed')
>>> annotator.annotate_vcf_file('input.vcf', 'output.vcf')
>>> # Batch process directory
>>> annotator.process_directory('input_dir/', 'output_dir/')
>>> # Stream processing
>>> vcf_in = pysam.VariantFile('input.vcf')
>>> for record in annotator.annotate_vcf_stream(vcf_in):
...     print(record.info['RU'])
annotate_vcf_file(input_path: str, output_path: str, *, somatic_mode: bool | None = None, ignore_mismatch_warnings: bool | None = None, mismatch_truth: str | None = None) None[source]

Annotate a single VCF file with STR information.

This method reads a VCF file, annotates variants that overlap STR regions, and writes the annotated records to an output VCF file.

Parameters:
  • input_path (str) – Path to the input VCF file.

  • output_path (str) – Path to the output annotated VCF file.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, variants where all samples have identical genotypes are skipped. If None, the value set on the annotator instance is used.

  • ignore_mismatch_warnings (bool, optional) – If True, suppress warnings about reference mismatches between the STR panel sequence and the VCF REF allele. If None, the value set on the annotator instance is used.

  • mismatch_truth (str, optional) –

    Specifies which source is treated as ground truth when a mismatch between the STR panel and VCF REF allele is detected.

    Allowed values are:

    • "panel": trust the STR panel repeat sequence (default)

    • "vcf": trust the VCF REF allele and patch the overlapping

    panel sequence - "skip": skip variants with mismatches entirely

    If None, the value set on the annotator instance is used.

Raises:

ValidationError – If the input VCF file is invalid.

Examples

>>> annotator = STRAnnotator("str_regions.bed")
>>> annotator.annotate_vcf_file("input.vcf", "output.vcf")
annotate_vcf_stream(vcf_in: VariantFile, *, somatic_mode: bool | None = None, ignore_mismatch_warnings: bool | None = None, mismatch_truth: str | None = None) Iterator[VariantRecord][source]

Annotate VCF records from an open stream.

This generator yields annotated VCF records from an already opened pysam.VariantFile object. It is useful for streaming workflows or custom processing pipelines.

Parameters:
  • vcf_in (pysam.VariantFile) – Open VCF file object to read variants from.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, variants where all samples have identical genotypes are skipped. If None, the value set on the annotator instance is used.

  • ignore_mismatch_warnings (bool, optional) – If True, suppress warnings about reference mismatches between the STR panel sequence and the VCF REF allele. If None, the value set on the annotator instance is used.

  • mismatch_truth (str, optional) –

    Specifies which source is treated as ground truth when a mismatch between the STR panel and VCF REF allele is detected.

    Allowed values are:

    • "panel": trust the STR panel repeat sequence (default)

    • "vcf": trust the VCF REF allele and patch the overlapping

    panel sequence - "skip": skip variants with mismatches entirely

    If None, the value set on the annotator instance is used.

Yields:

pysam.VariantRecord – Annotated VCF records.

Examples

>>> annotator = STRAnnotator("str_regions.bed")
>>> vcf_in = pysam.VariantFile("input.vcf")
>>> for record in annotator.annotate_vcf_stream(vcf_in):
...     print(record.info["RU"])
process_directory(input_dir: str, output_dir: str, *, somatic_mode: bool | None = None, ignore_mismatch_warnings: bool | None = None, mismatch_truth: str | None = None) None[source]

Batch process a directory of VCF files.

This method processes all VCF files in the input directory and writes annotated versions to the output directory. Files that have already been processed are skipped automatically.

Parameters:
  • input_dir (str) – Directory containing input VCF files.

  • output_dir (str) – Directory where annotated VCF files will be written. The directory is created if it does not already exist.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, variants where all samples have identical genotypes are skipped. If None, the value set on the annotator instance is used.

  • ignore_mismatch_warnings (bool, optional) – If True, suppress warnings about reference mismatches between the STR panel sequence and the VCF REF allele. If None, the value set on the annotator instance is used.

  • mismatch_truth (str, optional) –

    Specifies which source is treated as ground truth when a mismatch between the STR panel and VCF REF allele is detected.

    Allowed values are:

    • "panel": trust the STR panel repeat sequence (default)

    • "vcf": trust the VCF REF allele and patch the overlapping

    panel sequence - "skip": skip variants with mismatches entirely

    If None, the value set on the annotator instance is used.

Raises:

ValidationError – If the input directory is invalid.

Examples

>>> annotator = STRAnnotator("str_regions.bed")
>>> annotator.process_directory("vcf_files/", "annotated_vcfs/")
get_str_at_position(chrom: str, pos: int) dict | None[source]

Get STR region at specific genomic position.

Parameters:
  • chrom (str) – Chromosome name

  • pos (int) – Genomic position (1-based)

Returns:

STR region data if position is within an STR, None otherwise

Return type:

Optional[dict]

Examples

>>> annotator = STRAnnotator('str_regions.bed')
>>> str_region = annotator.get_str_at_position('chr1', 1000000)
>>> if str_region:
...     print(f"Repeat unit: {str_region['RU']}")
get_statistics() dict[source]

Get statistics about loaded STR regions.

Returns:

Statistics including total regions, chromosomes, repeat units

Return type:

dict

Examples

>>> annotator = STRAnnotator('str_regions.bed')
>>> stats = annotator.get_statistics()
>>> print(f"Total STR regions: {stats['total_regions']}")
strvcf_annotator.api.annotate_vcf(input_vcf: str, str_bed: str, output_vcf: str, parser: BaseVCFParser | None = None, *, somatic_mode: bool | None = None, ignore_mismatch_warnings: bool | None = None, mismatch_truth: str | None = None) None[source]

Convenience function for single VCF annotation.

This is a simplified functional interface for annotating a single VCF file with STR information, without explicitly creating an STRAnnotator instance.

Parameters:
  • input_vcf (str) – Path to the input VCF file.

  • str_bed (str) – Path to the STR BED file.

  • output_vcf (str) – Path to the output annotated VCF file.

  • parser (BaseVCFParser, optional) – Custom parser for genotype extraction. If not provided, GenericParser is used.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, variants where all samples have identical genotypes are skipped. If None, the default behavior is used.

  • ignore_mismatch_warnings (bool, optional) – If True, suppress warnings about reference mismatches between the STR panel sequence and the VCF REF allele. If None, the default behavior is used.

  • mismatch_truth (str, optional) –

    Specifies which source is treated as ground truth when a mismatch between the STR panel and the VCF REF allele is detected.

    Allowed values are:

    • "panel": trust the STR panel repeat sequence (default)

    • "vcf": trust the VCF REF allele and patch the overlapping panel sequence

    • "skip": skip variants with mismatches entirely

    If None, the default behavior is used.

Examples

>>> from strvcf_annotator import annotate_vcf
>>> annotate_vcf("input.vcf", "str_regions.bed", "output.vcf")