Main functionality
Library API for programmatic access to STR annotation functionality.
- class strvcf_annotator.api.STRAnnotator(str_bed_path: str, parser: BaseVCFParser | None = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel')[source]
Bases:
objectMain class for STR annotation functionality.
Provides a high-level interface for annotating VCF files with STR (Short Tandem Repeat) information. Supports both single file and batch directory processing.
- Parameters:
str_bed_path (str) – Path to BED file containing STR regions
parser (BaseVCFParser, optional) – Custom parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When
True, variants where all samples have identical genotypes are skipped. IfNone, the value set on the annotator instance is used.ignore_mismatch_warnings (bool, optional) – If
True, suppress warnings about reference mismatches between the STR panel sequence and the VCFREFallele. IfNone, the value set on the annotator instance is used.mismatch_truth (str, optional) –
Specifies which source is treated as ground truth when a mismatch between the STR panel and VCF
REFallele is detected.Allowed values are:
"panel": trust the STR panel repeat sequence (default)"vcf": trust the VCFREFallele and patch the overlapping
panel sequence -
"skip": skip variants with mismatches entirelyIf
None, the value set on the annotator instance is used.
- str_bed_path
Path to STR BED file
- Type:
str
- str_df
Loaded STR reference data
- Type:
pd.DataFrame
- parser
Parser for genotype extraction
- Type:
- somatic_mode
Whether somatic filtering is enabled
- Type:
bool
Examples
>>> annotator = STRAnnotator('str_regions.bed') >>> annotator.annotate_vcf_file('input.vcf', 'output.vcf')
>>> # Batch process directory >>> annotator.process_directory('input_dir/', 'output_dir/')
>>> # Stream processing >>> vcf_in = pysam.VariantFile('input.vcf') >>> for record in annotator.annotate_vcf_stream(vcf_in): ... print(record.info['RU'])
- annotate_vcf_file(input_path: str, output_path: str, *, somatic_mode: bool | None = None, ignore_mismatch_warnings: bool | None = None, mismatch_truth: str | None = None) None[source]
Annotate a single VCF file with STR information.
This method reads a VCF file, annotates variants that overlap STR regions, and writes the annotated records to an output VCF file.
- Parameters:
input_path (str) – Path to the input VCF file.
output_path (str) – Path to the output annotated VCF file.
somatic_mode (bool, optional) – Enable somatic filtering. When
True, variants where all samples have identical genotypes are skipped. IfNone, the value set on the annotator instance is used.ignore_mismatch_warnings (bool, optional) – If
True, suppress warnings about reference mismatches between the STR panel sequence and the VCFREFallele. IfNone, the value set on the annotator instance is used.mismatch_truth (str, optional) –
Specifies which source is treated as ground truth when a mismatch between the STR panel and VCF
REFallele is detected.Allowed values are:
"panel": trust the STR panel repeat sequence (default)"vcf": trust the VCFREFallele and patch the overlapping
panel sequence -
"skip": skip variants with mismatches entirelyIf
None, the value set on the annotator instance is used.
- Raises:
ValidationError – If the input VCF file is invalid.
Examples
>>> annotator = STRAnnotator("str_regions.bed") >>> annotator.annotate_vcf_file("input.vcf", "output.vcf")
- annotate_vcf_stream(vcf_in: VariantFile, *, somatic_mode: bool | None = None, ignore_mismatch_warnings: bool | None = None, mismatch_truth: str | None = None) Iterator[VariantRecord][source]
Annotate VCF records from an open stream.
This generator yields annotated VCF records from an already opened
pysam.VariantFileobject. It is useful for streaming workflows or custom processing pipelines.- Parameters:
vcf_in (pysam.VariantFile) – Open VCF file object to read variants from.
somatic_mode (bool, optional) – Enable somatic filtering. When
True, variants where all samples have identical genotypes are skipped. IfNone, the value set on the annotator instance is used.ignore_mismatch_warnings (bool, optional) – If
True, suppress warnings about reference mismatches between the STR panel sequence and the VCFREFallele. IfNone, the value set on the annotator instance is used.mismatch_truth (str, optional) –
Specifies which source is treated as ground truth when a mismatch between the STR panel and VCF
REFallele is detected.Allowed values are:
"panel": trust the STR panel repeat sequence (default)"vcf": trust the VCFREFallele and patch the overlapping
panel sequence -
"skip": skip variants with mismatches entirelyIf
None, the value set on the annotator instance is used.
- Yields:
pysam.VariantRecord – Annotated VCF records.
Examples
>>> annotator = STRAnnotator("str_regions.bed") >>> vcf_in = pysam.VariantFile("input.vcf") >>> for record in annotator.annotate_vcf_stream(vcf_in): ... print(record.info["RU"])
- process_directory(input_dir: str, output_dir: str, *, somatic_mode: bool | None = None, ignore_mismatch_warnings: bool | None = None, mismatch_truth: str | None = None) None[source]
Batch process a directory of VCF files.
This method processes all VCF files in the input directory and writes annotated versions to the output directory. Files that have already been processed are skipped automatically.
- Parameters:
input_dir (str) – Directory containing input VCF files.
output_dir (str) – Directory where annotated VCF files will be written. The directory is created if it does not already exist.
somatic_mode (bool, optional) – Enable somatic filtering. When
True, variants where all samples have identical genotypes are skipped. IfNone, the value set on the annotator instance is used.ignore_mismatch_warnings (bool, optional) – If
True, suppress warnings about reference mismatches between the STR panel sequence and the VCFREFallele. IfNone, the value set on the annotator instance is used.mismatch_truth (str, optional) –
Specifies which source is treated as ground truth when a mismatch between the STR panel and VCF
REFallele is detected.Allowed values are:
"panel": trust the STR panel repeat sequence (default)"vcf": trust the VCFREFallele and patch the overlapping
panel sequence -
"skip": skip variants with mismatches entirelyIf
None, the value set on the annotator instance is used.
- Raises:
ValidationError – If the input directory is invalid.
Examples
>>> annotator = STRAnnotator("str_regions.bed") >>> annotator.process_directory("vcf_files/", "annotated_vcfs/")
- get_str_at_position(chrom: str, pos: int) dict | None[source]
Get STR region at specific genomic position.
- Parameters:
chrom (str) – Chromosome name
pos (int) – Genomic position (1-based)
- Returns:
STR region data if position is within an STR, None otherwise
- Return type:
Optional[dict]
Examples
>>> annotator = STRAnnotator('str_regions.bed') >>> str_region = annotator.get_str_at_position('chr1', 1000000) >>> if str_region: ... print(f"Repeat unit: {str_region['RU']}")
- get_statistics() dict[source]
Get statistics about loaded STR regions.
- Returns:
Statistics including total regions, chromosomes, repeat units
- Return type:
dict
Examples
>>> annotator = STRAnnotator('str_regions.bed') >>> stats = annotator.get_statistics() >>> print(f"Total STR regions: {stats['total_regions']}")
- strvcf_annotator.api.annotate_vcf(input_vcf: str, str_bed: str, output_vcf: str, parser: BaseVCFParser | None = None, *, somatic_mode: bool | None = None, ignore_mismatch_warnings: bool | None = None, mismatch_truth: str | None = None) None[source]
Convenience function for single VCF annotation.
This is a simplified functional interface for annotating a single VCF file with STR information, without explicitly creating an
STRAnnotatorinstance.- Parameters:
input_vcf (str) – Path to the input VCF file.
str_bed (str) – Path to the STR BED file.
output_vcf (str) – Path to the output annotated VCF file.
parser (BaseVCFParser, optional) – Custom parser for genotype extraction. If not provided,
GenericParseris used.somatic_mode (bool, optional) – Enable somatic filtering. When
True, variants where all samples have identical genotypes are skipped. IfNone, the default behavior is used.ignore_mismatch_warnings (bool, optional) – If
True, suppress warnings about reference mismatches between the STR panel sequence and the VCFREFallele. IfNone, the default behavior is used.mismatch_truth (str, optional) –
Specifies which source is treated as ground truth when a mismatch between the STR panel and the VCF
REFallele is detected.Allowed values are:
"panel": trust the STR panel repeat sequence (default)"vcf": trust the VCFREFallele and patch the overlapping panel sequence"skip": skip variants with mismatches entirely
If
None, the default behavior is used.
Examples
>>> from strvcf_annotator import annotate_vcf >>> annotate_vcf("input.vcf", "str_regions.bed", "output.vcf")