Dev Documentation
Annotation
Core annotation engine for building STR-annotated VCF records.
- strvcf_annotator.core.annotation.make_modified_header(vcf_in: VariantFile) VariantHeader[source]
Create VCF header with STR-specific INFO and FORMAT fields.
Creates a modified VCF header that includes all original header information plus STR-specific annotations. Replaces existing RU, PERIOD, REF, PERFECT INFO fields and REPCN FORMAT field with STR-specific definitions.
- Parameters:
vcf_in (pysam.VariantFile) – Input VCF file object
- Returns:
New header with STR-specific fields
- Return type:
pysam.VariantHeader
Notes
- INFO fields added/replaced:
RU: Repeat unit
PERIOD: Repeat period (length of unit)
REF: Reference copy number
PERFECT: Indicates perfect repeats in REF and ALT
- FORMAT field added/replaced:
REPCN: Genotype as number of repeat motif copies
- strvcf_annotator.core.annotation.build_new_record(record: VariantRecord, str_row: Dict | Series, header: VariantHeader, parser: BaseVCFParser, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') VariantRecord | None[source]
Build annotated VCF record with STR alleles and metadata.
Constructs a new VCF record where alleles represent full repeat sequences (before and after mutation) and adds STR-specific annotations to INFO and FORMAT fields.
- Parameters:
record (pysam.VariantRecord) – Original VCF record with mutation
str_row (Dict or pd.Series) – STR metadata (CHROM, START, END, RU, PERIOD)
header (pysam.VariantHeader) – Modified header with STR fields
parser (BaseVCFParser) – Parser for extracting genotype information
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
- Which source to treat as correct when there is a mismatch:
”panel”: trust panel repeat sequence (default behavior)
”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
”skip”: skip record with mismatch
- Returns:
New record with STR alleles and annotations
- Return type:
pysam.VariantRecord
Notes
Logs warning if reference mismatch detected
Calculates repeat copy numbers for REF and ALT
Marks PERFECT=TRUE only if both alleles are perfect repeats
Preserves all original FORMAT fields
Returns None if there is a reference mismatch and mismatch_truth is “skip”
- strvcf_annotator.core.annotation.should_skip_genotype(record: VariantRecord, parser: BaseVCFParser) bool[source]
Determine if record should be skipped based on genotype filtering.
Skips records where: - Not exactly 2 samples present - Genotypes are invalid or missing - Both samples have identical genotypes
- Parameters:
record (pysam.VariantRecord) – VCF record to check
parser (BaseVCFParser) – Parser for extracting genotypes
- Returns:
True if record should be skipped, False otherwise
- Return type:
bool
Process vcf
VCF file processing and workflow management.
- class strvcf_annotator.core.vcf_processor.WorkerConfig(str_panel_gz: str, somatic_mode: bool, ignore_mismatch_warnings: bool, mismatch_truth: str, parser: BaseVCFParser)[source]
Bases:
objectConfiguration container for worker processes.
Stores parameters required by each worker to annotate VCF files. The configuration is passed once during worker initialization and reused for all tasks processed by that worker.
- str_panel_gz
Path to the BGZF-compressed, tabix-indexed STR reference file.
- Type:
str
- somatic_mode
Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
- Type:
bool, optional
- ignore_mismatch_warnings
If True, suppresses warnings about reference mismatches between the STR panel and VCF REF alleles. Default is False.
- Type:
bool, optional
- mismatch_truth
Specifies which source to consider as ground truth for mismatches. Options are “panel”, “vcf”, or “skip”. Default is “panel”.
- Type:
str, optional
Notes
The dataclass is frozen to ensure the configuration remains immutable once workers are initialized.
Instances of this class are passed to worker_init, which loads the STR reference and exposes these settings to worker tasks.
- str_panel_gz: str
- somatic_mode: bool
- ignore_mismatch_warnings: bool
- mismatch_truth: str
- parser: BaseVCFParser
- strvcf_annotator.core.vcf_processor.check_vcf_sorted(vcf_in: VariantFile) bool[source]
Validate VCF sorting by chromosome and position.
Checks if VCF records are sorted by chromosome and position. Rewinds the file after checking.
- Parameters:
vcf_in (pysam.VariantFile) – Input VCF file
- Returns:
True if VCF is sorted, False otherwise
- Return type:
bool
- strvcf_annotator.core.vcf_processor.reset_and_sort_vcf(vcf_in: VariantFile) List[VariantRecord][source]
Sort VCF records in memory when needed.
Loads all VCF records into memory and sorts them by chromosome and position according to the contig order in the header.
- Parameters:
vcf_in (pysam.VariantFile) – Input VCF file
- Returns:
Sorted list of VCF records
- Return type:
List[pysam.VariantRecord]
Notes
This loads the entire VCF into memory, so use with caution for large files.
- strvcf_annotator.core.vcf_processor.generate_annotated_records(vcf_in: VariantFile, str_panel_gz: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') Iterator[VariantRecord][source]
Generator yielding annotated VCF records.
Processes VCF records and yields annotated records for variants that overlap with STR regions. Handles sorting if needed and optionally filters records based on genotype criteria. When multiple STR regions overlap the same POS, try all overlapping STR candidates and pick the first that produces a meaningful STR allele change.
- Parameters:
vcf_in (pysam.VariantFile) – Input VCF file
str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool, optional) – If True, suppresses warnings about reference mismatches between the STR panel and VCF REF alleles. Default is False.
mismatch_truth (str, optional) – Specifies which source to consider as ground truth for mismatches. Options are “panel”, “vcf”, or “skip”. Default is “panel”.
- Yields:
pysam.VariantRecord – Annotated VCF records
Notes
Automatically sorts VCF if not sorted
Skips records without STR overlap
If somatic_mode=True, filters records with identical genotypes
ignore_mismatch_warnings controls logging of reference mismatches
mismatch_truth controls which source is considered ground truth for mismatches
- strvcf_annotator.core.vcf_processor.annotate_vcf_to_file(vcf_path: str, str_panel_gz: str, output_path: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') None[source]
Process VCF file and write annotated output.
Reads a VCF file, annotates variants that overlap with STR regions, and writes the annotated records to an output file.
- Parameters:
vcf_path (str) – Path to input VCF file
str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.
output_path (str) – Path to output VCF file
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
- Which source to treat as correct when there is a mismatch:
”panel”: trust panel repeat sequence (default behavior)
”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
”skip”: skip record with mismatch
Notes
Prints summary statistics after processing.
- strvcf_annotator.core.vcf_processor.annotate_one_vcf(task: Tuple[str, str]) str[source]
Annotate a single VCF file in a worker process.
Runs annotate_vcf_to_file for one input VCF and writes the annotated VCF to the given output path. Created to be executed inside a process pool.
- Parameters:
task (Tuple[str, str]) –
- (vcf_path, output_path) pair, where:
vcf_path is the input VCF (optionally gzipped)
output_path is the target annotated VCF path
- Returns:
Path to the produced output VCF file.
- Return type:
str
Notes
Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG) to be initialized once per worker via worker_init.
- strvcf_annotator.core.vcf_processor.get_available_ram_bytes() int[source]
Get available system RAM.
- Returns:
Available RAM in bytes.
- Return type:
int
- strvcf_annotator.core.vcf_processor.estimate_ram_per_worker_bytes(vcf_paths: List[str]) int[source]
Estimate RAM usage per worker for VCF annotation.
Provides an estimate of how much memory a single worker process might require while annotating one VCF. This estimate is used to cap the number of concurrent workers to reduce the risk of out-of-memory (OOM) crashes.
- Parameters:
vcf_paths (list[str]) – List of input VCF paths that will be processed.
- Returns:
Estimated RAM usage per worker in bytes.
- Return type:
int
Notes
If a VCF is not sorted, the current pipeline may load all records into memory for sorting, which can drastically increase memory usage.
Even for sorted VCFs, pysam/htslib buffers plus Python object overhead can be substantial.
Each worker loads the STR reference once. The STR DataFrame and derived Python objects often consume several times the BED file size on disk.
Heuristic
Identify the largest input file size on disk.
If the largest file is gzipped, assume a higher expansion factor for the working set (e.g., decompression + object overhead).
Add a fixed overhead to account for Python/pysam allocations.
Add STR panel RAM estimate as: str_panel_factor * BED_size_on_disk.
This is intentionally conservative to avoid OOM.
- strvcf_annotator.core.vcf_processor.compute_jobs_auto(n_files: int, vcf_paths: List[str]) int[source]
Compute an automatic number of concurrent workers.
Chooses a default number of parallel jobs for processing a directory of VCF files, balancing CPU capacity and memory constraints.
- Parameters:
n_files (int) – Number of VCF files that will be processed (after skipping outputs that already exist).
vcf_paths (list[str]) – List of VCF paths used to estimate per-worker memory usage.
- Returns:
Recommended number of concurrent worker processes (at least 1).
- Return type:
int
Notes
The selection follows: - jobs_auto = min(cpu_cores, n_files) - jobs_auto = min(jobs_auto, floor(available_ram / ram_per_worker_estimate))
If available RAM cannot be determined, the CPU-based limit is used.
- strvcf_annotator.core.vcf_processor.worker_init(config: WorkerConfig) None[source]
Initialize worker process state.
Called once when a worker process starts. Stores configuration values so they can be reused for all VCF files processed by that worker.
- Parameters:
config (WorkerConfig) –
- Configuration object containing:
str_panel_gz : path to STR panel BGZF-compressed, tabix-indexed reference file
somatic_mode : whether somatic filtering is enabled
ignore_mismatch_warnings : whether to suppress mismatch warnings
mismatch_truth : rule for handling panel/VCF mismatches
Notes
This function is used as the initializer for ProcessPoolExecutor.
- strvcf_annotator.core.vcf_processor.process_directory(input_dir: str, str_panel_gz: str, output_dir: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel', jobs: int = None) None[source]
Batch process directory of VCF files.
Processes all VCF files in a directory and writes annotated versions to the output directory.
- Parameters:
input_dir (str) – Directory containing input VCF files
str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR panel reference file
output_dir (str) – Directory for output VCF files
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
- Which source to treat as correct when there is a mismatch:
”panel”: trust panel repeat sequence (default behavior)
”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
”skip”: skip record with mismatch
jobs (int, optional) –
- If jobs is None: compute jobs automatically:
jobs_auto = min(cpu_cores, n_files) jobs_auto = min(jobs_auto, floor(available_ram / ram_per_worker_estimate))
If jobs is provided: use it exactly.
VCF validation
Input validation functions.
- exception strvcf_annotator.utils.validation.ValidationError[source]
Bases:
ExceptionException raised for validation errors.
- strvcf_annotator.utils.validation.validate_file_path(file_path: str, must_exist: bool = True) Path[source]
Validate file path.
- Parameters:
file_path (str) – Path to validate
must_exist (bool, optional) – If True, file must exist. Default is True.
- Returns:
Validated Path object
- Return type:
Path
- Raises:
ValidationError – If file path is invalid or doesn’t exist when required
- strvcf_annotator.utils.validation.validate_directory_path(dir_path: str, must_exist: bool = True, create: bool = False) Path[source]
Validate directory path.
- Parameters:
dir_path (str) – Directory path to validate
must_exist (bool, optional) – If True, directory must exist. Default is True.
create (bool, optional) – If True, create directory if it doesn’t exist. Default is False.
- Returns:
Validated Path object
- Return type:
Path
- Raises:
ValidationError – If directory path is invalid
- strvcf_annotator.utils.validation.validate_vcf_file(vcf_path: str) bool[source]
Validate VCF file format.
- Parameters:
vcf_path (str) – Path to VCF file
- Returns:
True if VCF is valid
- Return type:
bool
- Raises:
ValidationError – If VCF file is invalid or cannot be opened
- strvcf_annotator.utils.validation.validate_bed_file(bed_path: str) bool[source]
Validate BED file format.
- Parameters:
bed_path (str) – Path to BED file
- Returns:
True if BED is valid
- Return type:
bool
- Raises:
ValidationError – If BED file is invalid or cannot be opened
- strvcf_annotator.utils.validation.validate_str_bed_file(bed_path: str) bool[source]
Validate STR BED file format with required columns.
- Parameters:
bed_path (str) – Path to STR BED file
- Returns:
True if STR BED is valid
- Return type:
bool
- Raises:
ValidationError – If STR BED file is invalid or missing required columns
VCF utils
Utility functions for VCF processing.
- strvcf_annotator.utils.vcf_utils.chrom_to_order(chrom: str) int[source]
Map chromosome names like ‘chr1’, ‘chrX’, ‘1’ to an integer order so that sorting is: 1,2,…,22,X,Y,M/MT,others.
- strvcf_annotator.utils.vcf_utils.normalize_info_fields(record: VariantRecord, header: VariantHeader) Dict[str, Any][source]
Normalize INFO fields for proper VCF serialization.
Handles various INFO field types and ensures they are properly formatted for writing to VCF files. Handles Flags, Strings, and R-type fields specially.
- Parameters:
record (pysam.VariantRecord) – VCF record with INFO fields to normalize
header (pysam.VariantHeader) – VCF header with field definitions
- Returns:
Normalized INFO fields ready for VCF writing
- Return type:
Dict[str, Any]
Notes
Flag fields are included only if True
String fields with multiple values are joined with “|”
R-type fields (REF + ALT) are clipped to first 2 values
Unknown fields are skipped
- strvcf_annotator.utils.vcf_utils.get_sample_by_name(record: VariantRecord, sample_name: str) Any[source]
Get sample data by name from VCF record.
- Parameters:
record (pysam.VariantRecord) – VCF record
sample_name (str) – Name of sample to retrieve
- Returns:
Sample data object
- Return type:
Any
- Raises:
KeyError – If sample name not found in record
- strvcf_annotator.utils.vcf_utils.get_sample_by_index(record: VariantRecord, sample_idx: int) Any[source]
Get sample data by index from VCF record.
- Parameters:
record (pysam.VariantRecord) – VCF record
sample_idx (int) – Index of sample to retrieve
- Returns:
Sample data object
- Return type:
Any
- Raises:
IndexError – If sample index out of range
- strvcf_annotator.utils.vcf_utils.has_format_field(record: VariantRecord, field_name: str) bool[source]
Check if FORMAT field exists in any sample.
- Parameters:
record (pysam.VariantRecord) – VCF record
field_name (str) – Name of FORMAT field to check
- Returns:
True if field exists in at least one sample
- Return type:
bool
STR reference processing
STR reference management for BED file loading and region lookups.
- strvcf_annotator.core.str_reference.is_valid_tabix(gz_path: str) bool[source]
Check that a BGZF file has a valid tabix index.
Returns True only if: - .tbi exists - pysam can open the file - index is readable
- strvcf_annotator.core.str_reference.sort_bed_file(bed_path: str, output_path: str, chrom_col: int = 0, start_col: int = 1) str[source]
Sort a BED-like file by chromosome and start coordinate.
- Parameters:
bed_path (str) – Path to input BED file (tab-delimited).
output_path (str) – Path to write the sorted BED file.
chrom_col (int, optional) – Zero-based column index for chromosome. Default is 0.
start_col (int, optional) – Zero-based column index for start coordinate. Default is 1.
- Returns:
Path to the sorted BED file.
- Return type:
str
Notes
This function loads the BED into memory via pandas. For extremely large BED files, consider an external sort.
Sorting is lexicographic by chromosome, then numeric by start.
- strvcf_annotator.core.str_reference.load_str_reference(str_path: str) str[source]
Ensure a BED file is BGZF-compressed and tabix-indexed.
This function: - Accepts a BED path (.bed or .bed.gz). - If the input is already .gz and has a .tbi index, returns it. - Otherwise, creates a sorted BED (if needed), BGZF-compresses it, and
creates a tabix index (preset=”bed”).
- Parameters:
bed_path (str) – Path to input BED file (.bed or .bed.gz).
- Returns:
Path to the BGZF-compressed, tabix-indexed BED file (*.gz).
- Return type:
str
Notes
Tabix indexing requires the BED to be sorted by chromosome and start.
This function uses pysam.tabix_compress and pysam.tabix_index.
- strvcf_annotator.core.str_reference.find_overlapping_str(str_panel_gz: str, chrom: str, pos: int, end: int) Dict | None[source]
Find STR region overlapping with variant coordinates using tabix index.
- Parameters:
str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.
chrom (str) – Chromosome name.
pos (int) – Variant start position (1-based).
end (int) – Variant end position (1-based).
- Returns:
Dictionary with STR region data if overlap found, None otherwise. Keys: CHROM, START, END, PERIOD, RU, COUNT
- Return type:
Optional[Dict]
- strvcf_annotator.core.str_reference.get_str_at_position(str_panel_gz: str, chrom: str, pos: int) Dict | None[source]
Get STR region containing a specific position using tabix index.
- Parameters:
str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.
chrom (str) – Chromosome name.
pos (int) – Position to query (1-based).
- Returns:
Dictionary with STR region data if position is within an STR, None otherwise.
- Return type:
Optional[Dict]
Repeat utils module
Utilities for repeat sequence operations.
- strvcf_annotator.core.repeat_utils.extract_repeat_sequence(str_row: Dict | Series) str[source]
Reconstruct repeat sequence from STR metadata.
Generates the full repeat sequence by repeating the repeat unit (RU) the calculated number of times (COUNT).
- Parameters:
str_row (Dict or pd.Series) – STR region data containing ‘RU’ (repeat unit) and ‘COUNT’ (number of repeats)
- Returns:
Full repeat sequence
- Return type:
str
Examples
>>> str_row = {'RU': 'CAG', 'COUNT': 5} >>> extract_repeat_sequence(str_row) 'CAGCAGCAGCAGCAG'
- strvcf_annotator.core.repeat_utils.count_repeat_units(sequence: str, motif: str) int[source]
Return the longest contiguous run of motif in sequence.
The function looks for exact, non-overlapping copies of motif that occur consecutively and returns the maximum number of such copies in any run.
This corresponds to how STR repeat counts are typically defined: the length of the longest perfect contiguous block of the repeat unit.
- Parameters:
sequence (str) – DNA sequence to search.
motif (str) – Repeat unit motif to count (e.g. ‘A’, ‘CAG’).
- Returns:
Length of the longest contiguous run of motif in sequence.
- Return type:
int
- Raises:
ValueError – If motif is empty or if either argument is not a string.
Examples
Perfect repeats
>>> count_repeat_units("CAGCAGCAG", "CAG") 3
Imperfect tail
>>> count_repeat_units("CAGCAGCA", "CAG") 2
No repeats
>>> count_repeat_units("ATCG", "CAG") 0
Homopolymer runs
>>> count_repeat_units("ATAAAAA", "A") 5 >>> count_repeat_units("AAAATAAA", "A") 4 # longest contiguous run is 'AAAA'
Overlapping motifs
‘AAAA’ with motif ‘AA’ contains two non-overlapping copies: ‘AA’ ‘AA’ >>> count_repeat_units(“AAAA”, “AA”) 2
- strvcf_annotator.core.repeat_utils.normalize_variant(pos: int, ref: str, alt: str) Tuple[int, str, str][source]
Locally normalize (pos, ref, alt) by trimming shared prefix/suffix.
pos is 1-based VCF coordinate.
Trimming is case-insensitive.
We always keep at least 1 base in ref and alt if they differ.
- Return type:
new_pos, new_ref, new_alt
- strvcf_annotator.core.repeat_utils.apply_variant_to_repeat(pos: int, ref: str, alt: str, repeat_start: int, repeat_seq: str) str[source]
Apply a variant to an STR repeat sequence.
The function applies a VCF variant to a reference STR sequence while respecting VCF normalization rules and STR boundaries.
The algorithm works as follows:
Normalize
pos,ref, andaltby trimming shared prefixes and suffixes.If the normalized variant lies fully inside the STR region, apply the full
ALTallele.If the variant partially overlaps the STR region:
SNP-like variants (
len(ref) == len(alt)) are aligned positionally.Indel-like variants (
len(ref) != len(alt)) use the suffix ofALTthat overlaps the STR.
Any parts of the variant outside the STR window are ignored.
Notes
The genomic reference is conceptually treated as:
repeat_seq + UNKNOWN_SUFFIX
Differences outside the STR window do not affect the resulting mutated STR sequence.
Case handling rules:
All matching and overlap logic is case-insensitive.
Output case follows the overlapping STR segment:
lowercase STR slice → lowercase ALT
uppercase STR slice → uppercase ALT
mixed case → ALT is left unchanged
- Parameters:
pos (int) – Variant position (1-based VCF coordinate).
ref (str) – Reference allele from the VCF record.
alt (str) – Alternate allele from the VCF record.
repeat_start (int) – Start position of the STR region (1-based).
repeat_seq (str) – Reference STR sequence from the panel.
- Returns:
The mutated STR sequence after applying the variant. If the variant does not overlap the STR region, the original
repeat_seqis returned unchanged.- Return type:
str
- strvcf_annotator.core.repeat_utils.is_perfect_repeat(sequence: str, motif: str) bool[source]
Check if sequence is a perfect repeat of the motif.
A perfect repeat means the sequence consists entirely of exact copies of the motif with no interruptions or variations.
- Parameters:
sequence (str) – DNA sequence to check
motif (str) – Repeat unit motif
- Returns:
True if sequence is a perfect repeat, False otherwise
- Return type:
bool
Examples
>>> is_perfect_repeat('CAGCAGCAG', 'CAG') True >>> is_perfect_repeat('CAGCAGCA', 'CAG') False
Parser class
Abstract base class for VCF parsers.
- class strvcf_annotator.parsers.base.BaseVCFParser[source]
Bases:
ABCAbstract base class for VCF parsers.
Defines the interface for extracting genotype and variant information from VCF records in a standardized way. All parser implementations must inherit from this class and implement all abstract methods.
- abstractmethod get_genotype(record: VariantRecord, sample_idx: int) Tuple[int, int] | None[source]
Extract genotype as (allele1, allele2) or None if unknown.
- Parameters:
record (pysam.VariantRecord) – The VCF record to extract genotype from
sample_idx (int) – Index of the sample in the record
- Returns:
Genotype as tuple of allele indices (0=REF, 1=ALT), or None if unknown
- Return type:
Optional[Tuple[int, int]]
- abstractmethod has_variant(record: VariantRecord, sample_idx: int) bool[source]
Check if sample has variant even when GT is unknown.
- Parameters:
record (pysam.VariantRecord) – The VCF record to check
sample_idx (int) – Index of the sample in the record
- Returns:
True if variant is present, False otherwise
- Return type:
bool
- abstractmethod extract_info(record: VariantRecord, sample_idx: int) Dict[str, Any][source]
Extract additional fields (AD, DP, etc.) as dictionary.
- Parameters:
record (pysam.VariantRecord) – The VCF record to extract information from
sample_idx (int) – Index of the sample in the record
- Returns:
Dictionary of additional FORMAT fields
- Return type:
Dict[str, Any]
Generic parser
Generic parser for standard VCF format fields.
- class strvcf_annotator.parsers.generic.GenericParser[source]
Bases:
BaseVCFParserGeneric parser for standard VCF format fields.
Handles standard VCF FORMAT fields including GT (genotype), AD (allelic depth), and DP (total depth). Provides robust error handling for missing or invalid data.
- get_genotype(record: VariantRecord, sample_idx: int) Tuple[int, int] | None[source]
Extract GT field, return None for missing/invalid genotypes.
- Parameters:
record (pysam.VariantRecord) – The VCF record to extract genotype from
sample_idx (int) – Index of the sample in the record
- Returns:
Genotype as tuple of allele indices, or None if missing/invalid
- Return type:
Optional[Tuple[int, int]]
- has_variant(record: VariantRecord, sample_idx: int) bool[source]
Check variant presence using GT or alternative evidence.
- Parameters:
record (pysam.VariantRecord) – The VCF record to check
sample_idx (int) – Index of the sample in the record
- Returns:
True if variant is present, False otherwise
- Return type:
bool
- extract_info(record: VariantRecord, sample_idx: int) Dict[str, Any][source]
Extract AD, DP and other standard FORMAT fields.
- Parameters:
record (pysam.VariantRecord) – The VCF record to extract information from
sample_idx (int) – Index of the sample in the record
- Returns:
Dictionary of FORMAT fields (AD, DP, etc.)
- Return type:
Dict[str, Any]