Dev Documentation

Annotation

Core annotation engine for building STR-annotated VCF records.

strvcf_annotator.core.annotation.make_modified_header(vcf_in: VariantFile) → VariantHeader[source]

Create VCF header with STR-specific INFO and FORMAT fields.

Creates a modified VCF header that includes all original header information plus STR-specific annotations. Replaces existing RU, PERIOD, REF, PERFECT INFO fields and REPCN FORMAT field with STR-specific definitions.

Parameters:: vcf_in (pysam.VariantFile) – Input VCF file object
Returns:: New header with STR-specific fields
Return type:: pysam.VariantHeader

Notes

INFO fields added/replaced:

RU: Repeat unit
PERIOD: Repeat period (length of unit)
REF: Reference copy number
PERFECT: Indicates perfect repeats in REF and ALT

FORMAT field added/replaced:

REPCN: Genotype as number of repeat motif copies

strvcf_annotator.core.annotation.build_new_record(record: VariantRecord, str_row: Dict | Series, header: VariantHeader, parser: BaseVCFParser, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') → VariantRecord | None[source]

Build annotated VCF record with STR alleles and metadata.

Constructs a new VCF record where alleles represent full repeat sequences (before and after mutation) and adds STR-specific annotations to INFO and FORMAT fields.

Parameters:

record (pysam.VariantRecord) – Original VCF record with mutation
str_row (Dict or pd.Series) – STR metadata (CHROM, START, END, RU, PERIOD)
header (pysam.VariantHeader) – Modified header with STR fields
parser (BaseVCFParser) – Parser for extracting genotype information
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
Which source to treat as correct when there is a mismatch:
- ”panel”: trust panel repeat sequence (default behavior)
- ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
- ”skip”: skip record with mismatch

Returns:

New record with STR alleles and annotations

Return type:

pysam.VariantRecord

Notes

Logs warning if reference mismatch detected
Calculates repeat copy numbers for REF and ALT
Marks PERFECT=TRUE only if both alleles are perfect repeats
Preserves all original FORMAT fields
Returns None if there is a reference mismatch and mismatch_truth is “skip”

strvcf_annotator.core.annotation.should_skip_genotype(record: VariantRecord, parser: BaseVCFParser) → bool[source]

Determine if record should be skipped based on genotype filtering.

Skips records where: - Not exactly 2 samples present - Genotypes are invalid or missing - Both samples have identical genotypes

Parameters:

record (pysam.VariantRecord) – VCF record to check
parser (BaseVCFParser) – Parser for extracting genotypes

Returns:

True if record should be skipped, False otherwise

Return type:

bool

Process vcf

VCF file processing and workflow management.

class strvcf_annotator.core.vcf_processor.WorkerConfig(str_panel_gz: str, somatic_mode: bool, ignore_mismatch_warnings: bool, mismatch_truth: str, parser: BaseVCFParser)[source]

Bases: object

Configuration container for worker processes.

Stores parameters required by each worker to annotate VCF files. The configuration is passed once during worker initialization and reused for all tasks processed by that worker.

str_panel_gz

Path to the BGZF-compressed, tabix-indexed STR reference file.

Type:: str

somatic_mode

Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.

Type:: bool, optional

ignore_mismatch_warnings

If True, suppresses warnings about reference mismatches between the STR panel and VCF REF alleles. Default is False.

Type:: bool, optional

mismatch_truth

Specifies which source to consider as ground truth for mismatches. Options are “panel”, “vcf”, or “skip”. Default is “panel”.

Type:: str, optional

Notes

The dataclass is frozen to ensure the configuration remains immutable once workers are initialized.
Instances of this class are passed to worker_init, which loads the STR reference and exposes these settings to worker tasks.

str_panel_gz: str

somatic_mode: bool

ignore_mismatch_warnings: bool

mismatch_truth: str

parser: BaseVCFParser

strvcf_annotator.core.vcf_processor.check_vcf_sorted(vcf_in: VariantFile) → bool[source]

Validate VCF sorting by chromosome and position.

Checks if VCF records are sorted by chromosome and position. Rewinds the file after checking.

Parameters:: vcf_in (pysam.VariantFile) – Input VCF file
Returns:: True if VCF is sorted, False otherwise
Return type:: bool

strvcf_annotator.core.vcf_processor.reset_and_sort_vcf(vcf_in: VariantFile) → List[VariantRecord][source]

Sort VCF records in memory when needed.

Loads all VCF records into memory and sorts them by chromosome and position according to the contig order in the header.

Parameters:: vcf_in (pysam.VariantFile) – Input VCF file
Returns:: Sorted list of VCF records
Return type:: List[pysam.VariantRecord]

Notes

This loads the entire VCF into memory, so use with caution for large files.

strvcf_annotator.core.vcf_processor.generate_annotated_records(vcf_in: VariantFile, str_panel_gz: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') → Iterator[VariantRecord][source]

Generator yielding annotated VCF records.

Processes VCF records and yields annotated records for variants that overlap with STR regions. Handles sorting if needed and optionally filters records based on genotype criteria. When multiple STR regions overlap the same POS, try all overlapping STR candidates and pick the first that produces a meaningful STR allele change.

Parameters:

vcf_in (pysam.VariantFile) – Input VCF file
str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool, optional) – If True, suppresses warnings about reference mismatches between the STR panel and VCF REF alleles. Default is False.
mismatch_truth (str, optional) – Specifies which source to consider as ground truth for mismatches. Options are “panel”, “vcf”, or “skip”. Default is “panel”.

Yields:

pysam.VariantRecord – Annotated VCF records

Notes

Automatically sorts VCF if not sorted
Skips records without STR overlap
If somatic_mode=True, filters records with identical genotypes
ignore_mismatch_warnings controls logging of reference mismatches
mismatch_truth controls which source is considered ground truth for mismatches

strvcf_annotator.core.vcf_processor.annotate_vcf_to_file(vcf_path: str, str_panel_gz: str, output_path: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') → None[source]

Process VCF file and write annotated output.

Reads a VCF file, annotates variants that overlap with STR regions, and writes the annotated records to an output file.

Parameters:

vcf_path (str) – Path to input VCF file
str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.
output_path (str) – Path to output VCF file
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
Which source to treat as correct when there is a mismatch:
- ”panel”: trust panel repeat sequence (default behavior)
- ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
- ”skip”: skip record with mismatch

Notes

Prints summary statistics after processing.

strvcf_annotator.core.vcf_processor.annotate_one_vcf(task: Tuple[str, str]) → str[source]

Annotate a single VCF file in a worker process.

Runs annotate_vcf_to_file for one input VCF and writes the annotated VCF to the given output path. Created to be executed inside a process pool.

Parameters:

task (Tuple[str, str]) –

(vcf_path, output_path) pair, where:

vcf_path is the input VCF (optionally gzipped)
output_path is the target annotated VCF path

Returns:

Path to the produced output VCF file.

Return type:

str

Notes

Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG) to be initialized once per worker via worker_init.

strvcf_annotator.core.vcf_processor.get_available_ram_bytes() → int[source]

Get available system RAM.

Returns:: Available RAM in bytes.
Return type:: int

strvcf_annotator.core.vcf_processor.estimate_ram_per_worker_bytes(vcf_paths: List[str]) → int[source]

Estimate RAM usage per worker for VCF annotation.

Provides an estimate of how much memory a single worker process might require while annotating one VCF. This estimate is used to cap the number of concurrent workers to reduce the risk of out-of-memory (OOM) crashes.

Parameters:: vcf_paths (list[str]) – List of input VCF paths that will be processed.
Returns:: Estimated RAM usage per worker in bytes.
Return type:: int

Notes

If a VCF is not sorted, the current pipeline may load all records into memory for sorting, which can drastically increase memory usage.
Even for sorted VCFs, pysam/htslib buffers plus Python object overhead can be substantial.
Each worker loads the STR reference once. The STR DataFrame and derived Python objects often consume several times the BED file size on disk.

Heuristic

Identify the largest input file size on disk.
If the largest file is gzipped, assume a higher expansion factor for the working set (e.g., decompression + object overhead).
Add a fixed overhead to account for Python/pysam allocations.
Add STR panel RAM estimate as: str_panel_factor * BED_size_on_disk.

This is intentionally conservative to avoid OOM.

strvcf_annotator.core.vcf_processor.compute_jobs_auto(n_files: int, vcf_paths: List[str]) → int[source]

Compute an automatic number of concurrent workers.

Chooses a default number of parallel jobs for processing a directory of VCF files, balancing CPU capacity and memory constraints.

Parameters:

n_files (int) – Number of VCF files that will be processed (after skipping outputs that already exist).
vcf_paths (list[str]) – List of VCF paths used to estimate per-worker memory usage.

Returns:

Recommended number of concurrent worker processes (at least 1).

Return type:

int

Notes

The selection follows: - jobs_auto = min(cpu_cores, n_files) - jobs_auto = min(jobs_auto, floor(available_ram / ram_per_worker_estimate))

If available RAM cannot be determined, the CPU-based limit is used.

strvcf_annotator.core.vcf_processor.worker_init(config: WorkerConfig) → None[source]

Initialize worker process state.

Called once when a worker process starts. Stores configuration values so they can be reused for all VCF files processed by that worker.

Parameters:

config (WorkerConfig) –

Configuration object containing:

str_panel_gz : path to STR panel BGZF-compressed, tabix-indexed reference file
somatic_mode : whether somatic filtering is enabled
ignore_mismatch_warnings : whether to suppress mismatch warnings
mismatch_truth : rule for handling panel/VCF mismatches

Notes

This function is used as the initializer for ProcessPoolExecutor.

strvcf_annotator.core.vcf_processor.process_directory(input_dir: str, str_panel_gz: str, output_dir: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel', jobs: int = None) → None[source]

Batch process directory of VCF files.

Processes all VCF files in a directory and writes annotated versions to the output directory.

Parameters:

input_dir (str) – Directory containing input VCF files
str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR panel reference file
output_dir (str) – Directory for output VCF files
parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.
somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.
ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.
mismatch_truth (str) –
Which source to treat as correct when there is a mismatch:
- ”panel”: trust panel repeat sequence (default behavior)
- ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF
- ”skip”: skip record with mismatch
jobs (int, optional) –
- If jobs is None: compute jobs automatically:
  jobs_auto = min(cpu_cores, n_files) jobs_auto = min(jobs_auto, floor(available_ram / ram_per_worker_estimate))
- If jobs is provided: use it exactly.

VCF validation

Input validation functions.

exception strvcf_annotator.utils.validation.ValidationError[source]

Bases: Exception

Exception raised for validation errors.

strvcf_annotator.utils.validation.validate_file_path(file_path: str, must_exist: bool = True) → Path[source]

Validate file path.

Parameters:

file_path (str) – Path to validate
must_exist (bool, optional) – If True, file must exist. Default is True.

Returns:

Validated Path object

Return type:

Path

Raises:

ValidationError – If file path is invalid or doesn’t exist when required

strvcf_annotator.utils.validation.validate_directory_path(dir_path: str, must_exist: bool = True, create: bool = False) → Path[source]

Validate directory path.

Parameters:

dir_path (str) – Directory path to validate
must_exist (bool, optional) – If True, directory must exist. Default is True.
create (bool, optional) – If True, create directory if it doesn’t exist. Default is False.

Returns:

Validated Path object

Return type:

Path

Raises:

ValidationError – If directory path is invalid

strvcf_annotator.utils.validation.validate_vcf_file(vcf_path: str) → bool[source]

Validate VCF file format.

Parameters:: vcf_path (str) – Path to VCF file
Returns:: True if VCF is valid
Return type:: bool
Raises:: ValidationError – If VCF file is invalid or cannot be opened

strvcf_annotator.utils.validation.validate_bed_file(bed_path: str) → bool[source]

Validate BED file format.

Parameters:: bed_path (str) – Path to BED file
Returns:: True if BED is valid
Return type:: bool
Raises:: ValidationError – If BED file is invalid or cannot be opened

strvcf_annotator.utils.validation.validate_str_bed_file(bed_path: str) → bool[source]

Validate STR BED file format with required columns.

Parameters:: bed_path (str) – Path to STR BED file
Returns:: True if STR BED is valid
Return type:: bool
Raises:: ValidationError – If STR BED file is invalid or missing required columns

VCF utils

Utility functions for VCF processing.

strvcf_annotator.utils.vcf_utils.chrom_to_order(chrom: str) → int[source]: Map chromosome names like ‘chr1’, ‘chrX’, ‘1’ to an integer order so that sorting is: 1,2,…,22,X,Y,M/MT,others.

strvcf_annotator.utils.vcf_utils.normalize_info_fields(record: VariantRecord, header: VariantHeader) → Dict[str, Any][source]

Normalize INFO fields for proper VCF serialization.

Handles various INFO field types and ensures they are properly formatted for writing to VCF files. Handles Flags, Strings, and R-type fields specially.

Parameters:

record (pysam.VariantRecord) – VCF record with INFO fields to normalize
header (pysam.VariantHeader) – VCF header with field definitions

Returns:

Normalized INFO fields ready for VCF writing

Return type:

Dict[str, Any]

Notes

Flag fields are included only if True
String fields with multiple values are joined with “|”
R-type fields (REF + ALT) are clipped to first 2 values
Unknown fields are skipped

strvcf_annotator.utils.vcf_utils.get_sample_by_name(record: VariantRecord, sample_name: str) → Any[source]

Get sample data by name from VCF record.

Parameters:

record (pysam.VariantRecord) – VCF record
sample_name (str) – Name of sample to retrieve

Returns:

Sample data object

Return type:

Any

Raises:

KeyError – If sample name not found in record

strvcf_annotator.utils.vcf_utils.get_sample_by_index(record: VariantRecord, sample_idx: int) → Any[source]

Get sample data by index from VCF record.

Parameters:

record (pysam.VariantRecord) – VCF record
sample_idx (int) – Index of sample to retrieve

Returns:

Sample data object

Return type:

Any

Raises:

IndexError – If sample index out of range

strvcf_annotator.utils.vcf_utils.has_format_field(record: VariantRecord, field_name: str) → bool[source]

Check if FORMAT field exists in any sample.

Parameters:

record (pysam.VariantRecord) – VCF record
field_name (str) – Name of FORMAT field to check

Returns:

True if field exists in at least one sample

Return type:

bool

STR reference processing

STR reference management for BED file loading and region lookups.

strvcf_annotator.core.str_reference.is_valid_tabix(gz_path: str) → bool[source]

Check that a BGZF file has a valid tabix index.

Returns True only if: - .tbi exists - pysam can open the file - index is readable

strvcf_annotator.core.str_reference.sort_bed_file(bed_path: str, output_path: str, chrom_col: int = 0, start_col: int = 1) → str[source]

Sort a BED-like file by chromosome and start coordinate.

Parameters:

bed_path (str) – Path to input BED file (tab-delimited).
output_path (str) – Path to write the sorted BED file.
chrom_col (int, optional) – Zero-based column index for chromosome. Default is 0.
start_col (int, optional) – Zero-based column index for start coordinate. Default is 1.

Returns:

Path to the sorted BED file.

Return type:

str

Notes

This function loads the BED into memory via pandas. For extremely large BED files, consider an external sort.
Sorting is lexicographic by chromosome, then numeric by start.

strvcf_annotator.core.str_reference.load_str_reference(str_path: str) → str[source]

Ensure a BED file is BGZF-compressed and tabix-indexed.

This function: - Accepts a BED path (.bed or .bed.gz). - If the input is already .gz and has a .tbi index, returns it. - Otherwise, creates a sorted BED (if needed), BGZF-compresses it, and

creates a tabix index (preset=”bed”).

Parameters:: bed_path (str) – Path to input BED file (.bed or .bed.gz).
Returns:: Path to the BGZF-compressed, tabix-indexed BED file (*.gz).
Return type:: str

Notes

Tabix indexing requires the BED to be sorted by chromosome and start.
This function uses pysam.tabix_compress and pysam.tabix_index.

strvcf_annotator.core.str_reference.find_overlapping_str(str_panel_gz: str, chrom: str, pos: int, end: int) → Dict | None[source]

Find STR region overlapping with variant coordinates using tabix index.

Parameters:

str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.
chrom (str) – Chromosome name.
pos (int) – Variant start position (1-based).
end (int) – Variant end position (1-based).

Returns:

Dictionary with STR region data if overlap found, None otherwise. Keys: CHROM, START, END, PERIOD, RU, COUNT

Return type:

Optional[Dict]

strvcf_annotator.core.str_reference.get_str_at_position(str_panel_gz: str, chrom: str, pos: int) → Dict | None[source]

Get STR region containing a specific position using tabix index.

Parameters:

str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.
chrom (str) – Chromosome name.
pos (int) – Position to query (1-based).

Returns:

Dictionary with STR region data if position is within an STR, None otherwise.

Return type:

Optional[Dict]

Repeat utils module

Utilities for repeat sequence operations.

strvcf_annotator.core.repeat_utils.extract_repeat_sequence(str_row: Dict | Series) → str[source]

Reconstruct repeat sequence from STR metadata.

Generates the full repeat sequence by repeating the repeat unit (RU) the calculated number of times (COUNT).

Parameters:: str_row (Dict or pd.Series) – STR region data containing ‘RU’ (repeat unit) and ‘COUNT’ (number of repeats)
Returns:: Full repeat sequence
Return type:: str

Examples

>>> str_row = {'RU': 'CAG', 'COUNT': 5}
>>> extract_repeat_sequence(str_row)
'CAGCAGCAGCAGCAG'

strvcf_annotator.core.repeat_utils.count_repeat_units(sequence: str, motif: str) → int[source]

Return the longest contiguous run of motif in sequence.

The function looks for exact, non-overlapping copies of motif that occur consecutively and returns the maximum number of such copies in any run.

This corresponds to how STR repeat counts are typically defined: the length of the longest perfect contiguous block of the repeat unit.

Parameters:

sequence (str) – DNA sequence to search.
motif (str) – Repeat unit motif to count (e.g. ‘A’, ‘CAG’).

Returns:

Length of the longest contiguous run of motif in sequence.

Return type:

int

Raises:

ValueError – If motif is empty or if either argument is not a string.

Examples

Perfect repeats

>>> count_repeat_units("CAGCAGCAG", "CAG")
3

Imperfect tail

>>> count_repeat_units("CAGCAGCA", "CAG")
2

No repeats

>>> count_repeat_units("ATCG", "CAG")
0

Homopolymer runs

>>> count_repeat_units("ATAAAAA", "A")
5
>>> count_repeat_units("AAAATAAA", "A")
4  # longest contiguous run is 'AAAA'

Overlapping motifs

‘AAAA’ with motif ‘AA’ contains two non-overlapping copies: ‘AA’ ‘AA’ >>> count_repeat_units(“AAAA”, “AA”) 2

strvcf_annotator.core.repeat_utils.normalize_variant(pos: int, ref: str, alt: str) → Tuple[int, str, str][source]

Locally normalize (pos, ref, alt) by trimming shared prefix/suffix.

pos is 1-based VCF coordinate.
Trimming is case-insensitive.
We always keep at least 1 base in ref and alt if they differ.

Return type:: new_pos, new_ref, new_alt

strvcf_annotator.core.repeat_utils.apply_variant_to_repeat(pos: int, ref: str, alt: str, repeat_start: int, repeat_seq: str) → str[source]

Apply a variant to an STR repeat sequence.

The function applies a VCF variant to a reference STR sequence while respecting VCF normalization rules and STR boundaries.

The algorithm works as follows:

Normalize pos, ref, and alt by trimming shared prefixes and suffixes.
If the normalized variant lies fully inside the STR region, apply the full ALT allele.
If the variant partially overlaps the STR region:
- SNP-like variants (len(ref) == len(alt)) are aligned positionally.
- Indel-like variants (len(ref) != len(alt)) use the suffix of ALT that overlaps the STR.

Any parts of the variant outside the STR window are ignored.

Notes

The genomic reference is conceptually treated as:

repeat_seq + UNKNOWN_SUFFIX

Differences outside the STR window do not affect the resulting mutated STR sequence.

Case handling rules:

All matching and overlap logic is case-insensitive.
Output case follows the overlapping STR segment:
- lowercase STR slice → lowercase ALT
- uppercase STR slice → uppercase ALT
- mixed case → ALT is left unchanged

Parameters:

pos (int) – Variant position (1-based VCF coordinate).
ref (str) – Reference allele from the VCF record.
alt (str) – Alternate allele from the VCF record.
repeat_start (int) – Start position of the STR region (1-based).
repeat_seq (str) – Reference STR sequence from the panel.

Returns:

The mutated STR sequence after applying the variant. If the variant does not overlap the STR region, the original repeat_seq is returned unchanged.

Return type:

str

strvcf_annotator.core.repeat_utils.is_perfect_repeat(sequence: str, motif: str) → bool[source]

Check if sequence is a perfect repeat of the motif.

A perfect repeat means the sequence consists entirely of exact copies of the motif with no interruptions or variations.

Parameters:

sequence (str) – DNA sequence to check
motif (str) – Repeat unit motif

Returns:

True if sequence is a perfect repeat, False otherwise

Return type:

bool

Examples

>>> is_perfect_repeat('CAGCAGCAG', 'CAG')
True
>>> is_perfect_repeat('CAGCAGCA', 'CAG')
False

Parser class

Abstract base class for VCF parsers.

class strvcf_annotator.parsers.base.BaseVCFParser[source]

Bases: ABC

Abstract base class for VCF parsers.

Defines the interface for extracting genotype and variant information from VCF records in a standardized way. All parser implementations must inherit from this class and implement all abstract methods.

abstractmethod get_genotype(record: VariantRecord, sample_idx: int) → Tuple[int, int] | None[source]

Extract genotype as (allele1, allele2) or None if unknown.

Parameters:

record (pysam.VariantRecord) – The VCF record to extract genotype from
sample_idx (int) – Index of the sample in the record

Returns:

Genotype as tuple of allele indices (0=REF, 1=ALT), or None if unknown

Return type:

Optional[Tuple[int, int]]

abstractmethod has_variant(record: VariantRecord, sample_idx: int) → bool[source]

Check if sample has variant even when GT is unknown.

Parameters:

record (pysam.VariantRecord) – The VCF record to check
sample_idx (int) – Index of the sample in the record

Returns:

True if variant is present, False otherwise

Return type:

bool

abstractmethod extract_info(record: VariantRecord, sample_idx: int) → Dict[str, Any][source]

Extract additional fields (AD, DP, etc.) as dictionary.

Parameters:

record (pysam.VariantRecord) – The VCF record to extract information from
sample_idx (int) – Index of the sample in the record

Returns:

Dictionary of additional FORMAT fields

Return type:

Dict[str, Any]

abstractmethod validate_record(record: VariantRecord) → bool[source]

Validate that record is compatible with this parser.

Parameters:: record (pysam.VariantRecord) – The VCF record to validate
Returns:: True if record is valid for this parser, False otherwise
Return type:: bool

Generic parser

Generic parser for standard VCF format fields.

class strvcf_annotator.parsers.generic.GenericParser[source]

Bases: BaseVCFParser

Generic parser for standard VCF format fields.

Handles standard VCF FORMAT fields including GT (genotype), AD (allelic depth), and DP (total depth). Provides robust error handling for missing or invalid data.

get_genotype(record: VariantRecord, sample_idx: int) → Tuple[int, int] | None[source]

Extract GT field, return None for missing/invalid genotypes.

Parameters:

record (pysam.VariantRecord) – The VCF record to extract genotype from
sample_idx (int) – Index of the sample in the record

Returns:

Genotype as tuple of allele indices, or None if missing/invalid

Return type:

Optional[Tuple[int, int]]

has_variant(record: VariantRecord, sample_idx: int) → bool[source]

Check variant presence using GT or alternative evidence.

Parameters:

record (pysam.VariantRecord) – The VCF record to check
sample_idx (int) – Index of the sample in the record

Returns:

True if variant is present, False otherwise

Return type:

bool

extract_info(record: VariantRecord, sample_idx: int) → Dict[str, Any][source]

Extract AD, DP and other standard FORMAT fields.

Parameters:

record (pysam.VariantRecord) – The VCF record to extract information from
sample_idx (int) – Index of the sample in the record

Returns:

Dictionary of FORMAT fields (AD, DP, etc.)

Return type:

Dict[str, Any]

validate_record(record: VariantRecord) → bool[source]

Validate record has required fields for generic parsing.

Parameters:: record (pysam.VariantRecord) – The VCF record to validate
Returns:: True if record is valid, False otherwise
Return type:: bool