Dev Documentation

Annotation

Core annotation engine for building STR-annotated VCF records.

strvcf_annotator.core.annotation.make_modified_header(vcf_in: VariantFile) VariantHeader[source]

Create VCF header with STR-specific INFO and FORMAT fields.

Creates a modified VCF header that includes all original header information plus STR-specific annotations. Replaces existing RU, PERIOD, REF, PERFECT INFO fields and REPCN FORMAT field with STR-specific definitions.

Parameters:

vcf_in (pysam.VariantFile) – Input VCF file object

Returns:

New header with STR-specific fields

Return type:

pysam.VariantHeader

Notes

INFO fields added/replaced:
  • RU: Repeat unit

  • PERIOD: Repeat period (length of unit)

  • REF: Reference copy number

  • PERFECT: Indicates perfect repeats in REF and ALT

FORMAT field added/replaced:
  • REPCN: Genotype as number of repeat motif copies

strvcf_annotator.core.annotation.build_new_record(record: VariantRecord, str_row: Dict | Series, header: VariantHeader, parser: BaseVCFParser, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') VariantRecord | None[source]

Build annotated VCF record with STR alleles and metadata.

Constructs a new VCF record where alleles represent full repeat sequences (before and after mutation) and adds STR-specific annotations to INFO and FORMAT fields.

Parameters:
  • record (pysam.VariantRecord) – Original VCF record with mutation

  • str_row (Dict or pd.Series) – STR metadata (CHROM, START, END, RU, PERIOD)

  • header (pysam.VariantHeader) – Modified header with STR fields

  • parser (BaseVCFParser) – Parser for extracting genotype information

  • ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.

  • mismatch_truth (str) –

    Which source to treat as correct when there is a mismatch:
    • ”panel”: trust panel repeat sequence (default behavior)

    • ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF

    • ”skip”: skip record with mismatch

Returns:

New record with STR alleles and annotations

Return type:

pysam.VariantRecord

Notes

  • Logs warning if reference mismatch detected

  • Calculates repeat copy numbers for REF and ALT

  • Marks PERFECT=TRUE only if both alleles are perfect repeats

  • Preserves all original FORMAT fields

  • Returns None if there is a reference mismatch and mismatch_truth is “skip”

strvcf_annotator.core.annotation.should_skip_genotype(record: VariantRecord, parser: BaseVCFParser) bool[source]

Determine if record should be skipped based on genotype filtering.

Skips records where: - Not exactly 2 samples present - Genotypes are invalid or missing - Both samples have identical genotypes

Parameters:
  • record (pysam.VariantRecord) – VCF record to check

  • parser (BaseVCFParser) – Parser for extracting genotypes

Returns:

True if record should be skipped, False otherwise

Return type:

bool

Process vcf

VCF file processing and workflow management.

class strvcf_annotator.core.vcf_processor.WorkerConfig(str_panel_gz: str, somatic_mode: bool, ignore_mismatch_warnings: bool, mismatch_truth: str, parser: BaseVCFParser)[source]

Bases: object

Configuration container for worker processes.

Stores parameters required by each worker to annotate VCF files. The configuration is passed once during worker initialization and reused for all tasks processed by that worker.

str_panel_gz

Path to the BGZF-compressed, tabix-indexed STR reference file.

Type:

str

somatic_mode

Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.

Type:

bool, optional

ignore_mismatch_warnings

If True, suppresses warnings about reference mismatches between the STR panel and VCF REF alleles. Default is False.

Type:

bool, optional

mismatch_truth

Specifies which source to consider as ground truth for mismatches. Options are “panel”, “vcf”, or “skip”. Default is “panel”.

Type:

str, optional

Notes

  • The dataclass is frozen to ensure the configuration remains immutable once workers are initialized.

  • Instances of this class are passed to worker_init, which loads the STR reference and exposes these settings to worker tasks.

str_panel_gz: str
somatic_mode: bool
ignore_mismatch_warnings: bool
mismatch_truth: str
parser: BaseVCFParser
strvcf_annotator.core.vcf_processor.check_vcf_sorted(vcf_in: VariantFile) bool[source]

Validate VCF sorting by chromosome and position.

Checks if VCF records are sorted by chromosome and position. Rewinds the file after checking.

Parameters:

vcf_in (pysam.VariantFile) – Input VCF file

Returns:

True if VCF is sorted, False otherwise

Return type:

bool

strvcf_annotator.core.vcf_processor.reset_and_sort_vcf(vcf_in: VariantFile) List[VariantRecord][source]

Sort VCF records in memory when needed.

Loads all VCF records into memory and sorts them by chromosome and position according to the contig order in the header.

Parameters:

vcf_in (pysam.VariantFile) – Input VCF file

Returns:

Sorted list of VCF records

Return type:

List[pysam.VariantRecord]

Notes

This loads the entire VCF into memory, so use with caution for large files.

strvcf_annotator.core.vcf_processor.generate_annotated_records(vcf_in: VariantFile, str_panel_gz: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') Iterator[VariantRecord][source]

Generator yielding annotated VCF records.

Processes VCF records and yields annotated records for variants that overlap with STR regions. Handles sorting if needed and optionally filters records based on genotype criteria. When multiple STR regions overlap the same POS, try all overlapping STR candidates and pick the first that produces a meaningful STR allele change.

Parameters:
  • vcf_in (pysam.VariantFile) – Input VCF file

  • str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.

  • parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.

  • ignore_mismatch_warnings (bool, optional) – If True, suppresses warnings about reference mismatches between the STR panel and VCF REF alleles. Default is False.

  • mismatch_truth (str, optional) – Specifies which source to consider as ground truth for mismatches. Options are “panel”, “vcf”, or “skip”. Default is “panel”.

Yields:

pysam.VariantRecord – Annotated VCF records

Notes

  • Automatically sorts VCF if not sorted

  • Skips records without STR overlap

  • If somatic_mode=True, filters records with identical genotypes

  • ignore_mismatch_warnings controls logging of reference mismatches

  • mismatch_truth controls which source is considered ground truth for mismatches

strvcf_annotator.core.vcf_processor.annotate_vcf_to_file(vcf_path: str, str_panel_gz: str, output_path: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel') None[source]

Process VCF file and write annotated output.

Reads a VCF file, annotates variants that overlap with STR regions, and writes the annotated records to an output file.

Parameters:
  • vcf_path (str) – Path to input VCF file

  • str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.

  • output_path (str) – Path to output VCF file

  • parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.

  • ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.

  • mismatch_truth (str) –

    Which source to treat as correct when there is a mismatch:
    • ”panel”: trust panel repeat sequence (default behavior)

    • ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF

    • ”skip”: skip record with mismatch

Notes

Prints summary statistics after processing.

strvcf_annotator.core.vcf_processor.annotate_one_vcf(task: Tuple[str, str]) str[source]

Annotate a single VCF file in a worker process.

Runs annotate_vcf_to_file for one input VCF and writes the annotated VCF to the given output path. Created to be executed inside a process pool.

Parameters:

task (Tuple[str, str]) –

(vcf_path, output_path) pair, where:
  • vcf_path is the input VCF (optionally gzipped)

  • output_path is the target annotated VCF path

Returns:

Path to the produced output VCF file.

Return type:

str

Notes

  • Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG) to be initialized once per worker via worker_init.

strvcf_annotator.core.vcf_processor.get_available_ram_bytes() int[source]

Get available system RAM.

Returns:

Available RAM in bytes.

Return type:

int

strvcf_annotator.core.vcf_processor.estimate_ram_per_worker_bytes(vcf_paths: List[str]) int[source]

Estimate RAM usage per worker for VCF annotation.

Provides an estimate of how much memory a single worker process might require while annotating one VCF. This estimate is used to cap the number of concurrent workers to reduce the risk of out-of-memory (OOM) crashes.

Parameters:

vcf_paths (list[str]) – List of input VCF paths that will be processed.

Returns:

Estimated RAM usage per worker in bytes.

Return type:

int

Notes

  • If a VCF is not sorted, the current pipeline may load all records into memory for sorting, which can drastically increase memory usage.

  • Even for sorted VCFs, pysam/htslib buffers plus Python object overhead can be substantial.

  • Each worker loads the STR reference once. The STR DataFrame and derived Python objects often consume several times the BED file size on disk.

Heuristic

  • Identify the largest input file size on disk.

  • If the largest file is gzipped, assume a higher expansion factor for the working set (e.g., decompression + object overhead).

  • Add a fixed overhead to account for Python/pysam allocations.

  • Add STR panel RAM estimate as: str_panel_factor * BED_size_on_disk.

This is intentionally conservative to avoid OOM.

strvcf_annotator.core.vcf_processor.compute_jobs_auto(n_files: int, vcf_paths: List[str]) int[source]

Compute an automatic number of concurrent workers.

Chooses a default number of parallel jobs for processing a directory of VCF files, balancing CPU capacity and memory constraints.

Parameters:
  • n_files (int) – Number of VCF files that will be processed (after skipping outputs that already exist).

  • vcf_paths (list[str]) – List of VCF paths used to estimate per-worker memory usage.

Returns:

Recommended number of concurrent worker processes (at least 1).

Return type:

int

Notes

The selection follows: - jobs_auto = min(cpu_cores, n_files) - jobs_auto = min(jobs_auto, floor(available_ram / ram_per_worker_estimate))

If available RAM cannot be determined, the CPU-based limit is used.

strvcf_annotator.core.vcf_processor.worker_init(config: WorkerConfig) None[source]

Initialize worker process state.

Called once when a worker process starts. Stores configuration values so they can be reused for all VCF files processed by that worker.

Parameters:

config (WorkerConfig) –

Configuration object containing:
  • str_panel_gz : path to STR panel BGZF-compressed, tabix-indexed reference file

  • somatic_mode : whether somatic filtering is enabled

  • ignore_mismatch_warnings : whether to suppress mismatch warnings

  • mismatch_truth : rule for handling panel/VCF mismatches

Notes

  • This function is used as the initializer for ProcessPoolExecutor.

strvcf_annotator.core.vcf_processor.process_directory(input_dir: str, str_panel_gz: str, output_dir: str, parser: BaseVCFParser = None, somatic_mode: bool = False, ignore_mismatch_warnings: bool = False, mismatch_truth: str = 'panel', jobs: int = None) None[source]

Batch process directory of VCF files.

Processes all VCF files in a directory and writes annotated versions to the output directory.

Parameters:
  • input_dir (str) – Directory containing input VCF files

  • str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR panel reference file

  • output_dir (str) – Directory for output VCF files

  • parser (BaseVCFParser, optional) – Parser for genotype extraction. Uses GenericParser if None.

  • somatic_mode (bool, optional) – Enable somatic filtering. When True, skips variants where both samples have identical genotypes. Default is False.

  • ignore_mismatch_warnings (bool) – If True, do not log mismatch warnings between VCF REF and panel repeat sequence. Annotation continues regardless.

  • mismatch_truth (str) –

    Which source to treat as correct when there is a mismatch:
    • ”panel”: trust panel repeat sequence (default behavior)

    • ”vcf”: trust VCF REF. patch the panel repeat sequence overlap to match VCF REF

    • ”skip”: skip record with mismatch

  • jobs (int, optional) –

    • If jobs is None: compute jobs automatically:

      jobs_auto = min(cpu_cores, n_files) jobs_auto = min(jobs_auto, floor(available_ram / ram_per_worker_estimate))

    • If jobs is provided: use it exactly.

VCF validation

Input validation functions.

exception strvcf_annotator.utils.validation.ValidationError[source]

Bases: Exception

Exception raised for validation errors.

strvcf_annotator.utils.validation.validate_file_path(file_path: str, must_exist: bool = True) Path[source]

Validate file path.

Parameters:
  • file_path (str) – Path to validate

  • must_exist (bool, optional) – If True, file must exist. Default is True.

Returns:

Validated Path object

Return type:

Path

Raises:

ValidationError – If file path is invalid or doesn’t exist when required

strvcf_annotator.utils.validation.validate_directory_path(dir_path: str, must_exist: bool = True, create: bool = False) Path[source]

Validate directory path.

Parameters:
  • dir_path (str) – Directory path to validate

  • must_exist (bool, optional) – If True, directory must exist. Default is True.

  • create (bool, optional) – If True, create directory if it doesn’t exist. Default is False.

Returns:

Validated Path object

Return type:

Path

Raises:

ValidationError – If directory path is invalid

strvcf_annotator.utils.validation.validate_vcf_file(vcf_path: str) bool[source]

Validate VCF file format.

Parameters:

vcf_path (str) – Path to VCF file

Returns:

True if VCF is valid

Return type:

bool

Raises:

ValidationError – If VCF file is invalid or cannot be opened

strvcf_annotator.utils.validation.validate_bed_file(bed_path: str) bool[source]

Validate BED file format.

Parameters:

bed_path (str) – Path to BED file

Returns:

True if BED is valid

Return type:

bool

Raises:

ValidationError – If BED file is invalid or cannot be opened

strvcf_annotator.utils.validation.validate_str_bed_file(bed_path: str) bool[source]

Validate STR BED file format with required columns.

Parameters:

bed_path (str) – Path to STR BED file

Returns:

True if STR BED is valid

Return type:

bool

Raises:

ValidationError – If STR BED file is invalid or missing required columns

VCF utils

Utility functions for VCF processing.

strvcf_annotator.utils.vcf_utils.chrom_to_order(chrom: str) int[source]

Map chromosome names like ‘chr1’, ‘chrX’, ‘1’ to an integer order so that sorting is: 1,2,…,22,X,Y,M/MT,others.

strvcf_annotator.utils.vcf_utils.normalize_info_fields(record: VariantRecord, header: VariantHeader) Dict[str, Any][source]

Normalize INFO fields for proper VCF serialization.

Handles various INFO field types and ensures they are properly formatted for writing to VCF files. Handles Flags, Strings, and R-type fields specially.

Parameters:
  • record (pysam.VariantRecord) – VCF record with INFO fields to normalize

  • header (pysam.VariantHeader) – VCF header with field definitions

Returns:

Normalized INFO fields ready for VCF writing

Return type:

Dict[str, Any]

Notes

  • Flag fields are included only if True

  • String fields with multiple values are joined with “|”

  • R-type fields (REF + ALT) are clipped to first 2 values

  • Unknown fields are skipped

strvcf_annotator.utils.vcf_utils.get_sample_by_name(record: VariantRecord, sample_name: str) Any[source]

Get sample data by name from VCF record.

Parameters:
  • record (pysam.VariantRecord) – VCF record

  • sample_name (str) – Name of sample to retrieve

Returns:

Sample data object

Return type:

Any

Raises:

KeyError – If sample name not found in record

strvcf_annotator.utils.vcf_utils.get_sample_by_index(record: VariantRecord, sample_idx: int) Any[source]

Get sample data by index from VCF record.

Parameters:
  • record (pysam.VariantRecord) – VCF record

  • sample_idx (int) – Index of sample to retrieve

Returns:

Sample data object

Return type:

Any

Raises:

IndexError – If sample index out of range

strvcf_annotator.utils.vcf_utils.has_format_field(record: VariantRecord, field_name: str) bool[source]

Check if FORMAT field exists in any sample.

Parameters:
  • record (pysam.VariantRecord) – VCF record

  • field_name (str) – Name of FORMAT field to check

Returns:

True if field exists in at least one sample

Return type:

bool

STR reference processing

STR reference management for BED file loading and region lookups.

strvcf_annotator.core.str_reference.is_valid_tabix(gz_path: str) bool[source]

Check that a BGZF file has a valid tabix index.

Returns True only if: - .tbi exists - pysam can open the file - index is readable

strvcf_annotator.core.str_reference.sort_bed_file(bed_path: str, output_path: str, chrom_col: int = 0, start_col: int = 1) str[source]

Sort a BED-like file by chromosome and start coordinate.

Parameters:
  • bed_path (str) – Path to input BED file (tab-delimited).

  • output_path (str) – Path to write the sorted BED file.

  • chrom_col (int, optional) – Zero-based column index for chromosome. Default is 0.

  • start_col (int, optional) – Zero-based column index for start coordinate. Default is 1.

Returns:

Path to the sorted BED file.

Return type:

str

Notes

  • This function loads the BED into memory via pandas. For extremely large BED files, consider an external sort.

  • Sorting is lexicographic by chromosome, then numeric by start.

strvcf_annotator.core.str_reference.load_str_reference(str_path: str) str[source]

Ensure a BED file is BGZF-compressed and tabix-indexed.

This function: - Accepts a BED path (.bed or .bed.gz). - If the input is already .gz and has a .tbi index, returns it. - Otherwise, creates a sorted BED (if needed), BGZF-compresses it, and

creates a tabix index (preset=”bed”).

Parameters:

bed_path (str) – Path to input BED file (.bed or .bed.gz).

Returns:

Path to the BGZF-compressed, tabix-indexed BED file (*.gz).

Return type:

str

Notes

  • Tabix indexing requires the BED to be sorted by chromosome and start.

  • This function uses pysam.tabix_compress and pysam.tabix_index.

strvcf_annotator.core.str_reference.find_overlapping_str(str_panel_gz: str, chrom: str, pos: int, end: int) Dict | None[source]

Find STR region overlapping with variant coordinates using tabix index.

Parameters:
  • str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.

  • chrom (str) – Chromosome name.

  • pos (int) – Variant start position (1-based).

  • end (int) – Variant end position (1-based).

Returns:

Dictionary with STR region data if overlap found, None otherwise. Keys: CHROM, START, END, PERIOD, RU, COUNT

Return type:

Optional[Dict]

strvcf_annotator.core.str_reference.get_str_at_position(str_panel_gz: str, chrom: str, pos: int) Dict | None[source]

Get STR region containing a specific position using tabix index.

Parameters:
  • str_panel_gz (str) – Path to BGZF-compressed, tabix-indexed STR reference file.

  • chrom (str) – Chromosome name.

  • pos (int) – Position to query (1-based).

Returns:

Dictionary with STR region data if position is within an STR, None otherwise.

Return type:

Optional[Dict]

Repeat utils module

Utilities for repeat sequence operations.

strvcf_annotator.core.repeat_utils.extract_repeat_sequence(str_row: Dict | Series) str[source]

Reconstruct repeat sequence from STR metadata.

Generates the full repeat sequence by repeating the repeat unit (RU) the calculated number of times (COUNT).

Parameters:

str_row (Dict or pd.Series) – STR region data containing ‘RU’ (repeat unit) and ‘COUNT’ (number of repeats)

Returns:

Full repeat sequence

Return type:

str

Examples

>>> str_row = {'RU': 'CAG', 'COUNT': 5}
>>> extract_repeat_sequence(str_row)
'CAGCAGCAGCAGCAG'
strvcf_annotator.core.repeat_utils.count_repeat_units(sequence: str, motif: str) int[source]

Return the longest contiguous run of motif in sequence.

The function looks for exact, non-overlapping copies of motif that occur consecutively and returns the maximum number of such copies in any run.

This corresponds to how STR repeat counts are typically defined: the length of the longest perfect contiguous block of the repeat unit.

Parameters:
  • sequence (str) – DNA sequence to search.

  • motif (str) – Repeat unit motif to count (e.g. ‘A’, ‘CAG’).

Returns:

Length of the longest contiguous run of motif in sequence.

Return type:

int

Raises:

ValueError – If motif is empty or if either argument is not a string.

Examples

Perfect repeats

>>> count_repeat_units("CAGCAGCAG", "CAG")
3

Imperfect tail

>>> count_repeat_units("CAGCAGCA", "CAG")
2

No repeats

>>> count_repeat_units("ATCG", "CAG")
0

Homopolymer runs

>>> count_repeat_units("ATAAAAA", "A")
5
>>> count_repeat_units("AAAATAAA", "A")
4  # longest contiguous run is 'AAAA'

Overlapping motifs

‘AAAA’ with motif ‘AA’ contains two non-overlapping copies: ‘AA’ ‘AA’ >>> count_repeat_units(“AAAA”, “AA”) 2

strvcf_annotator.core.repeat_utils.normalize_variant(pos: int, ref: str, alt: str) Tuple[int, str, str][source]

Locally normalize (pos, ref, alt) by trimming shared prefix/suffix.

  • pos is 1-based VCF coordinate.

  • Trimming is case-insensitive.

  • We always keep at least 1 base in ref and alt if they differ.

Return type:

new_pos, new_ref, new_alt

strvcf_annotator.core.repeat_utils.apply_variant_to_repeat(pos: int, ref: str, alt: str, repeat_start: int, repeat_seq: str) str[source]

Apply a variant to an STR repeat sequence.

The function applies a VCF variant to a reference STR sequence while respecting VCF normalization rules and STR boundaries.

The algorithm works as follows:

  1. Normalize pos, ref, and alt by trimming shared prefixes and suffixes.

  2. If the normalized variant lies fully inside the STR region, apply the full ALT allele.

  3. If the variant partially overlaps the STR region:

    • SNP-like variants (len(ref) == len(alt)) are aligned positionally.

    • Indel-like variants (len(ref) != len(alt)) use the suffix of ALT that overlaps the STR.

Any parts of the variant outside the STR window are ignored.

Notes

The genomic reference is conceptually treated as:

repeat_seq + UNKNOWN_SUFFIX

Differences outside the STR window do not affect the resulting mutated STR sequence.

Case handling rules:

  • All matching and overlap logic is case-insensitive.

  • Output case follows the overlapping STR segment:

    • lowercase STR slice → lowercase ALT

    • uppercase STR slice → uppercase ALT

    • mixed case → ALT is left unchanged

Parameters:
  • pos (int) – Variant position (1-based VCF coordinate).

  • ref (str) – Reference allele from the VCF record.

  • alt (str) – Alternate allele from the VCF record.

  • repeat_start (int) – Start position of the STR region (1-based).

  • repeat_seq (str) – Reference STR sequence from the panel.

Returns:

The mutated STR sequence after applying the variant. If the variant does not overlap the STR region, the original repeat_seq is returned unchanged.

Return type:

str

strvcf_annotator.core.repeat_utils.is_perfect_repeat(sequence: str, motif: str) bool[source]

Check if sequence is a perfect repeat of the motif.

A perfect repeat means the sequence consists entirely of exact copies of the motif with no interruptions or variations.

Parameters:
  • sequence (str) – DNA sequence to check

  • motif (str) – Repeat unit motif

Returns:

True if sequence is a perfect repeat, False otherwise

Return type:

bool

Examples

>>> is_perfect_repeat('CAGCAGCAG', 'CAG')
True
>>> is_perfect_repeat('CAGCAGCA', 'CAG')
False

Parser class

Abstract base class for VCF parsers.

class strvcf_annotator.parsers.base.BaseVCFParser[source]

Bases: ABC

Abstract base class for VCF parsers.

Defines the interface for extracting genotype and variant information from VCF records in a standardized way. All parser implementations must inherit from this class and implement all abstract methods.

abstractmethod get_genotype(record: VariantRecord, sample_idx: int) Tuple[int, int] | None[source]

Extract genotype as (allele1, allele2) or None if unknown.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to extract genotype from

  • sample_idx (int) – Index of the sample in the record

Returns:

Genotype as tuple of allele indices (0=REF, 1=ALT), or None if unknown

Return type:

Optional[Tuple[int, int]]

abstractmethod has_variant(record: VariantRecord, sample_idx: int) bool[source]

Check if sample has variant even when GT is unknown.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to check

  • sample_idx (int) – Index of the sample in the record

Returns:

True if variant is present, False otherwise

Return type:

bool

abstractmethod extract_info(record: VariantRecord, sample_idx: int) Dict[str, Any][source]

Extract additional fields (AD, DP, etc.) as dictionary.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to extract information from

  • sample_idx (int) – Index of the sample in the record

Returns:

Dictionary of additional FORMAT fields

Return type:

Dict[str, Any]

abstractmethod validate_record(record: VariantRecord) bool[source]

Validate that record is compatible with this parser.

Parameters:

record (pysam.VariantRecord) – The VCF record to validate

Returns:

True if record is valid for this parser, False otherwise

Return type:

bool

Generic parser

Generic parser for standard VCF format fields.

class strvcf_annotator.parsers.generic.GenericParser[source]

Bases: BaseVCFParser

Generic parser for standard VCF format fields.

Handles standard VCF FORMAT fields including GT (genotype), AD (allelic depth), and DP (total depth). Provides robust error handling for missing or invalid data.

get_genotype(record: VariantRecord, sample_idx: int) Tuple[int, int] | None[source]

Extract GT field, return None for missing/invalid genotypes.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to extract genotype from

  • sample_idx (int) – Index of the sample in the record

Returns:

Genotype as tuple of allele indices, or None if missing/invalid

Return type:

Optional[Tuple[int, int]]

has_variant(record: VariantRecord, sample_idx: int) bool[source]

Check variant presence using GT or alternative evidence.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to check

  • sample_idx (int) – Index of the sample in the record

Returns:

True if variant is present, False otherwise

Return type:

bool

extract_info(record: VariantRecord, sample_idx: int) Dict[str, Any][source]

Extract AD, DP and other standard FORMAT fields.

Parameters:
  • record (pysam.VariantRecord) – The VCF record to extract information from

  • sample_idx (int) – Index of the sample in the record

Returns:

Dictionary of FORMAT fields (AD, DP, etc.)

Return type:

Dict[str, Any]

validate_record(record: VariantRecord) bool[source]

Validate record has required fields for generic parsing.

Parameters:

record (pysam.VariantRecord) – The VCF record to validate

Returns:

True if record is valid, False otherwise

Return type:

bool