Main functionality

Top-level package for str_mut_signatures.

Analysis of Short Tandem Repeat (STR) mutation signatures from VCF files. Provides both library and CLI interfaces.

str_mut_signatures.parse_vcf_files(input_dir: str | Path, *, filter_by_pass: bool = True, filter_by_perfect: bool = True) DataFrame[source]

Process all VCF (.vcf / .vcf.gz) files in a directory into a DataFrame.

Supports GangSTR and conSTRain STR-annotated VCFs, as well as VCFs annotated with strvcf_annotator.

If a file causes an error, it is skipped and a message is printed.

Parameters:
  • input_dir (str or pathlib.Path) – Directory containing .vcf or .vcf.gz files.

  • filter_by_pass (bool, optional) – If True, keep only records with FILTER == "PASS".

  • filter_by_perfect (bool, optional) – If True, keep only records with INFO/PERFECT != "FALSE" when present.

Returns:

Parsed STR records concatenated across all input files.

Columns:

  • sample

  • tmp_id

  • tumor_allele_a

  • tumor_allele_b

  • normal_allele_a

  • normal_allele_b

  • end

  • period

  • ref

  • motif

  • genotype_separator

Return type:

pandas.DataFrame

str_mut_signatures.save_counts_matrix(mutations_data: DataFrame, output_csv: str | Path)[source]

Save a mutation counts matrix to a CSV file.

Parameters:
  • mutations_data (pandas.DataFrame) – DataFrame containing mutation count data to be written to disk.

  • output_csv (str or pathlib.Path) – Path to the output CSV file.

Return type:

None

str_mut_signatures.process_vcf_to_rows(path: str | Path, *, filter_by_pass: bool = True, filter_by_perfect: bool = True)[source]

Parse a single STR-annotated VCF into row dictionaries.

Supports:

  • GangSTR: uses FORMAT/REPCN as copy number

  • conSTRain: uses FORMAT/REPLEN as copy number

  • VCF annotated with strvcf_annotator (INFO/RU, INFO/REF, FORMAT/REPCN)

Filtering options

filter_by_pass

If True (default), keep only records with FILTER == "PASS". If False, ignore the FILTER field.

filter_by_perfect

If True (default), and INFO/PERFECT is present, keep only records where PERFECT != "FALSE" (i.e. skip variants where PERFECT == "FALSE"). If False, ignore the PERFECT flag completely.

Assumptions after validation

  • First sample column after FORMAT is NORMAL (index 9 in standard VCF).

  • Second sample column after FORMAT is TUMOR (index 10).

  • STR annotations are present in INFO / FORMAT.

param path:

Path to the STR-annotated VCF file.

type path:

str or pathlib.Path

param filter_by_pass:

Whether to keep only records with FILTER == "PASS".

type filter_by_pass:

bool, optional

param filter_by_perfect:

Whether to filter by INFO/PERFECT when present.

type filter_by_perfect:

bool, optional

returns:

List of dictionaries, one per parsed STR record.

rtype:

list[dict]

str_mut_signatures.build_mutation_matrix(mutations_data: DataFrame, *, ru_length: bool = True, ru: Literal[None, 'class', 'ru'] = None, ref_length: bool = True, change: bool = True) DataFrame[source]

Build a somatic STR mutation count matrix from paired tumor–normal data.

This function converts per-locus STR mutation calls into a sample-by-feature count matrix. Feature definitions are controlled by repeat-unit length, repeat-unit content, reference length, and somatic change options.

Parameters:
  • mutations_data (pandas.DataFrame) –

    Parsed STR mutation data, typically returned by parse_vcf_files().

    Required columns include:

    • sample

    • normal_allele_a, normal_allele_b

    • tumor_allele_a, tumor_allele_b

    • motif or RU (repeat unit sequence)

    • genotype_separator ('|', '/', or missing)

  • ru_length (bool, default True) – If True, include the repeat-unit length as LEN{len(motif)} in the feature key.

  • ru ({None, "class", "ru"}, default None) –

    Controls how repeat-unit content is represented in the feature key.

    • None : Do not include repeat-unit content.

    • "ru" : Use the full repeat-unit sequence (e.g. A, AT, AAT).

    • "class" : Use base-composition class of the repeat unit:

      • AT_only : motif contains only A/T

      • GC_only : motif contains only G/C

      • mixed : mixed A/T and G/C

  • ref_length (bool, default True) –

    If True, include a reference-length component derived from the normal allele repeat counts.

    • Phased genotypes: per-allele normal repeat count

    • Unphased genotypes: combined normal repeat count

  • change (bool, default True) –

    If True, include the tumor–normal repeat count change (delta) in the feature key and retain only non-zero changes (somatic events).

    If False, ignore delta and retain all loci that pass basic numeric checks, producing presence/absence-style summaries.

Returns:

STR mutation count matrix with:

  • rows: samples

  • columns: STR mutation feature categories

  • values: counts of allele-level or combined STR mutation events

Return type:

pandas.DataFrame

Notes

Phasing behavior is determined by genotype_separator:

  • '|' : Genotypes are treated as phased, producing two allele-level events per locus.

  • '/' or missing : Genotypes are treated as unphased, producing a single combined event per locus based on total tumor vs. normal repeat counts.

str_mut_signatures.filter_mutation_matrix(matrix: DataFrame, *, feature_method: Literal['manual', 'elbow', 'percentile'] = 'manual', min_feature_total: int | None = 10, min_samples_with_feature: int | None = 3, feature_percentile: float = 0.95, min_sample_total: int | None = 0) tuple[DataFrame, FilterSummary][source]

Filter a mutation count matrix (samples × features) based on simple metrics.

Parameters:
  • matrix (pandas.DataFrame) – Mutation count matrix with samples as rows and mutation features as columns.

  • feature_method ({"manual", "elbow", "percentile"}, optional) –

    Strategy for choosing a feature-level total-count threshold.

    • "manual":

      Use min_feature_total directly.

    • "elbow":

      Ignore min_feature_total and use an elbow heuristic based on the distribution of feature total counts.

    • "percentile":

      Ignore min_feature_total and keep features whose total count is >= the given percentile of the distribution.

  • min_feature_total (int or None, optional) – Minimal total count across all samples for a feature to be kept (only used when feature_method="manual"). If None, no total-count threshold is applied.

  • min_samples_with_feature (int or None, optional) – Minimal number of samples in which a feature must be non-zero. If None, no prevalence threshold is applied.

  • feature_percentile (float, optional) – When feature_method="percentile", features with total_count >= the feature_percentile quantile of the distribution are kept. Must be between 0 and 1.

  • min_sample_total (int or None, optional) – Minimal total count per sample to be kept. If None, no sample-level filter is applied.

Returns:

  • filtered_matrix (pandas.DataFrame) – Matrix with filtered samples/features.

  • summary (FilterSummary) – Structured summary containing:

    • feature_stats: DataFrame of per-feature metrics

    • sample_stats: DataFrame of per-sample metrics

    • feature_threshold_used: int or None

    • sample_threshold_used: int or None

Raises:
  • ValueError – If feature_method is not one of "manual", "elbow", or "percentile".

  • ValueError – If feature_method="percentile" and feature_percentile is not in the interval [0, 1].

class str_mut_signatures.NMFResult(signatures: DataFrame, exposures: DataFrame, groups: DataFrame, model_params: dict[str, Any])[source]

Bases: object

Container for NMF-based STR mutation signature decomposition.

signatures
No-index:

Matrix of signature profiles.

  • index : features (same as input matrix.columns)

  • columns : signatures (Signature_1, Signature_2, …)

Type:

pandas.DataFrame

exposures
No-index:

Matrix of sample exposures to each signature.

  • index : samples (same as input matrix.index)

  • columns : signatures (Signature_1, Signature_2, …)

Type:

pandas.DataFrame

groups
No-index:

Sample-level grouping or annotation table aligned to exposures. Typically indexed by sample (same as input matrix.index).

Type:

pandas.DataFrame

model_params
No-index:

Hyperparameters and metadata used to fit the model (e.g. n_signatures, init, max_iter, random_state).

Type:

dict[str, Any]

signatures: DataFrame
exposures: DataFrame
groups: DataFrame
model_params: dict[str, Any]
str_mut_signatures.run_nmf(matrix: DataFrame, n_signatures: int, init: str = 'nndsvd', max_iter: int = 200, random_state: int | None = 0, alpha_W: float = 0.0, alpha_H: float = 0.0, l1_ratio: float = 0.0, max_clusters: int = 1) NMFResult[source]

Run NMF decomposition on an STR mutation count matrix.

This function factorizes a non-negative mutation count matrix into:

  • signature profiles (feature-by-signature)

  • sample exposures (sample-by-signature)

Optionally, samples can be clustered based on their exposure profiles.

Parameters:
  • matrix (pandas.DataFrame) –

    Non-negative count matrix.

    • rows : samples

    • columns : mutation feature categories

  • n_signatures (int) – Number of signatures (components) to extract.

  • init (str, optional) – Initialization method for NMF (passed to the underlying estimator). Default is "nndsvd".

  • max_iter (int, optional) – Maximum number of iterations. Default is 200.

  • random_state (int or None, optional) – Random seed for reproducibility. If None, the estimator is not seeded. Default is 0.

  • alpha_W (float, optional) – Regularization parameter for the W matrix (exposures), if supported by the chosen NMF implementation. Default is 0.0.

  • alpha_H (float, optional) – Regularization parameter for the H matrix (signatures), if supported by the chosen NMF implementation. Default is 0.0.

  • l1_ratio (float, optional) – The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Default is 0.0.

  • max_clusters (int, optional) – Maximum number of clusters to consider for optional exposure-based clustering. Values <= 1 disable clustering. Default is 1.

Returns:

Container with signature profiles, exposures, optional grouping information, and model parameters.

Return type:

NMFResult

Raises:
  • ValueError – If matrix is empty, contains non-numeric values, or contains negative entries.

  • ValueError – If n_signatures is not a positive integer.

str_mut_signatures.save_nmf_result(result: NMFResult, outdir: str | Path) None[source]

Save an NMFResult to a directory on disk.

This function writes the main components of the NMF decomposition to tabular and JSON files in the specified output directory.

The following files are created:

  • signatures.tsv:

    Signature profiles (features × K).

  • exposures.tsv:

    Sample exposures (samples × K).

  • metadata.json:

    JSON file containing model_params together with basic shape information and a format_version field.

Parameters:
  • result (NMFResult) – Result object containing signatures, exposures, groups, and model parameters to be saved.

  • outdir (str or pathlib.Path) – Output directory where result files will be written.

Return type:

None

str_mut_signatures.load_nmf_result(outdir: str | Path) NMFResult[source]

Load an NMFResult previously saved with save_nmf_result().

This function reads the files created by save_nmf_result from the given directory and reconstructs the corresponding NMFResult object.

The following files are expected in outdir:

  • signatures.tsv:

    Signature profiles (features × K).

  • exposures.tsv:

    Sample exposures (samples × K).

  • metadata.json:

    JSON file containing model parameters and basic shape information.

Parameters:

outdir (str or pathlib.Path) – Directory containing the saved NMF result files.

Returns:

Reconstructed NMF result with signatures, exposures, groups (if present), and model parameters.

Return type:

NMFResult

str_mut_signatures.project_onto_signatures(new_matrix: DataFrame, signatures: DataFrame, method: str = 'nnls') DataFrame[source]

Project new samples onto existing signatures to obtain exposures.

This function computes exposure weights for each new sample given a fixed set of signature profiles.

Parameters:
  • new_matrix (pandas.DataFrame) –

    Matrix of new samples.

    • rows : samples

    • columns : features (must overlap signatures.index)

  • signatures (pandas.DataFrame) –

    Signature profile matrix.

    • index : features (same feature space as new_matrix.columns)

    • columns : signatures (e.g., "Signature_1", "Signature_2", …)

  • method ({"nnls"}, optional) – Projection method. Currently only non-negative least squares ("nnls") is implemented.

Returns:

Exposure matrix for the new samples.

  • rows : samples (same as new_matrix.index)

  • columns : signatures (same as signatures.columns)

Return type:

pandas.DataFrame

Notes

For method="nnls", for each sample vector x (1 × F) the exposures e are obtained by solving:

minimize || x - A e ||_2   subject to e >= 0

where A is the feature-by-signature matrix (F × K).

Raises:

ValueError – If method is not supported or if there is no overlap between new_matrix.columns and signatures.index.

str_mut_signatures.compute_pca(matrix: DataFrame, n_components: int = 2) tuple[DataFrame, ndarray][source]

Compute PCA on a samples x features matrix (e.g. exposures).

Parameters:
  • matrix (pandas.DataFrame) – Numeric matrix: - rows : samples - columns: features or signatures.

  • n_components (int, default 2) – Number of principal components to compute.

Returns:

  • coords (pandas.DataFrame) – PCA coordinates with: - index : same as matrix.index - columns: PC1, PC2, …, PC{n_components}

  • explained_variance_ratio_ (np.ndarray) – 1D array of length n_components with the fraction of variance explained by each component.

str_mut_signatures.plot_exposures(result: NMFResult, *, stacked: bool = True, figsize: tuple[float, float] | None = None, max_samples_per_fig: int | None = None, plot: Literal['both', 'absolute', 'proportion'] = 'both') dict[str, list[Figure]][source]

Plot sample exposures from an NMFResult.

This function generates exposure visualizations while enforcing:

  • Sample order is identical across all panels/plots (absolute, proportion, and per-signature views use the same ordering rule).

  • Signature stacking/order is consistent (uses result.exposures.columns).

Expected inputs:

  • result.exposures: pandas.DataFrame with samples as index and signatures as columns.

  • result.groups: pandas.DataFrame with a "group" column (optional), used to determine sample ordering and/or grouping.

Parameters:
  • result (NMFResult) – Output of run_nmf() containing exposures and optional grouping information.

  • stacked (bool, optional) – If True, plot stacked bar charts. If False, plot grouped (side-by-side) bars where applicable. Default is True.

  • figsize (tuple[float, float] or None, optional) – Figure size passed to matplotlib. If None, a default size is chosen based on the number of samples and signatures.

  • max_samples_per_fig (int or None, optional) – Maximum number of samples to show per figure. If provided, samples are split across multiple figures. If None, all samples are plotted in a single figure (may be large).

  • plot ({"both", "absolute", "proportion"}, optional) –

    Which exposure views to generate:

    • "absolute": plot raw exposure values.

    • "proportion": plot exposures normalized to sum to 1 per sample.

    • "both": generate both absolute and proportional plots.

    Default is "both".

Returns:

Dictionary mapping plot type to a list of created figures. Keys depend on plot (e.g. "absolute", "proportion").

Return type:

dict[str, list[matplotlib.figure.Figure]]

str_mut_signatures.plot_pca_samples(result: NMFResult, *, matrix: DataFrame | None = None, n_components: int = 2, ax: Axes | None = None, title: str | None = None, alpha: float = 0.8, cmap: str | None = None, s: float = 30.0) tuple[DataFrame, ndarray, Axes][source]

Run PCA on an NMF result (typically exposures) and plot PC1 vs PC2.

This is the main entry point for PCA visualization:

  • Extract a samples × features matrix from result (by default result.exposures).

  • Compute PCA coordinates.

  • Color samples by result.groups.

  • Plot the first two principal components.

Parameters:
  • result (NMFResult) – Output of run_nmf(). Must provide .exposures as a pandas.DataFrame unless matrix is explicitly provided.

  • matrix (pandas.DataFrame or None, optional) – Optional matrix to use instead of result.exposures. Must be samples × features. If None, uses result.exposures.

  • n_components (int, optional) – Number of principal components to compute. Must be >= 2. Default is 2.

  • ax (matplotlib.axes.Axes or None, optional) – Existing axes to plot on. If None, a new figure and axes are created.

  • title (str or None, optional) – Plot title. If None, a default title is generated.

  • alpha (float, optional) – Point transparency. Default is 0.8.

  • cmap (str or None, optional) –

    Matplotlib colormap name used for coloring samples.

    • If group is detected as categorical, the default is "tab20".

    • If group is detected as continuous, the default is "viridis".

    If provided explicitly, this colormap is used in both cases.

  • s (float, optional) – Point size. Default is 30.0.

Returns:

  • coords (pandas.DataFrame) – PCA coordinates for each sample (PC1, PC2, …), indexed by sample.

  • explained_variance_ratio_ (numpy.ndarray) – Fraction of variance explained by each principal component.

  • ax (matplotlib.axes.Axes) – Axes containing the PCA scatter plot.

Raises:

ValueError – If n_components < 2 or if the chosen matrix is empty or non-numeric.

str_mut_signatures.plot_signatures(result: NMFResult, top_n: int = 20, signatures: list[int] | list[str] | None = None, figsize: tuple[float, float] | None = None, sharey: bool = False) Figure[source]

Plot per-signature bar plots of feature loadings.

This function visualizes the strongest features for each selected signature as bar plots, using the signature profiles stored in result.signatures.

Parameters:
  • result (NMFResult) – Output of run_nmf() containing the signature matrix.

  • top_n (int, optional) – Number of top features (by absolute loading) to display per signature. Default is 20.

  • signatures (list[int] or list[str] or None, optional) –

    Which signatures to plot.

    • If None, all signatures are plotted.

    • If list[int], values are interpreted as 1-based indices (1 .. K).

    • If list[str], values must match column names in result.signatures.

  • figsize (tuple[float, float] or None, optional) – Figure size passed to matplotlib. If None, a default size is chosen based on the number of signatures.

  • sharey (bool, optional) – If True, all subplots share the same y-axis. Default is False.

Returns:

The created matplotlib figure containing the signature bar plots.

Return type:

matplotlib.figure.Figure