Main functionality

Top-level package for str_mut_signatures.

Analysis of Short Tandem Repeat (STR) mutation signatures from VCF files. Provides both library and CLI interfaces.

str_mut_signatures.parse_vcf_files(input_dir: str | Path, *, filter_by_pass: bool = True, filter_by_perfect: bool = True) → DataFrame[source]

Process all VCF (.vcf / .vcf.gz) files in a directory into a DataFrame.

Supports GangSTR and conSTRain STR-annotated VCFs, as well as VCFs annotated with strvcf_annotator.

If a file causes an error, it is skipped and a message is printed.

Parameters:

input_dir (str or pathlib.Path) – Directory containing .vcf or .vcf.gz files.
filter_by_pass (bool, optional) – If True, keep only records with FILTER == "PASS".
filter_by_perfect (bool, optional) – If True, keep only records with INFO/PERFECT != "FALSE" when present.

Returns:

Parsed STR records concatenated across all input files.

Columns:

sample
tmp_id
tumor_allele_a
tumor_allele_b
normal_allele_a
normal_allele_b
end
period
ref
motif
genotype_separator

Return type:

pandas.DataFrame

str_mut_signatures.save_counts_matrix(mutations_data: DataFrame, output_csv: str | Path)[source]

Save a mutation counts matrix to a CSV file.

Parameters:

mutations_data (pandas.DataFrame) – DataFrame containing mutation count data to be written to disk.
output_csv (str or pathlib.Path) – Path to the output CSV file.

Return type:

None

str_mut_signatures.process_vcf_to_rows(path: str | Path, *, filter_by_pass: bool = True, filter_by_perfect: bool = True)[source]

Parse a single STR-annotated VCF into row dictionaries.

Supports:

GangSTR: uses FORMAT/REPCN as copy number
conSTRain: uses FORMAT/REPLEN as copy number
VCF annotated with strvcf_annotator (INFO/RU, INFO/REF, FORMAT/REPCN)

Filtering options

filter_by_pass: If True (default), keep only records with FILTER == "PASS". If False, ignore the FILTER field.
filter_by_perfect: If True (default), and INFO/PERFECT is present, keep only records where PERFECT != "FALSE" (i.e. skip variants where PERFECT == "FALSE"). If False, ignore the PERFECT flag completely.

Assumptions after validation

First sample column after FORMAT is NORMAL (index 9 in standard VCF).
Second sample column after FORMAT is TUMOR (index 10).
STR annotations are present in INFO / FORMAT.

param path:: Path to the STR-annotated VCF file.
type path:: str or pathlib.Path
param filter_by_pass:: Whether to keep only records with FILTER == "PASS".
type filter_by_pass:: bool, optional
param filter_by_perfect:: Whether to filter by INFO/PERFECT when present.
type filter_by_perfect:: bool, optional
returns:: List of dictionaries, one per parsed STR record.
rtype:: list[dict]

str_mut_signatures.build_mutation_matrix(mutations_data: DataFrame, *, ru_length: bool = True, ru: Literal[None, 'class', 'ru'] = None, ref_length: bool = True, change: bool = True) → DataFrame[source]

Build a somatic STR mutation count matrix from paired tumor–normal data.

This function converts per-locus STR mutation calls into a sample-by-feature count matrix. Feature definitions are controlled by repeat-unit length, repeat-unit content, reference length, and somatic change options.

Parameters:

mutations_data (pandas.DataFrame) –
Parsed STR mutation data, typically returned by parse_vcf_files().

Required columns include:
- sample
- normal_allele_a, normal_allele_b
- tumor_allele_a, tumor_allele_b
- motif or RU (repeat unit sequence)
- genotype_separator ('|', '/', or missing)
ru_length (bool, default True) – If True, include the repeat-unit length as LEN{len(motif)} in the feature key.
ru ({None, "class", "ru"}, default None) –
Controls how repeat-unit content is represented in the feature key.
- None : Do not include repeat-unit content.
- "ru" : Use the full repeat-unit sequence (e.g. A, AT, AAT).
- "class" : Use base-composition class of the repeat unit:
  - AT_only : motif contains only A/T
  - GC_only : motif contains only G/C
  - mixed : mixed A/T and G/C
ref_length (bool, default True) –
If True, include a reference-length component derived from the normal allele repeat counts.
- Phased genotypes: per-allele normal repeat count
- Unphased genotypes: combined normal repeat count
change (bool, default True) –
If True, include the tumor–normal repeat count change (delta) in the feature key and retain only non-zero changes (somatic events).

If False, ignore delta and retain all loci that pass basic numeric checks, producing presence/absence-style summaries.

Returns:

STR mutation count matrix with:

rows: samples
columns: STR mutation feature categories
values: counts of allele-level or combined STR mutation events

Return type:

pandas.DataFrame

Notes

Phasing behavior is determined by genotype_separator:

'|' : Genotypes are treated as phased, producing two allele-level events per locus.
'/' or missing : Genotypes are treated as unphased, producing a single combined event per locus based on total tumor vs. normal repeat counts.

str_mut_signatures.filter_mutation_matrix(matrix: DataFrame, *, feature_method: Literal['manual', 'elbow', 'percentile'] = 'manual', min_feature_total: int | None = 10, min_samples_with_feature: int | None = 3, feature_percentile: float = 0.95, min_sample_total: int | None = 0) → tuple[DataFrame, FilterSummary][source]

Filter a mutation count matrix (samples × features) based on simple metrics.

Parameters:

matrix (pandas.DataFrame) – Mutation count matrix with samples as rows and mutation features as columns.
feature_method ({"manual", "elbow", "percentile"}, optional) –
Strategy for choosing a feature-level total-count threshold.
- "manual":
  Use min_feature_total directly.
- "elbow":
  Ignore min_feature_total and use an elbow heuristic based on the distribution of feature total counts.
- "percentile":
  Ignore min_feature_total and keep features whose total count is >= the given percentile of the distribution.
min_feature_total (int or None, optional) – Minimal total count across all samples for a feature to be kept (only used when feature_method="manual"). If None, no total-count threshold is applied.
min_samples_with_feature (int or None, optional) – Minimal number of samples in which a feature must be non-zero. If None, no prevalence threshold is applied.
feature_percentile (float, optional) – When feature_method="percentile", features with total_count >= the feature_percentile quantile of the distribution are kept. Must be between 0 and 1.
min_sample_total (int or None, optional) – Minimal total count per sample to be kept. If None, no sample-level filter is applied.

Returns:

filtered_matrix (pandas.DataFrame) – Matrix with filtered samples/features.
summary (FilterSummary) – Structured summary containing:
- feature_stats: DataFrame of per-feature metrics
- sample_stats: DataFrame of per-sample metrics
- feature_threshold_used: int or None
- sample_threshold_used: int or None

Raises:

ValueError – If feature_method is not one of "manual", "elbow", or "percentile".
ValueError – If feature_method="percentile" and feature_percentile is not in the interval [0, 1].

class str_mut_signatures.NMFResult(signatures: DataFrame, exposures: DataFrame, groups: DataFrame, model_params: dict[str, Any])[source]

Bases: object

Container for NMF-based STR mutation signature decomposition.

signatures

No-index:

Matrix of signature profiles.

index : features (same as input matrix.columns)
columns : signatures (Signature_1, Signature_2, …)

Type:: pandas.DataFrame

exposures

No-index:

Matrix of sample exposures to each signature.

index : samples (same as input matrix.index)
columns : signatures (Signature_1, Signature_2, …)

Type:: pandas.DataFrame

groups

No-index:

Sample-level grouping or annotation table aligned to exposures. Typically indexed by sample (same as input matrix.index).

Type:: pandas.DataFrame

model_params

No-index:

Hyperparameters and metadata used to fit the model (e.g. n_signatures, init, max_iter, random_state).

Type:: dict[str, Any]

signatures: DataFrame

exposures: DataFrame

groups: DataFrame

model_params: dict[str, Any]

str_mut_signatures.run_nmf(matrix: DataFrame, n_signatures: int, init: str = 'nndsvd', max_iter: int = 200, random_state: int | None = 0, alpha_W: float = 0.0, alpha_H: float = 0.0, l1_ratio: float = 0.0, max_clusters: int = 1) → NMFResult[source]

Run NMF decomposition on an STR mutation count matrix.

This function factorizes a non-negative mutation count matrix into:

signature profiles (feature-by-signature)
sample exposures (sample-by-signature)

Optionally, samples can be clustered based on their exposure profiles.

Parameters:

matrix (pandas.DataFrame) –
Non-negative count matrix.
- rows : samples
- columns : mutation feature categories
n_signatures (int) – Number of signatures (components) to extract.
init (str, optional) – Initialization method for NMF (passed to the underlying estimator). Default is "nndsvd".
max_iter (int, optional) – Maximum number of iterations. Default is 200.
random_state (int or None, optional) – Random seed for reproducibility. If None, the estimator is not seeded. Default is 0.
alpha_W (float, optional) – Regularization parameter for the W matrix (exposures), if supported by the chosen NMF implementation. Default is 0.0.
alpha_H (float, optional) – Regularization parameter for the H matrix (signatures), if supported by the chosen NMF implementation. Default is 0.0.
l1_ratio (float, optional) – The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Default is 0.0.
max_clusters (int, optional) – Maximum number of clusters to consider for optional exposure-based clustering. Values <= 1 disable clustering. Default is 1.

Returns:

Container with signature profiles, exposures, optional grouping information, and model parameters.

Return type:

NMFResult

Raises:

ValueError – If matrix is empty, contains non-numeric values, or contains negative entries.
ValueError – If n_signatures is not a positive integer.

str_mut_signatures.save_nmf_result(result: NMFResult, outdir: str | Path) → None[source]

Save an NMFResult to a directory on disk.

This function writes the main components of the NMF decomposition to tabular and JSON files in the specified output directory.

The following files are created:

signatures.tsv:
Signature profiles (features × K).
exposures.tsv:
Sample exposures (samples × K).
metadata.json:
JSON file containing model_params together with basic shape information and a format_version field.

Parameters:

result (NMFResult) – Result object containing signatures, exposures, groups, and model parameters to be saved.
outdir (str or pathlib.Path) – Output directory where result files will be written.

Return type:

None

str_mut_signatures.load_nmf_result(outdir: str | Path) → NMFResult[source]

Load an NMFResult previously saved with save_nmf_result().

This function reads the files created by save_nmf_result from the given directory and reconstructs the corresponding NMFResult object.

The following files are expected in outdir:

signatures.tsv:
Signature profiles (features × K).
exposures.tsv:
Sample exposures (samples × K).
metadata.json:
JSON file containing model parameters and basic shape information.

Parameters:: outdir (str or pathlib.Path) – Directory containing the saved NMF result files.
Returns:: Reconstructed NMF result with signatures, exposures, groups (if present), and model parameters.
Return type:: NMFResult

str_mut_signatures.project_onto_signatures(new_matrix: DataFrame, signatures: DataFrame, method: str = 'nnls') → DataFrame[source]

Project new samples onto existing signatures to obtain exposures.

This function computes exposure weights for each new sample given a fixed set of signature profiles.

Parameters:

new_matrix (pandas.DataFrame) –
Matrix of new samples.
- rows : samples
- columns : features (must overlap signatures.index)
signatures (pandas.DataFrame) –
Signature profile matrix.
- index : features (same feature space as new_matrix.columns)
- columns : signatures (e.g., "Signature_1", "Signature_2", …)
method ({"nnls"}, optional) – Projection method. Currently only non-negative least squares ("nnls") is implemented.

Returns:

Exposure matrix for the new samples.

rows : samples (same as new_matrix.index)
columns : signatures (same as signatures.columns)

Return type:

pandas.DataFrame

Notes

For method="nnls", for each sample vector x (1 × F) the exposures e are obtained by solving:

minimize || x - A e ||_2   subject to e >= 0

where A is the feature-by-signature matrix (F × K).

Raises:: ValueError – If method is not supported or if there is no overlap between new_matrix.columns and signatures.index.

str_mut_signatures.compute_pca(matrix: DataFrame, n_components: int = 2) → tuple[DataFrame, ndarray][source]

Compute PCA on a samples x features matrix (e.g. exposures).

Parameters:

matrix (pandas.DataFrame) – Numeric matrix: - rows : samples - columns: features or signatures.
n_components (int, default 2) – Number of principal components to compute.

Returns:

coords (pandas.DataFrame) – PCA coordinates with: - index : same as matrix.index - columns: PC1, PC2, …, PC{n_components}
explained_variance_ratio_ (np.ndarray) – 1D array of length n_components with the fraction of variance explained by each component.

str_mut_signatures.plot_exposures(result: NMFResult, *, stacked: bool = True, figsize: tuple[float, float] | None = None, max_samples_per_fig: int | None = None, plot: Literal['both', 'absolute', 'proportion'] = 'both') → dict[str, list[Figure]][source]

Plot sample exposures from an NMFResult.

This function generates exposure visualizations while enforcing:

Sample order is identical across all panels/plots (absolute, proportion, and per-signature views use the same ordering rule).
Signature stacking/order is consistent (uses result.exposures.columns).

Expected inputs:

result.exposures: pandas.DataFrame with samples as index and signatures as columns.
result.groups: pandas.DataFrame with a "group" column (optional), used to determine sample ordering and/or grouping.

Parameters:

result (NMFResult) – Output of run_nmf() containing exposures and optional grouping information.
stacked (bool, optional) – If True, plot stacked bar charts. If False, plot grouped (side-by-side) bars where applicable. Default is True.
figsize (tuple[float, float] or None, optional) – Figure size passed to matplotlib. If None, a default size is chosen based on the number of samples and signatures.
max_samples_per_fig (int or None, optional) – Maximum number of samples to show per figure. If provided, samples are split across multiple figures. If None, all samples are plotted in a single figure (may be large).
plot ({"both", "absolute", "proportion"}, optional) –
Which exposure views to generate:
- "absolute": plot raw exposure values.
- "proportion": plot exposures normalized to sum to 1 per sample.
- "both": generate both absolute and proportional plots.
Default is "both".

Returns:

Dictionary mapping plot type to a list of created figures. Keys depend on plot (e.g. "absolute", "proportion").

Return type:

dict[str, list[matplotlib.figure.Figure]]

str_mut_signatures.plot_pca_samples(result: NMFResult, *, matrix: DataFrame | None = None, n_components: int = 2, ax: Axes | None = None, title: str | None = None, alpha: float = 0.8, cmap: str | None = None, s: float = 30.0) → tuple[DataFrame, ndarray, Axes][source]

Run PCA on an NMF result (typically exposures) and plot PC1 vs PC2.

This is the main entry point for PCA visualization:

Extract a samples × features matrix from result (by default result.exposures).
Compute PCA coordinates.
Color samples by result.groups.
Plot the first two principal components.

Parameters:

result (NMFResult) – Output of run_nmf(). Must provide .exposures as a pandas.DataFrame unless matrix is explicitly provided.
matrix (pandas.DataFrame or None, optional) – Optional matrix to use instead of result.exposures. Must be samples × features. If None, uses result.exposures.
n_components (int, optional) – Number of principal components to compute. Must be >= 2. Default is 2.
ax (matplotlib.axes.Axes or None, optional) – Existing axes to plot on. If None, a new figure and axes are created.
title (str or None, optional) – Plot title. If None, a default title is generated.
alpha (float, optional) – Point transparency. Default is 0.8.
cmap (str or None, optional) –
Matplotlib colormap name used for coloring samples.
- If group is detected as categorical, the default is "tab20".
- If group is detected as continuous, the default is "viridis".
If provided explicitly, this colormap is used in both cases.
s (float, optional) – Point size. Default is 30.0.

Returns:

coords (pandas.DataFrame) – PCA coordinates for each sample (PC1, PC2, …), indexed by sample.
explained_variance_ratio_ (numpy.ndarray) – Fraction of variance explained by each principal component.
ax (matplotlib.axes.Axes) – Axes containing the PCA scatter plot.

Raises:

ValueError – If n_components < 2 or if the chosen matrix is empty or non-numeric.

str_mut_signatures.plot_signatures(result: NMFResult, top_n: int = 20, signatures: list[int] | list[str] | None = None, figsize: tuple[float, float] | None = None, sharey: bool = False) → Figure[source]

Plot per-signature bar plots of feature loadings.

This function visualizes the strongest features for each selected signature as bar plots, using the signature profiles stored in result.signatures.

Parameters:

result (NMFResult) – Output of run_nmf() containing the signature matrix.
top_n (int, optional) – Number of top features (by absolute loading) to display per signature. Default is 20.
signatures (list[int] or list[str] or None, optional) –
Which signatures to plot.
- If None, all signatures are plotted.
- If list[int], values are interpreted as 1-based indices (1 .. K).
- If list[str], values must match column names in result.signatures.
figsize (tuple[float, float] or None, optional) – Figure size passed to matplotlib. If None, a default size is chosen based on the number of signatures.
sharey (bool, optional) – If True, all subplots share the same y-axis. Default is False.

Returns:

The created matplotlib figure containing the signature bar plots.

Return type:

matplotlib.figure.Figure