Main functionality
Top-level package for str_mut_signatures.
Analysis of Short Tandem Repeat (STR) mutation signatures from VCF files. Provides both library and CLI interfaces.
- str_mut_signatures.parse_vcf_files(input_dir: str | Path, *, filter_by_pass: bool = True, filter_by_perfect: bool = True) DataFrame[source]
Process all VCF (
.vcf/.vcf.gz) files in a directory into a DataFrame.Supports GangSTR and conSTRain STR-annotated VCFs, as well as VCFs annotated with
strvcf_annotator.If a file causes an error, it is skipped and a message is printed.
- Parameters:
input_dir (str or pathlib.Path) – Directory containing
.vcfor.vcf.gzfiles.filter_by_pass (bool, optional) – If
True, keep only records withFILTER == "PASS".filter_by_perfect (bool, optional) – If
True, keep only records withINFO/PERFECT != "FALSE"when present.
- Returns:
Parsed STR records concatenated across all input files.
Columns:
sampletmp_idtumor_allele_atumor_allele_bnormal_allele_anormal_allele_bendperiodrefmotifgenotype_separator
- Return type:
pandas.DataFrame
- str_mut_signatures.save_counts_matrix(mutations_data: DataFrame, output_csv: str | Path)[source]
Save a mutation counts matrix to a CSV file.
- Parameters:
mutations_data (pandas.DataFrame) – DataFrame containing mutation count data to be written to disk.
output_csv (str or pathlib.Path) – Path to the output CSV file.
- Return type:
None
- str_mut_signatures.process_vcf_to_rows(path: str | Path, *, filter_by_pass: bool = True, filter_by_perfect: bool = True)[source]
Parse a single STR-annotated VCF into row dictionaries.
Supports:
GangSTR: uses
FORMAT/REPCNas copy numberconSTRain: uses
FORMAT/REPLENas copy numberVCF annotated with
strvcf_annotator(INFO/RU,INFO/REF,FORMAT/REPCN)
Filtering options
- filter_by_pass
If
True(default), keep only records withFILTER == "PASS". IfFalse, ignore theFILTERfield.- filter_by_perfect
If
True(default), andINFO/PERFECTis present, keep only records wherePERFECT != "FALSE"(i.e. skip variants wherePERFECT == "FALSE"). IfFalse, ignore thePERFECTflag completely.
Assumptions after validation
First sample column after
FORMATis NORMAL (index 9 in standard VCF).Second sample column after
FORMATis TUMOR (index 10).STR annotations are present in
INFO/FORMAT.
- param path:
Path to the STR-annotated VCF file.
- type path:
str or pathlib.Path
- param filter_by_pass:
Whether to keep only records with
FILTER == "PASS".- type filter_by_pass:
bool, optional
- param filter_by_perfect:
Whether to filter by
INFO/PERFECTwhen present.- type filter_by_perfect:
bool, optional
- returns:
List of dictionaries, one per parsed STR record.
- rtype:
list[dict]
- str_mut_signatures.build_mutation_matrix(mutations_data: DataFrame, *, ru_length: bool = True, ru: Literal[None, 'class', 'ru'] = None, ref_length: bool = True, change: bool = True) DataFrame[source]
Build a somatic STR mutation count matrix from paired tumor–normal data.
This function converts per-locus STR mutation calls into a sample-by-feature count matrix. Feature definitions are controlled by repeat-unit length, repeat-unit content, reference length, and somatic change options.
- Parameters:
mutations_data (pandas.DataFrame) –
Parsed STR mutation data, typically returned by
parse_vcf_files().Required columns include:
samplenormal_allele_a,normal_allele_btumor_allele_a,tumor_allele_bmotiforRU(repeat unit sequence)genotype_separator('|','/', or missing)
ru_length (bool, default True) – If True, include the repeat-unit length as
LEN{len(motif)}in the feature key.ru ({None, "class", "ru"}, default None) –
Controls how repeat-unit content is represented in the feature key.
None: Do not include repeat-unit content."ru": Use the full repeat-unit sequence (e.g.A,AT,AAT)."class": Use base-composition class of the repeat unit:AT_only: motif contains only A/TGC_only: motif contains only G/Cmixed: mixed A/T and G/C
ref_length (bool, default True) –
If True, include a reference-length component derived from the normal allele repeat counts.
Phased genotypes: per-allele normal repeat count
Unphased genotypes: combined normal repeat count
change (bool, default True) –
If True, include the tumor–normal repeat count change (delta) in the feature key and retain only non-zero changes (somatic events).
If False, ignore delta and retain all loci that pass basic numeric checks, producing presence/absence-style summaries.
- Returns:
STR mutation count matrix with:
rows: samples
columns: STR mutation feature categories
values: counts of allele-level or combined STR mutation events
- Return type:
pandas.DataFrame
Notes
Phasing behavior is determined by
genotype_separator:'|': Genotypes are treated as phased, producing two allele-level events per locus.'/'or missing : Genotypes are treated as unphased, producing a single combined event per locus based on total tumor vs. normal repeat counts.
- str_mut_signatures.filter_mutation_matrix(matrix: DataFrame, *, feature_method: Literal['manual', 'elbow', 'percentile'] = 'manual', min_feature_total: int | None = 10, min_samples_with_feature: int | None = 3, feature_percentile: float = 0.95, min_sample_total: int | None = 0) tuple[DataFrame, FilterSummary][source]
Filter a mutation count matrix (samples × features) based on simple metrics.
- Parameters:
matrix (pandas.DataFrame) – Mutation count matrix with samples as rows and mutation features as columns.
feature_method ({"manual", "elbow", "percentile"}, optional) –
Strategy for choosing a feature-level total-count threshold.
"manual":Use
min_feature_totaldirectly.
"elbow":Ignore
min_feature_totaland use an elbow heuristic based on the distribution of feature total counts.
"percentile":Ignore
min_feature_totaland keep features whose total count is >= the given percentile of the distribution.
min_feature_total (int or None, optional) – Minimal total count across all samples for a feature to be kept (only used when
feature_method="manual"). IfNone, no total-count threshold is applied.min_samples_with_feature (int or None, optional) – Minimal number of samples in which a feature must be non-zero. If
None, no prevalence threshold is applied.feature_percentile (float, optional) – When
feature_method="percentile", features withtotal_count>= thefeature_percentilequantile of the distribution are kept. Must be between 0 and 1.min_sample_total (int or None, optional) – Minimal total count per sample to be kept. If
None, no sample-level filter is applied.
- Returns:
filtered_matrix (pandas.DataFrame) – Matrix with filtered samples/features.
summary (FilterSummary) – Structured summary containing:
feature_stats: DataFrame of per-feature metricssample_stats: DataFrame of per-sample metricsfeature_threshold_used: int or Nonesample_threshold_used: int or None
- Raises:
ValueError – If
feature_methodis not one of"manual","elbow", or"percentile".ValueError – If
feature_method="percentile"andfeature_percentileis not in the interval[0, 1].
- class str_mut_signatures.NMFResult(signatures: DataFrame, exposures: DataFrame, groups: DataFrame, model_params: dict[str, Any])[source]
Bases:
objectContainer for NMF-based STR mutation signature decomposition.
- signatures
- No-index:
Matrix of signature profiles.
index : features (same as input
matrix.columns)columns : signatures (
Signature_1,Signature_2, …)
- Type:
pandas.DataFrame
- exposures
- No-index:
Matrix of sample exposures to each signature.
index : samples (same as input
matrix.index)columns : signatures (
Signature_1,Signature_2, …)
- Type:
pandas.DataFrame
- groups
- No-index:
Sample-level grouping or annotation table aligned to exposures. Typically indexed by sample (same as input
matrix.index).- Type:
pandas.DataFrame
- model_params
- No-index:
Hyperparameters and metadata used to fit the model (e.g.
n_signatures,init,max_iter,random_state).- Type:
dict[str, Any]
- signatures: DataFrame
- exposures: DataFrame
- groups: DataFrame
- model_params: dict[str, Any]
- str_mut_signatures.run_nmf(matrix: DataFrame, n_signatures: int, init: str = 'nndsvd', max_iter: int = 200, random_state: int | None = 0, alpha_W: float = 0.0, alpha_H: float = 0.0, l1_ratio: float = 0.0, max_clusters: int = 1) NMFResult[source]
Run NMF decomposition on an STR mutation count matrix.
This function factorizes a non-negative mutation count matrix into:
signature profiles (feature-by-signature)
sample exposures (sample-by-signature)
Optionally, samples can be clustered based on their exposure profiles.
- Parameters:
matrix (pandas.DataFrame) –
Non-negative count matrix.
rows : samples
columns : mutation feature categories
n_signatures (int) – Number of signatures (components) to extract.
init (str, optional) – Initialization method for NMF (passed to the underlying estimator). Default is
"nndsvd".max_iter (int, optional) – Maximum number of iterations. Default is 200.
random_state (int or None, optional) – Random seed for reproducibility. If
None, the estimator is not seeded. Default is 0.alpha_W (float, optional) – Regularization parameter for the W matrix (exposures), if supported by the chosen NMF implementation. Default is 0.0.
alpha_H (float, optional) – Regularization parameter for the H matrix (signatures), if supported by the chosen NMF implementation. Default is 0.0.
l1_ratio (float, optional) – The Elastic-Net mixing parameter, with
0 <= l1_ratio <= 1. Default is 0.0.max_clusters (int, optional) – Maximum number of clusters to consider for optional exposure-based clustering. Values
<= 1disable clustering. Default is 1.
- Returns:
Container with signature profiles, exposures, optional grouping information, and model parameters.
- Return type:
- Raises:
ValueError – If
matrixis empty, contains non-numeric values, or contains negative entries.ValueError – If
n_signaturesis not a positive integer.
- str_mut_signatures.save_nmf_result(result: NMFResult, outdir: str | Path) None[source]
Save an
NMFResultto a directory on disk.This function writes the main components of the NMF decomposition to tabular and JSON files in the specified output directory.
The following files are created:
signatures.tsv:Signature profiles (features × K).
exposures.tsv:Sample exposures (samples × K).
metadata.json:JSON file containing
model_paramstogether with basic shape information and aformat_versionfield.
- Parameters:
result (NMFResult) – Result object containing signatures, exposures, groups, and model parameters to be saved.
outdir (str or pathlib.Path) – Output directory where result files will be written.
- Return type:
None
- str_mut_signatures.load_nmf_result(outdir: str | Path) NMFResult[source]
Load an
NMFResultpreviously saved withsave_nmf_result().This function reads the files created by
save_nmf_resultfrom the given directory and reconstructs the correspondingNMFResultobject.The following files are expected in
outdir:signatures.tsv:Signature profiles (features × K).
exposures.tsv:Sample exposures (samples × K).
metadata.json:JSON file containing model parameters and basic shape information.
- Parameters:
outdir (str or pathlib.Path) – Directory containing the saved NMF result files.
- Returns:
Reconstructed NMF result with signatures, exposures, groups (if present), and model parameters.
- Return type:
- str_mut_signatures.project_onto_signatures(new_matrix: DataFrame, signatures: DataFrame, method: str = 'nnls') DataFrame[source]
Project new samples onto existing signatures to obtain exposures.
This function computes exposure weights for each new sample given a fixed set of signature profiles.
- Parameters:
new_matrix (pandas.DataFrame) –
Matrix of new samples.
rows : samples
columns : features (must overlap
signatures.index)
signatures (pandas.DataFrame) –
Signature profile matrix.
index : features (same feature space as
new_matrix.columns)columns : signatures (e.g.,
"Signature_1","Signature_2", …)
method ({"nnls"}, optional) – Projection method. Currently only non-negative least squares (
"nnls") is implemented.
- Returns:
Exposure matrix for the new samples.
rows : samples (same as
new_matrix.index)columns : signatures (same as
signatures.columns)
- Return type:
pandas.DataFrame
Notes
For
method="nnls", for each sample vectorx(1 × F) the exposureseare obtained by solving:minimize || x - A e ||_2 subject to e >= 0
where
Ais the feature-by-signature matrix (F × K).- Raises:
ValueError – If
methodis not supported or if there is no overlap betweennew_matrix.columnsandsignatures.index.
- str_mut_signatures.compute_pca(matrix: DataFrame, n_components: int = 2) tuple[DataFrame, ndarray][source]
Compute PCA on a samples x features matrix (e.g. exposures).
- Parameters:
matrix (pandas.DataFrame) – Numeric matrix: - rows : samples - columns: features or signatures.
n_components (int, default 2) – Number of principal components to compute.
- Returns:
coords (pandas.DataFrame) – PCA coordinates with: - index : same as matrix.index - columns: PC1, PC2, …, PC{n_components}
explained_variance_ratio_ (np.ndarray) – 1D array of length n_components with the fraction of variance explained by each component.
- str_mut_signatures.plot_exposures(result: NMFResult, *, stacked: bool = True, figsize: tuple[float, float] | None = None, max_samples_per_fig: int | None = None, plot: Literal['both', 'absolute', 'proportion'] = 'both') dict[str, list[Figure]][source]
Plot sample exposures from an
NMFResult.This function generates exposure visualizations while enforcing:
Sample order is identical across all panels/plots (absolute, proportion, and per-signature views use the same ordering rule).
Signature stacking/order is consistent (uses
result.exposures.columns).
Expected inputs:
result.exposures:pandas.DataFramewith samples as index and signatures as columns.result.groups:pandas.DataFramewith a"group"column (optional), used to determine sample ordering and/or grouping.
- Parameters:
result (NMFResult) – Output of
run_nmf()containing exposures and optional grouping information.stacked (bool, optional) – If
True, plot stacked bar charts. IfFalse, plot grouped (side-by-side) bars where applicable. Default isTrue.figsize (tuple[float, float] or None, optional) – Figure size passed to matplotlib. If
None, a default size is chosen based on the number of samples and signatures.max_samples_per_fig (int or None, optional) – Maximum number of samples to show per figure. If provided, samples are split across multiple figures. If
None, all samples are plotted in a single figure (may be large).plot ({"both", "absolute", "proportion"}, optional) –
Which exposure views to generate:
"absolute": plot raw exposure values."proportion": plot exposures normalized to sum to 1 per sample."both": generate both absolute and proportional plots.
Default is
"both".
- Returns:
Dictionary mapping plot type to a list of created figures. Keys depend on
plot(e.g."absolute","proportion").- Return type:
dict[str, list[matplotlib.figure.Figure]]
- str_mut_signatures.plot_pca_samples(result: NMFResult, *, matrix: DataFrame | None = None, n_components: int = 2, ax: Axes | None = None, title: str | None = None, alpha: float = 0.8, cmap: str | None = None, s: float = 30.0) tuple[DataFrame, ndarray, Axes][source]
Run PCA on an NMF result (typically exposures) and plot PC1 vs PC2.
This is the main entry point for PCA visualization:
Extract a samples × features matrix from
result(by defaultresult.exposures).Compute PCA coordinates.
Color samples by
result.groups.Plot the first two principal components.
- Parameters:
result (NMFResult) – Output of
run_nmf(). Must provide.exposuresas apandas.DataFrameunlessmatrixis explicitly provided.matrix (pandas.DataFrame or None, optional) – Optional matrix to use instead of
result.exposures. Must be samples × features. IfNone, usesresult.exposures.n_components (int, optional) – Number of principal components to compute. Must be >= 2. Default is 2.
ax (matplotlib.axes.Axes or None, optional) – Existing axes to plot on. If
None, a new figure and axes are created.title (str or None, optional) – Plot title. If
None, a default title is generated.alpha (float, optional) – Point transparency. Default is 0.8.
cmap (str or None, optional) –
Matplotlib colormap name used for coloring samples.
If
groupis detected as categorical, the default is"tab20".If
groupis detected as continuous, the default is"viridis".
If provided explicitly, this colormap is used in both cases.
s (float, optional) – Point size. Default is 30.0.
- Returns:
coords (pandas.DataFrame) – PCA coordinates for each sample (
PC1,PC2, …), indexed by sample.explained_variance_ratio_ (numpy.ndarray) – Fraction of variance explained by each principal component.
ax (matplotlib.axes.Axes) – Axes containing the PCA scatter plot.
- Raises:
ValueError – If
n_components < 2or if the chosen matrix is empty or non-numeric.
- str_mut_signatures.plot_signatures(result: NMFResult, top_n: int = 20, signatures: list[int] | list[str] | None = None, figsize: tuple[float, float] | None = None, sharey: bool = False) Figure[source]
Plot per-signature bar plots of feature loadings.
This function visualizes the strongest features for each selected signature as bar plots, using the signature profiles stored in
result.signatures.- Parameters:
result (NMFResult) – Output of
run_nmf()containing the signature matrix.top_n (int, optional) – Number of top features (by absolute loading) to display per signature. Default is 20.
signatures (list[int] or list[str] or None, optional) –
Which signatures to plot.
If
None, all signatures are plotted.If
list[int], values are interpreted as 1-based indices (1 .. K).If
list[str], values must match column names inresult.signatures.
figsize (tuple[float, float] or None, optional) – Figure size passed to matplotlib. If
None, a default size is chosen based on the number of signatures.sharey (bool, optional) – If
True, all subplots share the same y-axis. Default isFalse.
- Returns:
The created matplotlib figure containing the signature bar plots.
- Return type:
matplotlib.figure.Figure