Searching for particular repeats¶
Overview¶
The search package provides python modules and command line tools for identifying
tandem repeats in sequence databases. Typically processing consists of first
running search_hmm
to identify instances of a query repeat
in a sequence database. This is relatively slow (hours to days
against Uniprot) but can be trivially parallelized to run on a cluster.
After scoring the hmm against every sequence in the database, filter_hmm
is
used to identify significant results.
search_hmm¶
The search module can be run directly:
python -m tral.search.search_hmm -h
The query HMM should be in HMMR format. The hmmbuild
tool may be helpful for
generating HMM files from multiple alignments. The HMM should represent a single
repeat. Circularly permuted edges are automatically inserted to permit multiple
sequential repeats.
The database should be in FASTA format. It may be gzip compressed.
Results are produced in TSV format, and include
Sequence identifier (header line from FASTA file)
Viterbi probability (probability that this sequence was emitted by the query HMM)
Log-odds ratio (viterbi probability divided by the expected probability for a sequence of this length
HMM states (Maximum likelihood sequence of states through the HMM, as determined by the Viiterbi algorithm
Example:
python -m tral.search.search_hmm query.hmm database.faa.gz results.tsv
When running on a cluster, it may be useful to search just a subset of the
database. For instance, here’s a simplified submission script using the -n
and -s
parameters in an array job:
#!/bin/bash
#BSUB -J tral[1-10]
n=1000
start=$(( (${LSB_JOBINDEX:-1}-1) * $n ))
python -m tral.search.tral_search \
-n $n -s $start \
query.hmm database.fasta.gz results.${LSB_JOBINDEX:-1}.tsv
Choosing an appropriate log-odds ratio depends on the database size, amino
acid content, HMM length, and HMM composition. To help calibrate this, the -r
option is available to automatically shuffle the input sequences. The results
then represent all negative hits, allowing the False Discovery Rate to be
estimated for various thresholds in the filtering step.
Full usage¶
usage: search_hmm.py [-h] [-v] [-s START] [-n DBSIZE] [-r]
hmm database results
Align a cpHMM against a database using TRAL
positional arguments:
hmm cpHMM filename
database database to search, in fasta format (may be gzip
compressed)
results results file
optional arguments:
-h, --help show this help message and exit
-v, --verbose Long messages
-s START, --start START
Index of database sequence to start with (default: 0)
-n DBSIZE, --dbsize DBSIZE
Number of database sequences to include (default: 0 to
include all)
-r, --shuffle Shuffle database sequences
filter_hmm¶
The results from search_hmm
include both significant and insignificant hits.
Use filter_hmm
to filter the results based on length and log-odds criteria.
python -m tral.search.filter_hmm -h
Output. The tool can produce three kinds of output:
-f
FASTA: Subset of the input database corresponding to significant hits-o
TSV: Subset of thesearch_hmm
results corresponding to significant hits-x
TREKS: All significant repeats in T-REKS format.
The TREKS format is most useful for exporting results to other programs.
Identifiers. By default, the identifier is extracted from the database
header lines (e.g. >tr|id|description...
). For compatibility with other
programs, the --preserve-header
option can also be used to match sequences
based on the full header.
Thresholds. Thresholds can be set for either:
-r
Number of repeats (float). Repeats are computed based on the HMM match states. This means that partial repeats are counted as a fraction of the query length.-t
Log-odds threshold (float). Results with log-odds less than this value are filtered out.
Memory usage. Memory usage should be low if a single output is used. If
several output options are selected then the significant hits are stored in
memory. If this is problematic, filter_hmm
can be run separately for
each desired output.
Example:
python -m tral.search.filter_hmm \
-f filtered.faa \
-o filtered.tsv \
-x filtered.tdr \
-r 2 -t 8.0 \
results.tsv database.faa.gz
Full usage¶
usage: filter_hmm.py [-h] [-f FILTERED_FASTA] [-o FILTERED_TSV]
[-x FILTERED_TREKS] [-r MIN_REPEATS] [-t LOG_ODDS]
[--preserve-header] [-v]
hits database
Filter hits from search_hmm
positional arguments:
hits TSV file, as produced by search_hmm, containing HMM
hits
database database to filter, in fasta format (may be gzip
compressed)
optional arguments:
-h, --help show this help message and exit
--preserve-header Include the full header in FASTA output. Otherwise,
just the identifier is used to match TSV and TREKS.
-v, --verbose Long messages
outputs:
-f FILTERED_FASTA, --filtered-fasta FILTERED_FASTA
output fasta file, filtered by hits
-o FILTERED_TSV, --filtered-tsv FILTERED_TSV
Filtered TSV file
-x FILTERED_TREKS, --filtered-treks FILTERED_TREKS
Filtered T-Reks file
Thresholds:
-r MIN_REPEATS, --min-repeats MIN_REPEATS
Minimum number of repeats
-t LOG_ODDS, --log-odds LOG_ODDS
Threshold for minimum log-odds ratio
Python API¶
See also the Search code documentation