.. _search_hmm:

Searching for particular repeats
********************************

Overview
========

The search package provides python modules and command line tools for identifying
tandem repeats in sequence databases. Typically processing consists of first
running ``search_hmm`` to identify instances of a query repeat
in a sequence database. This is relatively slow (hours to days
against Uniprot) but can be trivially parallelized to run on a cluster.

After scoring the hmm against every sequence in the database, ``filter_hmm`` is
used to identify significant results.

search_hmm
==========

The search module can be run directly:
::

    python -m tral.search.search_hmm -h

The query HMM should be in HMMR format. The ``hmmbuild`` tool may be helpful for
generating HMM files from multiple alignments. The HMM should represent a single
repeat. Circularly permuted edges are automatically inserted to permit multiple
sequential repeats.

The database should be in FASTA format. It may be gzip compressed.

Results are produced in TSV format, and include

* Sequence identifier (header line from FASTA file)
* Viterbi probability (probability that this sequence was emitted by the query
  HMM)
* Log-odds ratio (viterbi probability divided by the expected probability for
  a sequence of this length
* HMM states (Maximum likelihood sequence of states through the HMM, as
  determined by the Viiterbi algorithm

Example:
::

    python -m tral.search.search_hmm query.hmm database.faa.gz results.tsv

When running on a cluster, it may be useful to search just a subset of the
database. For instance, here's a simplified submission script using the ``-n``
and ``-s`` parameters in an array job:
::

    #!/bin/bash
    #BSUB -J tral[1-10]
    n=1000
    start=$(( (${LSB_JOBINDEX:-1}-1) * $n ))
    python -m tral.search.tral_search \
        -n $n -s $start \
        query.hmm database.fasta.gz results.${LSB_JOBINDEX:-1}.tsv

Choosing an appropriate log-odds ratio depends on the database size, amino
acid content, HMM length, and HMM composition. To help calibrate this, the ``-r``
option is available to automatically shuffle the input sequences. The results
then represent all negative hits, allowing the False Discovery Rate to be
estimated for various thresholds in the filtering step.

Full usage
----------
::

    usage: search_hmm.py [-h] [-v] [-s START] [-n DBSIZE] [-r]
                        hmm database results

    Align a cpHMM against a database using TRAL

    positional arguments:
    hmm                   cpHMM filename
    database              database to search, in fasta format (may be gzip
                            compressed)
    results               results file

    optional arguments:
    -h, --help            show this help message and exit
    -v, --verbose         Long messages
    -s START, --start START
                            Index of database sequence to start with (default: 0)
    -n DBSIZE, --dbsize DBSIZE
                            Number of database sequences to include (default: 0 to
                            include all)
    -r, --shuffle         Shuffle database sequences


filter_hmm
==========

The results from ``search_hmm`` include both significant and insignificant hits.
Use ``filter_hmm`` to filter the results based on length and log-odds criteria.
::

    python -m tral.search.filter_hmm -h

**Output**. The tool can produce three kinds of output:

* ``-f`` **FASTA**: Subset of the input database corresponding to significant hits
* ``-o`` **TSV**: Subset of the ``search_hmm`` results corresponding to
  significant hits
* ``-x`` **TREKS**: All significant repeats in T-REKS format.

The TREKS format is most useful for exporting results to other programs.

**Identifiers**. By default, the identifier is extracted from the database
header lines (e.g. ``>tr|id|description...``). For compatibility with other
programs, the ``--preserve-header`` option can also be used to match sequences
based on the full header.

**Thresholds**. Thresholds can be set for either:

* ``-r`` **Number of repeats** (float). Repeats are computed based on the HMM match
  states. This means that partial repeats are counted as a fraction of the
  query length.
* ``-t`` **Log-odds threshold** (float). Results with log-odds less than this value
  are filtered out.

**Memory usage**. Memory usage should be low if a single output is used. If
several output options are selected then the significant hits are stored in
memory. If this is problematic, ``filter_hmm`` can be run separately for
each desired output.


**Example:**
::

    python -m tral.search.filter_hmm \
        -f filtered.faa \
        -o filtered.tsv \
        -x filtered.tdr \
        -r 2 -t 8.0 \
        results.tsv database.faa.gz


Full usage
----------
::

    usage: filter_hmm.py [-h] [-f FILTERED_FASTA] [-o FILTERED_TSV]
                        [-x FILTERED_TREKS] [-r MIN_REPEATS] [-t LOG_ODDS]
                        [--preserve-header] [-v]
                        hits database

    Filter hits from search_hmm

    positional arguments:
    hits                  TSV file, as produced by search_hmm, containing HMM
                            hits
    database              database to filter, in fasta format (may be gzip
                            compressed)

    optional arguments:
    -h, --help            show this help message and exit
    --preserve-header     Include the full header in FASTA output. Otherwise,
                            just the identifier is used to match TSV and TREKS.
    -v, --verbose         Long messages

    outputs:
    -f FILTERED_FASTA, --filtered-fasta FILTERED_FASTA
                            output fasta file, filtered by hits
    -o FILTERED_TSV, --filtered-tsv FILTERED_TSV
                            Filtered TSV file
    -x FILTERED_TREKS, --filtered-treks FILTERED_TREKS
                            Filtered T-Reks file

    Thresholds:
    -r MIN_REPEATS, --min-repeats MIN_REPEATS
                            Minimum number of repeats
    -t LOG_ODDS, --log-odds LOG_ODDS
                            Threshold for minimum log-odds ratio


Python API
==========

See also the :doc:`Search code documentation <tral/hmm>`