HMM code documentation

The hmm submodule documentation.

hmm

synopsis

A cyclic hidden Markov model that describes sequence tandem repeats.

class tral.hmm.hmm.HMM(hmmer_probabilities, sequence_type='AA')[source]

Sequence profile hidden Markov models (HMMs) are used to model genomic sequence containing a tandem repeat.

Note

The HMM implemented here is described in detail in

Schaper, E., Gascuel, O. & Anisimova, M. Deep conservation of human protein tandem repeats within the eukaryotes. Molecular Biology and Evolution (2014).

All notations used here are based on the notations introduced in

Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).

The HMM contains flanking states (“N”, “C”), match states (“M1”, “M2”, …) and insertion states (“I1”, “I2”, ..). As deletion states are non-emitting, they are merged with the other states.

The HMM state “N” emits the sequence before the tandem repeat. The HMM state “C” emits the residual sequence after the tandem repeat. The HMM states “M1”, “M2”, … can emit sequence within the tandem repeat. “M1” describes the first amino acid/nucleotide in the repeat unit, “M2” the second, and so on]. The HMM states “I1”, “I2”, can emit sequence representing insertions within the tandem repeat.

The transition probabilities between states are stored in p_t. The emission probabilities of every state are stored in p_e. The probabilities for any state to be the null state are stored in p_0.

The structure and probabilities defining standard sequence profile HMMs can be

  • taken from HMM databases such as PFAM,

  • calculated from sequence alignments (here: an alignment of TR units) using for example the HMMER suite,

  • defined in house.

All three ways are implemented here.

The HMM is used to detect the maximum likelihood occurrence of a TR modeled by the HMM on a sequence. The Viterbi algorithm is implemented to do this detection.

id

The id describes the origin of the HMM. E.g., PFAMID “PF00069”.

Type

str

states

The names of all states that the HMM contains. E.g. [“N”, “M1”, “I1”, “M2”, “I2”, “C”]

Type

list of str

insertion_states

The names of all insertion states that the HMM contains. E.g. [“I1”, “I2”]

Type

list of str

match_states

The names of all match states that the HMM contains. E.g. [“M1”, “M2”]

Type

list of str

l_effective

The number of states in the HMM. E.g. l_effective = 4

Type

int

alphabet

The letters that the HMM states can emit. E.g. all amino acids in the LG matrix: [“A”, “C”, …]

Type

list of str

p_t

A dictionary {state0: {state1: transition probability, …}, …} between all states as log10.

Type

dict of dict of float

p_0

A dictionary of the null probabilities to be in any of the states: {states[0]: 0.1, states[1]: 0.2, …}

Type

dict of dict of float

p_e

A dictionary of emission probabilities for every state and every letter in alphabet: {states[0]: {alphabet[0]: emission probability, …}, …}

Type

dict of dict of float

hmmer

A dictionary of all information extracted from a Hmmer model.

Type

dict of dict of float

static create(input_format, file=None, repeat=None, **kwargs)[source]

Creates a HMM instance from 2 possible input formats.

A HMM instance is created from one of the two possible inputs:

  • HMMER3 model

  • Repeat instance

Parameters
  • input_format (str) – The file format of the input file

  • file (str) – Path to the file containing the HMM parameters encoded in the HMMER3 format.

  • repeat (Repeat) – A Repeat object with an MSA that can be transformed into an HMM.

Returns

An initialized instance of the HMM class.

Return type

HMM

Raises
  • Exception – if the HMMER3 file provided does not exist.

  • Exception – if the repeat value provided is not an instance of the Repeat class.

  • Exception – if no parameters are provided to create the HMM from.

Warning

Will initialize all HMM’s in a file but will only return the first one.

Todo

Fix the previous warning (agree on the way how it should be done first)

static create_from_repeat(tandem_repeat, hmm_copy_path=None, hmm_copy_id=None)[source]

Get HMM parameters (including the alphabet, emission probabilities and transition probabilities) from a tandem repeat.

An HMM is created from tandem_repeat using hmmbuild and is saven in the HMMER3 file format. Next, HMM parameters are retrieved from the HMMER3 file and returned.

Parameters
  • tandem_repeat (TR) – A Repeat class instance of the TR to be transformed into an HMM.

  • hmm_copy_path (str) – Path to where a copy of the created HMM will be stored. If None, no copies are saved.

  • hmm_copy_id (str) – HMM id which will serve as the filename in case a copy should be saved. If None and hmm_copy_path is specified the name is generated from current date and time combination.

Returns

For a sample structure please refer to :py:hmm_io.read:.

Return type

HMM parameters (dict)

Todo

Make sure the id is correctly read in.

get_direct_transition_probabilities_for_deletions(state_index, state)[source]
Calculate updated transition probabilities for HMMs without

deletion states.

Calculate updated transition probabilities for HMMs without deletion states. Deletion states are non-emitting states. Therefore, they can be merged into the remaining emitting states. E.g. M -> D -> I would be equivalent to M -> I. The total transition probability from M to I would then be the product of the transition probabilities from M to D and from D to I (equivalent to the sum of the log10 probabilities.).

This method translates all deletions following state with state_index into direct match state to match state transitions, and returns the new transition probabilities.

Parameters
  • state_index (int) – The first parameter.

  • state (str) – The second parameter.

Returns

float}: The transition probability from state with state_index to state1 is calculated for all (?) possible state1.

Return type

{state1

hmm_example()[source]

Todo

Is this method needed?

initialise_HMM_structure(l_effective)[source]
Initialise the HMM with all states and standard values for

transition and emission probabilities.

Initialise all states Set transition probabilities to None/0 Set initial probability for all states equal.

Parameters

l_effective (int) – The number of match states in the HMM.

static read(hmm_filename, id=None)[source]

Read HMM file in HMMER3 format.

HMMER3 file format is described in detail in ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf, section 8.

The general file format is:

HMMER3/b [3.0b2 | June 2009]
NAME fn3
ACC PF00041.12
DESC Fibronectin type III domain
LENG 86
ALPH amino
(...)

HMM          A       C       D       E       F       G       H       I    (...)    Y
            m->m    m->i    m->d    i->m    i->i    d->m    d->d
  COMPO   2.70271 4.89246 3.02314 2.64362 3.59817 2.82566 3.74147 3.08574 (...) 3.22607
          2.68618 4.42225 2.77519 2.73123 3.46354 2.40513 3.72494 3.29354 (...) 3.61503
          0.00338 6.08833 6.81068 0.61958 0.77255 0.00000       *
  1       3.16986 5.21447 4.52134 3.29953 4.34285 4.18764 4.30886 3.35801 (...) 3.93889 1 - -
          2.68629 4.42236 2.77530 2.73088 3.46365 2.40512 3.72505 3.29365 (...) 3.61514
          0.09796 2.38361 6.81068 0.10064 2.34607 0.48576 0.95510
  2       2.70230 5.97353 2.24744 2.62947 5.31433 2.60356 4.43584 4.79731 (...) 4.25623 3 - -
          2.68618 4.42225 2.77519 2.73123 3.46354 2.40513 3.72494 3.29354 (...) 3.61503
          0.00338 6.08833 6.81068 0.61958 0.77255 0.48576 0.95510
  (...)
  86      3.03720 5.94099 3.75455 2.96917 5.26587 2.91682 3.66571 4.11840 (...) 4.99111 121 - E
          2.68618 4.42225 2.77519 2.73123 3.46354 2.40513 3.72494 3.29354 (...) 3.61503
          0.00227 6.08723       * 0.61958 0.77255 0.00000       *
//

Important

The probabilities in this format are stored as negative natural log probabilities, e.g. -ln(0.25) = 1.38629. The special case of 0 probability is stored as * which in fact means -∞ (minus infinity).

Parameters
  • hmm_filename (str) – Path to the file with model data in the HMMER3 file format.

  • id (str, optional) – The identifier for the model to be returned. E.g. for Pfam the id can look like this: ‘PF00560’. If defined, the function returns only the HMM with an identifier that matches the provided id. If a model in the file does not have an identifier to check against, it is skipped.

Returns

A dictionary of parameters required to initialise the HMM including the id, alphabet, emission probabilities and transition probabilities.

Output format:

{
    'id': 'PF08261.7',
    'letters': ['A', 'C', 'D', ..., 'Y'],
    'COMPO':
        {'insertion_emissions': [2.68618, 4.42225, ..., 3.61503],
          'emissions': [2.28205, 5.14899, ..., 1.92022],
          'transitions': [0.01467, 4.62483, ..., -inf]},
    '1':
        {'insertion_emissions': [2.68618, 4.42225, ..., 3.61503],
         'emissions': [1.00089, 4.54999, ..., 5.23581],
         'transitions': [0.01467, 4.62483, ..., 0.95510]},
    ...
    '8':
        {'insertion_emissions': [2.68618, 4.42225, ..., 3.61503],
         'emissions': [4.12723, 5.39816, ..., 4.58094],
         'transitions': [0.00990, 4.62006, ..., -inf]}
}

Return type

dict

See also

hmm_io.read()

set_circle_transition_probability_hmmer3(l_transition_probabilities, final_state)[source]
Calculate transition probabilites between states that exist due to

the circularity of the tandem repeat HMM.

In HMMER3 models, no transition in circle from the last match state to the first match state are included.

Values need to be found for the following transitions: m->m m->i m->d i->m i->i d->m d->d

Compare ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf pp.84-85.

Input: -ln(p) Output: Also -ln(p’)

When the HMM only has a single match state, i.e. l_effective == 1, the final match state equals the only match state, and no transition probabilities can be calculated via averaging over all other states. Therefore, the its transition probabilities remain unchanged.

Parameters
  • l_transition_probabilities (list of float) – The first parameter.

  • final_state (str) – The second parameter.

Returns

transition probabilities

Return type

list of float

set_emission_probability_hmmer3(state, l_emission_probabilities)[source]
Convert and set emission probabilities from HMMER3 file to class

attribute.

Set p_e of states to l_emission_probabilities for state given self.alphabet. In HMMER3 data, emission probabilities are -ln(p).

It is e.g.:

self.alphabet =
['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q',
 'R', 'S', 'T', 'V', 'W', 'Y']

Return log10(p), that is convert between the two. Conversion: p_Local = - p_HMM * log10(e)

Parameters
  • state (str) – The state of the HMM. E.g. the fourth match state ‘M4’.

  • l_emission_probabilities (list of float) – E.g. [‘3.27687’, ‘2.31397’, ‘3.32252’, ‘3.12746’, ‘2.89175’, ‘3.34719’, ‘2.28730’, ‘3.54139’, ‘2.53154’, ‘2.64774’, ‘3.75733’, ‘3.23860’, ‘3.57894’, ‘3.26290’, ‘2.66343’, ‘2.61544’, ‘2.91770’, ‘3.26739’, ‘5.09378’, ‘3.16816’]

set_transition_probability_hmmer3(state_index, l_transition_probabilities)[source]
Convert and assign transition probabilities from a HMMER3 file to

HMM attribute p_t.

Convert and assign transition probabilities from a HMMER3 file to HMM attribute p_t. Set p_t of states to l_transition_probabilities for the state_index th state.

In HMMER3 data, transition probabilities are -ln(p). Return log10(p), i.d. convert between the two. Conversion: p_Local = - p_HMM * log10(e)

Parameters
  • state_index (int) – E.g. 4

  • l_transition_probabilities (list of float) – A list of transition probabilities as found in a single row for a certain state in HMMER3 files. E.g. [‘0.00021’, ‘8.85487’, ‘9.57722’, ‘0.61958’, ‘0.77255’, ‘0.48576’, ‘0.95510’]

write(file, output_format, *args)[source]

Write HMM to file.

Write HMM to file. Currently, only pickle is implemented.

Parameters
  • file (str) – Path to input file

  • output_format (str) – Either “fasta”, “pickle” or “stockholm”

Todo

Write checks for format and file.

tral.hmm.hmm.hmmer3_emission_probabilities(hmmer_probabilities, letters, l_match)[source]

Get emission probabilities from hmmer3 hmm file. In hmm file, emission probabilities are -ln(p). Return log10(p), i.d. convert between the two. Conversion: p_Local = - p_HMM * log10(e)

Parameters (e.g.):

letters = ['A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'M', 'L', 'N',
           'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y']
l_match = ['M'+str(i) for i in range(24)]

Return format (pseudo code): [{iA: np.log10(p(iA,iM)) for iA in alphabet.keys()} for iM in l_match]

hmm_io

synopsis

Input/output for hmms.

tral.hmm.hmm_io.read(hmm_filename, id=None)[source]

Read HMM file in HMMER3 format.

HMMER3 file format is described in detail in ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf, section 8.

The general file format is:

HMMER3/b [3.0b2 | June 2009]
NAME fn3
ACC PF00041.12
DESC Fibronectin type III domain
LENG 86
ALPH amino
(...)

HMM          A       C       D       E       F       G       H       I    (...)    Y
            m->m    m->i    m->d    i->m    i->i    d->m    d->d
  COMPO   2.70271 4.89246 3.02314 2.64362 3.59817 2.82566 3.74147 3.08574 (...) 3.22607
          2.68618 4.42225 2.77519 2.73123 3.46354 2.40513 3.72494 3.29354 (...) 3.61503
          0.00338 6.08833 6.81068 0.61958 0.77255 0.00000       *
  1       3.16986 5.21447 4.52134 3.29953 4.34285 4.18764 4.30886 3.35801 (...) 3.93889 1 - -
          2.68629 4.42236 2.77530 2.73088 3.46365 2.40512 3.72505 3.29365 (...) 3.61514
          0.09796 2.38361 6.81068 0.10064 2.34607 0.48576 0.95510
  2       2.70230 5.97353 2.24744 2.62947 5.31433 2.60356 4.43584 4.79731 (...) 4.25623 3 - -
          2.68618 4.42225 2.77519 2.73123 3.46354 2.40513 3.72494 3.29354 (...) 3.61503
          0.00338 6.08833 6.81068 0.61958 0.77255 0.48576 0.95510
  (...)
  86      3.03720 5.94099 3.75455 2.96917 5.26587 2.91682 3.66571 4.11840 (...) 4.99111 121 - E
          2.68618 4.42225 2.77519 2.73123 3.46354 2.40513 3.72494 3.29354 (...) 3.61503
          0.00227 6.08723       * 0.61958 0.77255 0.00000       *
//

Important

The probabilities in this format are stored as negative natural log probabilities, e.g. -ln(0.25) = 1.38629. The special case of 0 probability is stored as * which in fact means -∞ (minus infinity).

Parameters
  • hmm_filename (str) – Path to the file with model data in the HMMER3 file format.

  • id (str, optional) – The identifier for the model to be returned. E.g. for Pfam the id can look like this: ‘PF00560’. If defined, the function returns only the HMM with an identifier that matches the provided id. If a model in the file does not have an identifier to check against, it is skipped.

Returns

A dictionary of parameters required to initialise the HMM including the id, alphabet, emission probabilities and transition probabilities.

Output format:

{
    'id': 'PF08261.7',
    'letters': ['A', 'C', 'D', ..., 'Y'],
    'COMPO':
        {'insertion_emissions': [2.68618, 4.42225, ..., 3.61503],
          'emissions': [2.28205, 5.14899, ..., 1.92022],
          'transitions': [0.01467, 4.62483, ..., -inf]},
    '1':
        {'insertion_emissions': [2.68618, 4.42225, ..., 3.61503],
         'emissions': [1.00089, 4.54999, ..., 5.23581],
         'transitions': [0.01467, 4.62483, ..., 0.95510]},
    ...
    '8':
        {'insertion_emissions': [2.68618, 4.42225, ..., 3.61503],
         'emissions': [4.12723, 5.39816, ..., 4.58094],
         'transitions': [0.00990, 4.62006, ..., -inf]}
}

Return type

dict

tral.hmm.hmm_io.read_HMMER_acc_lengths(hmm_filename)[source]

Read HMM file in HMMER3 format. Return the PFAM ID and the lengths of each model.

Read HMM file in HMMER3 format. Return the PFAM ID and the lengths of each model.

Parameters

hmm_filename (str) – Path to HMMER3 file.

Returns

int): The number of match states for all HMM models in hmm_filename.

Return type

(dict of str

Todo

Decide whether this function is needed.

tral.hmm.hmm_io.split_HMMER3_file(hmm_filename, resultdir)[source]

Split HMMER3 models from a single file hmm_filename into many files in resultdir.

Helper function: split HMMER3 models from a single file hmm_filename into many files in resultdir. The models are named after the HMM accession.

Parameters
  • hmm_filename (str) – Path to HMMER3 file.

  • resultdir (str) – Path to directory where result files are stored.

tral.hmm.hmm_io.to_fixed_width(n, max_width, allow_overflow=True, do_round=True)[source]

Format a float to a fixed-width string.

Parameters
  • n (-) – number to format

  • max_width (-) – width of output string

  • allow_overflow (-) – allow the result to be larger than max_width for large n? Raises OverflowError if False

  • do_round (-) – Round final digit?

Returns

(str) string representation of n, either padded or rounded to have

max_width characters

tral.hmm.hmm_io.write_HMMER(hmm, hmm_file)[source]

Write <hmm> too <hmm_filename> in HMMER3 format.

See Section 8 (File Formats) of ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf for the file format specification.

Parameters
  • hmm (hmm.HMM) – HMM object to write

  • hmm_file (str or file) – filename or file-like object to write to

hmm_own_model

class tral.hmm.hmm_own_model.HMM(tandem_repeat=None, prior_divergence=None, prior_indel_insertion=None, parameters=None)[source]

A cyclic HMM applicable to describe sequence Tandem Repeats

HMM_from_TR(tandem_repeat, prior_divergence=None, prior_indel_insertion=None, emission_file=None)[source]

Build a HMM from a TR <tandem_repeat>. Two step approach: First, calc the posterior of the ancestral sequence Then, calculate the likelihood of the homologous repeat unit sequence.

Parameters: <prior_divergence> is {‘type’: ‘alpha’/’fixed_value’, ‘value’: 4} <prior_indel_insertion> is {‘mu’: 0.5, ‘sigma_squared’: 0.81}

Store probabilities as log10arithms.

HMM_from_TR_One_step(tandem_repeat, t)[source]

Build a HMM from a TR alignment given the ML divergence t. One step approach: Calc the posterior of the ancestral sequence and use it as the likelihood of the homologous repeat unit sequence.

HMM_from_file(hmm_file, accession, prior_indel_insertion)[source]

Load a HMM from hmm_file Store probabilities as log10arithms.

Parameters: <hmm_file> = ‘path//to/file.hmm’ <accession> = ‘PF00560’ <prior_indel_insertion> = {‘mu’: 0.5, ‘sigma_squared’: 0.81}

initialise_HMM_structure(l_effective)[source]

Initialise all states Set transition probabilities to None/0 Set initial probability for all states equal.

class tral.hmm.hmm_own_model.TR[source]

Fake interim TR class

tral.hmm.hmm_own_model.calculate_MAP_Indel_length_Zipfian_factor(indel_lengths, prior=None)[source]

calculate the MAP Zipfian constant <indel_zipf> that determines the distribution of indel lengths. Assume a Gaussian prior. Input is a list of <indel_lengths> for a particular column. The probability distribution of indel lengths is assumed to follow the Zipfian distribution (Compare

Fletcher,W. and Yang,Z. (2009) INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol, 26, 1879–1888.

In their publication, <indel_zipf> is denoted as <a>, the gap lengths as <u>.)

The MAP is calculated numerically, as there did not seem to be a nice analytical solution.

tral.hmm.hmm_own_model.calculate_MAP_Indel_length_exponential_factor(indel_lengths, prior=None)[source]

Calculate the MAP Exponential decay constant <alpha> that determines the distribution of indel lengths. Assume a Gaussian prior. Input is a list of <indel_lengths> for a particular column.

The MAP is calculated numerically, as no nice analytical solution has been found so far. A power function of grade 3 needs to be solved:

0 == prior[‘sigma_squared’]*(sum(indel_lengths) - len(indel_lengths)) -

alpha*(prior[‘sigma_squared’]*sum(indel_lengths) + prior[‘mu’]) + (alpha**2)*(1 + prior[‘mu’]) - alpha**3

tral.hmm.hmm_own_model.calculate_MAP_indel_rate(nIndels, n, prior=None)[source]

calculate the MAP indel_rate per repeat unit and site given - an assumed prior distribution of the indel rate described by the dict <prior> - the likelihood of the the observed number of indels <nIndels>.

posterior ~ likelihood * prior Derive the posterior with respect to the indel_rate Set the result equal to 0 Solve with respect to the MAP of the indel_rate.

For nIndels in {0,1} it is possible to derive the MAP indel_rate analytically. For all other cases, the maximum of the posterior distribution needs to be found in good approximation algorithmically.

ln(posterior) ~ ln(likelihood) + ln(prior)

tral.hmm.hmm_own_model.calculate_log10_indel_probability(nIndels, n, prior=None)[source]

calculate the probabilty of an indel per site and per repeat unit from the MAP estimate of the indel rate, given

  • the <divergence> of the repeat units

  • a <prior> on the indel rates (normal distributed)

  • the number of indels <nIndels> in this column

  • the total length <n> of this column

tral.hmm.hmm_own_model.calculate_log10_probability_indel_lengths(indel_lengths, indel_length_max, type='zipf', prior=None)[source]

Calculate the probability of indels of different length until a maximum length <indel_length_max>, assuming that either

type == ‘zipf’ indel lengths are distributed following a Zipfian distribution with parameter <indel_zipf>. (Compare:

Fletcher,W. and Yang,Z. (2009) INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol, 26, 1879–1888.

In their publication, <indel_zipf> is denoted as <a>, the gap lengths as <u>.)

or type == ‘exponential’ indels lengths are distributed following a (1-alpha)*(alpha^(l-1)) distribution. [Is this model already used? Probably yes. How can i cite it?]

A good choice for <indel_length_max> when building HMMs for tandem repeats is <l_effective-1>, as a deletion, that includes a complete repeat unit, will not be visible in the tandem repeat unit alignment, and could therefore be ignored.

Return the probabilities as a list.

tral.hmm.hmm_own_model.derivative_log_posterior(indel_rate, prior, n, nIndels)[source]

assuming that on each site on each repeat unit, an in/del event has occured, or not. Thus, the likelihood of a certain number of in/del events follows the binomial distribution. The prior is assumed to be a gaussian described by prior.

Return the derivative with respect to <indel_rate> of the natural logarithm of the posterior probability of <nIndels>, calculated as the product of likelihood and prior. The normalisation factor is ignored for the moment.

tral.hmm.hmm_own_model.divergence_from_FP_simulations(l, alpha=0.1)[source]

Which HMM divergence sets the average FP-rate (i.e. the number of falsely positively assigend amino acid on either flanking side to a perfect TR) to in total alpha*l

tral.hmm.hmm_own_model.hmmer3_emission_probabilities(hmmer_probabilities, letters, lMatch)[source]

Get emission probabilities from hmmer3 hmm file. In hmm file, emission probabilities are -ln(p). Return log10(p), i.d. convert between the two. Conversion: p_Local = - p_HMM * log10(e)

Parameters (e.g.): letters = [‘A’, ‘C’, ‘E’, ‘D’, ‘G’, ‘F’, ‘I’, ‘H’, ‘K’, ‘M’, ‘L’, ‘N’, ‘Q’, ‘P’, ‘S’, ‘R’, ‘T’, ‘W’, ‘V’, ‘Y’] lMatch = [‘M’+str(i) for i in range(24)]

Return format (pseudo code): [{iA: np.log10(p(iA,iM)) for iA in alphabet.keys()} for iM in lMatch]

tral.hmm.hmm_own_model.loglikelihood_substitution(t, Q, eqFreq, alphabet, tandem_repeat)[source]

calculate the likelihood of a repeat assuming a star tree and a sequence model defined by Q, t, eqFreq and alphabet.

tral.hmm.hmm_own_model.test()[source]

To be implemented…

hmm_viterbi

synopsis

Viterbi algorithm for circular hidden Markov models.

tral.hmm.hmm_viterbi.distance_index(i, j, length)[source]

Helper function to calculate the distance between two indices in a circular HMM.

Parameters
  • i (int) – first index

  • j (int) – second index

  • length (int) – length of the HMM

Returns

The distance between the indices i and j: As the HMM is circular j may have a smaller value than i, even though j is ahead of i.

Return type

int

Todo

May be replaced by a simple mod, as distance_index is used only once in the code at current.

tral.hmm.hmm_viterbi.hmm_path_to_aligned_tandem_repeat_units(sequence, most_likely_path, l_effective, translate=False)[source]

WARNING: Currently do not rely on this function!! The output seems to have missing characters and too much gaps. The function needs review in order to be used. todo: fix hidden bug!

Convert a viterbi path in an hmm of length l_effective on the

sequence into a corresponding tandem repeat.

Extract the tandem repeat alignment from a sequence given a Viterbi path. Use alignment information in the Viterbi path. For example, all emissions labelled with M1 (match state 1) align according to the HMM. Insert gaps for insertions and deletions accordingly. Thus, for example the first characters in the repeat units do necessarily align, albeit some of them may be gaps. Also, all repeat units are necessarily of same length.

Assume that all states are counted starting on 0, unless the translate flag is set.

Parameters
  • sequence (str) – A sequence as a string.

  • most_likely_path (list of str) – a Viterbi path.

  • l_effective (int) – length of the HMM used to create the Viterbi paths.

  • translate (bool) – This function assumes that HMM states are enumerated starting on 0. If the HMM states are enumerated starting on 1, set this flag for transformation.

Returns

The function returns a tuple consisting of 3 values. The tuple contains:

  • msa: A repeat instance.

  • begin: The start index of the repeat on the sequence.

  • shift: The index of the HMM where the cut between repeat units is set.

Warning

[None].

Todo

Use sequence instance instead of just a string?

Todo

Check: How is the returned begin defined? Starting counting on 0 or 1? Is it the index of the last flanking character, or the first repeat character?

Todo

Can we update this function, e.g. to not assume that HMM states start on 0?

Todo

Check the docstring, reformat returns.

tral.hmm.hmm_viterbi.hmm_path_to_maximal_complete_tandem_repeat_units(sequences, paths, l_effective, alpha=None)[source]

Convert several viterbi paths of a hmm on several sequences into the corresponding hmm units.

Be ungreedy: Start from the last index in the cluster of all start state and end state indices.

Only integrate repeat units that are at least alpha complete (be it before of after) If you prefer absolute number of characters to filter which repeat units are used, and which not, change this in two occurrences of alpha.

Assume that all states are counted starting on 1.

Parameters
  • sequences (list of str) – A list of sequences.

  • paths (list of list of str) – A list of Viterbi paths

  • l_effective (int) – length of the HMM used to create the Viterbi paths.

  • alpha (Float) – alpha element [0,1].

Returns

A list of multiple sequence alignments (MSAs) in the form of a list of list of str, e.g. [['ATAILC', 'ATAILC', 'ACALKG'], ...].

Raises

Exception – If alpha is not in [0,1].

Todo

Use sequence instances instead of just strings for sequences?

Todo

Check example for returns.

tral.hmm.hmm_viterbi.hmm_path_to_non_aligned_tandem_repeat_units(sequence, path, l_effective)[source]

Convert a viterbi <path> of a hmm of length <l_effective> on <sequence> into the corresponding tandem repeat

Extract the tandem repeat alignment from a sequence given a Viterbi path. Ignore the alignment information in the Viterbi path. For example, all emissions labelled with M1 (match state 1) align according to the HMM. However, this method does not use this information. Therefore, for example the first characters in the repeat units do not necessarily align, and the repeat units are not necessarily of same length.

Assume that all states are counted starting on 1.

Parameters
  • sequence (str) – A sequence as a string.

  • paths (list of list of str) – A list of Viterbi paths

  • l_effective (int) – length of the HMM used to create the Viterbi paths.

Returns

A multiple sequence alignment (MSA) created from the most likely path along the hmm in the form of a list of str.

Todo

Use sequence instance instead of string for input?

tral.hmm.hmm_viterbi.logodds(hmm, emission, logprob)[source]

Compute the log-odds score of the given emission

Log odds is defined as:

LO = log10 Pr[viterbi probability]/Pr[null]

where the null model is given by a constant background frequency f(x) for each amino acid:

Pr[null] = sum(f(x) for x in emission)

Parameters
  • hmm (-) – hmm used to calculate logprob

  • emission (-) – emitted residues

  • logprob (-) – log10 of the match probability, as returned by viterby_with_prob

Returns

(float) LO

tral.hmm.hmm_viterbi.probability_of_the_former_state(i_former, iS, iE, p_e, p_t, path)[source]

Calculate the probability of i_former, given iS and iE together with the dicts of emission and transition probabilities.

tral.hmm.hmm_viterbi.viterbi(hmm, emission)[source]

Calculate the most probable sequence of states given a sequence of emissions and a HMM using the Viterbi algorithm

Get local copies of all variables. All probabilities must be given as logarithms Replace Selenocysteine (U) with Cysteine (C), replace Pyrrolysine (O) with Lysine (K) [Reason: Seldom AAs that are not part of standard amino acid substitution models.]

Parameters
  • hmm (hmm) – An instance of the HMM class.

  • emission (sequence) – An instance of the Sequence class.

Returns

The most likely sequence of hmm states to emit the sequence in the form of a list of str, or None if it does not meet the configured thresholds

Configuration:

filter.basic.dict.n_effective.threshold: minimum (sequence length/hmm l_effective) hmm.l_effective_max: Max HMM size [AA|DNA].ambiguous_chars: Combine probabilities of ambiguous residues

tral.hmm.hmm_viterbi.viterbi_with_prob(hmm, emission)[source]

Calculate the most probable sequence of states given a sequence of emissions and a HMM using the Viterbi algorithm, as well as the probability of that path

Get local copies of all variables. All probabilities must be given as logarithms Replace Selenocysteine (U) with Cysteine (C), replace Pyrrolysine (O) with Lysine (K) [Reason: Seldom AAs that are not part of standard amino acid substitution models.]

Parameters
  • hmm (hmm) – An instance of the HMM class.

  • emission (sequence) – An instance of the Sequence class.

Returns

Tuple (path, probability) - The most likely sequence of hmm states to emit the sequence in the form of a list of str, or None if it does not meet the configured thresholds - The log10 probability of that path (-inf if path is None)

Configuration:

filter.basic.dict.n_effective.threshold: minimum (sequence length/hmm l_effective) hmm.l_effective_max: Max HMM size [AA|DNA].ambiguous_chars: Combine probabilities of ambiguous residues

Todo

Adapt docstrings to refactored Viterbi -> Viterbi_path classes.

Todo

Check: Do you need local copies of all variables?

Todo

Do the functions related to viterbi need to be summarized (e.g. in one class?) How do they relate to the Sequence class, or the HMM class?