Repeat code documentation

Initial version of the repeat submodule documentation.

repeat

synopsis:The Repeat Class.
class tral.repeat.repeat.Repeat(msa, begin=None, sequence_type='AA', scoreslist=['phylo_gap01'], calc_score=True, calc_pvalue=True)[source]

Tandem repeats are consecutive copies of genomic sequence.

A tandem repeat is stored in terms of a multiple sequence alignment of its tandem repeat units. For example, the amino acid tandem repeat KLMKLMKLKLM can be displayed as a multiple sequence alignment (here: “multiple repeat unit alignment”) KLM KLM KL- KLM and can be stored as [“KLM”, “KLM”, “KL-”, “KLM”]. The - char stands for a gap: It indicates that either, there used to be a char at the same position, but it got deleted. Or, all other chars in the same column got inserted at some point.

For research on tandem repeats, many different statistics need to be calculated on the multiple sequence alignment of the tandem repeat.

[Description of these statistics]

msa_original

list of str

The original alignment of repeat units.

msa

list of str

The alignment of repeat units without rare chars.

msaT

list of str

The transposed alignment of repeat units. E.g. [“KKKK”, “LLLL”, “MM-M”].

sequence_type

str

“AA” or “DNA”

n

int

The number of repeat units in the original msa. E.g. 4.

l_msa

int

The length of the repeat units in the original msa. E.g. 3.

l_effective

int

The number of columns in the original msa not counting insertion columns (i.e. columns with more gaps than chars).

n_effective

float

The number of chars in the MSA without insertion columns divided by n_effective.

text

str

The repeat region of the tandem repeats (e.g. all chars in the MSA without gaps concatenated.)

n_gaps

int

The number of gaps in msa

repeat_region_length

int

The lengths of text

__str__()[source]

Create string for Repeat instance.

calc_index_msa()[source]

formerly named: set_begin

calc_n_effective()[source]

Calculate the effective number of repeat units n_effective

Calculate the number of letters in in all non-insertion columns, divided by <l_effective>

Parameters:self (Repeat instance) –
calculate_pvalues(scoreslist=['phylo_gap01'])[source]

Calculate pvalues on a Repeat instance.

Calculate pvalues on a Repeat instance for all scores in scoreslist.

Parameters:self (Repeat) – An instance of the repeat class.
Kwargs:
scoreslist(list of Str): A list of the names of the scores (e.g. [“phylo_gap01”]) for which p-Values are calculated on the Repeat instance self
calculate_scores(scoreslist=['phylo_gap01'])[source]

Calculate scores on a Repeat instance.

Calculate scores on a Repeat instance for all scores in scoreslist.

Parameters:self (Repeat) – An instance of the repeat class.
Kwargs:
scoreslist(list of str): A list of the names of the scores (e.g.
[“phylo_gap01”]) that are calculated on the Repeat instance self
create(file, input_format, **kwargs)[source]

Read tandem repeat from file.

Read tandem repeat from file (currently, only pickle is supported)

Parameters:
  • format (str) – Currently only “pickle”
  • file (str) – Path to output file

Todo

Write checks for format and file.

Todo

Write tests.

Todo

Can repeat_io.read_fasta be replaced by e.g. BioPython seqIO?

delete_insertion_columns()[source]
Create the tandem repeat attributes msa, masT, l and n
without insertion columns.

We define insertion columns as columns where there are more or equal gaps compared to chars. These columns are removed from both msa to create msaD and msaT to create msaTD.

gap_structure()[source]

Calculate the number and length of insertions and deletions for this tandem repeat.

gap_structure_HMM()[source]
Calculate for each column in the transposed multiple repeat unit

alignment msaTD

  • how many deletions of what length begin in this column?:

    deletion_lengths`[`iColumn] = list of deletion lengths

  • how many insertions of what length begin in potential insertion columns before this column?: insertion_lengths`[`iColumn] = list of insertion lengths

Parameters:self (Repeat) – An instance of the repeat class.
realign_TR(realignment='proPIP', rate_distribution='constant')[source]
Realign multiple sequence alignment of tandem repeats.
Update of tandem repeat MSA.
Parameters:
  • self (Repeat instance) –
  • realignment (str) – either “proPIP” or “mafft”
  • rate_distribution (str) – either “constant” or “gamma” (per default value from REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’])
save_original_msa(sequence)[source]

recalculate the original msa, assuming that begin is defined correctly (index starting counting on 1.) This function might be obsolete when the original MSA is saved from the beginning.

write(file, file_format)[source]

Write tandem repeat to file.

Write tandem repeat to file using one of three formats.

Parameters:
  • file (str) – Path to input file
  • file_format (str) – Either “fasta”, “pickle” or “stockholm”

Todo

Write checks for format and file.

tral.repeat.repeat.calculate_position_in_alignment(begin, length, alignment)[source]
Calculate the index of the begin and the end of a TR within an alignment

returns

Calculate the index of the begin and the end of a TR within an alignment returns

Parameters:
  • begin (int) – (needs explaining)
  • length (int) – (needs explaining)
  • alignment – (needs explaining)
Returns:

Dict

repeat_align

synopsis:Alignment of tandem repeat units.
tral.repeat.repeat_align.realign_repeat(my_msa, realignment='mafft', sequence_type='AA', rate_distribution='constant', user_path=None)[source]

Realignment of a repeat MSA using mafft or proPIP

A good multiple sequence alignment (MSA) is an important description of a tandem repeat (TR) and can be crucial to distinguish a true positive from a false positive TR. This can be realised using a proper MSA estimation algorithm. Currently, Mafft and proPIP can be used to realign a TR.

For proPIP the rate distribution for indels can be either constant or gamma distributed
  • REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’] == ‘constant’: no gamma distribution is used for the proPIP algorithm
  • REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’] == ‘gamma’: gamma distribution (n=6, alpha=0.5)
Parameters:
  • my_msa (list of strings) – List of sequences (str)
  • realignment (str) – Either “mafft” or “proPIP”
  • sequence_type (str) – Either “AA” or “DNA”
  • user_path (str) – copy alignment files to user defined path
Returns:

my_msa realigned (list of strings)

Todo

  • decide what happens if branch length of tree is getting zero
tral.repeat.repeat_align.remove_char(str, n)[source]

helper function to remove a character in a sequence

tral.repeat.repeat_align.remove_gaps(msa)[source]

Remove columns in a MSA which contain only gaps. Do not use MSA longer than 3000 characters.

Returns the MSA as a list without only-gap-containing columns.

Parameters:msa (list of strings) – List of sequences (str)
Returns:msa (list of strings) with removed colums in case they only contain gaps

repeat_exon

tral.repeat.repeat_exon.get_exon_measures(my_TR, exon_structure)[source]

Calculate diverse exon measures

Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]

tral.repeat.repeat_exon.get_exon_structure(my_TR, exon_structure)[source]

Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]

tral.repeat.repeat_exon.logger = <logging.Logger object>

Functions to check whether the exon structure and the repeat unit structure are correlated.

tral.repeat.repeat_exon.print_exon_structure(my_TR, exon_structure)[source]

Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]

repeat_io

synopsis:Input/output for tandem repeats.
tral.repeat.repeat_io.evolved_tandem_repeats(l, n, n_samples, sequence_type, job_id='job_id', mutation_rate=50, tree='star', indel_rate_per_site=False, return_type='repeat')[source]

Simulate evolved sequences with ALF.

Simulate evolved sequences with ALF: Dalquen, D. A., Anisimova, M., Gonnet, G. H. & Dessimoz, C. ALF–a simulation framework for genome evolution. Molecular Biology and Evolution 29, 1115–1123 (2012).

Parameters:
  • l (int) – The length of the repeat unit
  • n (int) – The number of repeat units in the tandem repeat
  • n_samples (int) – The number of samples
  • sequence_type (str) – Either “AA” or “DNA”
  • job_id (str) – A tag for files produces with ALF, and result files.
  • mutation_rate (float) – The mutation rate.
  • tree (str) – The type of tree, e.g. “star” or “birthdeath”
  • indel_rate_per_site (int or False) – The indel rate per site.
  • return_type (str) – Either “repeat” or “list”
  • sequence_length (int) – The total length of the simulated sequence
Returns:

Return type depends on return_type: Repeat or Bio.Seq.Seq instance.

tral.repeat.repeat_io.random_sequence(n_samples, sequence_type='AA', return_type='repeat', equilibrium_frequencies='human', l=0, n=0, sequence_length=0)[source]

Simulate random sequence locally.

Simulate random sequence locally.

Parameters:
  • n_samples (int) – The number of samples
  • sequence_type (str) – Either “AA” or “DNA”
  • return_type (str) – Either “repeat” or “list”
  • equilibrium_frequencies (str) – Only “human” option available at current
  • l (int) – The length of the repeat unit
  • n (int) – The number of repeat units in the tandem repeat
  • sequence_length (int) – The total length of the simulated sequence
Returns:

Return type depends on return_type.

tral.repeat.repeat_io.read_fasta(seq_filename, sequence_type='AA')[source]

Read repeat from file in fasta format.

Read repeat from file in fasta format.

Parameters:
  • seq_filename (str) – Path to the repeats containing fasta file
  • sequence_type (str) – Either “AA” or “DNA”
Returns:

A list of Repeat instances.

Return type:

(list of Repeat)

tral.repeat.repeat_io.save_repeat_fasta(tandem_repeats, file)[source]

save multiple <tandem_repeats> in Fasta format in specified <file>

At current, only one TR per sequence can be defined, as the identifiers in the dict <tandem_repeats> must be unique.

Parameters: Dict of tandem repeats and identifiers.

e.g. {‘ENSP00012’: msa1, ‘ENSP00013’: msa2}

>ID GHKI GHKI GH–

tral.repeat.repeat_io.save_repeat_stockholm(tandem_repeat, file)[source]

save <tandem_repeat> in STOCKHOLM format in specified <file>

Parameters: Tandem repeat MSA.

e.g. [“ACDEF-”, “ACCDEF”]

More information in http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf STOCKHOLM Format Example:

# STOCKHOLM 1.0 seq1 ACDEF...GHIKL seq2 ACDEF...GHIKL seq3 ...EFMNRGHIKL

seq1 MNPQTVWY seq2 MNPQTVWY seq3 MNPQT... //

tral.repeat.repeat_io.save_repeat_treks(tandem_repeats, file)[source]

Save multiple tandem_repeats in T-REKS format in specified file

At current, only one TR per sequence can be defined, as the identifiers in the dict <tandem_repeats> must be unique.

Parameters: Dict of tandem repeats and identifiers.

e.g. {‘ENSP00012’: [msa1, begin1], ‘ENSP00013’: [msa2, begin2]}

T-REKS format example:

>a Length: 3 residues - nb: 3 from 1 to 10 - Psim:0.8076923076923077 region Length:42 GHKI GHKI GH– ******************

Length: 3 residues - nb: 2 from 21 to 27 - Psim:0.7857142857142857 region Length:22 GHKI GH– ******************

>b Length: 3 residues - nb: 3 from 1 to 10 - Psim:0.8095238095238095 region Length:20 UIR UIR UIR

repeat_pvalue

synopsis:The repeat pvalue module.
tral.repeat.repeat_pvalue.calc_pvalues(repeats, result_file_path, file_name, scoretypes=['phylo', 'phylo_gap'], gappy_data=False)[source]

Create and save a null distribution for repeats scores for p-Value calculation.

You can use a different null distribution for each column of different length for both the parsimony and the pSim score. Therefore, calculate the structure of the TR before you calculate the null distribution of tandem repeat scores of tandem repeats with the same gap distribution.

Parameters:
  • repeats (list of Repeat) – A list of Repeat instances.
  • result_file_path (str) – Path to result folder.
  • file_name (str) – Name of result file.
  • scoretypes (list of str) – List of scores. E.g. [‘phylo’,’phylo_gap’].
  • gappy_data (bool) – True if any of repeats contain gaps, else False.
tral.repeat.repeat_pvalue.calculate_repeat_structure(tandemrepeat)[source]

Calculate the number of columns with a certain number of gaps for each column in a Repeat instance.

You can use a different null distribution for each column of different length for both the parsimony and the pSim score. Therefore, calculate the structure of the TR before you calculate the null distribution of tandem repeat scores of tandem repeats with the same gap distribution.

Parameters:
  • n (int) – The number of repeat units.
  • score – The heuristic model used. E.g. “psim”.
  • sequence_type (str) – The type of the sequence: either “AA” or “DNA”.
Returns:

l_repeat_structure (list) n_repeat_structure (int)

Todo

Include exact definition of returned values.

tral.repeat.repeat_pvalue.column_pdf(n, score_type='psim', sequence_type='AA')[source]
Load and return the probability density function of the score on random
sequence_type data of length n.

Load and return the probability density function of the score on random sequence_type data of length n. This method handels scores with distributions independent of l. This includes all parsimony scores, and excludes model based scores.

The order of the values in the probability density function is WORST FIRST: The first values describe the probability for “bad” scores, the last values the probabilities for “top” scores.

Currently, for all models values are available until

  • n_max = 150

For n <= n_max all values are available. If no distribution is available for a given n the closest values for n is used.

For values of n above n_max, the n_max pdf is returned.

Parameters:
  • n (int) – The number of repeat units.
  • score_type – The heuristic model used. E.g. “psim”.
  • sequence_type (str) – The type of the sequence: either “AA” or “DNA”.
Returns:

A numpy list of length 10.000

Todo

Check return type.

Todo

Check how this method concurs with the empirical_list method.

tral.repeat.repeat_pvalue.d_average_multinom(l, n, sequence_type, score_type)[source]

Helper function for the analytic calculation of parsimony or pSim scores.

Called by d_average_multiple_max_multinom or d_average_multiple_pars_multinom.

Return the l-times self-convoluted pdf of the score on random sequence_type data of length n. The order of the pdf is kept from column_pdf().

For details see:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters:
  • l (int) – The length of the repeat unit.
  • n (int) – The number of repeat units.
  • sequence_type (str) – The type of the sequence: either “AA” or “DNA”.
  • score_type – The model of repeat evolution used. Either ‘psim’ or ‘parsimony’.
Returns:

(description missing)

Warning

if precision higher than max(uint32) use uint64 instead. CHECK: https://numpy.org/doc/stable/user/basics.types.html

Todo

Describe return value.

tral.repeat.repeat_pvalue.d_average_multiple_max_multinom(tandemrepeat, precision=10000.0)[source]
Calculate null distribution for e.g. the pSim score for repeats of type tandemrepeat
as a probability density function.

Analytically calculate the p-Value distribution for the pSim score for repeats of type tandemrepeat. The derivation is described in:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters:
  • tandemrepeat (Repeat) – A Repeat instance.
  • precision (float) – The precision of the returned probability density function in
  • of the length of the resulting list. (terms) –
Returns:

p – cumulated probabilities from 0 to 1. unnamed (list of float): scores corresponding to the probabilities in p.

Return type:

list of float

Warning

if precision higher than max(uint32) use uint64 instead. CHECK: https://numpy.org/doc/stable/user/basics.types.html

tral.repeat.repeat_pvalue.d_average_multiple_pars_multinom(tandemrepeat, precision=10000.0)[source]
Calculate null distribution for the parsimony score for repeats of type tandemrepeat
as a probability density function.

Analytically calculate the p-Value distribution for the parsimony score for repeats of type tandemrepeat. The derivation is described in:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters:
  • tandemrepeat (Repeat) – A Repeat instance.
  • precision (float) – The precision of the returned probability density function in
  • of the length of the resulting list. (terms) –
Returns:

p – cumulated probabilities from 0 to 1. unnamed (list of float): scores corresponding to the probabilities in p.

Return type:

list of float

Warning

if precision higher than max(uint32) use uint64 instead. CHECK: https://numpy.org/doc/stable/user/basics.types.html

tral.repeat.repeat_pvalue.empirical_list(l, n, sequence_type='AA', score_type='phylo')[source]
Load and return a numpy list with 10,000 empirical values of the user defined
distribution from an external file.

If no distribution is available for a given l and n, the closest values for l and n are chosen instead.

For the standard model score “phylo”, currently values are available until

  • l_max = 99
  • n_max = 49
  • total_repeat_length_max = 1000

For l <= l_max and n <= n_max and n*l <= total_repeat_length_max all values are available.

For other values, we use the closest distribution, assuming that values change little (heuristic assumption!).

Parameters:
  • l (int) – The length of the repeat unit.
  • n (int) – The number of repeat units.
  • sequence_type (str) – The type of the sequence: either “AA” or “DNA”.
  • score_type – The model of repeat evolution used. E.g. “phylo”.
Returns:

A numpy list of length 10.000

Todo

Define “phylo” model.

Todo

Perhaps add distributions beyond l_max, n_max.

tral.repeat.repeat_pvalue.gap_penalty(tandemrepeat, mu)[source]

Calculate the gap penalty for a Repeat given mutation rate mu.

Parameters:
  • tandemrepeat (Repeat) – A Repeat instance.
  • mu (float) – The mutation rate.
Returns:

Gap penalty (float)

Todo

Define mu more precisely.

Todo

Is this function called from anywhere? In case, consider refactoring.

tral.repeat.repeat_pvalue.pvalue_from_empirical_list(tandemrepeat, score_type='phylo', score_value=None, empirical=[])[source]

Calculates the p-Value of a score_type for the given tandemrepeat.

The p-Value is the number of scores for comparable tandem repeats, i.e. of same repeat unit length and repeat unit copy number that are as good or better.

Parameters:
  • tandemrepeat (Repeat) – An instance of the Repeat class.
  • score_type – The model of repeat evolution used. E.g. “phylo”.
  • score_value (float) – The value of the score of the Repeat instance.
  • empirical – The null distribution of the score for repeats of the same type.
Returns:

A float.

Return type:

pvalue

tral.repeat.repeat_pvalue.pvalue_pars(tandemrepeat)[source]

Calculate the p-Value of the parsimony score for a Repeat.

Retrieve the probability density function for repeats of the same type as tandemrepeat. Then, calculate the p-Value given this probability density function, and the parsimony score of tandemrepeat.

Parameters:tandemrepeat (Repeat) – A Repeat instance.
Returns:p-Value (float)

Todo

Check the method’s behaviour if tandemrepeat s parsimony score has not been calculated before.

Todo

Check exception: if pdf == False: return 1.

tral.repeat.repeat_pvalue.pvalue_psim(tandemrepeat)[source]

Calculate the p-Value of the pSim score for a Repeat.

Retrieve the probability density function for repeats of the same type as tandemrepeat. Then, calculate the p-Value given this probability density function, and the pSim score of tandemrepeat.

Parameters:tandemrepeat (Repeat) – A Repeat instance.
Returns:p-Value (float)

Todo

Check the method’s behaviour if tandemrepeat s parsimony score has not been calculated before.

Todo

Check exception: if pdf == False: return 1.

Todo

Describe precision.

repeat_score

synopsis:The repeat score module.
tral.repeat.repeat_score.K80(kappa=2.59)[source]
tral.repeat.repeat_score.LG()[source]
tral.repeat.repeat_score.TN93(alpha_1=0.3, alpha_2=0.4, beta=0.7)[source]
tral.repeat.repeat_score.calculate_pdf_scores(repeats, result_file_path, file_name, scoreslist=['phylo_gap01'])[source]

CALCULATE THE PROBABILITY DISTRIBUTION OF SCORES

tral.repeat.repeat_score.entropy(column)[source]

Calculate the entropy of a list.

Calculate the entropy of list. Assume that the frequency of elements in the list reflects a probability density estimator.

For an exact description see:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters:column (list of str) – List of elements that can be ordered in a set, e.g. str or int.
Returns:The entropy of column.
Return type:float
tral.repeat.repeat_score.load_equilibrium_freq(file_name)[source]

Load equilibrium frequencies from a substitution matrix file.

Load equilibrium frequencies from a substitution matrix file. Note that with minor changes, the whole amino acid substitution matrix could be returned.

Parameters:file_name (str) – Path to substitution matrix file, E.g. LG.dat.
Returns:Dictionary of one letter amino acid codes and float value
Return type:(Dict of str, float)
tral.repeat.repeat_score.loglikelihood_gaps_starphylogeny_zipfian(t, tandem_repeat, indel_rate_per_site=0.01, gaps='row_wise', indel_zipf=1.821)[source]
Calculate the log-likelihood of the gap structure in a
tandem repeat assuming a star tree.

Calculate the log-likelihood of the gap structure in a tandem repeat assuming a star tree (i.e. repeat unit correlations are not modeled). The gap length model is a zipfian distribution with parameter indel_zipf. The gap generation model is a simple poisson model with rate parameter indel_rate_per_site.

Parameters:
  • t (float) – The divergence of the tandem_repeat units.
  • tandem_repeat (Repeat) – An instance of the Repeat class.
  • indel_rate_per_site (float) – The mutation rate of insertions and deletions, as opposed to t.
  • gaps (str) – The mode of gap counting. Options are “row_wise”,
  • "ignore_trailing_gaps" and ("ignore_coherent_deletions",) –
  • "ignore_trailing_gaps_and_coherent_deletions".
  • indel_zipf (float) – The parameter of the Zipfian distribution.
Returns:

The loglikelihood value for the gap structure of the tandem repeat.

Return type:

float

Todo

USE LOG EARLIER, NOT JUST IN RETURN STATEMENT

tral.repeat.repeat_score.loglikelihood_substitution(t, Q, eq_freq, alphabet, tandem_repeat)[source]
Calculate the likelihood of a repeat assuming a star tree and a
sequence model defined by Q, t, eq_freq and alphabet.
Parameters:
  • t (float) – The divergence of the tandem_repeat units.
  • Q (dict ?) – A substitution matrix.
  • eq_freq (dict ?) – Equilibrium frequencies of all chars.
  • alphabet (dict of str,?) – An alphabet from a substitution matrix.
  • tandem_repeat (Repeat) – An instance of the Repeat class.
Returns:

The loglikelihood value for the tandem repeat.

Return type:

float

tral.repeat.repeat_score.mean_similarity(repeat, measure_of_similarity, ignore_gaps=True)[source]
Calculate the mean similarity of all columns in the multiple sequence
alignment in repeat according to a measure_of_similarity.
Parameters:
  • repeat (Repeat) – An instance of the Repeat class.
  • measure_of_similarity (str) – Either “entropy”, “parsimony”, or “pSim”.
  • ignore_gaps (boolean) – Are gaps, “-” treated as chars (False) or ignored (True).
Returns:

A similarity measure for the repeat units in repeat.

Return type:

float

tral.repeat.repeat_score.optimisation(function, args, start_min=0.5, start_max=1.5, n_iteration=14)[source]

Binary one-dimensional optimisation of function.

Perform a binary one dimensional optimisation of parameter t on function with the additional list of arguments/parameters args. Start on a given range of start, and iterate n_iteration times. Return the maximum of the function, together with the maximum parameter t as a tuple.

In addition to start_min and start_max, additional hard boundaries on the optimisation parameter t are hard-coded: t E [0.0000000001, 100]

Parameters:
  • function (method) – Method with input (float, args) and output float.
  • args (list of items) – List of arguments for function.
Kwargs:
start_min (float): Minimum value of function parameter t. start_max (float): Maximum value of function parameter t. n_iteration (int): Number of iterations until optimisation results are returned.
Returns:float, Corresponding function parameter: float)
Return type:tuple (Maximum of function

Todo

Check why region_bounded is hardcoded at current.

Todo

Check why stepsize is hardcoded at current.

Todo

Generalise hard-coded optimisation parameter boundaries.

tral.repeat.repeat_score.pSim(column)[source]

Calculate the pSim score of a list.

Calculate the pSim score on column: Count the frequency of the most frequent element in a column of a multiple sequence alignment.

For an exact description see:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters:column (list of str) – A list of elements.
Returns:The pSim score of column.
Return type:float
tral.repeat.repeat_score.parsimony(column)[source]

Calculate the parsimony score of a list.

Calculate the parsimony score on column: Count the number of changes in a column of a multiple sequence alignment and divide by the maximally possible number of changes.

For an exact description see:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters:
  • column (list of str) – A list of two or more elements that can be
  • in a set, e.g. str or int (ordered) –
Returns:

The parsimony score of column.

Return type:

float

Todo

At which step shall we assert that the length of column > 1?