Repeat code documentation

The repeat submodule documentation.

repeat

synopsis

The Repeat Class.

class tral.repeat.repeat.Repeat(msa, begin=None, sequence_type='AA', scoreslist=['phylo_gap01'], calc_score=False, calc_pvalue=False)[source]

Tandem repeats are consecutive copies of genomic sequence.

A tandem repeat is stored in terms of a multiple sequence alignment of its tandem repeat units. For example, the amino acid tandem repeat KLMKLMKLKLM can be displayed as a multiple sequence alignment (here: “multiple repeat unit alignment”) KLM KLM KL- KLM and can be stored as [“KLM”, “KLM”, “KL-“, “KLM”]. The - char stands for a gap: It indicates that either, there used to be a char at the same position, but it got deleted. Or, all other chars in the same column got inserted at some point.

For research on tandem repeats, many different statistics need to be calculated on the multiple sequence alignment of the tandem repeat.

[Description of these statistics]

msa_original

The original alignment of repeat units.

Type

list of str

msa

The alignment of repeat units without rare chars.

Type

list of str

msaT

The transposed alignment of repeat units. E.g. [“KKKK”, “LLLL”, “MM-M”].

Type

list of str

sequence_type

“AA” or “DNA”

Type

str

n

The number of repeat units in the original msa. E.g. 4.

Type

int

l_msa

The length of the repeat units in the original msa. E.g. 3.

Type

int

l_effective

The number of columns in the original msa not counting insertion columns (i.e. columns with more gaps than chars).

Type

int

n_effective

The number of chars in the MSA without insertion columns divided by n_effective.

Type

float

text

The repeat region of the tandem repeats (e.g. all chars in the MSA without gaps concatenated.)

Type

str

n_gaps

The number of gaps in msa

Type

int

repeat_region_length

The lengths of text

Type

int

calc_index_msa()[source]

formerly named: set_begin

calc_n_effective()[source]

Calculate the effective number of repeat units n_effective

Calculate the number of letters in in all non-insertion columns, divided by <l_effective>

Parameters

self (Repeat instance) –

calculate_pvalues(scoreslist=['phylo_gap01'])[source]

Calculate pvalues on a Repeat instance.

Calculate pvalues on a Repeat instance for all scores in scoreslist.

Parameters

self (Repeat) – An instance of the repeat class.

Kwargs:

scoreslist(list of Str): A list of the names of the scores (e.g. [“phylo_gap01”]) for which p-Values are calculated on the Repeat instance self

calculate_scores(scoreslist=['phylo_gap01'])[source]

Calculate scores on a Repeat instance.

Calculate scores on a Repeat instance for all scores in scoreslist.

Parameters

self (Repeat) – An instance of the repeat class.

Kwargs:
scoreslist(list of str): A list of the names of the scores (e.g.

[“phylo_gap01”]) that are calculated on the Repeat instance self

create(input_format, **kwargs)[source]

Read tandem repeat from file.

Read tandem repeat from file (currently, only pickle is supported)

Parameters
  • format (str) – Currently only “pickle”

  • file (str) – Path to output file

Todo

Write checks for format and file.

Todo

Write tests.

Todo

Can repeat_io.read_fasta be replaced by e.g. BioPython seqIO?

delete_insertion_columns()[source]
Create the tandem repeat attributes msa, masT, l and n

without insertion columns.

We define insertion columns as columns where there are more or equal gaps compared to chars. These columns are removed from both msa to create msaD and msaT to create msaTD.

gap_structure()[source]

Calculate the number and length of insertions and deletions for this tandem repeat.

gap_structure_HMM()[source]
Calculate for each column in the transposed multiple repeat unit

alignment msaTD

  • how many deletions of what length begin in this column?:

    deletion_lengths`[`iColumn] = list of deletion lengths

  • how many insertions of what length begin in potential insertion columns before this column?: insertion_lengths`[`iColumn] = list of insertion lengths

Parameters

self (Repeat) – An instance of the repeat class.

realign_TR(realignment='proPIP', rate_distribution='constant')[source]
Realign multiple sequence alignment of tandem repeats.

Update of tandem repeat MSA.

Parameters
  • self (Repeat instance) –

  • realignment (str) – either “proPIP” or “mafft”

  • rate_distribution (str) – either “constant” or “gamma” (per default value from REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’])

save_original_msa(sequence)[source]

recalculate the original msa, assuming that begin is defined correctly (index starting counting on 1.) This function might be obsolete when the original MSA is saved from the beginning.

write(file, file_format)[source]

Write tandem repeat to file.

Write tandem repeat to file using one of three formats.

Parameters
  • file (str) – Path to input file

  • file_format (str) – Either “fasta”, “pickle” or “stockholm”

Todo

Write checks for format and file.

tral.repeat.repeat.calculate_position_in_alignment(begin, length, alignment)[source]
Calculate the index of the begin and the end of a TR within an alignment

returns

Calculate the index of the begin and the end of a TR within an alignment returns

Parameters
  • begin (int) – (needs explaining)

  • length (int) – (needs explaining)

  • alignment – (needs explaining)

Returns

Dict

repeat_align

synopsis

Alignment of tandem repeat units.

tral.repeat.repeat_align.realign_repeat(my_msa, realignment='mafft', sequence_type='AA', rate_distribution='constant', user_path=None)[source]

Realignment of a repeat MSA using mafft or proPIP

A good multiple sequence alignment (MSA) is an important description of a tandem repeat (TR) and can be crucial to distinguish a true positive from a false positive TR. This can be realised using a proper MSA estimation algorithm. Currently, Mafft and proPIP can be used to realign a TR.

For proPIP the rate distribution for indels can be either constant or gamma distributed
  • REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’] == ‘constant’: no gamma distribution is used for the proPIP algorithm

  • REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’] == ‘gamma’: gamma distribution (n=6, alpha=0.5)

Parameters
  • my_msa (list of strings) – List of sequences (str)

  • realignment (str) – Either “mafft” or “proPIP”

  • sequence_type (str) – Either “AA” or “DNA”

  • user_path (str) – copy alignment files to user defined path

Returns

my_msa realigned (list of strings)

Todo

  • decide what happens if branch length of tree is getting zero

tral.repeat.repeat_align.remove_char(str, n)[source]

helper function to remove a character in a sequence

tral.repeat.repeat_align.remove_gaps(msa)[source]

Remove columns in a MSA which contain only gaps. Do not use MSA longer than 3000 characters.

Returns the MSA as a list without only-gap-containing columns.

Parameters

msa (list of strings) – List of sequences (str)

Returns

msa (list of strings) with removed colums in case they only contain gaps

repeat_exon

tral.repeat.repeat_exon.get_exon_measures(my_TR, exon_structure)[source]

Calculate diverse exon measures

Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]

tral.repeat.repeat_exon.get_exon_structure(my_TR, exon_structure)[source]

Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]

tral.repeat.repeat_exon.logger = <Logger tral.repeat.repeat_exon (ERROR)>

Functions to check whether the exon structure and the repeat unit structure are correlated.

tral.repeat.repeat_exon.print_exon_structure(my_TR, exon_structure)[source]

Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]

repeat_io

synopsis

Input/output for tandem repeats.

tral.repeat.repeat_io.evolved_tandem_repeats(l, n, n_samples, sequence_type, job_id='job_id', mutation_rate=50, tree='star', indel_rate_per_site=False, return_type='repeat')[source]

Simulate evolved sequences with ALF.

Simulate evolved sequences with ALF: Dalquen, D. A., Anisimova, M., Gonnet, G. H. & Dessimoz, C. ALF–a simulation framework for genome evolution. Molecular Biology and Evolution 29, 1115–1123 (2012).

Parameters
  • l (int) – The length of the repeat unit

  • n (int) – The number of repeat units in the tandem repeat

  • n_samples (int) – The number of samples

  • sequence_type (str) – Either “AA” or “DNA”

  • job_id (str) – A tag for files produces with ALF, and result files.

  • mutation_rate (float) – The mutation rate.

  • tree (str) – The type of tree, e.g. “star” or “birthdeath”

  • indel_rate_per_site (int or False) – The indel rate per site.

  • return_type (str) – Either “repeat” or “list”

  • sequence_length (int) – The total length of the simulated sequence

Returns

Return type depends on return_type: Repeat or Bio.Seq.Seq instance.

tral.repeat.repeat_io.random_sequence(n_samples, sequence_type='AA', return_type='repeat', equilibrium_frequencies='human', l=0, n=0, sequence_length=0)[source]

Simulate random sequence locally.

Simulate random sequence locally.

Parameters
  • n_samples (int) – The number of samples

  • sequence_type (str) – Either “AA” or “DNA”

  • return_type (str) – Either “repeat” or “list”

  • equilibrium_frequencies (str) – Only “human” option available at current

  • l (int) – The length of the repeat unit

  • n (int) – The number of repeat units in the tandem repeat

  • sequence_length (int) – The total length of the simulated sequence

Returns

Return type depends on return_type.

tral.repeat.repeat_io.read_fasta(seq_filename, sequence_type='AA')[source]

Read repeat from file in fasta format.

Read repeat from file in fasta format.

Parameters
  • seq_filename (str) – Path to the repeats containing fasta file

  • sequence_type (str) – Either “AA” or “DNA”

Returns

A list of Repeat instances.

Return type

(list of Repeat)

tral.repeat.repeat_io.save_repeat_fasta(tandem_repeats, file)[source]

save multiple <tandem_repeats> in Fasta format in specified <file>

At current, only one TR per sequence can be defined, as the identifiers in the dict <tandem_repeats> must be unique.

Parameters: Dict of tandem repeats and identifiers.

e.g. {‘ENSP00012’: msa1, ‘ENSP00013’: msa2}

>ID GHKI GHKI GH–

tral.repeat.repeat_io.save_repeat_stockholm(tandem_repeat, file)[source]

save <tandem_repeat> in STOCKHOLM format in specified <file>

Parameters: Tandem repeat MSA.

e.g. [“ACDEF-“, “ACCDEF”]

More information in ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf STOCKHOLM Format Example:

# STOCKHOLM 1.0 seq1 ACDEF…GHIKL seq2 ACDEF…GHIKL seq3 …EFMNRGHIKL

seq1 MNPQTVWY seq2 MNPQTVWY seq3 MNPQT… //

tral.repeat.repeat_io.save_repeat_treks(tandem_repeats, file)[source]

Save multiple tandem_repeats in T-REKS format in specified file

At current, only one TR per sequence can be defined, as the identifiers in the dict <tandem_repeats> must be unique.

Parameters: Dict of tandem repeats and identifiers.

e.g. {‘ENSP00012’: [msa1, begin1], ‘ENSP00013’: [msa2, begin2]}

T-REKS format example:

>a Length: 3 residues - nb: 3 from 1 to 10 - Psim:0.8076923076923077 region Length:42 GHKI GHKI GH– ******************

Length: 3 residues - nb: 2 from 21 to 27 - Psim:0.7857142857142857 region Length:22 GHKI GH– ******************

>b Length: 3 residues - nb: 3 from 1 to 10 - Psim:0.8095238095238095 region Length:20 UIR UIR UIR

repeat_pvalue

synopsis

The repeat pvalue module.

tral.repeat.repeat_pvalue.calc_pvalues(repeats, result_file_path, file_name, scoretypes=['phylo', 'phylo_gap'], gappy_data=False)[source]

Create and save a null distribution for repeats scores for p-Value calculation.

You can use a different null distribution for each column of different length for both the parsimony and the pSim score. Therefore, calculate the structure of the TR before you calculate the null distribution of tandem repeat scores of tandem repeats with the same gap distribution.

Parameters
  • repeats (list of Repeat) – A list of Repeat instances.

  • result_file_path (str) – Path to result folder.

  • file_name (str) – Name of result file.

  • scoretypes (list of str) – List of scores. E.g. [‘phylo’,’phylo_gap’].

  • gappy_data (bool) – True if any of repeats contain gaps, else False.

tral.repeat.repeat_pvalue.calculate_repeat_structure(tandemrepeat)[source]

Calculate the number of columns with a certain number of gaps for each column in a Repeat instance.

You can use a different null distribution for each column of different length for both the parsimony and the pSim score. Therefore, calculate the structure of the TR before you calculate the null distribution of tandem repeat scores of tandem repeats with the same gap distribution.

Parameters
  • n (int) – The number of repeat units.

  • score – The heuristic model used. E.g. “psim”.

  • sequence_type (str) – The type of the sequence: either “AA” or “DNA”.

Returns

l_repeat_structure (list) n_repeat_structure (int)

Todo

Include exact definition of returned values.

tral.repeat.repeat_pvalue.column_pdf(n, score_type='psim', sequence_type='AA')[source]
Load and return the probability density function of the score on random

sequence_type data of length n.

Load and return the probability density function of the score on random sequence_type data of length n. This method handels scores with distributions independent of l. This includes all parsimony scores, and excludes model based scores.

The order of the values in the probability density function is WORST FIRST: The first values describe the probability for “bad” scores, the last values the probabilities for “top” scores.

Currently, for all models values are available until

  • n_max = 150

For n <= n_max all values are available. If no distribution is available for a given n the closest values for n is used.

For values of n above n_max, the n_max pdf is returned.

Parameters
  • n (int) – The number of repeat units.

  • score_type – The heuristic model used. E.g. “psim”.

  • sequence_type (str) – The type of the sequence: either “AA” or “DNA”.

Returns

A numpy list of length 10.000

Todo

Check return type.

Todo

Check how this method concurs with the empirical_list method.

tral.repeat.repeat_pvalue.d_average_multinom(l, n, sequence_type, score_type)[source]

Helper function for the analytic calculation of parsimony or pSim scores.

Called by d_average_multiple_max_multinom or d_average_multiple_pars_multinom.

Return the l-times self-convoluted pdf of the score on random sequence_type data of length n. The order of the pdf is kept from column_pdf().

For details see:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters
  • l (int) – The length of the repeat unit.

  • n (int) – The number of repeat units.

  • sequence_type (str) – The type of the sequence: either “AA” or “DNA”.

  • score_type – The model of repeat evolution used. Either ‘psim’ or ‘parsimony’.

Returns

(description missing)

Warning

if precision higher than max(uint32) use uint64 instead. CHECK: http://docs.scipy.org/doc/numpy/user/basics.types.html

Todo

Describe return value.

tral.repeat.repeat_pvalue.d_average_multiple_max_multinom(tandemrepeat, precision=10000.0)[source]
Calculate null distribution for e.g. the pSim score for repeats of type tandemrepeat

as a probability density function.

Analytically calculate the p-Value distribution for the pSim score for repeats of type tandemrepeat. The derivation is described in:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters
  • tandemrepeat (Repeat) – A Repeat instance.

  • precision (float) – The precision of the returned probability density function in

  • of the length of the resulting list. (terms) –

Returns

cumulated probabilities from 0 to 1. unnamed (list of float): scores corresponding to the probabilities in p.

Return type

p (list of float)

Warning

if precision higher than max(uint32) use uint64 instead. CHECK: http://docs.scipy.org/doc/numpy/user/basics.types.html

tral.repeat.repeat_pvalue.d_average_multiple_pars_multinom(tandemrepeat, precision=10000.0)[source]
Calculate null distribution for the parsimony score for repeats of type tandemrepeat

as a probability density function.

Analytically calculate the p-Value distribution for the parsimony score for repeats of type tandemrepeat. The derivation is described in:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters
  • tandemrepeat (Repeat) – A Repeat instance.

  • precision (float) – The precision of the returned probability density function in

  • of the length of the resulting list. (terms) –

Returns

cumulated probabilities from 0 to 1. unnamed (list of float): scores corresponding to the probabilities in p.

Return type

p (list of float)

Warning

if precision higher than max(uint32) use uint64 instead. CHECK: http://docs.scipy.org/doc/numpy/user/basics.types.html

tral.repeat.repeat_pvalue.empirical_list(l, n, sequence_type='AA', score_type='phylo')[source]
Load and return a numpy list with 10,000 empirical values of the user defined

distribution from an external file.

If no distribution is available for a given l and n, the closest values for l and n are chosen instead.

For the standard model score “phylo”, currently values are available until

  • l_max = 99

  • n_max = 49

  • total_repeat_length_max = 1000

For l <= l_max and n <= n_max and n*l <= total_repeat_length_max all values are available.

For other values, we use the closest distribution, assuming that values change little (heuristic assumption!).

Parameters
  • l (int) – The length of the repeat unit.

  • n (int) – The number of repeat units.

  • sequence_type (str) – The type of the sequence: either “AA” or “DNA”.

  • score_type – The model of repeat evolution used. E.g. “phylo”.

Returns

A numpy list of length 10.000

Todo

Define “phylo” model.

Todo

Perhaps add distributions beyond l_max, n_max.

tral.repeat.repeat_pvalue.gap_penalty(tandemrepeat, mu)[source]

Calculate the gap penalty for a Repeat given mutation rate mu.

Parameters
  • tandemrepeat (Repeat) – A Repeat instance.

  • mu (float) – The mutation rate.

Returns

Gap penalty (float)

Todo

Define mu more precisely.

Todo

Is this function called from anywhere? In case, consider refactoring.

tral.repeat.repeat_pvalue.pvalue_from_empirical_list(tandemrepeat, score_type='phylo', score_value=None, empirical=[])[source]

Calculates the p-Value of a score_type for the given tandemrepeat.

The p-Value is the number of scores for comparable tandem repeats, i.e. of same repeat unit length and repeat unit copy number that are as good or better.

Parameters
  • tandemrepeat (Repeat) – An instance of the Repeat class.

  • score_type – The model of repeat evolution used. E.g. “phylo”.

  • score_value (float) – The value of the score of the Repeat instance.

  • empirical – The null distribution of the score for repeats of the same type.

Returns

A float.

Return type

pvalue

tral.repeat.repeat_pvalue.pvalue_pars(tandemrepeat)[source]

Calculate the p-Value of the parsimony score for a Repeat.

Retrieve the probability density function for repeats of the same type as tandemrepeat. Then, calculate the p-Value given this probability density function, and the parsimony score of tandemrepeat.

Parameters

tandemrepeat (Repeat) – A Repeat instance.

Returns

p-Value (float)

Todo

Check the method’s behaviour if tandemrepeat s parsimony score has not been calculated before.

Todo

Check exception: if pdf == False: return 1.

tral.repeat.repeat_pvalue.pvalue_psim(tandemrepeat)[source]

Calculate the p-Value of the pSim score for a Repeat.

Retrieve the probability density function for repeats of the same type as tandemrepeat. Then, calculate the p-Value given this probability density function, and the pSim score of tandemrepeat.

Parameters

tandemrepeat (Repeat) – A Repeat instance.

Returns

p-Value (float)

Todo

Check the method’s behaviour if tandemrepeat s parsimony score has not been calculated before.

Todo

Check exception: if pdf == False: return 1.

Todo

Describe precision.

repeat_score

synopsis

The repeat score module.

tral.repeat.repeat_score.K80(kappa=2.59)[source]
tral.repeat.repeat_score.LG()[source]
tral.repeat.repeat_score.TN93(alpha_1=0.3, alpha_2=0.4, beta=0.7)[source]
tral.repeat.repeat_score.calculate_pdf_scores(repeats, result_file_path, file_name, scoreslist=['phylo_gap01'])[source]

CALCULATE THE PROBABILITY DISTRIBUTION OF SCORES

tral.repeat.repeat_score.entropy(column)[source]

Calculate the entropy of a list.

Calculate the entropy of list. Assume that the frequency of elements in the list reflects a probability density estimator.

For an exact description see:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters

column (list of str) – List of elements that can be ordered in a set, e.g. str or int.

Returns

The entropy of column.

Return type

float

tral.repeat.repeat_score.load_equilibrium_freq(file_name)[source]

Load equilibrium frequencies from a substitution matrix file.

Load equilibrium frequencies from a substitution matrix file. Note that with minor changes, the whole amino acid substitution matrix could be returned.

Parameters

file_name (str) – Path to substitution matrix file, E.g. LG.dat.

Returns

Dictionary of one letter amino acid codes and

float value

Return type

(Dict of str, float)

tral.repeat.repeat_score.loglikelihood_gaps_starphylogeny_zipfian(t, tandem_repeat, indel_rate_per_site=0.01, gaps='row_wise', indel_zipf=1.821)[source]
Calculate the log-likelihood of the gap structure in a

tandem repeat assuming a star tree.

Calculate the log-likelihood of the gap structure in a tandem repeat assuming a star tree (i.e. repeat unit correlations are not modeled). The gap length model is a zipfian distribution with parameter indel_zipf. The gap generation model is a simple poisson model with rate parameter indel_rate_per_site.

Parameters
  • t (float) – The divergence of the tandem_repeat units.

  • tandem_repeat (Repeat) – An instance of the Repeat class.

  • indel_rate_per_site (float) – The mutation rate of insertions and deletions, as opposed to t.

  • gaps (str) – The mode of gap counting. Options are “row_wise”,

  • "ignore_trailing_gaps" and ("ignore_coherent_deletions",) –

  • "ignore_trailing_gaps_and_coherent_deletions".

  • indel_zipf (float) – The parameter of the Zipfian distribution.

Returns

The loglikelihood value for the gap structure of the tandem

repeat.

Return type

float

Todo

USE LOG EARLIER, NOT JUST IN RETURN STATEMENT

tral.repeat.repeat_score.loglikelihood_substitution(t, Q, eq_freq, alphabet, tandem_repeat)[source]
Calculate the likelihood of a repeat assuming a star tree and a

sequence model defined by Q, t, eq_freq and alphabet.

Parameters
  • t (float) – The divergence of the tandem_repeat units.

  • Q (dict ?) – A substitution matrix.

  • eq_freq (dict ?) – Equilibrium frequencies of all chars.

  • alphabet (dict of str,?) – An alphabet from a substitution matrix.

  • tandem_repeat (Repeat) – An instance of the Repeat class.

Returns

The loglikelihood value for the tandem repeat.

Return type

float

tral.repeat.repeat_score.mean_similarity(repeat, measure_of_similarity, ignore_gaps=True)[source]
Calculate the mean similarity of all columns in the multiple sequence

alignment in repeat according to a measure_of_similarity.

Parameters
  • repeat (Repeat) – An instance of the Repeat class.

  • measure_of_similarity (str) – Either “entropy”, “parsimony”, or “pSim”.

  • ignore_gaps (boolean) – Are gaps, “-” treated as chars (False) or ignored (True).

Returns

A similarity measure for the repeat units in repeat.

Return type

float

tral.repeat.repeat_score.optimisation(function, args, start_min=0.5, start_max=1.5, n_iteration=14)[source]

Binary one-dimensional optimisation of function.

Perform a binary one dimensional optimisation of parameter t on function with the additional list of arguments/parameters args. Start on a given range of start, and iterate n_iteration times. Return the maximum of the function, together with the maximum parameter t as a tuple.

In addition to start_min and start_max, additional hard boundaries on the optimisation parameter t are hard-coded: t E [0.0000000001, 100]

Parameters
  • function (method) – Method with input (float, args) and output float.

  • args (list of items) – List of arguments for function.

Kwargs:

start_min (float): Minimum value of function parameter t. start_max (float): Maximum value of function parameter t. n_iteration (int): Number of iterations until optimisation results are returned.

Returns

float,

Corresponding function parameter: float)

Return type

tuple (Maximum of function

Todo

Check why region_bounded is hardcoded at current.

Todo

Check why stepsize is hardcoded at current.

Todo

Generalise hard-coded optimisation parameter boundaries.

tral.repeat.repeat_score.pSim(column)[source]

Calculate the pSim score of a list.

Calculate the pSim score on column: Count the frequency of the most frequent element in a column of a multiple sequence alignment.

For an exact description see:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters

column (list of str) – A list of elements.

Returns

The pSim score of column.

Return type

float

tral.repeat.repeat_score.parsimony(column)[source]

Calculate the parsimony score of a list.

Calculate the parsimony score on column: Count the number of changes in a column of a multiple sequence alignment and divide by the maximally possible number of changes.

For an exact description see:

Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).

Parameters
  • column (list of str) – A list of two or more elements that can be

  • in a set, e.g. str or int (ordered) –

Returns

The parsimony score of column.

Return type

float

Todo

At which step shall we assert that the length of column > 1?