Repeat code documentation¶
Initial version of the repeat submodule documentation.
repeat¶
synopsis: | The Repeat Class. |
---|
-
class
tral.repeat.repeat.
Repeat
(msa, begin=None, sequence_type='AA', scoreslist=['phylo_gap01'], calc_score=True, calc_pvalue=True)[source]¶ Tandem repeats are consecutive copies of genomic sequence.
A tandem repeat is stored in terms of a multiple sequence alignment of its tandem repeat units. For example, the amino acid tandem repeat KLMKLMKLKLM can be displayed as a multiple sequence alignment (here: “multiple repeat unit alignment”) KLM KLM KL- KLM and can be stored as [“KLM”, “KLM”, “KL-”, “KLM”]. The - char stands for a gap: It indicates that either, there used to be a char at the same position, but it got deleted. Or, all other chars in the same column got inserted at some point.
For research on tandem repeats, many different statistics need to be calculated on the multiple sequence alignment of the tandem repeat.
[Description of these statistics]
-
msa_original
¶ list of str
The original alignment of repeat units.
-
msa
¶ list of str
The alignment of repeat units without rare chars.
-
msaT
¶ list of str
The transposed alignment of repeat units. E.g. [“KKKK”, “LLLL”, “MM-M”].
-
sequence_type
¶ str
“AA” or “DNA”
-
n
¶ int
The number of repeat units in the original msa. E.g. 4.
-
l_msa
¶ int
The length of the repeat units in the original msa. E.g. 3.
-
l_effective
¶ int
The number of columns in the original msa not counting insertion columns (i.e. columns with more gaps than chars).
-
n_effective
¶ float
The number of chars in the MSA without insertion columns divided by n_effective.
-
text
¶ str
The repeat region of the tandem repeats (e.g. all chars in the MSA without gaps concatenated.)
-
n_gaps
¶ int
The number of gaps in
msa
-
repeat_region_length
¶ int
The lengths of text
-
calc_n_effective
()[source]¶ Calculate the effective number of repeat units n_effective
Calculate the number of letters in in all non-insertion columns, divided by <l_effective>
Parameters: self (Repeat instance) –
-
calculate_pvalues
(scoreslist=['phylo_gap01'])[source]¶ Calculate pvalues on a Repeat instance.
Calculate pvalues on a Repeat instance for all scores in scoreslist.
Parameters: self (Repeat) – An instance of the repeat class. - Kwargs:
- scoreslist(list of Str): A list of the names of the scores (e.g. [“phylo_gap01”]) for which p-Values are calculated on the Repeat instance self
-
calculate_scores
(scoreslist=['phylo_gap01'])[source]¶ Calculate scores on a Repeat instance.
Calculate scores on a Repeat instance for all scores in scoreslist.
Parameters: self (Repeat) – An instance of the repeat class. - Kwargs:
- scoreslist(list of str): A list of the names of the scores (e.g.
- [“phylo_gap01”]) that are calculated on the Repeat instance self
-
create
(file, input_format, **kwargs)[source]¶ Read tandem repeat from file.
Read tandem repeat from file (currently, only pickle is supported)
Parameters: - format (str) – Currently only “pickle”
- file (str) – Path to output file
Todo
Write checks for
format
andfile
.Todo
Write tests.
Todo
Can repeat_io.read_fasta be replaced by e.g. BioPython seqIO?
-
delete_insertion_columns
()[source]¶ - Create the tandem repeat attributes msa, masT, l and n
- without insertion columns.
We define insertion columns as columns where there are more or equal gaps compared to chars. These columns are removed from both msa to create msaD and msaT to create msaTD.
-
gap_structure
()[source]¶ Calculate the number and length of insertions and deletions for this tandem repeat.
-
gap_structure_HMM
()[source]¶ - Calculate for each column in the transposed multiple repeat unit
alignment msaTD
- how many deletions of what length begin in this column?:
deletion_lengths`[`iColumn] = list of deletion lengths
how many insertions of what length begin in potential insertion columns before this column?: insertion_lengths`[`iColumn] = list of insertion lengths
Parameters: self (Repeat) – An instance of the repeat class.
-
realign_TR
(realignment='proPIP', rate_distribution='constant')[source]¶ - Realign multiple sequence alignment of tandem repeats.
- Update of tandem repeat MSA.
Parameters: - self (Repeat instance) –
- realignment (str) – either “proPIP” or “mafft”
- rate_distribution (str) – either “constant” or “gamma” (per default value from REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’])
-
-
tral.repeat.repeat.
calculate_position_in_alignment
(begin, length, alignment)[source]¶ - Calculate the index of the begin and the end of a TR within an alignment
returns
Calculate the index of the begin and the end of a TR within an alignment returns
Parameters: - begin (int) – (needs explaining)
- length (int) – (needs explaining)
- alignment – (needs explaining)
Returns: Dict
repeat_align¶
synopsis: | Alignment of tandem repeat units. |
---|
-
tral.repeat.repeat_align.
realign_repeat
(my_msa, realignment='mafft', sequence_type='AA', rate_distribution='constant', user_path=None)[source]¶ Realignment of a repeat MSA using mafft or proPIP
A good multiple sequence alignment (MSA) is an important description of a tandem repeat (TR) and can be crucial to distinguish a true positive from a false positive TR. This can be realised using a proper MSA estimation algorithm. Currently, Mafft and proPIP can be used to realign a TR.
- For proPIP the rate distribution for indels can be either constant or gamma distributed
- REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’] == ‘constant’: no gamma distribution is used for the proPIP algorithm
- REPEAT_CONFIG[‘castor_parameter’][‘rate_distribution’] == ‘gamma’: gamma distribution (n=6, alpha=0.5)
Parameters: - my_msa (list of strings) – List of sequences (str)
- realignment (str) – Either “mafft” or “proPIP”
- sequence_type (str) – Either “AA” or “DNA”
- user_path (str) – copy alignment files to user defined path
Returns: my_msa realigned (list of strings)
Todo
- decide what happens if branch length of tree is getting zero
-
tral.repeat.repeat_align.
remove_char
(str, n)[source]¶ helper function to remove a character in a sequence
-
tral.repeat.repeat_align.
remove_gaps
(msa)[source]¶ Remove columns in a MSA which contain only gaps. Do not use MSA longer than 3000 characters.
Returns the MSA as a list without only-gap-containing columns.
Parameters: msa (list of strings) – List of sequences (str) Returns: msa (list of strings) with removed colums in case they only contain gaps
repeat_exon¶
-
tral.repeat.repeat_exon.
get_exon_measures
(my_TR, exon_structure)[source]¶ Calculate diverse exon measures
Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]
-
tral.repeat.repeat_exon.
get_exon_structure
(my_TR, exon_structure)[source]¶ Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]
-
tral.repeat.repeat_exon.
logger
= <logging.Logger object>¶ Functions to check whether the exon structure and the repeat unit structure are correlated.
-
tral.repeat.repeat_exon.
print_exon_structure
(my_TR, exon_structure)[source]¶ Exon structure parameters: # exon_structure = [{‘Exon Chr End (bp)’: ‘43478424’, ‘Exon Chr Start (bp)’: ‘43477252’, ‘CDS End’: ‘272’, ‘CDS Start’: ‘1’}, {‘Exon Chr End (bp)’: ‘43476658’, ‘Exon Chr Start (bp)’: ‘43476498’, ‘CDS End’: ‘433’, ‘CDS Start’: ‘273’}, {‘Exon Chr End (bp)’: ‘43476158’, ‘Exon Chr Start (bp)’: ‘43476036’, ‘CDS End’: ‘556’, ‘CDS Start’: ‘434’}, {‘Exon Chr End (bp)’: ‘43475664’, ‘Exon Chr Start (bp)’: ‘43475564’, ‘CDS End’: ‘657’, ‘CDS Start’: ‘557’}, {‘Exon Chr End (bp)’: ‘43475416’, ‘Exon Chr Start (bp)’: ‘43475194’, ‘CDS End’: ‘880’, ‘CDS Start’: ‘658’}, {‘Exon Chr End (bp)’: ‘43475046’, ‘Exon Chr Start (bp)’: ‘43474707’, ‘CDS End’: ‘951’, ‘CDS Start’: ‘881’}]
repeat_io¶
synopsis: | Input/output for tandem repeats. |
---|
-
tral.repeat.repeat_io.
evolved_tandem_repeats
(l, n, n_samples, sequence_type, job_id='job_id', mutation_rate=50, tree='star', indel_rate_per_site=False, return_type='repeat')[source]¶ Simulate evolved sequences with ALF.
Simulate evolved sequences with ALF: Dalquen, D. A., Anisimova, M., Gonnet, G. H. & Dessimoz, C. ALF–a simulation framework for genome evolution. Molecular Biology and Evolution 29, 1115–1123 (2012).
Parameters: - l (int) – The length of the repeat unit
- n (int) – The number of repeat units in the tandem repeat
- n_samples (int) – The number of samples
- sequence_type (str) – Either “AA” or “DNA”
- job_id (str) – A tag for files produces with ALF, and result files.
- mutation_rate (float) – The mutation rate.
- tree (str) – The type of tree, e.g. “star” or “birthdeath”
- indel_rate_per_site (int or False) – The indel rate per site.
- return_type (str) – Either “repeat” or “list”
- sequence_length (int) – The total length of the simulated sequence
Returns: Return type depends on
return_type
:Repeat
orBio.Seq.Seq
instance.
-
tral.repeat.repeat_io.
random_sequence
(n_samples, sequence_type='AA', return_type='repeat', equilibrium_frequencies='human', l=0, n=0, sequence_length=0)[source]¶ Simulate random sequence locally.
Simulate random sequence locally.
Parameters: - n_samples (int) – The number of samples
- sequence_type (str) – Either “AA” or “DNA”
- return_type (str) – Either “repeat” or “list”
- equilibrium_frequencies (str) – Only “human” option available at current
- l (int) – The length of the repeat unit
- n (int) – The number of repeat units in the tandem repeat
- sequence_length (int) – The total length of the simulated sequence
Returns: Return type depends on
return_type
.
-
tral.repeat.repeat_io.
read_fasta
(seq_filename, sequence_type='AA')[source]¶ Read repeat from file in fasta format.
Read repeat from file in fasta format.
Parameters: - seq_filename (str) – Path to the repeats containing fasta file
- sequence_type (str) – Either “AA” or “DNA”
Returns: A list of Repeat instances.
Return type: (list of Repeat)
-
tral.repeat.repeat_io.
save_repeat_fasta
(tandem_repeats, file)[source]¶ save multiple <tandem_repeats> in Fasta format in specified <file>
At current, only one TR per sequence can be defined, as the identifiers in the dict <tandem_repeats> must be unique.
- Parameters: Dict of tandem repeats and identifiers.
e.g. {‘ENSP00012’: msa1, ‘ENSP00013’: msa2}
>ID GHKI GHKI GH–
-
tral.repeat.repeat_io.
save_repeat_stockholm
(tandem_repeat, file)[source]¶ save <tandem_repeat> in STOCKHOLM format in specified <file>
- Parameters: Tandem repeat MSA.
e.g. [“ACDEF-”, “ACCDEF”]
More information in http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf STOCKHOLM Format Example:
# STOCKHOLM 1.0 seq1 ACDEF...GHIKL seq2 ACDEF...GHIKL seq3 ...EFMNRGHIKL
seq1 MNPQTVWY seq2 MNPQTVWY seq3 MNPQT... //
-
tral.repeat.repeat_io.
save_repeat_treks
(tandem_repeats, file)[source]¶ Save multiple tandem_repeats in T-REKS format in specified file
At current, only one TR per sequence can be defined, as the identifiers in the dict <tandem_repeats> must be unique.
- Parameters: Dict of tandem repeats and identifiers.
e.g. {‘ENSP00012’: [msa1, begin1], ‘ENSP00013’: [msa2, begin2]}
T-REKS format example:
>a Length: 3 residues - nb: 3 from 1 to 10 - Psim:0.8076923076923077 region Length:42 GHKI GHKI GH– ******************
Length: 3 residues - nb: 2 from 21 to 27 - Psim:0.7857142857142857 region Length:22 GHKI GH– ******************
>b Length: 3 residues - nb: 3 from 1 to 10 - Psim:0.8095238095238095 region Length:20 UIR UIR UIR
repeat_pvalue¶
synopsis: | The repeat pvalue module. |
---|
-
tral.repeat.repeat_pvalue.
calc_pvalues
(repeats, result_file_path, file_name, scoretypes=['phylo', 'phylo_gap'], gappy_data=False)[source]¶ Create and save a null distribution for
repeats
scores for p-Value calculation.You can use a different null distribution for each column of different length for both the parsimony and the pSim score. Therefore, calculate the structure of the TR before you calculate the null distribution of tandem repeat scores of tandem repeats with the same gap distribution.
Parameters: - repeats (list of Repeat) – A list of
Repeat
instances. - result_file_path (str) – Path to result folder.
- file_name (str) – Name of result file.
- scoretypes (list of str) – List of scores. E.g. [‘phylo’,’phylo_gap’].
- gappy_data (bool) – True if any of
repeats
contain gaps, else False.
- repeats (list of Repeat) – A list of
-
tral.repeat.repeat_pvalue.
calculate_repeat_structure
(tandemrepeat)[source]¶ Calculate the number of columns with a certain number of gaps for each column in a
Repeat
instance.You can use a different null distribution for each column of different length for both the parsimony and the pSim score. Therefore, calculate the structure of the TR before you calculate the null distribution of tandem repeat scores of tandem repeats with the same gap distribution.
Parameters: - n (int) – The number of repeat units.
- score – The heuristic model used. E.g. “psim”.
- sequence_type (str) – The type of the sequence: either “AA” or “DNA”.
Returns: l_repeat_structure (list) n_repeat_structure (int)
Todo
Include exact definition of returned values.
-
tral.repeat.repeat_pvalue.
column_pdf
(n, score_type='psim', sequence_type='AA')[source]¶ - Load and return the probability density function of the score on random
sequence_type
data of lengthn
.
Load and return the probability density function of the score on random
sequence_type
data of lengthn
. This method handels scores with distributions independent ofl
. This includes all parsimony scores, and excludes model based scores.The order of the values in the probability density function is WORST FIRST: The first values describe the probability for “bad” scores, the last values the probabilities for “top” scores.
Currently, for all models values are available until
n_max
= 150
For
n
<=n_max
all values are available. If no distribution is available for a givenn
the closest values forn
is used.For values of
n
aboven_max
, then_max
pdf is returned.Parameters: - n (int) – The number of repeat units.
- score_type – The heuristic model used. E.g. “psim”.
- sequence_type (str) – The type of the sequence: either “AA” or “DNA”.
Returns: A numpy list of length 10.000
Todo
Check return type.
Todo
Check how this method concurs with the
empirical_list
method.
-
tral.repeat.repeat_pvalue.
d_average_multinom
(l, n, sequence_type, score_type)[source]¶ Helper function for the analytic calculation of parsimony or pSim scores.
Called by
d_average_multiple_max_multinom
ord_average_multiple_pars_multinom
.Return the
l
-times self-convoluted pdf of the score on random sequence_type data of lengthn
. The order of the pdf is kept fromcolumn_pdf()
.For details see:
Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).
Parameters: - l (int) – The length of the repeat unit.
- n (int) – The number of repeat units.
- sequence_type (str) – The type of the sequence: either “AA” or “DNA”.
- score_type – The model of repeat evolution used. Either ‘psim’ or ‘parsimony’.
Returns: (description missing)
Warning
if precision higher than max(uint32) use uint64 instead. CHECK: https://numpy.org/doc/stable/user/basics.types.html
Todo
Describe return value.
-
tral.repeat.repeat_pvalue.
d_average_multiple_max_multinom
(tandemrepeat, precision=10000.0)[source]¶ - Calculate null distribution for e.g. the pSim score for repeats of type
tandemrepeat
- as a probability density function.
Analytically calculate the p-Value distribution for the pSim score for repeats of type
tandemrepeat
. The derivation is described in:Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).
Parameters: - tandemrepeat (Repeat) – A
Repeat
instance. - precision (float) – The precision of the returned probability density function in
- of the length of the resulting list. (terms) –
Returns: p – cumulated probabilities from 0 to 1. unnamed (list of float): scores corresponding to the probabilities in
p
.Return type: list of float
Warning
if precision higher than max(uint32) use uint64 instead. CHECK: https://numpy.org/doc/stable/user/basics.types.html
- Calculate null distribution for e.g. the pSim score for repeats of type
-
tral.repeat.repeat_pvalue.
d_average_multiple_pars_multinom
(tandemrepeat, precision=10000.0)[source]¶ - Calculate null distribution for the parsimony score for repeats of type
tandemrepeat
- as a probability density function.
Analytically calculate the p-Value distribution for the parsimony score for repeats of type
tandemrepeat
. The derivation is described in:Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).
Parameters: - tandemrepeat (Repeat) – A
Repeat
instance. - precision (float) – The precision of the returned probability density function in
- of the length of the resulting list. (terms) –
Returns: p – cumulated probabilities from 0 to 1. unnamed (list of float): scores corresponding to the probabilities in
p
.Return type: list of float
Warning
if precision higher than max(uint32) use uint64 instead. CHECK: https://numpy.org/doc/stable/user/basics.types.html
- Calculate null distribution for the parsimony score for repeats of type
-
tral.repeat.repeat_pvalue.
empirical_list
(l, n, sequence_type='AA', score_type='phylo')[source]¶ - Load and return a numpy list with 10,000 empirical values of the user defined
- distribution from an external file.
If no distribution is available for a given
l
andn
, the closest values forl
andn
are chosen instead.For the standard model score “phylo”, currently values are available until
l_max
= 99n_max
= 49total_repeat_length_max
= 1000
For
l
<=l_max
andn
<=n_max
andn*l
<=total_repeat_length_max
all values are available.For other values, we use the closest distribution, assuming that values change little (heuristic assumption!).
Parameters: - l (int) – The length of the repeat unit.
- n (int) – The number of repeat units.
- sequence_type (str) – The type of the sequence: either “AA” or “DNA”.
- score_type – The model of repeat evolution used. E.g. “phylo”.
Returns: A numpy list of length 10.000
Todo
Define “phylo” model.
Todo
Perhaps add distributions beyond l_max, n_max.
-
tral.repeat.repeat_pvalue.
gap_penalty
(tandemrepeat, mu)[source]¶ Calculate the gap penalty for a
Repeat
given mutation ratemu
.Parameters: - tandemrepeat (Repeat) – A
Repeat
instance. - mu (float) – The mutation rate.
Returns: Gap penalty (float)
Todo
Define
mu
more precisely.Todo
Is this function called from anywhere? In case, consider refactoring.
- tandemrepeat (Repeat) – A
-
tral.repeat.repeat_pvalue.
pvalue_from_empirical_list
(tandemrepeat, score_type='phylo', score_value=None, empirical=[])[source]¶ Calculates the p-Value of a score_type for the given tandemrepeat.
The p-Value is the number of scores for comparable tandem repeats, i.e. of same repeat unit length and repeat unit copy number that are as good or better.
Parameters: - tandemrepeat (Repeat) – An instance of the Repeat class.
- score_type – The model of repeat evolution used. E.g. “phylo”.
- score_value (float) – The value of the score of the Repeat instance.
- empirical – The null distribution of the score for repeats of the same type.
Returns: A float.
Return type:
-
tral.repeat.repeat_pvalue.
pvalue_pars
(tandemrepeat)[source]¶ Calculate the p-Value of the parsimony score for a Repeat.
Retrieve the probability density function for repeats of the same type as
tandemrepeat
. Then, calculate the p-Value given this probability density function, and the parsimony score oftandemrepeat
.Parameters: tandemrepeat (Repeat) – A Repeat
instance.Returns: p-Value (float) Todo
Check the method’s behaviour if
tandemrepeat
s parsimony score has not been calculated before.Todo
Check exception: if pdf == False: return 1.
-
tral.repeat.repeat_pvalue.
pvalue_psim
(tandemrepeat)[source]¶ Calculate the p-Value of the pSim score for a Repeat.
Retrieve the probability density function for repeats of the same type as
tandemrepeat
. Then, calculate the p-Value given this probability density function, and the pSim score oftandemrepeat
.Parameters: tandemrepeat (Repeat) – A Repeat
instance.Returns: p-Value (float) Todo
Check the method’s behaviour if
tandemrepeat
s parsimony score has not been calculated before.Todo
Check exception: if pdf == False: return 1.
Todo
Describe
precision
.
repeat_score¶
synopsis: | The repeat score module. |
---|
-
tral.repeat.repeat_score.
calculate_pdf_scores
(repeats, result_file_path, file_name, scoreslist=['phylo_gap01'])[source]¶ CALCULATE THE PROBABILITY DISTRIBUTION OF SCORES
-
tral.repeat.repeat_score.
entropy
(column)[source]¶ Calculate the entropy of a list.
Calculate the entropy of list. Assume that the frequency of elements in the list reflects a probability density estimator.
For an exact description see:
Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).
Parameters: column (list of str) – List of elements that can be ordered in a set, e.g. str or int. Returns: The entropy of column
.Return type: float
-
tral.repeat.repeat_score.
load_equilibrium_freq
(file_name)[source]¶ Load equilibrium frequencies from a substitution matrix file.
Load equilibrium frequencies from a substitution matrix file. Note that with minor changes, the whole amino acid substitution matrix could be returned.
Parameters: file_name (str) – Path to substitution matrix file, E.g. LG.dat. Returns: Dictionary of one letter amino acid codes and float value Return type: (Dict of str, float)
-
tral.repeat.repeat_score.
loglikelihood_gaps_starphylogeny_zipfian
(t, tandem_repeat, indel_rate_per_site=0.01, gaps='row_wise', indel_zipf=1.821)[source]¶ - Calculate the log-likelihood of the gap structure in a
tandem repeat
assuming a star tree.
Calculate the log-likelihood of the gap structure in a
tandem repeat
assuming a star tree (i.e. repeat unit correlations are not modeled). The gap length model is a zipfian distribution with parameterindel_zipf
. The gap generation model is a simple poisson model with rate parameterindel_rate_per_site
.Parameters: - t (float) – The divergence of the
tandem_repeat
units. - tandem_repeat (Repeat) – An instance of the
Repeat
class. - indel_rate_per_site (float) – The mutation rate of insertions and
deletions, as opposed to
t
. - gaps (str) – The mode of gap counting. Options are “row_wise”,
- "ignore_trailing_gaps" and ("ignore_coherent_deletions",) –
- "ignore_trailing_gaps_and_coherent_deletions". –
- indel_zipf (float) – The parameter of the Zipfian distribution.
Returns: The loglikelihood value for the gap structure of the tandem repeat.
Return type: float
Todo
USE LOG EARLIER, NOT JUST IN RETURN STATEMENT
-
tral.repeat.repeat_score.
loglikelihood_substitution
(t, Q, eq_freq, alphabet, tandem_repeat)[source]¶ - Calculate the likelihood of a repeat assuming a star tree and a
- sequence model defined by
Q
,t
,eq_freq
andalphabet
.
Parameters: - t (float) – The divergence of the
tandem_repeat
units. - Q (dict ?) – A substitution matrix.
- eq_freq (dict ?) – Equilibrium frequencies of all chars.
- alphabet (dict of str,?) – An alphabet from a substitution matrix.
- tandem_repeat (Repeat) – An instance of the
Repeat
class.
Returns: The loglikelihood value for the tandem repeat.
Return type: float
-
tral.repeat.repeat_score.
mean_similarity
(repeat, measure_of_similarity, ignore_gaps=True)[source]¶ - Calculate the mean similarity of all columns in the multiple sequence
- alignment in
repeat
according to ameasure_of_similarity
.
Parameters: - repeat (Repeat) – An instance of the Repeat class.
- measure_of_similarity (str) – Either “entropy”, “parsimony”, or “pSim”.
- ignore_gaps (boolean) – Are gaps, “-” treated as chars (False) or ignored (True).
Returns: A similarity measure for the repeat units in
repeat
.Return type: float
-
tral.repeat.repeat_score.
optimisation
(function, args, start_min=0.5, start_max=1.5, n_iteration=14)[source]¶ Binary one-dimensional optimisation of
function
.Perform a binary one dimensional optimisation of parameter
t
onfunction
with the additional list of arguments/parametersargs
. Start on a given range of start, and iteraten_iteration
times. Return the maximum of the function, together with the maximum parametert
as a tuple.In addition to
start_min
andstart_max
, additional hard boundaries on the optimisation parametert
are hard-coded: t E [0.0000000001, 100]Parameters: - function (method) – Method with input (float,
args
) and output float. - args (list of items) – List of arguments for
function
.
- Kwargs:
- start_min (float): Minimum value of function parameter
t
. start_max (float): Maximum value of function parametert
. n_iteration (int): Number of iterations until optimisation results are returned.
Returns: float, Corresponding function parameter: float) Return type: tuple (Maximum of function Todo
Check why
region_bounded
is hardcoded at current.Todo
Check why
stepsize
is hardcoded at current.Todo
Generalise hard-coded optimisation parameter boundaries.
- function (method) – Method with input (float,
-
tral.repeat.repeat_score.
pSim
(column)[source]¶ Calculate the pSim score of a list.
Calculate the pSim score on
column
: Count the frequency of the most frequent element in acolumn
of a multiple sequence alignment.For an exact description see:
Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).
Parameters: column (list of str) – A list of elements. Returns: The pSim score of column
.Return type: float
-
tral.repeat.repeat_score.
parsimony
(column)[source]¶ Calculate the parsimony score of a list.
Calculate the parsimony score on
column
: Count the number of changes in acolumn
of a multiple sequence alignment and divide by the maximally possible number of changes.For an exact description see:
Schaper, E., Kajava, A., Hauser, A., & Anisimova, M. Repeat or not repeat? –Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Research (2012).
Parameters: - column (list of str) – A
list
of two or more elements that can be - in a set, e.g. str or int (ordered) –
Returns: The parsimony score of
column
.Return type: float
Todo
At which step shall we assert that the length of
column
> 1?- column (list of str) – A