Background Information¶
When printing a tandem repeat, you will see a lot of information. Here is what it means. (You get this example from the significance test tutorials)
>>> print(lHIV_Sequence[4].dRepeat_list['denovo'].repeats[1])
> begin:38 l_effective:2 n:6 pvalue:0.0 divergence:0.46649169922545164 type:phylo_gap01
RRN
RR-
RRW
RA-
RQ-
RQI
The tandem repeat is displayed as an alignment of tandem repeat units, similar to multiple sequence alignments
RRN
RR-
RRW
RA-
RQ-
RQI
Within the sequence, this tandem repeat would like as follows
RRNRRRRWRARQRQI
These are the other characteristics of the tandem repeat, that might be shown (if available):
begin: Index where the tandem repeat starts in the sequence
l_effective: ength of the consensus tandem repeat unit, not including insertions.
n: Number of tandem repeat units.
pvalue: What is the probability that this repeat has occurred by random chance? (This value is only a good estimate. So even if pvalue=0.0, of course it is possible that the sequence shows similarity by random chance, and not because the repeat units have evolved by duplications).
divergence: tModel-based measure of the similarity of the repeat units. (mathematically, it is the maximum likelihood estimate of the branch length on the phylogeny connecting all tandem repeat units. In the above example, every site has mutated 0.47 times on average. A value of 0 would mean no mutations have occurred, and the sequence of the tandem repeat units is very conserved).
type: The model which was used to calculate the pvalue and the divergence.
More tandem repeat characteristics¶
You have access to more tandem repeat characteristics. The dir() command will provide you with a list of all attributes and functions connected to the object:
>>> dir(lHIV_Sequence[4].dRepeat_list['denovo'].repeats[1])
['TRD', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'begin', 'calc_index_msa', 'calc_calc_n_effective', 'calculate_pvalues', 'calculate_scores', 'create', 'dDivergence', 'dPValue', 'dScore', 'deleteInsertionColumns', 'deletions', 'divergence', 'gapStructure', 'gap_structure_HMM', 'gaps', 'insertions', 'l', 'l_effective', 'msa', 'msaD', 'msaT', 'msaTD', 'msaTDN', 'msaTD_standard_aa', 'msa_original', 'msa_standard_aa', 'n', 'calc_n_effective', 'nGap', 'pvalue', 'save_original_msa', 'score', 'sequence_length', 'sequence_type', 'text', 'textD', 'textD_standard_aa', 'totD', 'write']
For a more explanation, check out the Class documentation, or contact us.
Writing a tandem repeat to .csv¶
When you write a tandem repeat to .csv with TRAL, the result may look as follows:
begin msa_original l_effective n_effective repeat_region_length divergence pvalue
316 GDII,GDIR 4 2.0 8 None None
507 FLG,FLG 3 2.0 6 None None
Additional to the tandem repeat characteristics explained above, here you can find:
msa_original: The tandem repeat unit alignment, with units separated by commata.
repeat_region_length: The number of characters covered by the tandem repeat region.
None values indicated that the required characteristics had not been calculated previously in the code.