Perform overlap filtering of redundant tandem repeat annotations.

Here you learn how to detect tandem repeat annotations from multiple sources for different types of overlap, and filter them accordingly.

Requirements for this tutorial:

Read in tandem repeat annotations.

import os, pickle
from tral import sequence
from tral.paths import PACKAGE_DIRECTORY

fRepeat_Pickle = os.path.join(PACKAGE_DIRECTORY, "examples", "data", "HIV-1_388796.pickle")

with open(fRepeat_Pickle, 'rb') as fh:
    lHIV_Sequence = pickle.load(fh)

Filter overlapping tandem repeats.

Tandem repeats can be clustered according to different types of overlap:

overlap_type = "shared_char"
for iSequence in lHIV_Sequence:
    iSequence.get_repeatlist("denovo").cluster(overlap_type)

In the first sequences, no repeats share any chars, however in the second sequence, three tandem repeats overlap:

>>> print(lHIV_Sequence[0].get_repeatlist("denovo").d_cluster[overlap_type])
[{5}, {4}, {3}, {2}, {1}, {0}]
>>> print(lHIV_Sequence[1].get_repeatlist("denovo").d_cluster[overlap_type])
[{4, 5}, {0, 2, 3}, {1}]

Tandem repeats can also directly be filtered of overlapping tandem repeats. Here, we need to choose to retain one of the overlapping tandem repeats. For example the following usage will first retain the tandem repeats of lowest p-Value according to the score, and in case there still are draws, retain the tandem repeat of lowest divergence:

overlap_type = "common_ancestry"
score = "phylo_gap001"
for iSequence in lHIV_Sequence:
    repeat_list_filtered = iSequence.get_repeatlist("denovo").filter("none_overlapping", (overlap_type, None), [("pvalue", score), ("divergence", score)])
    iSequence.set_repeatlist(repeat_list_filtered, "denovo_non_overlapping")

The resulting result_list now only contains tandem repeats that do not overlap according the common ancestry overlap:

>>> len([iR for iS in lHIV_Sequence for iR in iS.get_repeatlist("denovo").repeats])
28
>>> len([iR for iS in lHIV_Sequence for iR in iS.get_repeatlist("denovo_non_overlapping").repeats])
23

Available types of overlap/redundancy detection

There are currently two types of overlap implemented:

  • “shared_char”: Do two tandem repeats contain any two chars in common?

  • “common ancestry”: Do two tandem repeats have at least two chars in the same column of their tandem repeat unit alignments? This approach is more conservative than “shared char”. This approach has been used in the MBE and New Phytologist publications, 2014.

For clustering, overlap is assumed to be a transitive attribute. That is, if tandem repeats A & B, as well as B & C overlap, all tandem repeats A, B & C are clustered, no matter whether A and C do also overlap.