.. _significance_test: Perform statistical significance test of tandem repeats. ======================================================== Here you learn how to perform statistical significance tests on your candidate set of tandem repeats. TRAL currently provides a variety of model-based and none-model based tests. They are detailed in a :ref:`NAR publication (2012) `. Requirements for this tutorial: - :ref:`Install TRAL `. TRAL ships with the sequence data needed for this tutorial. - :ref:`Download p-Value distribution files `. Read in tandem repeat annotations. ---------------------------------- :: import os, pickle from tral import sequence from tral.paths import PACKAGE_DIRECTORY fRepeat_Pickle = os.path.join(PACKAGE_DIRECTORY,"examples", "data", "HIV-1_388796.pickle") with open(fRepeat_Pickle, 'rb') as fh: lHIV_Sequence = pickle.load(fh) There are nine proteins in the HIV proteome:: >>> len(lHIV_Sequence) 9 Perform a statistical significance test. ---------------------------------------- Calculate the statistical score on the 'denovo' *repeat_list*, and the corresponding p-Value with parameters defined in the :ref:`configuration file `: :: for iSequence in lHIV_Sequence: for iTandemRepeat in iSequence.get_repeatlist('denovo').repeats: iTandemRepeat.calculate_pvalues() For example, the following putative tandem repeat is found to be non-significant with the used model of tandem repeat evolution :ref:`(interpretation) `: :: >>> print(lHIV_Sequence[2].get_repeatlist('denovo').repeats[0]) > begin:94 l_effective:7 n:2 pvalue:1.0 divergence:1.656005859375 type:phylo_gap01 EKGGLEGLIYSKKRQE ---ILDLWVY------ Whereas this putative tandem repeat is considered significant: :: >>> print(lHIV_Sequence[4].get_repeatlist('denovo').repeats[0]) > begin:38 l_effective:2 n:6 pvalue:0.0 divergence:0.46649169922545164 type:phylo_gap01 RRN RR- RRW RA- RQ- RQI Besides, parameters can also be directly supplied to *calculate_pvalues*: :: for iSequence in lHIV_Sequence: for iTandemRepeat in iSequence.get_repeatlist('denovo').repeats: iTandemRepeat.calculate_pvalues(scoreslist=['phylo_gap001']) Filter tandem repeat below a significance threshold. ---------------------------------------------------- :: for iSequence in lHIV_Sequence: repeat_list = iSequence.get_repeatlist('denovo') if repeat_list: repeat_list_filtered = repeat_list.filter(func_name = "pvalue", score = "phylo_gap01", threshold = 0.05) iSequence.set_repeatlist(repeat_list_filtered, "denovo_filtered") The resulting *result_list* now only contains tandem repeats with a p-Value below 0.05, and is shorter than the original tandem repeat list: :: >>> len([iR for iS in lHIV_Sequence for iR in iS.get_repeatlist('denovo').repeats]) 28 >>> len([iR for iS in lHIV_Sequence for iR in iS.get_repeatlist('denovo_filtered').repeats]) 20 Available significance tests ---------------------------- The model based scores are generally described in our :ref:`NAR publication, 2012 `. Briefly, the sequences evolution is assumed to be described by the LG matrix. Further, the models can be distinguished by how gaps are handled: - 'phylo': Gaps are treated as characters. Gaps can also assumed to be inserted with exponentially distributed waiting times and Zipfian distributed indel lengths: - 'phylo_gap01', 'phylo_gap01_ignore_trailing_gaps', 'phylo_gap01_ignore_coherent_deletions', 'phylo_gap01_ignore_trailing_gaps_and_coherent_deletions': Gaps are modelled, and assumed to mutation at at 0.1 times lower rate compared to substitutions. - 'phylo_gap001', 'phylo_gap001_ignore_trailing_gaps', 'phylo_gap001_ignore_coherent_deletions', 'phylo_gap001_ignore_trailing_gaps_and_coherent_deletions': Gaps are modelled, and assumed to mutation at at 0.01 times lower rate compared to substitutions. - *ignore_trailing_gaps* signifies that gaps before the first tandem repeat unit and after the last tandem repeat unit are not penalised. - *coherent_deletions* signifies that gaps of the same length and position within the alignment of tandem repeat units are penalised only once. Among the three *ad hoc* scores are 'entropy', 'parsimony', and 'pSim' scores (see our :ref:`NAR publication, 2012 `).