.. _denovo: Run and parse de novo repeat detection software. ================================================ Here you learn how to run external repeat detection software on your favourite sequence file and output the results. Requirements for this tutorial: - :ref:`Install TRAL `. TRAL ships with the data needed for this tutorial. - :ref:`Install XSTREAM `. You can also install one or more other :ref:`tandem repeat detectors` instead. - `Install Castor `_. If you wish to realign your repeat MSA you need to have proPIP installed beforehand. .. todo:: write section to install castor with aligner proPIP Read in your sequences. ----------------------- :: import os from tral.sequence import sequence from tral.paths import PACKAGE_DIRECTORY proteome_HIV = os.path.join(PACKAGE_DIRECTORY, "examples", "data", "HIV-1_388796.faa") sequences_HIV = sequence.Sequence.create(file = proteome_HIV, input_format = 'fasta') Run the external repeat detector. --------------------------------- Detect tandem repeats on a single sequence with :ref:`XSTREAM `. If you did not install XSTREAM, you can use any of the other *de novo* detection algorithms, but will see different results. :: tandem_repeats = sequences_HIV[0].detect(denovo = True, detection = {"detectors": ["XSTREAM"]}) As an example, the first detected putative tandem repeat looks as follows :ref:`(interpretation) `:: >>> print(tandem_repeats.repeats[0]) > begin:316 l_effective:4 n:2 GDII GDIR Detect tandem repeats on all sequences with all *de novo* tandem repeat detection algorithms defined in the :ref:`configuration file `. Here "denovo" is used as a tag, you can choose your own tag.:: for iSequence in sequences_HIV: iTandem_repeats = iSequence.detect(denovo = True) iSequence.set_repeatlist(iTandem_repeats, "denovo") Different different algorithms usually detect different tandem repeats. This the the absolute number of detections in the HIV proteome for a couple of algorithms:: >>> len([i for j in sequences_HIV for i in j.get_repeatlist('denovo').repeats if i.TRD == "HHrepID"]) 9 >>> len([i for j in sequences_HIV for i in j.get_repeatlist('denovo').repeats if i.TRD == "T-REKS"]) 4 >>> len([i for j in sequences_HIV for i in j.get_repeatlist('denovo').repeats if i.TRD == "TRUST"]) 6 >>> len([i for j in sequences_HIV for i in j.get_repeatlist('denovo').repeats if i.TRD == "XSTREAM"]) 9 As an example, T-REKS detects the following repeat in the second HIV sequence :ref:`(interpretation) `:: >>> print([i for i in sequences_HIV[1].get_repeatlist('denovo').repeats if i.TRD == "T-REKS"][0]) > begin:449 l_effective:10 n:2 RPEPTAPP-ESL RPEPTAPPPES- Realign a repeat with proPIP. ----------------------------- Realigning a tandem repeat with an indel aware algortithm such as proPIP may improve its representation. :: test_sequence = sequences_HIV[3] tandem_repeats = test_sequence.detect(denovo = True) Print one of the previous multiple sequence alignments (MSA):: >>> print(tandem_repeats.repeats[6]) > begin:828 l_effective:13 n:3 ----DKWTVQPIQLPE ---KDSWTVNDIQ--K LVGKLNWASQIY--PG Realignment with proPIP (default realignment='proPIP') of the tandem repeat MSAs with a constant rate distribution:: for TR in tandem_repeats.repeats: TR.realign_TR(realignment='proPIP', rate_distribution='constant') :: >>> print(tandem_repeats.repeats[6]) > begin:828 l_effective:13 n:3 ---DK--WTVQPIQLPE --KDS--WTVNDIQ--K LV-GKLNWASQ-I-YPG Set rate_distribution = 'gamma' for a gamma distributed indel placement or modify it in :ref:`config.ini `. Set realignment = 'mafft' to realign with mafft instead of proPIP. Output the detected tandem repeats. ----------------------------------- Write a singe repeat_list to .tsv format:: path_to_output_tsv_file = "outputfile.tsv" # Choose your path and filename tandem_repeats.write(output_format = "tsv", file = path_to_output_tsv_file) The created .tsv looks as follows :ref:`(interpretation) `:: $ cat outputfile.tsv begin msa_original l_effective nD sequence_length divergence pvalue 316 GDII,GDIR 4 2.0 8 None None 507 FLG,FLG 3 2.0 6 None None Write a singe repeat_list to .pickle format:: path_to_output_pickle_file = "outputfile.pickle" # Choose your path and filename tandem_repeats.write(output_format = "pickle", file = path_to_output_pickle_file) A repeat_list in pickle format can easily be read in again:: from tral.repeat_list import repeat_list tandem_repeats = repeat_list.RepeatList.create(input_format = "pickle", file = path_to_output_pickle_file) Save multiple sequence together with tandem repeat annotations:: import pickle path_to_output_pickle_file = "outputfile.pickle" # Choose your path and filename with open(path_to_output_pickle_file, 'wb') as fh: pickle.dump(sequences_HIV, fh)