Sequence code documentation¶
Initial version of the sequence submodule documentation.
sequence¶
synopsis: | The Sequence Class. |
---|
-
class
tral.sequence.sequence.
Sequence
(seq, name=None, sequence_type='AA')[source]¶ A
Sequence
describes either a protein or a DNA sequence.Sequence
contains methods that act on single sequences, for example:- tandem repeats
-
seq
¶ str
The sequence.
-
seq_standard_aa
¶ str
The sequence with standard amino acids only
-
create
(file, input_format)[source]¶ Create sequence(s) from file.
Create sequence(s) from file.
Parameters: - file (str) – Path to input file
- format (str) – Either “fasta” or “pickle”
Todo
Write checks for
format
andfile
.
-
detect
(lHMM=None, denovo=None, realignment='mafft', sequence_type='AA', rate_distribution='constant', user_path=None, **kwargs)[source]¶ Detects tandem repeats on
self.seq
from 2 possible sources.A list of
Repeat
instances is created for tandem repeat detections on the sequence from two possible sources:- Sequence profile hidden Markov models
HMM
- de novo detection algorithms.
Parameters: - hmm (HMM) – A list of
HMM
instances. - denovo (bool) – boolean
- realignment (str) – either “mafft”, “proPIP” or None
- *kwargs – Parameters fed to denovo TR prediction and/or Repeat
instantiation. E.g.
repeat = {"calc_score": True}
Returns: A
RepeatList
instance- Sequence profile hidden Markov models
-
get_repeatlist
(tag)[source]¶ Retrieve repeatlist from this sequence instance.
Retrieve repeatlist from this sequence instance. Access repeatlist as self.d_repeatlist[tag]
Parameters: tag (str) – A identifier for the repeat_list Returns: A repeat_list instance. Return type: RepeatList
-
repeat_in_sequence
(my_repeat)[source]¶ Sanity check whether the repeat is part of this sequence. In case, calculate the position of the repeat within the sequence.
If yes: Return True, set repeat.begin to corrected value if necessary. If no: Return False. Perform sanity check on sequences where all amino acids are or are converted to standard amino acids.
Parameters: sequence (sequence) – A sequence instance. Returns: True if repeat is part of sequence, else false Return type: bool Todo
Decide whether save_original_msa is needed here.
-
set_repeatlist
(repeatlist, tag)[source]¶ Add repeatlist as attribute to this sequence instance.
Add repeatlist as attribute to this sequence instance. Access repeatlist as self.d_repeatlist[tag]
Parameters: - repeatlist (RepeatList) – A repeat_list instance.
- tag (str) – A identifier for the repeat_list
sequence_io¶
synopsis: | Input/output for sequences |
---|
-
tral.sequence.sequence_io.
read_fasta
(file, indices=None)[source]¶ Read all sequences from a fasta file.
Read all sequences from a fasta file. At current, the Biopython SeqIO parser is used.
Parameters: - file (str) – Path to input file
- start ([int, int]) – Index of the first returned sequence, and the first not returned sequence.
-
tral.sequence.sequence_io.
write
(sequence, sequence_file, sequence_id='sequence_id_not_defined')[source]¶ Write a sequence str to fasta format in specified <sequence_file>
Write s sequence str to fasta format in specified <sequence_file>
Parameters: - sequence (str) – Sequence
- sequence_file (str) – Path to the output file
- sequence_id (str) – ID of the sequence in the output file.
repeat_detection_io¶
synopsis: | Parsing repeat detection algorithm output |
---|
-
tral.sequence.repeat_detection_io.
getMSA
(sequenceMSA, consensusMSA)[source]¶ Derive the MSA from a strange combination of consensusMSA and sequenceMSA in TRF (Benson) txt.html output files
Parameters: - sequenceMSA –
- consensusMSA –
Returns: msa – The multiple sequence alignment predicted by TRF.
Return type: list of str
-
tral.sequence.repeat_detection_io.
hhpredid_get_repeats
(infile)[source]¶ Read repeats from a HHREPID standard output (stdout) file stream successively.
Read repeats from a HHREPID standard output (stdout) file stream successively. Postcondition: infile points to EOF.
Layout of HHREPID standard output:
protein ::= begin"-"\d "+"\d repeatUnit ( \d"-"\d "+"\d repeatUnit )+
Parameters: - infile (file stream) – File stream from HHREPID standard output.
- by e.g. ([Generated) – ./hhrepid_32 -i FASTAFILE -v 0 -d cal.hhm -o INFILE]
Returns: A generator function is returned that, when called in a loop, yields one repeat per iteration.
Return type: (Repeat)
Todo
Layout HHREPID output syntax.
-
tral.sequence.repeat_detection_io.
phobos_get_repeats
(infile)[source]¶ Read repeats from a PHOBOS output file stream successively.
Read repeats from a PHOBOS output file stream successively. Postcondition: infile points to EOF.
Parameters: infile (file stream) – File stream from PHOBOS output. Returns: A generator function is returned that, when called in a loop, yields one repeat per iteration. Return type: (Repeat) Todo
Show PHOBOS output syntax.
-
tral.sequence.repeat_detection_io.
tred_get_repeats
(infile)[source]¶ Read repeats from a TRED standard output (stdout) file stream successively.
Read repeats from a TRED standard output (stdout) file stream successively. Postcondition: infile points to EOF.
Layout of TRED output file:
Start: start End: \d+ Length: \d+ ( \d repeat_unit \d ( alignment_indicator )?)*
Parameters: infile (file stream) – File stream of output1 from tred1 myDNA.faa intermediate_output tred2 myDNA.faa intermediate_output output1 output2 Returns: A generator function is returned that, when called in a loop, yields one repeat per iteration. Return type: (Repeat) Todo
Layout TRED output syntax.
-
tral.sequence.repeat_detection_io.
tred_msa_from_pairwise
(repeat_units)[source]¶ Construct a MSA from pairwise alignments.
Construct a MSA from pairwise alignments. At the moment, gaps following the repeat are not added. However, these gaps are added automatically when a
Repeat
instance is created.Parameters: repeat_units (list of str) – Read in from TRED output files Returns: (list of str) Todo
Is the Args format mentioned correctly?
-
tral.sequence.repeat_detection_io.
treks_get_repeats
(infile)[source]¶ Read repeats from a T-REKS standard output (stdout) file stream successively.
Read repeats from a T-REKS standard output (stdout) file stream successively. Postcondition: infile points to EOF.
Layout of T-REKS output file:
protein ::= ">" identifier repeat* # repeat ::= repeat_header sequence* "*"+ # repeat_header ::= "Length:" integer "residues - nb:" integer "from" integer "to" integer "- Psim:"float "region Length:"integer
Parameters: infile (file stream) – File stream to the file of the standard output of T-Reks Returns: A generator function is returned that, when called in a loop, yields one repeat per iteration. Return type: (Repeat) Todo
Layout T-REKS output syntax.
-
tral.sequence.repeat_detection_io.
trf_get_repeats
(infile)[source]¶ Read repeats from a TRF txt.html file stream file stream successively.
Read repeats from a TRF txt.html file stream file stream successively. Postcondition: infile points to EOF.
TRF output file syntax:
Sequence: ``identifier`` Indices: ``begin``--``end`` \d [a-zA-Z]+ # begin (repeat)* 1 (consensus)* # (( \d (repeat)* \d (consensus)* )? \d (repeat)* 1 (consensus)* )+ \d [a-zA-Z]+ # ``Statistics``
Parameters: infile (file stream) – File stream from TRF output txt.html. (generated via e.g. ./trf404.linux64.exe FASTAFILE 2 7 7 80 10 50 500 -d > /dev/null If the -h flag is set, no .txt.html output is produced)
Returns: A generator function is returned that, when called in a loop, yields one repeat per iteration. Return type: (Repeat) Todo
Layout TRF output syntax.
Todo
Does not search for the sequence identifier at current!
-
tral.sequence.repeat_detection_io.
trust_fill_repeats
(msa, begin, sequence, maximal_gap_length=20)[source]¶ return a trust msa that has no longer indels than maximal_gap_length, that contains the indel characters even when not part of the trust output file. Background trust returns tandem repeats, but also distant repeats.
-
tral.sequence.repeat_detection_io.
trust_get_repeats
(infile)[source]¶ Read repeats from a TRUST standard output (stdout) file stream successively.
Read repeats from a TRUST standard output (stdout) file stream successively. Postcondition: infile points to EOF.
Layout of TRUST standard output:
protein ::= ">" identifier (repeat_types)* "//" # repeat_types ::= "REPEAT_TYPE" integer "REPEAT_LENGTH" integer (repeat_info)* (">Repeat " integer sequence)* # repeat_info ::= integer integer [integer] [integer]
Parameters: infile (file stream) – File stream from TRUST standard output. Returns: A generator function is returned that, when called in a loop, yields one repeat per iteration. Return type: (Repeat) Todo
Layout TRUST output syntax.
-
tral.sequence.repeat_detection_io.
xstream_get_repeats
(infile)[source]¶ Read repeats from a XSTREAM output xls chart
Read repeats from a XSTREAM output xls chart
Postcondition: infile points to EOF.
Parameters: infile (file stream) – File stream to read XSTREAM output from a xls chart. Returns: A generator function is returned that, when called in a loop, yields one repeat per iteration. Return type: (Repeat)
repeat_detection_run¶
synopsis: | Execution of repeat detection algorithms |
---|
-
class
tral.sequence.repeat_detection_run.
BinaryExecutable
(binary=None, name=None)[source]¶ Contains the executable, and combines executable with parameters.
Contains the executable, and combines executable with parameters.
- Attributes:
- binary(str): Path to binary
-
tral.sequence.repeat_detection_run.
Detectors
(lDetector=None, sequence_type=None)[source]¶ Define a global dictionary of all used detector functions.
Define a global dictionary of all used detector functions.
Parameters: - lDetector (list of str) – A list of repeat detection algorithm names.
- sequence_type (str) – Either “AA” or “DNA”.
Raises: Exception
– if at least one of the provided detectors inlDetector
does not exist.
-
tral.sequence.repeat_detection_run.
check_java_errors
(outfile, errfile, log=<logging.Logger object>, procname=None)[source]¶ Check for java problems. Return True if there were problems, else False.
Check for these java errors:
- Stdout file is empty but stderr file is not.
- Java Exception string is indicated in the errfile
Return True if there were problems, else False.
Parameters: - outfile (file handle) – Redirected standard output channel file.
- errfile (file handle) – Redirected standard error channel file If None, no copies are saved.
- log – Name of the log to issue log messages to. If none, no log messages will be issued.
Todo
Complete docstring
-
tral.sequence.repeat_detection_run.
run_detector
(seq_records, detectors=None, sequence_type='AA', default=True, local_working_dir=None, num_threads=1)[source]¶ - Run TRD on sequence_records and return the predicted repeats for each
seq_records
- and for each tandem repeat detector.
Run TRD on sequence_records and return the predicted repeats for each
seq_records
and for each tandem repeat detector. Example of return type:[ # record 1 { 'T-REKS' : [ Repeat(), Repeat(), ...], 'XSTREAM' : [ Repeat(), Repeat(), ...], ... }, # record 2 ... ]
Parameters: - seq_records (list of Sequence) – A list of Sequence instances
- detectors (list of str) – A list tandem repeat detector names
- sequence_type (str) – Either “AA” or “DNA”
- default (bool) – If True, default values for the detection algorithms are used.
- local_working_dir (str) – Directory where data and results are stored. If provided,
- files are not deleted/ (temporary) –
- num_threads (int) – Run
num_threads
detectors on parallel threads.
Returns: A list with a dictionary for each record in seq_records. The dictionary contains a list of repeats for each detector that was used.
Return type: list of dictionary
- Run TRD on sequence_records and return the predicted repeats for each
-
tral.sequence.repeat_detection_run.
split_sequence
(seq_records, working_dir)[source]¶ Split a FASTA file with multiple entries to several FASTA files with one entry
Arguments: seq_records – List of Sequence instances. working_dir – Output directory for splitted file
Returns: A list of tuples containing the Protein identifier and the file name