Sequence code documentation¶
The sequence
submodule documentation.
sequence¶
- synopsis
The Sequence Class.
-
class
tral.sequence.sequence.
Sequence
(seq, name=None, sequence_type='AA')[source]¶ A
Sequence
describes either a protein or a DNA sequence.Sequence
contains methods that act on single sequences, for example:tandem repeats
-
seq
¶ The sequence.
- Type
str
-
seq_standard_aa
¶ The sequence with standard amino acids only
- Type
str
-
create
(input_format)[source]¶ Create sequence(s) from file.
Create sequence(s) from file.
- Parameters
file (str) – Path to input file
format (str) – Either “fasta” or “pickle”
Todo
Write checks for
format
andfile
.
-
detect
(lHMM=None, denovo=None, realignment='mafft', sequence_type='AA', rate_distribution='constant', user_path=None, **kwargs)[source]¶ Detects tandem repeats on
self.seq
from 2 possible sources.A list of
Repeat
instances is created for tandem repeat detections on the sequence from two possible sources:Sequence profile hidden Markov models
HMM
de novo detection algorithms.
- Parameters
hmm (hmm.HMM) – A list of
HMM
instances.denovo (bool) – boolean
realignment (str) – either “mafft”, “proPIP” or None
*kwargs – Parameters fed to denovo TR prediction and/or Repeat instantiation. E.g.
repeat = {"calc_score": True}
- Returns
A
RepeatList
instance
-
get_repeatlist
(tag)[source]¶ Retrieve repeatlist from this sequence instance.
Retrieve repeatlist from this sequence instance. Access repeatlist as self.d_repeatlist[tag]
- Parameters
tag (str) – A identifier for the repeat_list
- Returns
A repeat_list instance.
- Return type
-
repeat_in_sequence
(my_repeat)[source]¶ Sanity check whether the repeat is part of this sequence. In case, calculate the position of the repeat within the sequence.
If yes: Return True, set repeat.begin to corrected value if necessary. If no: Return False. Perform sanity check on sequences where all amino acids are or are converted to standard amino acids.
- Parameters
sequence (sequence) – A sequence instance.
- Returns
True if repeat is part of sequence, else false
- Return type
bool
Todo
Decide whether save_original_msa is needed here.
-
set_repeatlist
(repeatlist, tag)[source]¶ Add repeatlist as attribute to this sequence instance.
Add repeatlist as attribute to this sequence instance. Access repeatlist as self.d_repeatlist[tag]
- Parameters
repeatlist (RepeatList) – A repeat_list instance.
tag (str) – A identifier for the repeat_list
sequence_io¶
- synopsis
Input/output for sequences
-
tral.sequence.sequence_io.
read_fasta
(file, indices=None)[source]¶ Read all sequences from a fasta file.
Read all sequences from a fasta file. At current, the Biopython SeqIO parser is used.
- Parameters
file (str) – Path to input file
start ([int, int]) – Index of the first returned sequence, and the first not returned sequence.
-
tral.sequence.sequence_io.
write
(sequence, sequence_file, sequence_id='sequence_id_not_defined')[source]¶ Write a sequence str to fasta format in specified <sequence_file>
Write s sequence str to fasta format in specified <sequence_file>
- Parameters
sequence (str) – Sequence
sequence_file (str) – Path to the output file
sequence_id (str) – ID of the sequence in the output file.
repeat_detection_io¶
- synopsis
Parsing repeat detection algorithm output
-
tral.sequence.repeat_detection_io.
getMSA
(sequenceMSA, consensusMSA)[source]¶ Derive the MSA from a strange combination of consensusMSA and sequenceMSA in TRF (Benson) txt.html output files
- Parameters
sequenceMSA –
consensusMSA –
- Returns
The multiple sequence alignment predicted by TRF.
- Return type
msa (list of str)
-
tral.sequence.repeat_detection_io.
hhpredid_get_repeats
(infile)[source]¶ Read repeats from a HHREPID standard output (stdout) file stream successively.
Read repeats from a HHREPID standard output (stdout) file stream successively. Postcondition: infile points to EOF.
Layout of HHREPID standard output:
protein ::= begin"-"\d "+"\d repeatUnit ( \d"-"\d "+"\d repeatUnit )+
- Parameters
infile (file stream) – File stream from HHREPID standard output.
by e.g. ([Generated) – ./hhrepid_32 -i FASTAFILE -v 0 -d cal.hhm -o INFILE]
- Returns
A generator function is returned that, when called in a loop, yields one repeat per iteration.
- Return type
(Repeat)
Todo
Layout HHREPID output syntax.
-
tral.sequence.repeat_detection_io.
phobos_get_repeats
(infile)[source]¶ Read repeats from a PHOBOS output file stream successively.
Read repeats from a PHOBOS output file stream successively. Postcondition: infile points to EOF.
- Parameters
infile (file stream) – File stream from PHOBOS output.
- Returns
A generator function is returned that, when called in a loop, yields one repeat per iteration.
- Return type
(Repeat)
Todo
Show PHOBOS output syntax.
-
tral.sequence.repeat_detection_io.
tred_get_repeats
(infile)[source]¶ Read repeats from a TRED standard output (stdout) file stream successively.
Read repeats from a TRED standard output (stdout) file stream successively. Postcondition: infile points to EOF.
Layout of TRED output file:
Start: start End: \d+ Length: \d+ ( \d repeat_unit \d ( alignment_indicator )?)*
- Parameters
infile (file stream) – File stream of output1 from tred1 myDNA.faa intermediate_output tred2 myDNA.faa intermediate_output output1 output2
- Returns
A generator function is returned that, when called in a loop, yields one repeat per iteration.
- Return type
(Repeat)
Todo
Layout TRED output syntax.
-
tral.sequence.repeat_detection_io.
tred_msa_from_pairwise
(repeat_units)[source]¶ Construct a MSA from pairwise alignments.
Construct a MSA from pairwise alignments. At the moment, gaps following the repeat are not added. However, these gaps are added automatically when a
Repeat
instance is created.- Parameters
repeat_units (list of str) – Read in from TRED output files
- Returns
(list of str)
Todo
Is the Args format mentioned correctly?
-
tral.sequence.repeat_detection_io.
treks_get_repeats
(infile)[source]¶ Read repeats from a T-REKS standard output (stdout) file stream successively.
Read repeats from a T-REKS standard output (stdout) file stream successively. Postcondition: infile points to EOF.
Layout of T-REKS output file:
protein ::= ">" identifier repeat* # repeat ::= repeat_header sequence* "*"+ # repeat_header ::= "Length:" integer "residues - nb:" integer "from" integer "to" integer "- Psim:"float "region Length:"integer
- Parameters
infile (file stream) – File stream to the file of the standard output of T-Reks
- Returns
A generator function is returned that, when called in a loop, yields one repeat per iteration.
- Return type
(Repeat)
Todo
Layout T-REKS output syntax.
-
tral.sequence.repeat_detection_io.
trf_get_repeats
(infile)[source]¶ Read repeats from a TRF txt.html file stream file stream successively.
Read repeats from a TRF txt.html file stream file stream successively. Postcondition: infile points to EOF.
TRF output file syntax:
Sequence: ``identifier`` Indices: ``begin``--``end`` \d [a-zA-Z]+ # begin (repeat)* 1 (consensus)* # (( \d (repeat)* \d (consensus)* )? \d (repeat)* 1 (consensus)* )+ \d [a-zA-Z]+ # ``Statistics``
- Parameters
infile (file stream) – File stream from TRF output txt.html.
(generated via e.g. ./trf404.linux64.exe FASTAFILE 2 7 7 80 10 50 500 -d > /dev/null If the -h flag is set, no .txt.html output is produced)
- Returns
A generator function is returned that, when called in a loop, yields one repeat per iteration.
- Return type
(Repeat)
Todo
Layout TRF output syntax.
Todo
Does not search for the sequence identifier at current!
-
tral.sequence.repeat_detection_io.
trust_fill_repeats
(msa, begin, sequence, maximal_gap_length=20)[source]¶ return a trust msa that has no longer indels than maximal_gap_length, that contains the indel characters even when not part of the trust output file. Background trust returns tandem repeats, but also distant repeats.
-
tral.sequence.repeat_detection_io.
trust_get_repeats
(infile)[source]¶ Read repeats from a TRUST standard output (stdout) file stream successively.
Read repeats from a TRUST standard output (stdout) file stream successively. Postcondition: infile points to EOF.
Layout of TRUST standard output:
protein ::= ">" identifier (repeat_types)* "//" # repeat_types ::= "REPEAT_TYPE" integer "REPEAT_LENGTH" integer (repeat_info)* (">Repeat " integer sequence)* # repeat_info ::= integer integer [integer] [integer]
- Parameters
infile (file stream) – File stream from TRUST standard output.
- Returns
A generator function is returned that, when called in a loop, yields one repeat per iteration.
- Return type
(Repeat)
Todo
Layout TRUST output syntax.
-
tral.sequence.repeat_detection_io.
xstream_get_repeats
(infile)[source]¶ Read repeats from a XSTREAM output xls chart
Read repeats from a XSTREAM output xls chart
Postcondition: infile points to EOF.
- Parameters
infile (file stream) – File stream to read XSTREAM output from a xls chart.
- Returns
A generator function is returned that, when called in a loop, yields one repeat per iteration.
- Return type
(Repeat)
repeat_detection_run¶
- synopsis
Execution of repeat detection algorithms
-
class
tral.sequence.repeat_detection_run.
BinaryExecutable
(binary=None, name=None)[source]¶ Contains the executable, and combines executable with parameters.
Contains the executable, and combines executable with parameters.
- Attributes:
binary(str): Path to binary
-
tral.sequence.repeat_detection_run.
Detectors
(lDetector=None, sequence_type=None)[source]¶ Define a global dictionary of all used detector functions.
Define a global dictionary of all used detector functions.
- Parameters
lDetector (list of str) – A list of repeat detection algorithm names.
sequence_type (str) – Either “AA” or “DNA”.
- Raises
Exception – if at least one of the provided detectors in
lDetector
does not exist.
-
tral.sequence.repeat_detection_run.
check_java_errors
(outfile, errfile, log=<Logger tral.sequence.repeat_detection_run (ERROR)>, procname=None)[source]¶ Check for java problems. Return True if there were problems, else False.
Check for these java errors:
Stdout file is empty but stderr file is not.
Java Exception string is indicated in the errfile
Return True if there were problems, else False.
- Parameters
outfile (file handle) – Redirected standard output channel file.
errfile (file handle) – Redirected standard error channel file If None, no copies are saved.
log – Name of the log to issue log messages to. If none, no log messages will be issued.
Todo
Complete docstring
-
tral.sequence.repeat_detection_run.
run_detector
(seq_records, detectors=None, sequence_type='AA', default=True, local_working_dir=None, num_threads=1)[source]¶ - Run TRD on sequence_records and return the predicted repeats for each
seq_records
and for each tandem repeat detector.
Run TRD on sequence_records and return the predicted repeats for each
seq_records
and for each tandem repeat detector. Example of return type:[ # record 1 { 'T-REKS' : [ Repeat(), Repeat(), ...], 'XSTREAM' : [ Repeat(), Repeat(), ...], ... }, # record 2 ... ]
- Parameters
seq_records (list of Sequence) – A list of Sequence instances
detectors (list of str) – A list tandem repeat detector names
sequence_type (str) – Either “AA” or “DNA”
default (bool) – If True, default values for the detection algorithms are used.
local_working_dir (str) – Directory where data and results are stored. If provided,
files are not deleted/ (temporary) –
num_threads (int) – Run
num_threads
detectors on parallel threads.
- Returns
A list with a dictionary for each record in seq_records. The dictionary contains a list of repeats for each detector that was used.
- Return type
list of dictionary
- Run TRD on sequence_records and return the predicted repeats for each
-
tral.sequence.repeat_detection_run.
split_sequence
(seq_records, working_dir)[source]¶ Split a FASTA file with multiple entries to several FASTA files with one entry
Arguments: seq_records – List of Sequence instances. working_dir – Output directory for splitted file
Returns: A list of tuples containing the Protein identifier and the file name