Sequence code documentation

Initial version of the sequence submodule documentation.

sequence

synopsis:The Sequence Class.
class tral.sequence.sequence.Sequence(seq, name=None, sequence_type='AA')[source]

A Sequence describes either a protein or a DNA sequence.

Sequence contains methods that act on single sequences, for example:

  • tandem repeats
seq

str

The sequence.

seq_standard_aa

str

The sequence with standard amino acids only

create(file, input_format)[source]

Create sequence(s) from file.

Create sequence(s) from file.

Parameters:
  • file (str) – Path to input file
  • format (str) – Either “fasta” or “pickle”

Todo

Write checks for format and file.

detect(lHMM=None, denovo=None, realignment='mafft', sequence_type='AA', rate_distribution='constant', user_path=None, **kwargs)[source]

Detects tandem repeats on self.seq from 2 possible sources.

A list of Repeat instances is created for tandem repeat detections on the sequence from two possible sources:

  • Sequence profile hidden Markov models HMM
  • de novo detection algorithms.
Parameters:
  • hmm (HMM) – A list of HMM instances.
  • denovo (bool) – boolean
  • realignment (str) – either “mafft”, “proPIP” or None
  • *kwargs – Parameters fed to denovo TR prediction and/or Repeat instantiation. E.g. repeat = {"calc_score": True}
Returns:

A RepeatList instance

get_repeatlist(tag)[source]

Retrieve repeatlist from this sequence instance.

Retrieve repeatlist from this sequence instance. Access repeatlist as self.d_repeatlist[tag]

Parameters:tag (str) – A identifier for the repeat_list
Returns:A repeat_list instance.
Return type:RepeatList
repeat_in_sequence(my_repeat)[source]

Sanity check whether the repeat is part of this sequence. In case, calculate the position of the repeat within the sequence.

If yes: Return True, set repeat.begin to corrected value if necessary. If no: Return False. Perform sanity check on sequences where all amino acids are or are converted to standard amino acids.

Parameters:sequence (sequence) – A sequence instance.
Returns:True if repeat is part of sequence, else false
Return type:bool

Todo

Decide whether save_original_msa is needed here.

set_repeatlist(repeatlist, tag)[source]

Add repeatlist as attribute to this sequence instance.

Add repeatlist as attribute to this sequence instance. Access repeatlist as self.d_repeatlist[tag]

Parameters:
  • repeatlist (RepeatList) – A repeat_list instance.
  • tag (str) – A identifier for the repeat_list
write(file, file_format)[source]

Write sequence to file.

Write sequence to file using one of two formats.

Parameters:
  • file (str) – Path to output file
  • format (str) – Either “fasta” or “pickle”

Todo

Write checks for format and file.

sequence_io

synopsis:Input/output for sequences
tral.sequence.sequence_io.read_fasta(file, indices=None)[source]

Read all sequences from a fasta file.

Read all sequences from a fasta file. At current, the Biopython SeqIO parser is used.

Parameters:
  • file (str) – Path to input file
  • start ([int, int]) – Index of the first returned sequence, and the first not returned sequence.
tral.sequence.sequence_io.write(sequence, sequence_file, sequence_id='sequence_id_not_defined')[source]

Write a sequence str to fasta format in specified <sequence_file>

Write s sequence str to fasta format in specified <sequence_file>

Parameters:
  • sequence (str) – Sequence
  • sequence_file (str) – Path to the output file
  • sequence_id (str) – ID of the sequence in the output file.

repeat_detection_io

synopsis:Parsing repeat detection algorithm output
tral.sequence.repeat_detection_io.getMSA(sequenceMSA, consensusMSA)[source]

Derive the MSA from a strange combination of consensusMSA and sequenceMSA in TRF (Benson) txt.html output files

Parameters:
  • sequenceMSA
  • consensusMSA
Returns:

msa – The multiple sequence alignment predicted by TRF.

Return type:

list of str

tral.sequence.repeat_detection_io.hhpredid_get_repeats(infile)[source]

Read repeats from a HHREPID standard output (stdout) file stream successively.

Read repeats from a HHREPID standard output (stdout) file stream successively. Postcondition: infile points to EOF.

Layout of HHREPID standard output:

protein ::=
     begin"-"\d    "+"\d repeatUnit
   ( \d"-"\d    "+"\d repeatUnit )+
Parameters:
  • infile (file stream) – File stream from HHREPID standard output.
  • by e.g. ([Generated) – ./hhrepid_32 -i FASTAFILE -v 0 -d cal.hhm -o INFILE]
Returns:

A generator function is returned that, when called in a loop, yields one repeat per iteration.

Return type:

(Repeat)

Todo

Layout HHREPID output syntax.

tral.sequence.repeat_detection_io.phobos_get_repeats(infile)[source]

Read repeats from a PHOBOS output file stream successively.

Read repeats from a PHOBOS output file stream successively. Postcondition: infile points to EOF.

Parameters:infile (file stream) – File stream from PHOBOS output.
Returns:A generator function is returned that, when called in a loop, yields one repeat per iteration.
Return type:(Repeat)

Todo

Show PHOBOS output syntax.

tral.sequence.repeat_detection_io.tred_get_repeats(infile)[source]

Read repeats from a TRED standard output (stdout) file stream successively.

Read repeats from a TRED standard output (stdout) file stream successively. Postcondition: infile points to EOF.

Layout of TRED output file:

 Start: start End: \d+ Length: \d+

( \d repeat_unit \d
( alignment_indicator )?)*
Parameters:infile (file stream) – File stream of output1 from tred1 myDNA.faa intermediate_output tred2 myDNA.faa intermediate_output output1 output2
Returns:A generator function is returned that, when called in a loop, yields one repeat per iteration.
Return type:(Repeat)

Todo

Layout TRED output syntax.

tral.sequence.repeat_detection_io.tred_msa_from_pairwise(repeat_units)[source]

Construct a MSA from pairwise alignments.

Construct a MSA from pairwise alignments. At the moment, gaps following the repeat are not added. However, these gaps are added automatically when a Repeat instance is created.

Parameters:repeat_units (list of str) – Read in from TRED output files
Returns:(list of str)

Todo

Is the Args format mentioned correctly?

tral.sequence.repeat_detection_io.treks_get_repeats(infile)[source]

Read repeats from a T-REKS standard output (stdout) file stream successively.

Read repeats from a T-REKS standard output (stdout) file stream successively. Postcondition: infile points to EOF.

Layout of T-REKS output file:

protein ::=
    ">" identifier
    repeat*
#
repeat ::=
    repeat_header
    sequence*
    "*"+
#
repeat_header ::= "Length:" integer "residues - nb:" integer  "from"  integer "to" integer "- Psim:"float "region Length:"integer
Parameters:infile (file stream) – File stream to the file of the standard output of T-Reks
Returns:A generator function is returned that, when called in a loop, yields one repeat per iteration.
Return type:(Repeat)

Todo

Layout T-REKS output syntax.

tral.sequence.repeat_detection_io.trf_get_repeats(infile)[source]

Read repeats from a TRF txt.html file stream file stream successively.

Read repeats from a TRF txt.html file stream file stream successively. Postcondition: infile points to EOF.

TRF output file syntax:

Sequence: ``identifier``
     Indices: ``begin``--``end``
     \d [a-zA-Z]+
#
     begin (repeat)*
     1  (consensus)*
#
  (( \d (repeat)*
     \d  (consensus)*
  )?
     \d (repeat)*
     1  (consensus)*
  )+
     \d [a-zA-Z]+
#
    ``Statistics``
Parameters:infile (file stream) – File stream from TRF output txt.html.

(generated via e.g. ./trf404.linux64.exe FASTAFILE 2 7 7 80 10 50 500 -d > /dev/null If the -h flag is set, no .txt.html output is produced)

Returns:A generator function is returned that, when called in a loop, yields one repeat per iteration.
Return type:(Repeat)

Todo

Layout TRF output syntax.

Todo

Does not search for the sequence identifier at current!

tral.sequence.repeat_detection_io.trust_fill_repeats(msa, begin, sequence, maximal_gap_length=20)[source]

return a trust msa that has no longer indels than maximal_gap_length, that contains the indel characters even when not part of the trust output file. Background trust returns tandem repeats, but also distant repeats.

tral.sequence.repeat_detection_io.trust_get_repeats(infile)[source]

Read repeats from a TRUST standard output (stdout) file stream successively.

Read repeats from a TRUST standard output (stdout) file stream successively. Postcondition: infile points to EOF.

Layout of TRUST standard output:

protein ::=
    ">" identifier
    (repeat_types)*
    "//"
#
repeat_types ::=
    "REPEAT_TYPE" integer
    "REPEAT_LENGTH" integer
    (repeat_info)*
    (">Repeat " integer
    sequence)*
#
repeat_info ::=
    integer integer [integer] [integer]
Parameters:infile (file stream) – File stream from TRUST standard output.
Returns:A generator function is returned that, when called in a loop, yields one repeat per iteration.
Return type:(Repeat)

Todo

Layout TRUST output syntax.

tral.sequence.repeat_detection_io.xstream_get_repeats(infile)[source]

Read repeats from a XSTREAM output xls chart

Read repeats from a XSTREAM output xls chart

Postcondition: infile points to EOF.

Parameters:infile (file stream) – File stream to read XSTREAM output from a xls chart.
Returns:A generator function is returned that, when called in a loop, yields one repeat per iteration.
Return type:(Repeat)

repeat_detection_run

synopsis:Execution of repeat detection algorithms
class tral.sequence.repeat_detection_run.BinaryExecutable(binary=None, name=None)[source]

Contains the executable, and combines executable with parameters.

Contains the executable, and combines executable with parameters.

Attributes:
binary(str): Path to binary
get_execute_line(*args)[source]

Return the command line to invoke the program with the arguments args

get_execute_tokens(*args)[source]

Return the tokens to invoke the program with the arguments args

tral.sequence.repeat_detection_run.Detectors(lDetector=None, sequence_type=None)[source]

Define a global dictionary of all used detector functions.

Define a global dictionary of all used detector functions.

Parameters:
  • lDetector (list of str) – A list of repeat detection algorithm names.
  • sequence_type (str) – Either “AA” or “DNA”.
Raises:

Exception – if at least one of the provided detectors in lDetector does not exist.

tral.sequence.repeat_detection_run.check_java_errors(outfile, errfile, log=<logging.Logger object>, procname=None)[source]

Check for java problems. Return True if there were problems, else False.

Check for these java errors:

  • Stdout file is empty but stderr file is not.
  • Java Exception string is indicated in the errfile

Return True if there were problems, else False.

Parameters:
  • outfile (file handle) – Redirected standard output channel file.
  • errfile (file handle) – Redirected standard error channel file If None, no copies are saved.
  • log – Name of the log to issue log messages to. If none, no log messages will be issued.

Todo

Complete docstring

tral.sequence.repeat_detection_run.run_detector(seq_records, detectors=None, sequence_type='AA', default=True, local_working_dir=None, num_threads=1)[source]
Run TRD on sequence_records and return the predicted repeats for each seq_records
and for each tandem repeat detector.

Run TRD on sequence_records and return the predicted repeats for each seq_records and for each tandem repeat detector. Example of return type:

[
    # record 1
    {
    'T-REKS' : [ Repeat(), Repeat(), ...],
    'XSTREAM' : [ Repeat(), Repeat(), ...],
    ...
    },
    # record 2
    ...
]
Parameters:
  • seq_records (list of Sequence) – A list of Sequence instances
  • detectors (list of str) – A list tandem repeat detector names
  • sequence_type (str) – Either “AA” or “DNA”
  • default (bool) – If True, default values for the detection algorithms are used.
  • local_working_dir (str) – Directory where data and results are stored. If provided,
  • files are not deleted/ (temporary) –
  • num_threads (int) – Run num_threads detectors on parallel threads.
Returns:

A list with a dictionary for each record in seq_records. The dictionary contains a list of repeats for each detector that was used.

Return type:

list of dictionary

tral.sequence.repeat_detection_run.split_sequence(seq_records, working_dir)[source]

Split a FASTA file with multiple entries to several FASTA files with one entry

Arguments: seq_records – List of Sequence instances. working_dir – Output directory for splitted file

Returns: A list of tuples containing the Protein identifier and the file name