Sequence code documentation

The sequence submodule documentation.

sequence

synopsis

The Sequence Class.

class tral.sequence.sequence.Sequence(seq, name=None, sequence_type='AA')[source]

A Sequence describes either a protein or a DNA sequence.

Sequence contains methods that act on single sequences, for example:

  • tandem repeats

seq

The sequence.

Type

str

seq_standard_aa

The sequence with standard amino acids only

Type

str

create(input_format)[source]

Create sequence(s) from file.

Create sequence(s) from file.

Parameters
  • file (str) – Path to input file

  • format (str) – Either “fasta” or “pickle”

Todo

Write checks for format and file.

detect(lHMM=None, denovo=None, realignment='mafft', sequence_type='AA', rate_distribution='constant', user_path=None, **kwargs)[source]

Detects tandem repeats on self.seq from 2 possible sources.

A list of Repeat instances is created for tandem repeat detections on the sequence from two possible sources:

  • Sequence profile hidden Markov models HMM

  • de novo detection algorithms.

Parameters
  • hmm (hmm.HMM) – A list of HMM instances.

  • denovo (bool) – boolean

  • realignment (str) – either “mafft”, “proPIP” or None

  • *kwargs – Parameters fed to denovo TR prediction and/or Repeat instantiation. E.g. repeat = {"calc_score": True}

Returns

A RepeatList instance

get_repeatlist(tag)[source]

Retrieve repeatlist from this sequence instance.

Retrieve repeatlist from this sequence instance. Access repeatlist as self.d_repeatlist[tag]

Parameters

tag (str) – A identifier for the repeat_list

Returns

A repeat_list instance.

Return type

RepeatList

repeat_in_sequence(my_repeat)[source]

Sanity check whether the repeat is part of this sequence. In case, calculate the position of the repeat within the sequence.

If yes: Return True, set repeat.begin to corrected value if necessary. If no: Return False. Perform sanity check on sequences where all amino acids are or are converted to standard amino acids.

Parameters

sequence (sequence) – A sequence instance.

Returns

True if repeat is part of sequence, else false

Return type

bool

Todo

Decide whether save_original_msa is needed here.

set_repeatlist(repeatlist, tag)[source]

Add repeatlist as attribute to this sequence instance.

Add repeatlist as attribute to this sequence instance. Access repeatlist as self.d_repeatlist[tag]

Parameters
  • repeatlist (RepeatList) – A repeat_list instance.

  • tag (str) – A identifier for the repeat_list

write(file, file_format)[source]

Write sequence to file.

Write sequence to file using one of two formats.

Parameters
  • file (str) – Path to output file

  • format (str) – Either “fasta” or “pickle”

Todo

Write checks for format and file.

sequence_io

synopsis

Input/output for sequences

tral.sequence.sequence_io.read_fasta(file, indices=None)[source]

Read all sequences from a fasta file.

Read all sequences from a fasta file. At current, the Biopython SeqIO parser is used.

Parameters
  • file (str) – Path to input file

  • start ([int, int]) – Index of the first returned sequence, and the first not returned sequence.

tral.sequence.sequence_io.write(sequence, sequence_file, sequence_id='sequence_id_not_defined')[source]

Write a sequence str to fasta format in specified <sequence_file>

Write s sequence str to fasta format in specified <sequence_file>

Parameters
  • sequence (str) – Sequence

  • sequence_file (str) – Path to the output file

  • sequence_id (str) – ID of the sequence in the output file.

repeat_detection_io

synopsis

Parsing repeat detection algorithm output

tral.sequence.repeat_detection_io.getMSA(sequenceMSA, consensusMSA)[source]

Derive the MSA from a strange combination of consensusMSA and sequenceMSA in TRF (Benson) txt.html output files

Parameters
  • sequenceMSA

  • consensusMSA

Returns

The multiple sequence alignment predicted by TRF.

Return type

msa (list of str)

tral.sequence.repeat_detection_io.hhpredid_get_repeats(infile)[source]

Read repeats from a HHREPID standard output (stdout) file stream successively.

Read repeats from a HHREPID standard output (stdout) file stream successively. Postcondition: infile points to EOF.

Layout of HHREPID standard output:

protein ::=
     begin"-"\d    "+"\d repeatUnit
   ( \d"-"\d    "+"\d repeatUnit )+
Parameters
  • infile (file stream) – File stream from HHREPID standard output.

  • by e.g. ([Generated) – ./hhrepid_32 -i FASTAFILE -v 0 -d cal.hhm -o INFILE]

Returns

A generator function is returned that, when called in a loop, yields one repeat per iteration.

Return type

(Repeat)

Todo

Layout HHREPID output syntax.

tral.sequence.repeat_detection_io.phobos_get_repeats(infile)[source]

Read repeats from a PHOBOS output file stream successively.

Read repeats from a PHOBOS output file stream successively. Postcondition: infile points to EOF.

Parameters

infile (file stream) – File stream from PHOBOS output.

Returns

A generator function is returned that, when called in a loop, yields one repeat per iteration.

Return type

(Repeat)

Todo

Show PHOBOS output syntax.

tral.sequence.repeat_detection_io.tred_get_repeats(infile)[source]

Read repeats from a TRED standard output (stdout) file stream successively.

Read repeats from a TRED standard output (stdout) file stream successively. Postcondition: infile points to EOF.

Layout of TRED output file:

 Start: start End: \d+ Length: \d+

( \d repeat_unit \d
( alignment_indicator )?)*
Parameters

infile (file stream) – File stream of output1 from tred1 myDNA.faa intermediate_output tred2 myDNA.faa intermediate_output output1 output2

Returns

A generator function is returned that, when called in a loop, yields one repeat per iteration.

Return type

(Repeat)

Todo

Layout TRED output syntax.

tral.sequence.repeat_detection_io.tred_msa_from_pairwise(repeat_units)[source]

Construct a MSA from pairwise alignments.

Construct a MSA from pairwise alignments. At the moment, gaps following the repeat are not added. However, these gaps are added automatically when a Repeat instance is created.

Parameters

repeat_units (list of str) – Read in from TRED output files

Returns

(list of str)

Todo

Is the Args format mentioned correctly?

tral.sequence.repeat_detection_io.treks_get_repeats(infile)[source]

Read repeats from a T-REKS standard output (stdout) file stream successively.

Read repeats from a T-REKS standard output (stdout) file stream successively. Postcondition: infile points to EOF.

Layout of T-REKS output file:

protein ::=
    ">" identifier
    repeat*
#
repeat ::=
    repeat_header
    sequence*
    "*"+
#
repeat_header ::= "Length:" integer "residues - nb:" integer  "from"  integer "to" integer "- Psim:"float "region Length:"integer
Parameters

infile (file stream) – File stream to the file of the standard output of T-Reks

Returns

A generator function is returned that, when called in a loop, yields one repeat per iteration.

Return type

(Repeat)

Todo

Layout T-REKS output syntax.

tral.sequence.repeat_detection_io.trf_get_repeats(infile)[source]

Read repeats from a TRF txt.html file stream file stream successively.

Read repeats from a TRF txt.html file stream file stream successively. Postcondition: infile points to EOF.

TRF output file syntax:

Sequence: ``identifier``
     Indices: ``begin``--``end``
     \d [a-zA-Z]+
#
     begin (repeat)*
     1  (consensus)*
#
  (( \d (repeat)*
     \d  (consensus)*
  )?
     \d (repeat)*
     1  (consensus)*
  )+
     \d [a-zA-Z]+
#
    ``Statistics``
Parameters

infile (file stream) – File stream from TRF output txt.html.

(generated via e.g. ./trf404.linux64.exe FASTAFILE 2 7 7 80 10 50 500 -d > /dev/null If the -h flag is set, no .txt.html output is produced)

Returns

A generator function is returned that, when called in a loop, yields one repeat per iteration.

Return type

(Repeat)

Todo

Layout TRF output syntax.

Todo

Does not search for the sequence identifier at current!

tral.sequence.repeat_detection_io.trust_fill_repeats(msa, begin, sequence, maximal_gap_length=20)[source]

return a trust msa that has no longer indels than maximal_gap_length, that contains the indel characters even when not part of the trust output file. Background trust returns tandem repeats, but also distant repeats.

tral.sequence.repeat_detection_io.trust_get_repeats(infile)[source]

Read repeats from a TRUST standard output (stdout) file stream successively.

Read repeats from a TRUST standard output (stdout) file stream successively. Postcondition: infile points to EOF.

Layout of TRUST standard output:

protein ::=
    ">" identifier
    (repeat_types)*
    "//"
#
repeat_types ::=
    "REPEAT_TYPE" integer
    "REPEAT_LENGTH" integer
    (repeat_info)*
    (">Repeat " integer
    sequence)*
#
repeat_info ::=
    integer integer [integer] [integer]
Parameters

infile (file stream) – File stream from TRUST standard output.

Returns

A generator function is returned that, when called in a loop, yields one repeat per iteration.

Return type

(Repeat)

Todo

Layout TRUST output syntax.

tral.sequence.repeat_detection_io.xstream_get_repeats(infile)[source]

Read repeats from a XSTREAM output xls chart

Read repeats from a XSTREAM output xls chart

Postcondition: infile points to EOF.

Parameters

infile (file stream) – File stream to read XSTREAM output from a xls chart.

Returns

A generator function is returned that, when called in a loop, yields one repeat per iteration.

Return type

(Repeat)

repeat_detection_run

synopsis

Execution of repeat detection algorithms

class tral.sequence.repeat_detection_run.BinaryExecutable(binary=None, name=None)[source]

Contains the executable, and combines executable with parameters.

Contains the executable, and combines executable with parameters.

Attributes:

binary(str): Path to binary

get_execute_line(*args)[source]

Return the command line to invoke the program with the arguments args

get_execute_tokens(*args)[source]

Return the tokens to invoke the program with the arguments args

tral.sequence.repeat_detection_run.Detectors(lDetector=None, sequence_type=None)[source]

Define a global dictionary of all used detector functions.

Define a global dictionary of all used detector functions.

Parameters
  • lDetector (list of str) – A list of repeat detection algorithm names.

  • sequence_type (str) – Either “AA” or “DNA”.

Raises

Exception – if at least one of the provided detectors in lDetector does not exist.

tral.sequence.repeat_detection_run.check_java_errors(outfile, errfile, log=<Logger tral.sequence.repeat_detection_run (ERROR)>, procname=None)[source]

Check for java problems. Return True if there were problems, else False.

Check for these java errors:

  • Stdout file is empty but stderr file is not.

  • Java Exception string is indicated in the errfile

Return True if there were problems, else False.

Parameters
  • outfile (file handle) – Redirected standard output channel file.

  • errfile (file handle) – Redirected standard error channel file If None, no copies are saved.

  • log – Name of the log to issue log messages to. If none, no log messages will be issued.

Todo

Complete docstring

tral.sequence.repeat_detection_run.run_detector(seq_records, detectors=None, sequence_type='AA', default=True, local_working_dir=None, num_threads=1)[source]
Run TRD on sequence_records and return the predicted repeats for each seq_records

and for each tandem repeat detector.

Run TRD on sequence_records and return the predicted repeats for each seq_records and for each tandem repeat detector. Example of return type:

[
    # record 1
    {
    'T-REKS' : [ Repeat(), Repeat(), ...],
    'XSTREAM' : [ Repeat(), Repeat(), ...],
    ...
    },
    # record 2
    ...
]
Parameters
  • seq_records (list of Sequence) – A list of Sequence instances

  • detectors (list of str) – A list tandem repeat detector names

  • sequence_type (str) – Either “AA” or “DNA”

  • default (bool) – If True, default values for the detection algorithms are used.

  • local_working_dir (str) – Directory where data and results are stored. If provided,

  • files are not deleted/ (temporary) –

  • num_threads (int) – Run num_threads detectors on parallel threads.

Returns

A list with a dictionary for each record in seq_records. The dictionary contains a list of repeats for each detector that was used.

Return type

list of dictionary

tral.sequence.repeat_detection_run.split_sequence(seq_records, working_dir)[source]

Split a FASTA file with multiple entries to several FASTA files with one entry

Arguments: seq_records – List of Sequence instances. working_dir – Output directory for splitted file

Returns: A list of tuples containing the Protein identifier and the file name