Orphelia [Readme]

0. Prerequisites for the command line tool

Step-by-step download and installation

In the following $ORPHELIA_HOME indicates the path where the program is located, e.g. /home/me/orphelia/.

  1. Download the tarball http://orphelia.gobics.de/download/orphelia.tgz
  2. Extract the file: tar -xzf orphelia.tgz (may take a while)
  3. Call the program via the script $ORPHELIA_HOME/orphelia (see II)

I. General

The program Orphelia comprises two parts: The ORF finder, which extracts all ORFs from the input sequences and the prediction tool, which selects those ORFs with a high probability to encode a protein. The ORF finder is a command line tool written in Java, which requires a file of input sequences as an argument (see II and IV). After extracting ORFs, the prediction tool is automatically triggered. The prediction tool is a MATLAB compiler generated program which requires the MATLAB libraries to be available in the LD_LIBRARY_PATH. (A full version of MATLAB is not required!) The complete library is located in $ORPHELIA_HOME/v74. The LD_LIBRARY_PATH is automatically set temporarily, if you call the program via the start script $ORPHELIA_HOME/orphelia.

II. Command line arguments

The easiest way to use Orhelia is via the BASH start script orphelia. The only obligatory argument is a file containing the input sequences. The other arguments are optional, or may be defined in the config file $ORPHELIA_HOME/.frag , where default values are preconfigured.

Syntax

orphlelia -s sequences.fna [-mfna y|n] [-m model] [-slots N] [-o outdir] [-maxoverlap M]

Arguments

-s : Sequence file (multiple FASTA or line-by-line sequences) This parameter is obligatory.
Optional arguments
-mfna y|n : Default is '-mfna y' If the sequence file is NOT FASTA, pass '-mfna n'
-m model : In the current version, model may be Net300 or Net700 Default is Net700 (defined in the config file)
-slots : Number of CPU slots you would like to use in parallel Default is 1 (defined in the config file)
-o outdir : By default a directory name is generated from system time in /var/tmp (defined in the config file). If you give a directory name via the parameter -o, the output files are saved directly to this path, e.g. with -o /var/tmp/my_result/ you will find the predictions in /var/tmp/my_result/gene.pred
-maxoverlap M : Maximum overlap of genes Default is 60 (defined in the config file)
The overlap is given as "bigger as", meaning that '59' will result in the shortest possible ORF to be 60 bp long.

III. Calling the prediction tool separately

If you would like to call the prediction tool separated from the ORF extraction tool, you should set the PATH and LD_LIBRARY_PATH manually (see start script $ORPHELIA_HOME/orphelia).

Syntax:

GeneFinderTool input-fasta orf.coords orf.seq tis.seq outfile model MaxOverlap [header.frag]

Example:

GeneFinderTool /path/to/input/frag.fasta /var/tmp/result/orf.coords /var/tmp/result/orf.seq /var/tmp/result/tis.seq
               /var/tmp/result/gene.pred /home/me/orphelia/Net700.mat 60  /var/tmp/result/header.frag

IV. Input and output files

By default an input file in multiple FASTA is expected:

> Seq No 1
CCTCCTCCTGTTTTTCCCTCAATACAACCTCATTGGATTATTCAATTCAC
CATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTC
TCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGAC
AACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTC
CAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGC
TCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA
> Seq No 2
CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAG 
CGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACT
ACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGT
TAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAG
AGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGG
AGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCT
TACACGGAAATCAACGGCGGTGTCATAAGCGAG
> Seq No 3
.....

The input also may be given in "line-by-line format" without header, where each line is considered to be a separate sequence.

Example for two sequences:
CCTCCTCCTGTTTTTCCCTCAATACAACCTCATTGGATTA
CGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGAC

Output files

(The output file names can be reconfigured in the config file.)

  1. gene.pred : The final prediction in coord format
  2. >FragmentNo,ORFinFragNo_posleft_posRight_+|-_Frame_C|I_FragHeader
    
  3. orf.coords : Coordinates of all ORFs considered for the prediction
  4. orf.seq : Respective ORF sequences in line-by-line format
  5. tis.seq : +/- 30 bp window around the TIS considered for the prediction (if available)
  6. frag.header : FASTA headers (if input was FASTA)

V. Config file

In the configuration file ($ORPHELIA/.frag) default values are set. Values in the configuration file are ignored during execution if the respective argument is passed via the command line.

The logging may be configured via the file $ORPHELIA/.log4j. From the preconfiguration the logging is set to level WARN, if you prefer logging to be more chatty, use the level INFO. By default logging is written to the file defined as "appender R", but stdout may easily be added (see config file for more information).


Please direct your questions and comments to orphelia-info@gobics.de.