0. Prerequisites for the command line tool
- Linux, 64-Bit architecture.
- The program needs Java 1.6
- The start script is written for a BASH. If you use another shell, it may be necessary to adapt the script.
Step-by-step download and installation
In the following $ORPHELIA_HOME indicates the path where the program is located, e.g. /home/me/orphelia/.
- Download the tarball http://orphelia.gobics.de/download/orphelia.tgz
- Extract the file: tar -xzf orphelia.tgz (may take a while)
- Call the program via the script $ORPHELIA_HOME/orphelia (see II)
The program Orphelia comprises two parts: The ORF finder, which extracts all ORFs from the input sequences and the prediction tool, which selects those ORFs with a high probability to encode a protein. The ORF finder is a command line tool written in Java, which requires a file of input sequences as an argument (see II and IV). After extracting ORFs, the prediction tool is automatically triggered. The prediction tool is a MATLAB compiler generated program which requires the MATLAB libraries to be available in the LD_LIBRARY_PATH. (A full version of MATLAB is not required!) The complete library is located in $ORPHELIA_HOME/v74. The LD_LIBRARY_PATH is automatically set temporarily, if you call the program via the start script $ORPHELIA_HOME/orphelia.
II. Command line arguments
The easiest way to use Orhelia is via the BASH start script orphelia. The only obligatory argument is a file containing the input sequences. The other arguments are optional, or may be defined in the config file $ORPHELIA_HOME/.frag , where default values are preconfigured.
orphlelia -s sequences.fna [-mfna y|n] [-m model] [-slots N] [-o outdir] [-maxoverlap M]
|-s : Sequence file (multiple FASTA or line-by-line sequences) This parameter is obligatory.|
|-mfna y|n : Default is '-mfna y' If the sequence file is NOT FASTA, pass '-mfna n'|
|-m model : In the current version, model may be Net300 or Net700 Default is Net700 (defined in the config file)|
|-slots : Number of CPU slots you would like to use in parallel Default is 1 (defined in the config file)|
|-o outdir : By default a directory name is generated from system time in /var/tmp (defined in the config file). If you give a directory name via the parameter -o, the output files are saved directly to this path, e.g. with -o /var/tmp/my_result/ you will find the predictions in /var/tmp/my_result/gene.pred|
|-maxoverlap M : Maximum overlap of genes
Default is 60 (defined in the config file)|
The overlap is given as "bigger as", meaning that '59' will result in the shortest possible ORF to be 60 bp long.
III. Calling the prediction tool separately
If you would like to call the prediction tool separated from the ORF extraction tool, you should set the PATH and LD_LIBRARY_PATH manually (see start script $ORPHELIA_HOME/orphelia).
GeneFinderTool input-fasta orf.coords orf.seq tis.seq outfile model MaxOverlap [header.frag]
GeneFinderTool /path/to/input/frag.fasta /var/tmp/result/orf.coords /var/tmp/result/orf.seq /var/tmp/result/tis.seq /var/tmp/result/gene.pred /home/me/orphelia/Net700.mat 60 /var/tmp/result/header.frag
IV. Input and output files
By default an input file in multiple FASTA is expected:
> Seq No 1 CCTCCTCCTGTTTTTCCCTCAATACAACCTCATTGGATTATTCAATTCAC CATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTC TCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGAC AACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTC CAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGC TCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA > Seq No 2 CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAG CGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACT ACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGT TAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAG AGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGG AGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCT TACACGGAAATCAACGGCGGTGTCATAAGCGAG > Seq No 3 .....
The input also may be given in "line-by-line format" without header, where each line is considered to be a separate sequence.Example for two sequences:
(The output file names can be reconfigured in the config file.)
- gene.pred : The final prediction in coord format
- FragmentNo : Number of the fragment (resp. input order)
- ORFinFragNo : Counter of the predicted ORF in this fragment (simply incremented)
- posLeft : Left coordinate of the ORF in the fragment sequence
- posRight : Right coordinate of the ORF in the fragment sequence
- +|- : Strand
- Frame : Reading frame of a predicted gene, counted from the 5'-end
of the input sequence. Reading frame 1 begins at the 1st
position of the sequence, reading frame 2 at the 2nd,
frame 3 at the third position.
Examples for Frame:
---------------------- DNA Fragment ATG------TAG Gene, begins on position 4 -> Fr. 1 ---------------------- DNA Fragment ATG------TAG Gene, begins on pos. 3 -> Fr. 3 ---------------------- DNA Fragment ATG------TAG Gene, begins on pos. 5 -> Fr. 2
- C|I : Is the predicted gene complete (C) of incomplete (I)
- FragHeader : Header of the fragment from the FASTA file
>1,1_197_841_-_2_C_Fragment 1 >2,1_292_822_+_1_C_Fragment 2 >2,2_906_1103_+_3_C_Fragment 2 >2,3_1_273_+_1_I_Fragment 2 >3,1_1_1064_+_3_I_Fragment 3 ...
V. Config file
In the configuration file ($ORPHELIA/.frag) default values are set. Values in the configuration file are ignored during execution if the respective argument is passed via the command line.
The logging may be configured via the file $ORPHELIA/.log4j. From the preconfiguration the logging is set to level WARN, if you prefer logging to be more chatty, use the level INFO. By default logging is written to the file defined as "appender R", but stdout may easily be added (see config file for more information).
Please direct your questions and comments to firstname.lastname@example.org.