Orphelia [Introduction]

General descrition

Orphelia is a metagenomic ORF finding tool for the prediction of protein coding genes in short, environmental DNA sequences with unknown phylogenetic origin [1]. Orphelia is based on a two-stage machine learning approach that was recently introduced by our group. After the initial extraction of open reading frames (ORFs), linear discriminants are used to extract features from those ORFs. Subsequently, an artificial neural network combines the features and computes a gene probability for each ORF in a fragment. A greedy strategy computes a likely combination of high scoring ORFs with an overlap constraint.

The linear discriminants in the scoring model were built from 131 complete genomes. The neural network was trained with DNA fragments of a certain length that were randomly excised from the complete genomes. We provide two models with neural networks that were trained on the fragment lengths 300 bp and 700 bp, respectively. The 700 bp model is suitable for fragments longer than ~300 bp, while the 300 bp model is intended for predicting genes ≤ 300 bp DNA fragments.

Orphelia predicts genes in DNA fragments from species that were not included in the training set with good accuracy, and in particular, with a high specificity.

References

[1] K. J. Hoff, M. Tech, T. Lingner, R. Daniel, B. Morgenstern, P. Meinicke (2008) Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics, 9:217
[2] K. J. Hoff, T. Lingner, P. Meinicke, M. Tech (2009) Orphelia: predicting genes in metagenomic sequencing reads Nucleic Acids Research, 37(Web Server issue):W101-W105.


Please direct your questions and comments to orphelia-info@gobics.de.