Document Type

Technical Report

Publication Date






Technical Report Number



A complete mapping from genome to proteome would constitute a foundation for genome-based biology and provide targets for pharmaceutical and therapeutic intervention. This is one reason gene structure prediction has been a major subfield of computational biology for over 20 years. Many of the widely used gene prediction systems were developed in the 1990s and are unable to take advantage of the revolution in comparative genomics brought on by the sequencing of the entire genomes of an increasing numbers of vertebrates. Twinscan is a new system for high-throughput gene-structure prediction that exploits the patterns of conservation observed in alignments between a target genomic sequence and its homologous sequence in other organisms. The approach employs a symbolic conservation sequence that effectively combines many local alignments into a single global alignment. This has several important properties that make Twinscan particularly useful for high-throughput gene prediction. For mammals, Twinscan has been shown to be significantly more accurate and reliable by all measures than any non-comparative genomic method. Twinscan is based on, and includes as a component, the same hidden Markov model topology as Genscan, a popular non-homology based gene prediction program. Twinscan has an object-oriented design and is implemented in the C++ programming language. Twinscan’s three major components consist of probabilistic models of both the DNA sequence and the conservation sequence as well as a dynamic programming framework. Both the models and the computational structure are complicated aggregate classes. In this report, the design and implementation of Twinscan is described at the source-code level for the first time.


Permanent URL: