Discriminative Models for Comparative Gene Prediction
lack>Yahoo! is not affiliated with the authors of this page or responsible for its content.
Discriminative Models for Comparative Gene Prediction
Discriminative Models for Comparative Gene Prediction
Axel E. Bernal
Department of Computer and Information Science
University of Pennsylvania, Philadelphia, PA 19104, USA
June 24, 2008
Abstract
Eukaryotic gene prediction is an important, long-standing problem in computa-
tional biology. It involves identifying relevant genomic features and combining them
using a structure model, typically a generative Hidden Markov Model (HMM). HMMs
can exploit with ease, small training sets and incomplete annotations, and can be
trained very eciently. These attributes made HMMs ideal for handling early ge-
nomic data, in spite of their known drawbacks. However, with the ever increasing
availability in both, computing power, and quality and quantity of genomic annota-
tions, research work on discriminative learning for gene prediction has nally yielded
improvements over the best generative models.
On the other hand, comparative features derived from genomic alignments, have
been shown eective in predicting gene structures in recent years. Most of these
features do not have a clear probabilistic meaning and might have dependencies car-
ried from evolutionary and functional conservation. Theoretically, a discriminatively
trained model, rather than a generative HMM, would be better suited to integrate
such features.
In this survey, we study the current state of the art for discriminatively trained
comparative gene predictors. First, we introduce general notation and discuss relevant
traits of discriminative genending using mSplicer as an example. mSplicer, a large
margin trained ab initio predictor, is the rst discriminative system to improve sys-
tematically upon existing gene annotation. We then review two discriminative models
for comparative gene prediction, Conrad and Contrast. Both of these predictors are
based on Conditional Random Fields (CRFs) but use a dierent approach to meet
the problem.
We compare these three systems regarding their design goals and scope, learning
algorithms, feature denitions and signicance of their results for biologists.
1
INTRODUCTION
Prediction of protein-coding genes involves identifying correctly splice and translation sig-
nals in DNA sequences. Traditionally, gene predictors have been divided in two classes.
Ab initio genenders rely exclusively on intrinsic structural features of genes, such as
motif enrichment in splice sites and content statistics in coding regions. Homology-based
genenders exploit extrinsic features derived from alignments to genomic DNA from re-
lated informant organisms, expressed sequence tags (ESTs), complementary DNAs (cD-
NAs) and protein sequences. Homology-based gene predictors which only use alignments
to informant genomes are also known as de novo or comparative.
1
Discriminatively trained gene predictors such as mSplicer [1] and HMMGene [2] are ab
initio, whereas Conrad [3] and Contrast [4] are comparative. Comparative gene predictors
may also use EST evidence when available but only to supplement information from
genomic alignments. Other types of predictors not covered in this review are a) homology-
based gene predictors which base their predictions primarily on cDNAs, protein and EST
alignments, and b) ensemble methods, which combine the output of other predictors.
Training discriminative genenders is computationally very costly and it often re-
quires a relatively large training corpus to obtain good predictions on unseen testing data.
However, discriminative models oer advantages over generative ones, in their ability to
combine a rich variety of (possibly overlapping) features without making the problem
intractable, and nd a global tradeo among feature contributions that maximizes anno-
tation accuracy. In practice, these advantages usually result in improved accuracy as it is
rare for a real problem to fulll all independence assumptions made by generative models.
HMMGene pioneered the use of discriminative learning for ab initio gene prediction
in human and C. elegans more than a decade ago. It uses an Hidden Markov Model
(HMM) to model gene structures and the conditional maximum likelihood (CML) [5] as
discriminative training objective. HMMGenes prediction accuracy on human was lower
than GenScans [6], which was considered the best ab initio genender for many years.
Accuracy was hurt primarily by the limited amount of high-quality annotated data, and
more importantly, by the absence of features associated with biological signals and state
lengths. We hypothesize that the nal model did not include these features, because of
the increased complexity.
Since then, new advances in discriminative learning techniques, better gene annotation
quality and faster computers, have made it possible to eciently train discriminatively
with larger amounts of data and incorporate many more informative features. An example
of these advances is mSplicer [1], an ab initio predictor for improving gene annotation in C.
elegans. mSplicer uses Support Vector Machines (SVMs) to learn individual splice signal
and content sensor submodels. The structure model is semi-Markov, to explicitly model
lengths of introns and exons. The length scores are then combined with the output from
the SVM classiers into a linear model for ranking gene structures. The parameters of
this model are estimated in a way that annotated gene structures are ranked the highest
and with a large margin with respect to all other possible structures. However, it should
be noted that mSplicer is not a fully blown gene predictor, as it cannot predict translation
regions nor genes on chromosomal DNA, only splicing forms on transcriptional units.
Section 3 describes this system in detail.
The rst attempt at using modern techniques for discriminative learning for homology-
based genending, is described in [7]. The system uses a CRF model [8], and CML training
objective to predict genes in human and it combines signal, content and protein alignnment
features.
The authors report better results than Genie [9] on H. sapiens when both
systems use intrinsic and protein alignment features. However, Genie gives better results
for ab initio prediction no results for de novo prediction were reported. Furthermore,
test experiments did not include real chromosomal DNA and comparisons against more
accurate gene predictors such as N-SCAN [10] were not given. Two relatively recent and
more successful comparative gene predictors are Conrad and Contrast.
Conrad uses a semi-Markov Conditional Random Field (SMCRF) as the underlying
structure model.
Complexity in SMCRFs is still linear in the input size, like CRFs,
2
but SMCRFs have a much larger constant, bounded from above by the longest possible
decoded segment. The training objective is the Maximum Expectation Accuracy (MEA)
which maximizes the expected accuracy over all possible segmentations dened by the
SMCRF. MEA is not a concave function and CML is used to initialize model parameters
to avoid convergence to local minima. Conrad currently supports fungal genomes only;
larger genomes may pose a problem due to the increased time and memory complexity of
SMCRFs. More on Conrad is given in Section 4
Finally, Contrast xes most of scalability limitations of Conrad, as it is able to handle
larger genomes like H. sapiens and D. melanogaster, while still being able to integrate
comparative features. Contrast osets the complexity costs by using simpler ab initio fea-
ture denitions and a CRF as the underlying structure model. Contrast uses a training
objective called Maximum Expected Boundary Accuracy (MEBA) which focuses on get-
ting the coding boundaries correctly predicted, even if in doing so, segment length features
are penalized. This function, like MEA, is not concave and hence, parameter initialization
is required prior to entering the learning phase. This system is described in Section 5
In the next section, we introduce some basic concepts on protein synthesis and gene
organization and some formal denitions on feature-based gene structure prediction. These
concepts and denitions will be used at length throughout the remaining sections, wherein
we evaluate the papers in detail.
2
BACKGROUND
2.1
From Genes to Proteins
In genomic DNA, gene structures or simply genes are often found anked by long
stretches of DNA called intergenic regions. Genes sometimes overlap other genes creating
a single transcriptional unit but these events are rare and complex enough that so far, no
gene predictor has been able to explicitely model them.
Genes can contain multiple transcripts. In the context of gene prediction, transcripts
consist of either a single exon or a succession of exons, with introns in between. As
such, transcripts include information from both, precursor messenger RNA (pre-mRNA)
and mature RNA (mRNA) molecules; pre-mRNAs are synthesized from genes through a
process called transcription by which the genomic DNA containing the gene the tran-
scriptional unit is copied to said molecule; mRNAs are synthesized from pre-mRNAs
through a process called splicing which removes introns from the pre-mRNA as guided by
the spliceosome.
Genes that are protein-coding, the focus of this survey can have their spliced
transcript(s) translated into protein(s).
Splice signals are located at the boundaries of exons and introns. Acceptors are located
at the beginning of exons and donors are located at th