BMC Biology
, 250 14th Street NW, Atlanta, GA 30318, USA
Email: Sitao Wu - stwu@ku.edu; Jeffrey Skolnick - skolnick@gatech.edu; Yang Zhang* - yzhang@ku.edu
* Corresponding author
Abstract
Background: Predicting 3-dimensional protein structures from amino-acid sequences is an
important unsolved problem in computational structural biology. The problem becomes relatively
easier if close homologous proteins have been solved, as high-resolution models can be built by
aligning target sequences to the solved homologous structures. However, for sequences without
similar folds in the Protein Data Bank (PDB) library, the models have to be predicted from scratch.
Progress in the ab initio structure modeling is slow. The aim of this study was to extend the TASSER
(threading/assembly/refinement) method for the ab initio modeling and examine systemically its
ability to fold small single-domain proteins.
Results: We developed I-TASSER by iteratively implementing the TASSER method, which is used
in the folding test of three benchmarks of small proteins. First, data on 16 small proteins (< 90
residues) were used to generate I-TASSER models, which had an average C
-root mean square
deviation (RMSD) of 3.8Å, with 6 of them having a C
-RMSD < 2.5Å. The overall result was
comparable with the all-atomic ROSETTA simulation, but the central processing unit (CPU) time
by I-TASSER was much shorter (150 CPU days vs. 5 CPU hours). Second, data on 20 small proteins
(< 120 residues) were used. I-TASSER folded four of them with a C
-RMSD < 2.5Å. The average
C
-RMSD of the I-TASSER models was 3.9Å, whereas it was 5.9Å using TOUCHSTONE-II
software. Finally, 20 non-homologous small proteins (< 120 residues) were taken from the PDB
library. An average C
-RMSD of 3.9Å was obtained for the third benchmark, with seven cases
having a C
-RMSD < 2.5Å.
Conclusion: Our simulation results show that I-TASSER can consistently predict the correct folds
and sometimes high-resolution models for small single-domain proteins. Compared with other ab
initio modeling methods such as ROSETTA and TOUCHSTONE II, the average performance of I-
TASSER is either much better or is similar within a lower computational time. These data, together
with the significant performance of automated I-TASSER server (the Zhang-Server) in the 'free
modeling' section of the recent Critical Assessment of Structure Prediction (CASP)7 experiment,
demonstrate new progresses in automated ab initio model generation. The I-TASSER server is
freely available for academic users http://zhang.bioinformatics.ku.edu/I-TASSER.
Published: 8 May 2007
BMC Biology 2007, 5:17
doi:10.1186/1741-7007-5-17
Received: 9 December 2006
Accepted: 8 May 2007
This article is available from: http://www.biomedcentral.com/1741-7007/5/17
© 2007 Wu et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Biology 2007, 5:17
http://www.biomedcentral.com/1741-7007/5/17
Page 2 of 10
(page number not for citation purposes)
Background
Prediction of protein structure from amino-acid
sequences has been one of the most challenging problems
in computational structural biology for many years [1,2].
Historically, protein structure prediction was classified
into three categories: (i) comparative modeling [3,4], (ii)
threading [5-9], and (iii) ab initio folding [10-15]. The first
two approaches build protein models by aligning query
sequences onto solved template structures. When close
templates are identified, high-resolution models could be
built by the template-based methods. If templates are
absent from the Protein Data Bank (PDB) library, the
models need to be built from scratch, i.e. ab initio folding.
This is the most difficult category of protein-structure pre-
diction [16,17].
With increasing protein sizes, the conformational phase
space of sampling also sharply increases, which makes the
ab initio modeling of larger proteins extremely difficult
[18]. Current ab initio predictions are mainly focused on
small proteins. Several successful examples have been
reported in literature. For example, based on an ab initio
approach designed to globally optimize their potential
energy function, Liwo et al were able to build models of
C
root mean square deviation (RMSD) to native < 6Å for
protein fragments of up to 61 residues [10]. Using the
ROSETTA program [11], Simon et al reported 73 success-
ful structure predictions out of 172 target proteins with
lengths of < 150 residues, with C
-RMSD < 7Å in the top
five models [19]. Using TOUCHSTONE-II software,
Zhang et al reported 83 foldable cases from 125 target
proteins (up to 174 residues) with C
-RMSD < 6.5Å in the
top five models [12]. Recently, Bradley et al demonstrated
an exciting achievement by building several high-resolu-
tion models for proteins of < 90 residues [13]. By combin-
ing low-resolution and high-resolution sampling, the
authors used the all-atomic ROSETTA to predict high-res-
olution models with C
-RMSD < 1.5Å for 5 of 16 small
proteins. The average C
-RMSD for all the 16 proteins was
3.8Å in the best of the top five clusters. The CPU time cost,
however, is expensive and ~150 CPU days are required for
the all-atom sampling of each target.
In this work, we aimed to investigate the possibility of
generating high-resolution models of small proteins in an
automated and fast simulation. We developed a new
method, I-TASSER, which implements TASSER [18,20] in
an iterative mode and also exploits new force-field opti-
mization and fragment identification. We tested the I-
TASSER method on three independent benchmark sets.
The result shows that I-TASSER has a comparable overall
performance with the all-atomic ROSETTA but with far
lower CPU cost. It also demonstrates that I-TASSER clearly
outperforms the TOUCHSTONE-II method.
Results and discussion
We tested the folding performance of I-TASSER on small
proteins. To avoid contamination with homologous pro-
teins, any template with > 20% sequence identity to the
target sequence was removed from our template library.
Moreover, if a template could be detected by the Position
Specific Iterative (PSI)-BLAST program with an E-value <
0.05, it would also be excluded. We note that the homol-
ogy exclusion cutoff used here is more stringent than that
used by Bradley et al [13], who only excluded templates
with a PSI-BLAST E-value < 0.05 but without sequence
identity cutoff, and that used by Zhang et al [12], who
only excluded the templates with sequence identity > 30%
but without PSI-BLAST checking. In the sense that all
homologous templates had been completely excluded, we
termed the corresponding simulations "ab initio" mode-
ling, following the notation by others [10,12,13,21].
For the evaluation of the predicted models, we used both
the RMSD and TM-score [22]. Although RMSD can give an
explicit concept of modeling errors, in some cases, a local
error (e.g., tail misorientation) can cause a large RMSD
value even though the global topology is correct. TM-score
is defined as [22].
where N is the number of residues of the query sequence
and N
ali
is the number of aligned residues in a threading
alignment. For a full-length model, N and N
ali
are identi-
cal. d
i
is the distance of the ith C
pair between model and
native after superposition, and
.
As TM-score weights small distances stronger than larger
distances, it is more sensitive to global topology than is
RMSD. According to Zhang and Skolnick [22], TM-score =
1 indicates two identical structures and TM-score < 0.17
indicates random structure pairs. A TM-score of > 0.5
means two structures with the same folding.
Benchmark I: 16 proteins from the data of Bradley et al
Table 1 shows the modeling result of I-TASSER on 16
small proteins that were used by Bradley et al [13]. This
benchmark set includes 3 proteins, 2 proteins, and 11
proteins with pairwise sequence identity < 30%. If we
define a high-resolution model as that with C
-RMSD to
native 1.5Å, I-TASSER predicts high-resolution models
for one target '1ogwA' (see Figure 2A for the model super-
imposed on the native structure). For the best of the top
five clusters, most of the targets (12/16) had a medium
resolution, with a C
-RMSD of 1.55Å. For the remaining
three targets, I-TASSER could not correctly fold the pro-
TM-score
=
+
(
)
=
1
1
1
0
2
1
N
d d
i
i
N
ali
,
(1)
d
N
0
3
1 24
15 1 8
=
.
.
BMC Biology 2007, 5:17
http://www.biomedcentral.com/1741-7007/5/17
Page 3 of 10
(page number not for citation purposes)
teins. One of them (1tif_) has a long swinging tail at the
C-terminal. For the other two (1dcjA_ and 1o2fB_), both
having a topology of four parallel -strands flanked by
two -helices, the imperfection of the I-TASSER force field
is obviously responsible for the failure because the energy
of the native structures is higher than that of the largest
clusters.
For the first predicted model of the highest cluster density,
the overall average C
-RMSD for the 16 target proteins
was 4.3Å with average TM-score of 0.59. If we consider the
best model in the top five predictions, the average C
-
RMSD to the native is 3.8Å and TM-score was 0.61. Figure
2(b,c) shows typical examples of both medium-resolution
and low-resolution predicted models.
As a comparison, the table also lists the all-atomic ROS-
SETA pre