13
Fold Recognition With Minimal Gaps William Chen, 1,3 Leonid Mirny, 2 and Eugene I. Shakhnovich 3 * 1 Department of Biophysics, Harvard University, Boston, Massachusetts 2 Division of Health Science and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 3 Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts ABSTRACT Here we present a simplified form of threading that uses only a 20 20 two-body residue-based potential and restricted number of gaps. Despite its simplicity and transparency the Monte Carlo-based threading algorithm performs very well in a rigorous test of fold recognition. The results suggest that by simplifying and constraining the decoy space, one can achieve better fold recogni- tion. Fold recognition results are compared with and supplemented by a PSI-BLAST search. The statistical significance of threading results is rigor- ously evaluated from statistics of extremes by com- parison with optimal alignments of a large set of randomly shuffled sequences. The statistical theory, based on the Random Energy Model, yields a cumu- lative statistical parameter, , that attests to the likelihood of correct fold recognition. A large indicates a significant energy gap between the opti- mal alignment and decoy alignments and, conse- quently, a high probability that the fold is correctly recognized. For a particular number of gaps, the parameter reaches its maximal value, and the fold is recognized. As the number of gaps further in- creases, the likelihood of correct fold recognition drops off. This is because the decoy space is small when gaps are restricted to a small number, but the native alignment is still well approximated, whereas unrestricted increase of the number of gaps leads to rapid growth of the number of decoys and their statistical dominance over the correct alignment. It is shown that best results are obtained when a combination of one-, two-, and three-gap threading is used. To this end, use of the parameter is crucial for rigorous comparison of results across the differ- ent decoy spaces belonging to a different number of gaps. Proteins 2003;51:531–543. © 2003 Wiley-Liss, Inc. Key words: fold recognition; PSI-BLAST; Random Energy Model; cutoff parameter INTRODUCTION It is difficult to overstate the importance of a fast and reliable method for predicting protein structures from sequence. Such a method would allow a proper character- ization of hitherto unknown regions of sequenced ge- nomes. The ability to accurately predict structures would also guide us in designing novel proteins for industrial, therapeutic, or other commercial purposes. Modern struc- ture prediction methods are varied and include the follow- ing: (i) ab initio methods, 1,2 (ii) sequence alignment meth- ods, 3–5 and (iii) threading methods. 6–9 Threading methods use physical principles that involve mounting the query sequence in question to some template structure and determining the best fit evaluated by statistically adjusted physicochemical scores. This is the focus of our study. The advantage of threading over sequence alignments is that threading does not rely explicitly on sequence similarity for success. There have been some successes in mixing approaches (ii) and (iii) by combining the physicochemical scores with information from large sequence databases. 10,11 The fold recognition problem, in its simplest incarna- tion, may be framed in the following way: given a query protein sequence and a template protein structure, do the two share the same fold? A corollary problem to this is the alignment recognition problem: how well can the query sequence be modeled with use of the template? Here we address the fold recognition problem only, recognizing the fact that these methods may not produce accurate sequence- structure alignments needed for homology modeling. Fold recognition by threading is, in general, a two-step process. First, an optimally scored alignment is found for a particular query sequence and a particular template struc- ture. 9,12,13 Second, a significance score is calculated. 14 –16 If the score is statistically significant, then one can often make an accurate statement that the query sequence adopts the template fold. If the score is not statistically significant, then no claims are made, or one rejects the template fold as the correct one for the query sequence. The scoring function for the optimal alignment is typically a two-body interaction potential. 17,18 It has been shown that the approximation scheme of a two-body potential is not good enough for certain coarse-grained models to stabilize a native fold against misfolds 19 (although a more detailed atom-based, one- and two-body potential was shown to be successful in folding a small -helical protein at the atomic level of detail 20 ). The hope with threading is that one already has fixed the folds to a small selection of possibilities and thereby compensates for the low-resolu- tion quality of the two-body potential. However, the prob- lem still demands one to select the optimal alignment. If gaps and insertions are allowed in threading alignments, *Correspondence to: Eugene I. Shakhnovich, Department of Chem- istry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, MA 02138. E-mail: [email protected] Received 5 September 2002; Accepted 30 December 2002 PROTEINS: Structure, Function, and Genetics 51:531–543 (2003) © 2003 WILEY-LISS, INC.

Fold recognition with minimal gaps

Embed Size (px)

Citation preview

Fold Recognition With Minimal GapsWilliam Chen,1,3 Leonid Mirny,2 and Eugene I. Shakhnovich3*1Department of Biophysics, Harvard University, Boston, Massachusetts2Division of Health Science and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts3Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts

ABSTRACT Here we present a simplified formof threading that uses only a 20 � 20 two-bodyresidue-based potential and restricted number ofgaps. Despite its simplicity and transparency theMonte Carlo-based threading algorithm performsvery well in a rigorous test of fold recognition. Theresults suggest that by simplifying and constrainingthe decoy space, one can achieve better fold recogni-tion. Fold recognition results are compared withand supplemented by a PSI-BLAST search. Thestatistical significance of threading results is rigor-ously evaluated from statistics of extremes by com-parison with optimal alignments of a large set ofrandomly shuffled sequences. The statistical theory,based on the Random Energy Model, yields a cumu-lative statistical parameter, �, that attests to thelikelihood of correct fold recognition. A large �indicates a significant energy gap between the opti-mal alignment and decoy alignments and, conse-quently, a high probability that the fold is correctlyrecognized. For a particular number of gaps, the �parameter reaches its maximal value, and the fold isrecognized. As the number of gaps further in-creases, the likelihood of correct fold recognitiondrops off. This is because the decoy space is smallwhen gaps are restricted to a small number, but thenative alignment is still well approximated, whereasunrestricted increase of the number of gaps leads torapid growth of the number of decoys and theirstatistical dominance over the correct alignment. Itis shown that best results are obtained when acombination of one-, two-, and three-gap threadingis used. To this end, use of the � parameter is crucialfor rigorous comparison of results across the differ-ent decoy spaces belonging to a different number ofgaps. Proteins 2003;51:531–543. © 2003 Wiley-Liss, Inc.

Key words: fold recognition; PSI-BLAST; RandomEnergy Model; � cutoff parameter

INTRODUCTION

It is difficult to overstate the importance of a fast andreliable method for predicting protein structures fromsequence. Such a method would allow a proper character-ization of hitherto unknown regions of sequenced ge-nomes. The ability to accurately predict structures wouldalso guide us in designing novel proteins for industrial,therapeutic, or other commercial purposes. Modern struc-ture prediction methods are varied and include the follow-

ing: (i) ab initio methods,1,2 (ii) sequence alignment meth-ods,3–5 and (iii) threading methods.6–9 Threading methodsuse physical principles that involve mounting the querysequence in question to some template structure anddetermining the best fit evaluated by statistically adjustedphysicochemical scores. This is the focus of our study. Theadvantage of threading over sequence alignments is thatthreading does not rely explicitly on sequence similarityfor success. There have been some successes in mixingapproaches (ii) and (iii) by combining the physicochemicalscores with information from large sequence databases.10,11

The fold recognition problem, in its simplest incarna-tion, may be framed in the following way: given a queryprotein sequence and a template protein structure, do thetwo share the same fold? A corollary problem to this is thealignment recognition problem: how well can the querysequence be modeled with use of the template? Here weaddress the fold recognition problem only, recognizing thefact that these methods may not produce accurate sequence-structure alignments needed for homology modeling.

Fold recognition by threading is, in general, a two-stepprocess. First, an optimally scored alignment is found for aparticular query sequence and a particular template struc-ture.9,12,13 Second, a significance score is calculated.14–16

If the score is statistically significant, then one can oftenmake an accurate statement that the query sequenceadopts the template fold. If the score is not statisticallysignificant, then no claims are made, or one rejects thetemplate fold as the correct one for the query sequence.The scoring function for the optimal alignment is typicallya two-body interaction potential.17,18 It has been shownthat the approximation scheme of a two-body potential isnot good enough for certain coarse-grained models tostabilize a native fold against misfolds19 (although a moredetailed atom-based, one- and two-body potential wasshown to be successful in folding a small �-helical proteinat the atomic level of detail20). The hope with threading isthat one already has fixed the folds to a small selection ofpossibilities and thereby compensates for the low-resolu-tion quality of the two-body potential. However, the prob-lem still demands one to select the optimal alignment. Ifgaps and insertions are allowed in threading alignments,

*Correspondence to: Eugene I. Shakhnovich, Department of Chem-istry and Chemical Biology, Harvard University, 12 Oxford Street,Cambridge, MA 02138. E-mail: [email protected]

Received 5 September 2002; Accepted 30 December 2002

PROTEINS: Structure, Function, and Genetics 51:531–543 (2003)

© 2003 WILEY-LISS, INC.

as they are in sequence alignments, the number of thread-ing alignments is also very large. In effect, one reduces theoverwhelming number of decoy folds to a fewer, thoughstill large, number of decoy alignments. It has been shown,by a reduction of a known NP-complete problem (1-3SAT),that the complexity of threading is exponential in length ofsequence (i.e., it is an NP-complete problem21 if the searchis exhaustive in alignments). This finding suggests thatthreading is a difficult optimization problem, perhaps stilllimited by errors in the potential, despite the restriction ofattention to only fixed structural motifs.

Several attempts have been made to improve threadingby incorporating additional structural or sequence informa-tion. One possibility to improve threading is to imposeconstraints by penalizing or eliminating alignments thatare inconsistent with additional physical or evolutionaryinput information. The constraints may take the form of areduced “core” template, which emphasizes the well-formed secondary structure elements of the struc-ture.10,13,22 Another type of constraint used is to combine asequence alignment score with the threading score.11 Stillanother method of improving fold recognition is to averagethe energies of a query sequence and its homologs, attempt-ing to reduce the random errors that undoubtedly plaguethe potential.23,24

These methods introduce more information at the cost ofadding complexity to the threading simulation. The con-straint considered in this study, on the other hand, is anattempt at simplification. Available potential functionsare always nonideal, and their ability to discriminatebetween the correct (native) alignment and decoys islimited to the number of decoys.14 Thus, it is necessary tolimit the space of decoys to ensure correct fold recognition.In this work, we consider the effect of a systematicrestriction of the number of gaps in the alignment. Exami-nation of structural alignments in FSSP database25 pro-vides further justification to that strategy. Analysis of asample of structural alignments of the FSSP databasesuggests that many good alignments have very few gapsand, hence, few fragments separated by these gaps. Figure1 shows the distribution of the number of fragments in arandom sample of FSSP alignments. Many alignmentshave two or three fragments, whereas very few havegreater than six fragments. A full 58% of alignmentsshown have fragment numbers ranging from 1 to 5.Furthermore, it is known that even gapless threading canrecognize protein folds in some cases.14,23 Therefore, re-stricting the number of gaps in the alignment may stillallow a good alignment while reducing the decoy space. Itis not necessary that the absolute “best” alignment bepreserved under the restricted fragment conditions, butrather, just a near “best” alignment be retained amongconsidered alignments. The compensatory disadvantage ofusing few fragments is that no “good” alignment can bebuilt with few fragments. The extreme case of this sort isgapless threading, in which almost certainly a “good”alignment cannot be found. Threading a query sequencethrough a �-helical hairpin, for example, is best done withat least one gap if the real structure and the templatestructure differ in the size of the turn that connects the

hairpin. Therefore, the problem becomes that of finding asignificant energy gap (indicating that fold is recognized)while keeping the number of fragments to a minimum.

Another important aspect of the present study is inthorough evaluation of statistical significance of observedalignments. As was pointed out by several authors,14,26,27

a consistent approach to threading requires a reliableestimate of statistical significance of obtained alignments.(The estimate of statistical significance is an integral partof sequence alignment methods that report their results interms of E-values.28)

In this article we first present a theory based on theRandom Energy Model for decoy energy distribution andderive a statistical parameter that can be universally usedas an estimate of statistical significance of obtained thread-ing alignment scores. To test our statistical parameter inthreading fold recognition, we use a test set of 150structurally similar proteins. The set of sequence-struc-ture pairs can be found in Table I. We also perform asimilar test of fold recognition using PSI-BLAST. (Valuesfollowing the test set pairs in Table I are e-values from ourPSI-BLAST comparisons. See Comparison to PSI-BLAST.)

THEORYThreading and the Random Energy Model

It is generally believed that the thermodynamic energyspectrum of the different conformations of a protein is welldescribed by the Random Energy Model (REM).29–33 Analo-gously, the REM has been applied to threading.9,14,18

Different alignments are analogous to different folds, andas such, are described by REM. In this article, we assumethe formalism of the REM for the energetics of threading.All the various decoy alignments have independent ener-gies that fall under a Gaussian distribution, except for the“native” optimal alignment (if it exists), which falls below

Fig. 1. The distribution of number of fragments in FSSP alignments.Many alignments have only two or three fragments, and much fewer havealignments with more than six fragments. The abundance of alignmentswith only a few fragments suggests that gains in fold prediction may comemostly from the signal in a few fragments.

532 W. CHEN ET AL.

almost all other alignments in energy. Assuming self-averaging, the mean and width of the alignment energydistribution is determined completely by the sequencecomposition of the query sequence, the two-body potential,and the number of contacts in the template. In addition,because the native alignment is special, its energy shouldform a gap from the rest of the spectrum, in the same waythe energy of the native fold is well separated from theenergies of all the misfolded conformations. The REMtheory is summarized in the following relations. Theenergy distribution is given by

P�E� �1

�2��2 exp���E � Eav�

2

2�2 � (1)

where the average energy, Eav and variance, � can be wellapproximated as in Ref. 14 by

Eav �CN

Ctotal�

i,j � 1

N1

U��i, �j� (2)

� � � CN

Ctotal�

i,j � 1

N1

U2��i, �j� �CN

Ctotal2 � �

i,j � 1

N1

U��i, �j��2�1/2

(3)

U(�i, �j) is the self-interaction matrix of the query se-quence, using values from a knowledge-based potential.CN is the number of contacts in the structure, and Ctotal isthe total number of contacts that can be. The density ofstates � then follows simply from the number of alterna-tive alignments M.

��E� � M � P�E� (4)

The qualitative features of this view of the thermodynam-ics of threading is briefly encapsulated in the schematic ofFigure 2(a).

Fold Recognition and Extreme Value Statistics

The theoretical ideas of REM can be applied to foldrecognition. It is not enough to thread the native sequence

TABLE I. Test Set Pairs Used in Fold Prediction Evaluation

1ten_ 1jrhI 1f9fA 1urnA 2ccyA 1bbhA 1qa9A 1qfoA1c1lA 1lcl_ (2e-15) 1qauA 1kwaA (8e-13) 1kp6A 1pytA 1bm9A 2hfh_1b8wA 1fd3A 1c28A 2tnfA 1cpq_ 1vls_ 1dz1A 1azpA1c9fA 1vcbA 256bA 1bbhA 1qk7A 1eit_ 1pdr_ 1kwaA (7e-13)1f02T 1a91_ 1dun_ 1dupA (4e-16) 1fxkA 1fxkC (1e-48) 1cewI 1molA2occE 1fjfT 1ba5_ 1tc3C 1a9v_ 1cmoA 1hce_ 2ila_2fxb_ 1fxd_ (1e-04) 1r2aA 1qp6A 1rlw_ 1rsy_ (6e-24) 1fc2C 1ytfB1qgkB 1lghB 1d7mA 1sfcB 1dx7A 1fbmC 1joyA 1cxzB1opy_ 1cewI 1c52_ 451c_ (9e-08) 1qckA 1cokA 1ci6B 1a02F1psm_ 1lghB 1lghA 1junA 1qcrK 1ce9C 1fyc_ 1ghj_ (3e-17)2cpb_ 1ifi_ (3e-14) 1cw5A 1kzuB 1qb2A 2ezk_ 3ncmA 1fltX1cd8_ 1neu_ 1bdyA 1rsy_ (2e-18) 1di0A 5nul_ 1mdyA 1a92A1e2aA 1qsdA 1qcrH 1be3H (2e-16) 1a8o_ 1dnyA 1fipA 1aoy_1qleD 1aq5A 1hstA 1qbjA 1peh_ 1ce0A 1gpt_ 1sco_1kum_ 1tvdA 1pfiA 1eq7A 1b4rA 1c8pA 1fjfT 1ecmA1hucA 1nkd_ 2tnfA 1aly_ (4e-41) 1ffkY 2occM 1scjB 1b64_1ql1A 1ifi_ 1bba_ 1ec5A 1d2zA 3ygsP 1kigI 1bpi_1e0bA 1azpA 1thx_ 1erv_ (3e-20) 1qyp_ 1vie_ 1tiiD 1bcpD9rnt_ 1rgeA 1a5r_ 1vcbA 2ezh_ 1cf7B 1bak_ 1btn_ (6e-08)1a3k_ 1lcl_ (8e-32) 1qu1A 1qbzA 1a32_ 1ail_ 1bb9_ 1bbzA (1e-06)1nkl_ 1qbjA 2occM 1bb1A 1c53_ 1cc5_ 1aub_ 1mbe_1tafB 1tafA (5e-25) 2mcm_ 1jrhI 1txb_ 1tgxA 1cwpA 1a34A1b4fA 1qckA 1bm4A 2sivB 1brz_ 1aho_ 2acy_ 1dt4A1fhoA 1rrpB 1pdo_ 1iibA 1kte_ 1bjx_ 1ctdA 1ytfB1b34B 1d3bB (8e-11) 1fjfL 3ullA 4icb_ 1eh2_ 1ihvA 1vie_1psrA 1eh2_ 1vtx_ 1eit_ 1sfcD 1sfcE (6e-24) 1bl8A 1fxkA1be3H 1qcrH (2e-16) 1as4B 1hleB 1cc8A 1b64_ 1fltX 1tlk_1bteA 1f94A 1dbwA 3chy_ 1quqA 1quqB (2e-55) 1qcrD 1a02F2afpA 1byfA (2e-14) 1prtF 3chbD 1pls_ 1dynA (8e-19) 1aa0_ 1sfcC1ixxB 1qddA (1e-27) 1xxaA 1aw0_ 1eteA 1rcb_ 1e68A 1lfb_1qo3C 1qddA (4e-18) 1ycc_ 451c_ (1e-13) 1bh9A 1afoA 1pfsA 1mjc_1qgvA 1thx_ 1rtm1 1byfA (1e-18) 1do6A 1cqkA 1qfnA 1aba_ (4e-07)1awd_ 1c1yB 2hts_ 2irfG 1bqk_ 2cuaA (4e-08) 1ctf_ 1bxyA1awcA 2hts_ 1bh9B 1aoiA 1avoB 1qbzA 1cf7A 1lea_1kdxA 1bg8A 2ifo_ 1a02F 1hp8_ 1nsgB 1abz_ 1ytfB1bazA 2hddA 1hssA 1rzl_ 3inkC 2gmfA 1dxzA 1a93B1ddf_ 1ngr_ 1a2xB 1dp5B 3crd_ 3ygsP (2e-22)2occK 1aikN 1eggA 1b6e_ (3e-17) 1sfcE 1sfcB (1e-23)

The number in parentheses following some pairs indicates the e-value under which PSI-BLAST recognizedthe test pair to be structurally similar. A pair is recognized if, by our definition, PSI-BLAST assigns it ane-value of less than 1e-2 for at least one round.

FOLD RECOGNITION WITH MINIMAL GAPS 533

into the template and obtain the best alignment (seethreading algorithm in Materials and Methods). To judgehow “real” or how good the alignment is, one must evaluatestatistical significance.14 To this end, one should formulatea zero hypothesis against which the results of actualthreading experiments should be judged. A meaningfulzero hypothesis is that the observed best alignment is justof a random sequence-template pair. Although such a zerohypothesis may seem too permissive, in reality it is not.Even for a random sequence-template pair, the alignmentis optimized (i.e., its score is low enough). More specifi-cally, optimized random sequence-template alignmentswould score close to Ec, a quantity in the REM thatcharacterizes the probable lower boundary of the densityof states and represents the most probable value of “en-ergy” (alignment score) of an optimal random sequence-template alignment. Without resorting to statistics of thetail of the energy distribution, one may mistakenly believethat the Z-score from comparing the single native align-ment against a random sampling of energies of nonopti-mized alignments is sufficient to lend significance to aparticular threading run. Figure 3 shows such a distribu-tion of Z-scores calculated this way, without shuffling andoptimizing. The solid curve is the distribution of REMZ-scores for a set of sequence-structure pairs which exhibitstructural similarity. The dashed curve is the distributionof REM Z-scores for a (negative control) set of pairs that do

not exhibit structural similarity. There seems to be noremarkable difference between the two. Thus, the distribu-tion of optimal random sequence-template alignment scoresrepresents an extreme value distribution (EVD) and thestatistical significance of the observed score should beevaluated as the probability that it is just a score of someoptimal random sequence-template pair (i.e., the probabil-ity given by the EVD).

These considerations can be made quantitative by defin-ing Ec as

� � ��Ec� � 1 (5)

Within the context of the REM, the Ec can be expressed asa function of the number of alignments M and the proper-ties of the REM distribution.

Ec � Eav � ��2 log� M

�2�� (6)

The physical meaning of this condition is that Ec specifieswhere the distribution of decoys falls below unity. The goalof threading is to find an optimal (“native”) alignment thathas energy En considerably lower than Ec. As we notedabove, if one were given a random sequence (with the samecomposition and length as the query sequence), and opti-mized its alignment to the template, its optimized energywill likely be close to Ec. Repeating this many times, onecan generate an extreme value distribution of optimized(rank 1) energies of random sequence-template pairs. InFigure 2(b), the resulting extreme value distribution,generated from an underlying Gaussian distribution, is

Fig. 2. a: Schematic of the REM shows qualitative locations of Eav andEc. Vertical mark at left edge of chart shows location of En. En is muchlower than Ec, suggesting that the optimal alignment is not spurious. Othervertical marks indicate that there can be other low-lying energetic statesthat are separated from the continuum distribution. b: Schematic of theREM shows an EVD derived from the REM. Notice the most probableextreme value lies on Ec. This is true when the number of decoyalignments is large (M � 1).

Fig. 3. REM Z-score derived by comparing the energy of optimalalignment of query sequence on the template structure against randomalignment energies given by the REM background energy distribution.The solid line is a set of sequence-structure pairs that exhibit structuralsimilarity. The dashed line is a set of sequence-structure pairs that do notexhibit structural similarity, which serves as negative control for foldrecognition. The two distributions seem indistinguishable near the lowerZ-score edge, suggesting that such a naive measure of significance is notuseful. In this test, the number of fragments was limited to two.

534 W. CHEN ET AL.

shown. Note that maximum of EVD coincides with Ec, inaccord with our earlier conjecture that most probablevalue of optimal score for a random sequence template pairis close to Ec. To assess significance of the best alignment,one locates En on this extreme value distribution and asksfor the probability that this native query sequence align-ment ranks first. Note that the total number of decoys M,depends on the template and query sequence length andthe number of fragments, whereas Eav and Ec depend onthe query sequence composition and number of interac-tions in the template. The probability that out of M decoys,a particular energy En ranks first for a given querysequence and template, is given by

P�MinEi � En� � exp��exp����n � u��� (7)

� � 2 log� M

�2�� (8)

u � 1 �

log�2 log� M

�2���4� log� M

�2��(9)

Here the quantity � � (E � Eav)/(Ec � Eav), and �n ��(En). By asking for the probability of the alignmentenergy ranking first with the given template and querysequence, the explicit dependence of the probability onquery sequence composition and length of template iseffectively removed and absorbed into the parameter �.One only has to calculate � to assess significance. It isconvenient to work with the probability density distribu-tion of extreme values En because this density distributiongives the probability that the optimized score falls withinsome specified range of scores. Such a distribution can beconveniently generated by a histogram of optimized andreshuffled sequences (see Materials and Methods).

P�En� � exp����n � u� � exp��exp����n � u���� (10)

It is clear from Eq. 7 that a query sequence with low En isless likely to be randomly matched to the template. Morerigorously, a very high � means the query sequencematched to the template in a nontrivial way, so thatprobably the fold is a correct match. Note at Ec, the typicaloptimized energy of a shuffled sequence displays an � of 1.The theory tells us then, it suffices to calculate � for eachquery sequence and template and use it to probe whetherthe threading score is significant (see Materials andMethods).

The formalism of the REM is also suggestive as to howrestricting fragments can enable improvements in foldprediction. From Figure 2(a), it is clear Ec is determined bythe two numbers � and M. � is a property of the querysequence that cannot be manipulated. However, M is aproperty of the decoy space, which, in some sense, isarbitrary. The best decoy space will have minimal ele-ments yet always include the optimal alignment. By fixingthe number of fragments to some number, one substan-tially reduces the number of alignments M. For example,the number of k-fragment alignments between a sequence

and a template of 100 amino acids is Mk�2 � 4 � 109,Mk�3 � 9 � 1012, Mk�4 � 5 � 1015, whereas the total numberof unconstrained alignments is M � 9 � 1058. At the sametime, there is a good chance that an optimal or nearoptimal alignment is still possible, with this fragmentrestriction.

RESULTSFold Recognition

We determined � parameter for both a test set of 150structurally similar pairs and a test set of 150 structurallydissimilar pairs (this is the negative control test set). Thenegative control test set is also a set of 150 protein pairsthat do not exhibit significant structural similarity. Toensure that the sequences and structures used in both testsets were comparable, the negative control test set wasgenerated from the original test set by shuffling the pairssuch that they no longer matched as structurally similarpairs. En was determined first by threading the nativequery sequence into its structural pair while holding thenumber of threading fragments fixed at some number. Ec

was determined by examining the distribution of optimalrandom sequence-template alignments. They were gener-ated by reshuffling the query sequence 1000 times andfinding an optimal alignment of the reshuffled randomsequence to the same template using the same number offragments. These energies form the spectrum from whichEc is calculated (see Materials and Methods). Figure 4shows the statistics of optimal alignment scores for many(shuffled) sequences threaded into one template. Note inthe figure the distinct extreme value shape of the curveand the position of Ec as the maximum of EVD. For thispair, En lies far lower than Ec, suggesting that it is anexcellent positive match for this particular sequence-structure pair. The � value, an indicator of significance ofthe query sequence match to the template, was calculatedfor all 150 pairs and binned. The procedure was repeated

Fig. 4. An extreme value distribution generated from optimization ofreshuffled sequences for one sequence-structure pair. This decoy spec-trum is fitted with an extreme value curve (R2 � 0.96). En is wellseparated from Ec, suggesting a recognized structural similarity.

FOLD RECOGNITION WITH MINIMAL GAPS 535

with fragment number fixed at 1, 2, 3, and 4. Foldprediction was also performed with unconstrained thread-ing, in which the number of fragments is variable andlimited only by the minimum size of the fragment. Theresulting � distributions for the various fragment numbersare shown in Figure 5.

A very important issue is what value of � to choose as acutoff above which recognition is deemed successful. InFigure 5, a vertical line is drawn to emphasize howshifting the cutoff changes the fold recognition results.One can always increase the signal (i.e., the number ofcorrectly recognized sequence-structure pairs) by shift-ing the cutoff � to a lower value. However, the number of“recognized” false positives will also increase. Con-versely, by shifting the cutoff � to a higher value willdecrease the number of false positives but also reducethe signal of the recognized folds. This trade-off betweenrecognition rate and false-positive rate is shown in

Table II. At a cutoff of � 1.2, the recognized rate is stillrelatively high (cumulative 30%), whereas the false-positive noise rate is relatively low (cumulative 7%). Thesalient point here is that as the number of fragments isincreased, the fold recognition does not necessarilyimprove. In fact, fold recognition tends to become worse,which is in accord with the idea that the decoy spacegrows dramatically and overwhelms any gain in align-ment accuracy due to increasing fragments. We observethis worsening effect from two to four fragments foralmost all cutoffs, especially at 1.25, where the signaldrops from 21 to 14 folds recognized.

Figure 5 shows the � distribution for both the test setand the negative control test set. Each epsilon value in thedistribution is an optimal epsilon (�N) from a differentsequence-structure pair. The vertical line indicates thecritical value of � � 1.20. Pairs beyond the cutoff point areconsidered recognized.

Fig. 5. The distribution of � for the 150 pairs of similar proteins in green (and corresponding negative controls in blue). The vertical line indicates � �1.2. Pairs to the right of this line are ones “recognized” to be similar folds by the algorithm.

536 W. CHEN ET AL.

Because each epsilon is from a different sequence-structure pair, we expect that for the negative control set,most optimal epsilons should fall on 1, an indication thatthese sequence-structures are not “special.” Note in Figure5, that the distribution of optimal epsilon values from thenegative control is more or less clustered around a peak of� � 1. On the other hand, the 150 similar pairs display an �distribution clearly skewed to the right, with more pairsspilling past the critical � value. This is true for tests underdifferent fragment restrictions.

The results of the fold recognition at different fragmentrestrictions are summarized in Figure 6. There appears anoptimal number of fragments for fold recognition at two,meaning that here the � is significant for the most pro-teins. Beyond this number, the average � and the numberof significant �s declines, in line with the expectations thatthe decoy space overwhelms the optimal native alignment.As a consequence, the number of recognized folds declinesalso. The optimal number of fragments for a particularsequence-structure pair does not seem to correlate witheither sequence similarity, nor structural similarity, be-

tween the query sequence and template structure (datanot shown). Perhaps a more important measure of successto consider is the cumulative recognition rate from one tofour fragments. We exclude the single fragment results,because of the rate of high false positives, which isexpected because these single fragment alignments aredominated by local contacts and are not representative ofprotein folds. The cumulative recognition rate for allfragments is the total number of elements in the union ofthe individual sets of recognized folds, m( � i

4 Fi), where Fi

is the set of recognized folds with i fragment restriction.There is an overlap between the sets of recognized folds ofdifferent fragment restrictions, as shown in Figure 7(a).The total recognized number of folds is the union of thedifferent sets of recognized folds. The same is true for thefalse positives [Fig. 7(b)]. The cumulative recognitionresults are shown in Figure 8.

For comparison, Table II also shows the fold recognitionachieved by unconstrained threading. In unconstrainedthreading, at the high and relevant values of � 1.15, 1.20,and 1.25, the signal and noise for unconstrained threadingare worse than those obtained under restricted fragmentconditions. Comparing only the two-fragment and theunconstrained threading results gives an indication of howmuch signal is being contributed solely from the two-fragment results: at � � 1.20 two-fragment results yield 21folds recognized, whereas the unconstrained results yield

TABLE II. Number of Folds Recognized at Various �cutoff andFragment Number Restrictions

�cutoff 2 3 4 Cumulative Unconstrained

1.00 119 (87) 124 (89) 123 (89) 129 (106) 121 (106)1.05 108 (48) 102 (51) 100 (52) 110 (71) 96 (70)1.10 74 (25) 71 (28) 67 (26) 87 (41) 77 (41)1.15 47 (10) 46 (13) 47 (16) 62 (21) 59 (28)1.20 36 (6) 32 (8) 25 (10) 45 (11) 41 (16)1.25 21 (4) 16 (5) 14 (3) 26 (7) 24 (9)

Numbers in parentheses give false positives. Beyond this number the fold is deemedrecognized. Cumulative column indicates the combined two- to four-fragment restrictionresults (combined false positives) under the corresponding � cutoff. Unconstrained columnindicates fold recognition when the number of fragments is unlimited.

Fig. 6. The recognition rate for different fragment restrictions. False-positive rates are indicated for comparison. The most folds are recognizedat two fragments. [Color figure can be viewed in the online issue, which isavailable at www.interscience.wiley.com.]

Fig. 7. a: Recognized folds at different fragment restrictions showsome overlap. Note some structural pairs are recognized for all threefragments. b: False positives display high overlap. Many false positivesrecognized under one fragment restriction are also recognized under allother fragment restrictions. This suggests some nonrandom feature offalse positives that threading identifies.

FOLD RECOGNITION WITH MINIMAL GAPS 537

24 folds recognized, just slightly higher. The performanceof the unconstrained threading is another indication thatusing large numbers of fragments to capture the specificsof the alignment can be harmful to fold recognition.Although En decreases with more numbers of fragments inthreading alignments, Ec will also decrease, and mostlikely, faster than En.

False Positives

By definition, false positives are improperly “recognized”because they are pairs that do not share structural similar-ity. Naively, one expects false positives to be randomevents, occurring when the unshuffled query sequence hasa deep optimal energy in an unrelated template. If falsepositives were random events, one would expect the cumu-lative false-positive rate to be proportional to the numberof combined sets. For instance, at two plus three fragmentcumulative recognition, one would expect twice the num-ber of false positives than found in two fragment recogni-tion alone. But this is not the case. From Figure 8, the twoplus three fragment false-positive rate is 8, compared to afalse-positive rate of 6 at two fragment recognition. Thismeans there is considerable overlap of false positives.Given this fact, we come to two conclusions. One, becausethe false-positive rate does not increase linearly withincreasing fragment number, it is beneficial to combine theresults of threading under different conditions, as long asthe different conditions generate different, uniquely recog-nized folds. In other words, the signal can increase whilethe noise remains relatively constant as results fromdifferent fragment restrictions are concatenated. Two,these false-positive events do not appear to be random, forprecisely the reason that they are correlated from set to

set. Rather, they possess some property that the threadingalgorithm recognizes. By inspection, 6 of 11 of the falsepositives identified show substantial helical content (e.g.,exemplified by �-helical coiled coils), with few long dis-tance contacts. It is entirely possible that these dominantshort-range interactions are responsible for subsequentrecognition by threading. For this reason, it is unclearwhether the term “false positive” is appropriate for thesepairs because they exhibit some regular feature that canbe picked out by threading.

We reasoned that these false positives can be addressedby eliminating local contacts that contribute to the contactmatrix. To test this, contacts between residues i and i � 3or closer were removed. Because a large number of thefalse positives were long coiled-coil proteins, or noncom-pact helical structures, we also eliminated templates thathad a low density of contacts. At 8.6 Å cutoff for defining aC�-C� contact, compact structures had a range of 2.5–3.5contacts per residue, whereas the less compact helicaltemplates had less than two contacts per residue. Theresulting fold recognition is shown in Table III. Aftereliminating some templates, the number of proteins in thetest set drop to 108. We observe that using mostly long-range contacts almost completely eliminates the falsepositives at highest � � 1.25. For the cumulative foldrecognition, the total number of false positives drops to 6(from 11 in Table II). The recognition rates on this reducedtest set is unchanged (e.g., 21 of 150 in two fragmentthreading of original test set, 15 of 108 recognized in twofragment threading with reduced test set, both approxi-mately at 14%). Our previous observation that the uncon-strained threading fold recognition is worse than thecumulative fold recognition with different fragment num-bers is borne out again in Table III. The effect is evenstronger here, perhaps indicating robustness with respectto the parameters of threading (because the results areconsistent across Tables II and Table III.) At � � 1.20, thenumber of folds recognized with two fragments is 15,whereas the number of folds recognized with uncon-strained threading is dismally poor at seven. In short,analysis of these false positives may prove fruitful in thatone may be able to identify which templates are the mostuseful and lead to the least noise. In the event that a querysequence might adopt such highly helical and low-densitystructures (e.g., �-helical coiled coils), we suggest othermethods (perhaps by secondary structure prediction) willprove useful.

Comparison to PSI-BLAST

For comparison, the same 150 pair test set (and controltest set) was subjected to fold recognition by PSI-BLAST.28

The only limitation of fold recognition by PSI-BLAST andother sequence alignment methods is that one requires thestructurally similar pair to possess some lower thresholdof sequence similarity. Although in threading, there is notheoretical lower bound on sequence similarity for foldrecognition of a query sequence. A fold recognition algo-rithm was developed, modeled on an algorithm developedby Friedberg et al.27 The results of PSI-BLAST foldrecognition are summarized in Table I. In the PSI-BLAST

Fig. 8. The cumulative fold recognition rate. At each number, thecumulative results up to that number of fragments are combined. Thereare overlapping recognized folds between different fragment restrictions.False-positive cumulative rates are indicated for comparison. The falsepositives do not increase substantially from one fragment number to thenext, meaning that the same false positives are recognized under differentfragment restrictions. [Color figure can be viewed in the online issue,which is available at www.interscience.wiley.com.]

538 W. CHEN ET AL.

experiment, we defined “recognition” so that, if for anyPSI-BLAST round the pair is assigned an e-value of lessthan 1e-2, that pair is considered to be recognized byPSI-BLAST as structurally similar. A BLAST e-valuedenotes the number of expected high-scoring alignments(or high-scoring segment pairs, abbreviated HSP) oncegiven the query sequence and template sequence lengths.If the threshhold for “high-scoring” is stringent, then thenumber of HSPs is poisson distributed, and e-values can beconverted into p values. For our choice of e-value cutoff setat 1e-2, the corresponding p value is also approximately1e-2 (i.e., the chance of finding at least one randomalignment with the threshold score, given the two se-quences, is just slightly less than 1e-2). In the table, eachpair that is recognized is accompanied by the highestPSI-BLAST-assigned e-value. The overall rate of recogni-tion, with no false positives (because PSI-BLAST has solidsignificance statistics), is 33 of 150 pairs (22%). Theseresults give some quantative indication of the difficulty ofthe test set. PSI-BLAST is unable to pick up most of thepairs.

Combining Results of Restricted FragmentThreading and PSI-BLAST

The ultimate goal of fold recognition is to recognize asmany folds as possible, without regard to methods. There-fore, it makes sense to tackle the problem with bothsequence alignment and threading methods. One methodis to directly combine the scoring potentials into someunified scoring potential, so the total score is the sum of asequence alignment score and a threading score.10,11 An-other method, conceptually simpler, is to simply applysequence alignment and threading in succession and com-bine the results. Although it is possible that this lattermethod misses the synergistic components of a combinedthreading sequence alignment prediction, the orthogonal-ity of the two methods makes it a reasonable way to tacklethe problem. The results of the restricted fragment thread-ing and PSI-BLAST were combined in the same way theresults of the different fragment restrictions were com-bined. The sequence alignment set of recognized folds andthe threading set of recognized folds form overlapping sets(cf. Fig. 7). The union of the two sets is the set of totalrecognized folds. The total fold recognition rate was foundto be 63 of 150 pairs (42%). We can quantify the relation-

ship between the two results in the following way. Eachsequence-structure pair gives a data point of either 1 if it isrecognized or 0 if it is not. For threading fold recognition,one can generate a set of binary numbers in the form 0, 1,1, 0, 1. . . . Correspondingly, a set of binary numbers can begenerated for PSI-BLAST fold recognition. The overlapbetween the two results can be quantified by calculatingthe Pearson correlation coefficient for the two sets. This isa measure of how strong a linear relationship existsbetween the two binary vectors.

r �

�xy ��x�y

N

���x2 ���x�2

N ���y2 ���y�2

N � (11)

For the threading and PSI-BLAST recognition data, thePearson correlation coefficient was calculated to be 0.28.The interpretation of this measure is easily made: forvalues close to 1, successes of one set will closely correlatewith the other. For values close to �1, the successesanticorrelate. The value we have calculated, 0.28, is quitelow. Interpreting this in the standard way, one may saythat there is little overlap between the two sets of recog-nized folds. We conclude that threading and sequencealignment methods seem quite independent of one an-other.

DISCUSSION

We applied a simple threading algorithm for fold recog-nition, by constraining the fragment number and therebyreducing the complexity of the decoy space. The � parame-ter from extreme value statistics was shown to be invalu-able in extracting a signal out of the simplified threading.We showed that this recognition signal is low at onefragment, peaks at two fragments, and slightly declinesthereafter. And significantly, � and fold recognition resultsfrom unconstrained threading (i.e., no limitation on num-ber of fragments) are consistently worse than threadingdone under conditions of fragment number restriction.These results suggest the physical picture that the depthof the optimal energy gain comes mostly from only twofragments, and then the decoy space begins to overwhelmthis energy gap as the number of fragments are increased.Our results suggest that the simple residue-residue con-

TABLE III. Number of Folds Recognized at Various andFragment Number Restrictions

�cutoff 2 3 4 Cumulative Unconstrained

1.00 94 (71) 96 (75) 99 (72) 102 (89) 63 (68)1.05 80 (49) 82 (46) 79 (47) 91 (64) 49 (42)1.10 55 (21) 56 (24) 57 (27) 68 (40) 35 (20)1.15 38 (8) 37 (12) 34 (2) 48 (23) 20 (15)1.20 23 (1) 23 (2) 24 (4) 34 (6) 13 (7)1.25 15 (1) 11 (0) 12 (1) 24 (2) 7 (4)

Numbers in parentheses give false positives when local contacts (i, i � 3) are omitted andlow-density templates are discarded. (See previous table.) The result is fold recognition ofsequences on compact structures using mostly long-range contacts. Cumulative columnindicates the combined two- to four-fragment restriction results. Unconstrained columnindicates fold recognition when the number of fragments is unlimited.

FOLD RECOGNITION WITH MINIMAL GAPS 539

tact potential does not have sufficient discriminatingpower to select the correct alignment among the decoysgenerated by full unconstrained threading. Furthermore,the results suggest that it may be possible to improvethreading by further simplification of the decoy space, byapplying more constraints that filter out alignments thatdo not contribute significantly to the optimal alignment.We combined threading and PSI-BLAST results in asimple way to achieve a total fold recognition rate of 42% ofthe 150 structural pairs.

The test set and test we have chosen is a realistic anddifficult benchmark for fold recognition. Very broadly, atest set can be characterized by a few measures: thesequence identity between structural pairs and the root-mean-square deviation (RMSD) between structural pairs.The test set used in this study (see Materials and Methods)is a series of structurally similar pairs that do not possesssignificant sequence identity (all pairs have �30% iden-tity). The PSI-BLAST fold recognition results yielded arecognition rate of 22%, a reflection of how sequencealignment methods do not perform spectacularly on thistest set. The set is also a mixture of low to high RMSD (1 to5 Å) between query sequence and structure, reflectingwhat one might expect in a real fold prediction run. It hasbeen shown that threading predictions deteriorate rapidlyas the RMSD between query sequence and templateincreases.9 In a real fold prediction run, there is noguarantee that there is a template with low RMSD to thequery sequence. A test set with a range of RMSD is mostappropriate for evaluating the fold recognition rate of aprediction method. For these reasons, comparison of theresults here to other fold recognition results is not trivial.

For example, high fold recognition rates of 80–90% havebeen reported for self-recognition with an algorithm opti-mized by linear programming.34 However, self-recognitionmeans the template and the query sequence have 0 RMSD,an unrealistic expectation for real fold prediction runs.When applied to real, homologous pairs, the recognitionrate is more reasonable (8 of 25) and comparable to therates we have obtained (30%). Other algorithms have beenreported to show fold recognition in the ranges of 70% ontest sets of homologous proteins.35,36 These results reliedon the Fischer–Eisenberg test set, which was selected forpairs with low RMS difference (�3 Å). This test set is also aless difficult benchmark than the one we selected becausethreading results depend strongly on how close in struc-ture template and query sequence match.9 An appropriatecomparison can be made with results reported recently byPanchenko et al.,10 in which a set of 163 structural pairswas subjected to a combined PSI-BLAST profile andthreading fold recognition algorithm. The Panchenko–Bryant test set was selected with respect to low-sequenceidentity, with no regard to the structural similarity of thepairs. For a particular choice of weighting parameter,Panchenko et al. reported recognition rates of 46% forpairs with sequence identity between 10 and 15%. Thisweighting parameter indicated a rather even consider-ation of sequence similarity scoring terms and threadingscoring terms. For comparison, the test set used in ourstudy was split: two thirds with pairs below 15% identity

and one third with pairs above 15% identity. Analogously,our results from combining the total PSI-BLAST andthreading algorithm recognition was 42%. These resultsare highly comparable.

It is worth mentioning that this benchmark does notcompletely mimic a methodology for fold recognition, al-though it certainly reflects and lends itself immediately tosuch a methodology. The reason being that fold recognitiondemands one to select one “proper” fold from a set oftemplate folds. In this test set and the negative control testset, one is asking whether the query sequence will adoptthe fold of the template structure being probed. Thenatural way to turn this into a fold recognition protocol isto pair up the single query sequence of interest against aset of templates and ask the same question of each pair:does the query sequence adopt template fold in question?

The hope is that by investigating a large set of such pairsas we have done here, one gains insights to a real foldrecognition protocol (which might also involve 150 pairs ofsequence structures, except the same query sequencewould be used in every pair). Following this, the implica-tions are also clear with respect to the negative control testset. If the negative control test set displays a high level offalse positives, then in a real fold recognition experiment,it is unlikely that the right fold is the only fold picked out.Although BLAST may find the right fold for only 25% ofsequences, it does so with almost no false positives becauseof strict E-values.

It is also worth mentioning that although this bench-mark does not mimic exactly a fold recognition experi-ment, in fact, the results are more encouraging in light ofthis. The test set involved probing whether a querysequence matched one particular template. The success ofthis is known to correlate with the structural similaritybetween query sequence and template. In a minimalistthreading experiment, one would thread the query se-quence through a very small number of templates. In thiscase, perhaps the success rate will be on the low end andwill follow the rates of our test (20–30%). However, athreading experiment need not be minimalist and caninclude many templates from the same fold family, butdiffering in RMSD from each other. If so, then the chancesof success are increased, because each template offers anadditional chance that it is a low RMSD template relativeto the query sequence.

Even with this limited survey, it appears the simplifiedform of threading with restricted fragment number andrigorous � calculations shows considerable success in foldrecognition. A more rigorous and proper comparison of foldrecognition will ideally involve a standardized test setwith appropriate RMSD and sequence identity conditions,or a blind prediction test such as CASP.37 Work on boththese objective measures is in progress.

FUTURE DIRECTIONSDecoy Space

The idea of restricting fragment number is to reduce thedecoy space so that En stands out. This is under theassumption that En does not change much under therestriction. Most of current techniques to improve thread-

540 W. CHEN ET AL.

ing10,11,34 aim to increase the En gap by either pushing upthe energies of decoys by applying profiles or improve-ments in the potential to push down energies of the nativealignment. A possible problem with applying statisticaland structural profiles is the paucity of structures avail-able for particular folds (PDB size is 5,000–15,000 struc-tures). Improvements in potentials face similar problems,because design or knowledge-based procedures have only afinite number of structures with which to work. Themethod we have presented is a departure from thesephilosophies, in that we attempt to eliminate decoys toincrease the � parameter. In theory, there are variousdirections in which decoy space reduction can take. Futurework may be directed toward constraining alignments inother ways that conform to the objective of reducing decoyspace but retaining a native-like alignment. Such addi-tional constraints would have additive effects, systemati-cally eliminating decoys and increasing En. This wouldallow rescue of native alignments that would otherwisehave energy gaps below the limit of detection.

Potentials

The potential used in this study was one designed tostabilize a small selection of folds.18 Although it is knownthat the specific potential can be important for differenttasks in the protein folding field, we did not address thisproblem in this study. It not difficult to imagine that apotential specifically designed under threading conditionswill improve the results presented here. The MS potentialwhich we have used (see Materials and Methods) wasdesigned by optimizing stability of one fold against decoyconformations. In threading, the case is that one align-ment must be stabilized against alternative alignments.Because the decoy space is quite different in both of thesecases, it is expected that the details of the potentials underthese two cases will be somewhat different. Therefore, afruitful direction to take for improving fold predictionresults in threading, in this study or other studies, is toreplace the generic potentials with ones constructed specifi-cally for the purpose of threading.

MATERIALS AND METHODSThe Threading Algorithm

We use a Monte Carlo threading algorithm. The algo-rithm is described in detail elsewhere.9 Here we summa-rize the most important aspects of the algorithm. Thealgorithm attempts to maximize the score of an alignmentby trying different alignments. A particular alignment isakin to a sequence alignment, in which fragments of thequery sequence are aligned to fragments of the structure.In the threading case, fragments of the query sequence aremounted onto the C�-atom coordinates of the structure.Fragments are constrained to be no less than six residuesin length. An alignment can be represented by a two-dimensional (2D) array, along the axes of which are theresidue identities of the query sequence and the templatestructure. Lines drawn against this 2D array indicate theone-to-one mapping of the threading alignment. The MonteCarlo move set, in this picture, is the following:

1. Shift: A fragment is shifted in one of four directions by nresidues.

2. Extension/contraction: A fragment is lengthened by nresidues; a fragment is shortened by n residues.

3. Split/Join: A fragment is split into two fragments atsome midpoint; two neighboring fragments are joined.

4. Jump: A piece of a fragment is allowed to split off andjoin another nearby fragment. This is equivalent to aseries of split, shift, and join.

A typical run is slowly annealed over 106 moves, startingat a temperature far above Topt and ending at a tempera-ture far below Topt. Convergence is verified by checking asubset of alignments obtained from very long runs (50 �106 moves).

The score of an alignment is given by the sum of atwo-body contact MS potential,18 which was designed tostabilize folds, and resembles the MJ knowledge-basedpotential. No attempt was made to optimize the potentialfor threading. A cutoff contact distance of 8 Å between theC� atoms was used. C� contacts that are nearest andnext-nearest neighbors were not used, because inclusionemphasized local contacts and decreased both alignmentaccuracy and fold recognition rates.

The Fold Prediction Algorithm

The fold prediction algorithm is a simple iterativealgorithm. The outline of the algorithm is summarized inFigure 9. The native query sequence is optimally threadedwith a fixed number of fragments (e.g. 1, 2, 3, or 4) [Fig.9(a)]. To obtain the decoy spectrum, assuming the REM isvalid, we shuffle the query sequence a thousand times andrecord the optimal energy [Fig. 9(b)]. This proceduresimilar to procedures used in other fold prediction algo-rithms,13 except here we rigorously use extreme valuestatistics because the calculations involve only optimized(extremal) energies. From this distribution of decoys, wecan obtain Ec and the parameter � [Fig. 9(c)]. Then nativeenergy is compared to the Ec of decoys. If there is an energygap, as defined by � 1.2, then the fold is considered to berecognized. At this point, the number of allowed fragmentsmay be increased, and determination of � of the nativequery sequence is repeated. Sometimes the fold is recog-nized regardless of how many fragments is used. Some-times the fold is only recognized at one fragment number.Calculation of Ec of the native spectrum is described below.

Picking a Test Set

From the FSSP database, we selected 150 sequence-structure pairs for threading tests. Each query sequenceand template structure was permitted to be no more than150 residues in length, to eliminate the difficulty of dealingwith domains. The query sequence and the templatestructure were permitted to be not more different than50% in length from the longer of the two. The test set pairsdisplayed no sequence identity 30%, to eliminate “easy”fold recognition targets that can be picked out by sequencealignment algorithms. Because it’s possible that threadingmay have systematic biases toward certain folds or se-quence families, an unbiased or nonredundant test set was

FOLD RECOGNITION WITH MINIMAL GAPS 541

desired. To pick such a test set, we picked pairs byensuring that none or few of its structural neighbors hadalready been picked. This is the basis of constructing thenonredundant test set from all FSSP classes. The list ofsequence-structure pairs of this test set can be found inTable I. The test set of protein pairs is available fordownload at http://paradox.harvard.edu/will/test_set.html.

Calculating �

To calculate � of a prediction attempt, we require En, Ec,and Eavg. The underlying assumptions of the threadingenergy landscape follow from the Random Energy Mo-del29–31 (see Theory also). The simplest method to compu-tationally derive Ec of a particular sequence-structure pairis to repeat the following procedure many times: shufflethe query sequence and then optimize the alignment ofthis shuffled query sequence onto the structure. Theoptimal energies of these decoy sequences should form anextreme value distribution, because they are optimizedeach time the query sequence is shuffled. The statistics ofextremes allows us to calculate Ec of the decoy distributioneasily. If one assumes that the decoy energy distribution is

Gaussian (according to the REM), with average energyEavg and M possible alignments, then the distributionfunction of extremal values take on the following form:

p�Eopt� � A exp�A � Eopt � B � exp�A � Eopt � B�� (12)

B � A � �1 �

log�2 log � M

�2���4�log� M

�2�� (13)

1 � dEoptp�Eopt� (14)

With the collection of shuffled and optimized energies, onecan reconstruct the extreme value distribution with acurve fit using the two parameters A and B. Although Ec isstill unknown, in the limit when the number of possiblealignments is large (M � 100,000 � 1), one can find Bsince B 3 A � Ec. But we know this is also the peak of thedistribution. Another way to view Ec is that it’s the most

Fig. 9. Summary of the fold prediction algorithm. a: Monte Carlo threading program produces an optimalalignment with native query sequence. b: Native query sequence is shuffled 1000� and optimally aligned. c:The shuffled and optimally aligned decoys form an extreme value distribution, from which � can be calculated.The cycle is repeated for different fragment number restrictions.

542 W. CHEN ET AL.

probable extreme value. The correlation of the simulationEc and the analytical values of Ec is 0.69 (data not shown.)We use the simulation values because they reflect the truesystem better. Thus, comparisons of En to Ec naturallyallow one to gauge whether En is significant.

dp�Eopt�

dEopt�

Ec

� 0 (15)

Ec �BA (16)

In addition to Ec, we require Eavg, or the average energyof all alignments. Eavg is acquired by long sampling ofdifferent alignments of the fixed query sequence.

These values allow us to calculate �, where � is defined as(Eopt � Eavg)/(Ec � Eavg). The � parameter then has avery simple interpretation: it’s the ratio of the depth of thenative energy, to the depth of the “typical” optimal energyfor some random, aligned sequence. The random se-quences considered here have their composition fixed tothat of the native query sequence. Thus, using � tomeasure how “good” the native query sequence fits into thetemplate corrects for both the query sequence compositionand the size of the template and the size of the decoy space.Therefore, the use of � is important because we arecomparing results across different decoy spaces (i.e., classesof decoys that differ by number of fragments.)

ACKNOWLEDGMENTS

We thank Edo Kussell, Gabriel Berriz, and Jun Shimadafor stimulating discussion.

REFERENCES

1. Ortiz AR, Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J. Abinitio folding of proteins using restraints derived from evolution-ary information. Proteins 1999;S3:177–185.

2. Simons KT, Kooperberg C, Huang E, Baker D. Assembly of proteintertiary structures from fragments with similar local sequencesusing simulated annealing and bayesian scoring functions. J MolBiol 1997;268:209–225.

3. Rychlewski L, Jaroszewski L, Li W, Godzik A. Comparison ofsequence profiles. strategies for structural predictions using se-quence information. Protein Sci 2000;9:232–241.

4. Teichmann S, Chothia C, Church G, Park J. Fast assignment ofprotein structures to sequences using the intermediate sequencelibrary pdb-isl. Bioinformatics 2000;16:117–124.

5. Dunbrack R. Comparative modeling of casp3 targets using psi-blast and scwrl. Proteins 1999;S3:81–87.

6. Bowie JU, Luthy R, Eisenberg D. A method to identify proteinsequences that fold into a known three-dimensional structure.Science 1991;253:164–170.

7. Bryant SH. Evaluation of threading specificity and accuracy.Proteins 1996;26:172–185.

8. Finkelstein A. Protein structure: what is it possible to predictnow? Curr Opin Struct Biol 1997;7:60–71.

9. Mirny LA, Shakhnovich EI. Protein structure prediction by thread-ing: why it works and why it does not. J Mol Biol 1998;283:507–526.

10. Panchenko AR, Marchler-Bauer A, Bryant SH. Combination ofthreading potentials and sequence profiles improves recognition. JMol Biol 2000;296:1319–1331.

11. Shan Y, Wang G, Zhou H. Fold recognition and accurate query-

template alignment by a combination of psi-blast and threading.Proteins 2001;42:23–37.

12. Lathrop R, Smith T. Global optimum protein threading withgapped alignment and empirical pair scoring function. J Mol Biol1996;40:641–665.

13. Panchenko A, Marchler-Bauer A, Bryant S. Threading withexplicit models for evolutionary conservation fo structure andsequence. Proteins 1999;S3:133–140.

14. Mirny LA, Finkelstein A, Shakhnovich EI. Statistical significanceof protein structure prediction by threading. Proc Natl Acad SciUSA 2000;97:9978–9983.

15. Bryant S, Altschul S. Statistics of sequence-structure threading.Curr Biol 1995;5:236–244.

16. Levitt M, Gerstein M. A unified statistical framework for sequencecomprison and structure comparison. Proc Natl Acad Sci USA1998;95:5913–5920.

17. Miyazawa S, Jernigan RL. Residue-residue potentials with a favor-able contact pair term and an unfavorable high packing density term,for simulation and threading. J Mol Biol 1996;256:623–644.

18. Mirny LA, Shakhnovich EI. How to derive a protein foldingpotential? A new approach to an old problem. J Mol Biol 1996;264:1164–1179.

19. Vendruscolo M, Najmanovich R, Domany E. Protein folding incontact map space. Phys Rev Lett 1999;82:656–659.

20. Kussell E, Shimada J, Shakhnovich E. A structure-based methodfor derivation of all-atom potentials for protein folding. Proc NatlAcad Sci USA 2002;99:5343–5348.

21. Lathrop R. The protein threading problem with sequence aminoacid interaction preferences is np-complete. Protein Eng 1994;7:1059–1068.

22. Matsuo Y, Bryant S. Identification of homologous core structures.Proteins 1999;35:70–79.

23. Reva B, Skolnick J, Finkelstein A. Averaging interaction energiesover homologs improves protein fold recognition in gapless thread-ing. Proteins 1999;35:353–359.

24. Finkelstein A. 3d protein folds: homologs against errors—a simpleestimate based on the random energy model. Phys Rev Lett1998;80:4823–4825.

25. Holm L, Sander C. Protein structure comparison by alignment ofdistance matrices. J Mol Biol 1993;233:123–138.

26. Marchler-Bauer A, Bryant S. A measure of success in foldrecognition. Trends Biochem Sci 1997;22:236–240.

27. Friedberg I, Kaplan T, Margalit H. Evaluation of psi-blast align-ment accuracy in comparison to structural alignments. Protein Sci2000;9:2278–2284.

28. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, MillerW, Lipman DJ. Gapped blast and psi-blast: a new generation ofprotein database search programs. Nucleic Acids Res 1997;25:3389–3402.

29. Derrida B. Random-energy model: limit of a family of disorderedmodels. Phys Rev Lett 1980;45:79–82.

30. Bryngelson JD, Wolynes PG. Spin glasses and the statisticalmechanics of protein folding. Proc Natl Acad Sci USA 1987;84:7524–7528.

31. Shakhnovich EI, Gutin AM. Formation of unique structure inpolypeptide chains: theoretical investigation with the aid of areplica approach. Biophys Chem 1989;34:187–199.

32. Pande V, Yu, Grosberg A, Joerg C, Tanaka T. Is heteropolymerfreezing well described by the random energy model? Phys RevLett 1996;76:3987–3990.

33. Chuang J, Grosberg AY, Kardar M. Free energy self-averaging inprotein-sized random heteropolymers. Phys Rev Lett 2001;87:078104.

34. Meller J, Elber R. Linear programming optimization and a doublestatistical filter for protein threading protocols. Proteins 2001;45:241–261.

35. Jones D. Genthreader: an efficient and reliable protein fold recogni-tion method for genomic sequences. J Mol Biol 1999;287:797–815.

36. Fischer D, Eisenberg D. Protein fold recognition using sequence-derived predictions. Protein Sci 1996;5:947–955.

37. Moult J, Fidelis K, Zemla A, Hubbard T. Critical assessment ofmethods of protein structure prediction (casp): round iv. Proteins2001;Suppl. 5:2–7.

FOLD RECOGNITION WITH MINIMAL GAPS 543