hmmstr: a hidden markov model for local sequence-structure ... · we describe a hidden markov...

18
HMMSTR: a Hidden Markov Model for Local Sequence-Structure Correlations in Proteins Christopher Bystroff{ 1,2 *, Vesteinn Thorsson{ 2,3 * and David Baker 2 * 1 Department of Biology Rensselaer Polytechnic Institute, Troy, NY 12180- 3590, USA 2 Department of Biochemistry University of Washington Seattle, WA 98195-7350, USA 3 Department of Molecular Biotechnology, University of Washington, Seattle 98195- 7730, USA We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure motifs. Unlike the linear hidden Markov models used to model individual protein families, HMMSTR has a highly branched topology and captures recur- rent local features of protein sequences and structures that transcend pro- tein family boundaries. The model extends the I-sites library by describing the adjacencies of different sequence-structure motifs as observed in the protein database and, by representing overlapping motifs in a much more compact form, achieves a great reduction in parameters. The HMM attributes a considerably higher probability to coding sequence than does an equivalent dipeptide model, predicts secondary structure with an accuracy of 74.3 %, backbone torsion angles better than any previously reported method and the structural context of b strands and turns with an accuracy that should be useful for tertiary structure prediction. # 2000 Academic Press Keywords: hidden Markov models; I-sites library; sequence patterns; motifs; clustering *Corresponding authors Introduction Proteins have recurrent local sequence patterns that reflect evolutionary selective pressures to fold into stable three-dimensional structures. Many of these local sequence patterns correlate with com- mon structural motifs such as helix caps and beta hairpins. A general model of protein sequence that captures these local features could lead to improved methods for gene finding, protein structure prediction, remote homolog detection and other applications relating to the interpretation of genomic sequence information. Here, we describe the development of such a model, based on the I-sites library of sequence-structure motifs. The I-sites (invariant or initiation sites) library consists of an extensive set of short sequence motifs, lengths 3 to 19 motifs obtained by exhaus- tive clustering of sequence segments from a non- redundant database of known structures (Han & Baker, 1996; Bystroff & Baker, 1998). Each sequence pattern correlates strongly with a recurrent local structural motif in proteins. Approximately one- third of all residues in the database are found in an I-sites motif that can be predicted with a high degree of confidence (>70 %). The library is non- redundant in that no motif is completely contained within another, longer motif. However, many of the motifs overlap. For example, the helix cap pos- ition may occur in the fourth position of one motif and the eighth position of another. Furthermore, the isolated motif model does not capture higher order relationships, such as the distinctly non-ran- dom transition frequencies between the different motifs. For example, sequences characteristic of amphipathic helices are frequently bracketed by N and C-terminal helix-capping motifs (Doig & Baldwin, 1995; Aurora & Rose, 1998). The redundancy inherent in the I-sites model suggests a better representation that would model the diversity of the motifs and their higher order relationships while condensing features they have in common. A hidden Markov model (HMM; Rabiner, 1989) is well suited to this purpose. An HMM consists of a set of states, each of which is associated with a probability distribution for gen- erating a symbol, such as an amino acid residue or a secondary structure type, and a set of transition probabilities between the states. E-mail address of the corresponding authors: [email protected],{thorsson,dabaker}@u.washington.edu {These authors contributed equally to this work. Abbreviations used: I-sites, invariant or initiation sites; HMM, hidden Markov model; EM, expectation maximization. doi:10.1006/jmbi.2000.3837 available online at http://www.idealibrary.com on J. Mol. Biol. (2000) 301, 173–190 0022-2836/00/010173–18 $35.00/0 # 2000 Academic Press

Upload: others

Post on 22-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

doi:10.1006/jmbi.2000.3837 available online at http://www.idealibrary.com on J. Mol. Biol. (2000) 301, 173±190

HMMSTR: a Hidden Markov Model for LocalSequence-Structure Correlations in Proteins

Christopher Bystroff{1,2*, Vesteinn Thorsson{2,3* and David Baker2*

1Department of BiologyRensselaer PolytechnicInstitute, Troy, NY 12180-3590, USA2Department of BiochemistryUniversity of WashingtonSeattle, WA 98195-7350, USA3Department of MolecularBiotechnology, University ofWashington, Seattle 98195-7730, USA

E-mail address of the [email protected],{thorsson,dabaker}@

{These authors contributed equalAbbreviations used: I-sites, invar

sites; HMM, hidden Markov modelmaximization.

0022-2836/00/010173±18 $35.00/0

We describe a hidden Markov model, HMMSTR, for general proteinsequence based on the I-sites library of sequence-structure motifs. Unlikethe linear hidden Markov models used to model individual proteinfamilies, HMMSTR has a highly branched topology and captures recur-rent local features of protein sequences and structures that transcend pro-tein family boundaries. The model extends the I-sites library bydescribing the adjacencies of different sequence-structure motifs asobserved in the protein database and, by representing overlapping motifsin a much more compact form, achieves a great reduction in parameters.The HMM attributes a considerably higher probability to codingsequence than does an equivalent dipeptide model, predicts secondarystructure with an accuracy of 74.3 %, backbone torsion angles better thanany previously reported method and the structural context of b strandsand turns with an accuracy that should be useful for tertiary structureprediction.

# 2000 Academic Press

Keywords: hidden Markov models; I-sites library; sequence patterns;motifs; clustering

*Corresponding authors

Introduction

Proteins have recurrent local sequence patternsthat re¯ect evolutionary selective pressures to foldinto stable three-dimensional structures. Many ofthese local sequence patterns correlate with com-mon structural motifs such as helix caps and betahairpins. A general model of protein sequence thatcaptures these local features could lead toimproved methods for gene ®nding, proteinstructure prediction, remote homolog detectionand other applications relating to the interpretationof genomic sequence information. Here, wedescribe the development of such a model, basedon the I-sites library of sequence-structure motifs.

The I-sites (invariant or initiation sites) libraryconsists of an extensive set of short sequencemotifs, lengths 3 to 19 motifs obtained by exhaus-tive clustering of sequence segments from a non-redundant database of known structures (Han &Baker, 1996; Bystroff & Baker, 1998). Each sequence

ing authors:u.washington.edu

ly to this work.iant or initiation; EM, expectation

pattern correlates strongly with a recurrent localstructural motif in proteins. Approximately one-third of all residues in the database are found in anI-sites motif that can be predicted with a highdegree of con®dence (>70 %). The library is non-redundant in that no motif is completely containedwithin another, longer motif. However, many ofthe motifs overlap. For example, the helix cap pos-ition may occur in the fourth position of one motifand the eighth position of another. Furthermore,the isolated motif model does not capture higherorder relationships, such as the distinctly non-ran-dom transition frequencies between the differentmotifs. For example, sequences characteristic ofamphipathic helices are frequently bracketed by Nand C-terminal helix-capping motifs (Doig &Baldwin, 1995; Aurora & Rose, 1998).

The redundancy inherent in the I-sites modelsuggests a better representation that would modelthe diversity of the motifs and their higher orderrelationships while condensing features they havein common. A hidden Markov model (HMM;Rabiner, 1989) is well suited to this purpose. AnHMM consists of a set of states, each of which isassociated with a probability distribution for gen-erating a symbol, such as an amino acid residue ora secondary structure type, and a set of transitionprobabilities between the states.

# 2000 Academic Press

Page 2: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

174 The Grammatical Structure of Protein Sequences

Previous applications of HMMs to biologicalsequence data include the problems of ®ndingcoding regions in DNA and splice junctions (Kulpet al., 1996, Burge & Karlin, 1998, Lukashin &Borodovsky, 1998, Henderson et al., 1997), ®ndingtransmembrane regions in proteins (Sonnhammeret al., 1998), and secondary structure prediction(Asai et al., 1993, Stultz et al., 1993; Di Francescoet al., 1997, Lio et al., 1998). Left-right, or feed-for-ward HMMs have proven to be very powerful rep-resentations of protein sequence families (Kroghet al., 1994; Eddy, 1996; Sonnhammer et al., 1997),able to pick out distant homologs that wereundetectable by other sequence alignment methods(Karplus et al., 1999), and to align correctly theirsequences.

The HMM we discuss here, HMMSTR, is funda-mentally different from previous Markov modelsof proteins. HMMSTR does not have a pre-de®ned``left-right'' topology, as ``pro®le'' HMMs do(Eddy, 1998), but instead has a highly branchedand multi-cyclic topology, discovered from theprotein database through a motif clustering pro-cess. A single state in our model always representsa single position, i.e. there are no gap or insertionstates. Unlike earlier HMMs that model sequencefeatures speci®c to a single protein family,HMMSTR models local motifs common to all pro-tein families. HMMSTR was trained simul-taneously on sequence and structure databases,and shows considerable promise for a variety ofapplications, including gene ®nding and predictionof local structure and secondary structural context.

Results

Novel local structure (I-sites) motifs

We ®rst brie¯y describe novel I-sites motifsfound in preparing the construction of HMMSTR.Weak or rare sequence-structure correlations werefound for 180 sequence-structure clusters beyondthose published (Bystroff & Baker, 1998), includinga few previously uncharacterized local structuremotifs, such as the a-a corner and the Type-I0 b-hairpin. Additional new clusters contain alternativesequence patterns for previously characterizedmotifs. Others are new sequence patterns for rareand novel structural motifs, such as a glycine-richa-helix N-cap, which differs in structure from thewell-known N-capping box. Figure 1 shows someof the more interesting new motifs.

The a-a corner has been described by E®mov(1995), but was not fully characterized. A detailedpro®le of the a-a corner and the other new motifsare available in the latest version of the I-siteslibrary along with a mapping of the positionspeci®c scoring matrix score to the con®dence ofthe motif prediction. This prediction service is pro-vided by the I-sites server (isites.bio.rpi.edu).

Description of the HMM

As a starting point for the work describedhere, each I-sites motif was represented as achain of Markov states, each of which containsinformation about the sequence and structureattributes of a single position in the motif (seeFigure 2, to be discussed in more detail below).Adjacent positions were represented by tran-sitions from one state to the next. Merging ofthese linear chains of states, based on sequenceand structure similarity (see Methods), resultedin graphs such as the one shown in Figure 3, inthis case representing two ways of building ahairpin. The graphs were hierarchically merged,until almost all motifs were contained in a singlegraph. Branches, bulges (as in Figure 3) andlong cycles, but not short cycles, were allowedto form during the merging process.

The merged graph of I-sites motifs comprises anetwork of states connected by probabilistic tran-sitions, or more speci®cally, an HMM. Each statecan produce, or ``emit'', amino acid residues andstructure symbols according to a probability distri-bution speci®c to that state. There are four prob-ability distributions de®ned for the states in ourmodels, b, d, r, and c (see Figure 2), which describethe probability of observing a particular aminoacid, secondary structure, backbone angle region(see Figure 5), or structural context descriptor,respectively. A context descriptor represents theclassi®cation of a secondary structure type accord-ing to its context. For example, a hairpin turn isdistinguished from a diverging turn, and a b-strand in the middle of a sheet is distinguishedfrom one at the end of a sheet. The conversion ofthe I-sites library to an HMM is advantageous, astraining and application of HMMs is greatly facili-tated by existing, powerful algorithms.

Three models are discussed here, denoted with asuperscripted l. Two (lD, lC) were created by clus-tering the I-sites motifs using observed adjacenciesin the database, then by training the models onsequence plus secondary structure data (lD), andon sequence plus structural context data (lC). Athird model was produced by clustering the I-sitesmotifs based on hierarchical pairwise alignments,followed by training on sequence plus backboneangle data (lR). Due to space considerations, onlylR is illustrated here, in Figure 4. The differences inmodel topology were not thoroughly investigated,and are dif®cult to quantify, though strong simi-larities are clearly present. The models wereimproved by modifying the topology of the HMMsand optimizing parameters using a training set of618 proteins. A considerable reduction in the num-ber of initial parameters was found to improve thepredictive power of the model in the applicationsdescribed below. Thus, the ®nal models are smallerthan the initial merged graphs. The nameHMMSTR refers to the three models collectively.The speci®c model to be used depends on theapplication.

Page 3: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Figure 1. Rare or weak I-sites motifs. Three of the new weak or rare sequence-structure correlations found in thegrowing sequence-structure database since the publication of the I-sites library (Bystroff & Baker, 1998). (a) The a-acorner, ®rst described by E®mov (1993). (b) A shorter version of the diverging b-turn described previously. (c) Theglycine-rich helix N-cap, which appears to be novel. Shown are (i) backbone angles; (ii) sequence pro®les, using acolor scale of red (greater than three-times background amino acid frequency) to blue (less than one-third of back-ground amino acid frequency), for representative structures (iii): (a) 1phg 163-174, (b) 1eny 5-12, and (c) 1pbe 5-15.Blue side-chains are conserved non-polar positions, green: conserved polar, magenta: conserved proline residue, andred dots: conserved glycine residue (Motifs were found by a clustering method, prior to development of the HMM.).

The Grammatical Structure of Protein Sequences 175

Page 4: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Figure 2. A Markov state. A hidden Markov modelconsists of Markov states connected by directed tran-sitions. Each state emits an output symbol, representingsequence or structure. There are four categories of emis-sion symbols in our model: b, d, r, and c, correspondingto amino acid residues, three-state secondary structure,backbone angles (discretized into regions of phi-psispace) and structural context (e.g. hairpin versus diver-ging turn, middle versus end-strand), respectively.

176 The Grammatical Structure of Protein Sequences

Clear statistical interactions between local struc-ture motifs are encoded in HMMSTR. For example,as seen in Figure 4, different a-helix C-caps branchfrom amphipathic helices at different pointsrelative to the amphipathic periodicity and arefollowed by different loop or strand motifs. Thetype-3 a-C-cap is often followed by a hydrophobicb-strand, while the a-C-caps types 1 and 2 arefollowed by amphipathic b-strands or a coil, andthen another helix. The reason for the statisticalinteractions between motifs lies in the chemicalinteractions between conserved side-chains and inthe geometric constraints imposed by the localstructure. Favorable energetic interactions betweenadjacent motifs lead to greater abundance of thoseadjacencies in the database. Because this is an over-view of what is clearly a complex system of inter-actions, no more will be said here about theunderlying chemistry. Instead, we concentrate onestablishing the validity of the model.

Applications of the HMM

The HMM captures recurrent features of bothprotein sequences and protein structures and thushas application to a wide range of problems of

{ An ef®cient algorithm is needed to reduce thecomputational cost for the direct sum, as the standardforward-backward dynamic programming algorithmdoes for P(O)

interest. Potential applications are summarized inTable 1 in terms of the input, output, and the prob-ability evaluated for each application. The input isa sequence of known attributes of a protein foreach position, such as an amino acid sequence orpro®le (O), a sequence of secondary structure sym-bols (D), symbols denoting the context in whichthe secondary structure symbols are found (C), ordihedral angle regions (R, see Figure 5).

Gene finding

Presenting the model with a potential codingsequence O, we may determine the likelihood thatit is a genuine coding sequence by evaluating theprobability of the sequence according to the model,P(O). P(O) is the sum, over all possible pathsthrough the network, of the probability of thatpath multiplied by the probability that the pathemits the sequence O (equation (1), in Methods).

Secondary structure prediction

A prediction of the three-state secondary struc-ture symbol for each position in the sequence O,may be obtained by summing the state-speci®cprobability distribution d (see Figure 2) multipliedby the position-speci®c state probability, over allstates (see Methods). The predicted secondarystructure state is the one with the highest value inthe summed distribution.

Local and super-secondary structure prediction

Structural context symbols and dihedral angleregions are predicted much the same way as sec-ondary structure, but using probability distri-butions c and r, respectively.

Sequence design

Since sequence and structure are formulated in adual and symmetric fashion in our model, designproceeds analogously to structure prediction, i.e. asequence of structure descriptors (i.e. D, R or C)leads to a sequence of position-speci®c probabil-ities for all states, which may be used to predictpreferred residues.

Sequence alignment

The ®nal entries in Table 1 concern the align-ment of two amino acid sequences O1 and O2, as iscommonly done in homology searches. One meth-od to increase the sensitivity of such alignments isto derive amino acid substitution matrices speci®cto the local environment at each position from theHMM. An alternative is to compute the probabilitythe two sequences take the same path through theHMM by the sum over all paths of the product ofsingle path contributions to P(O1) and P(O2){.

Page 5: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Figure 3. Merging of two I-sites motifs to form an HMM. The extended Type-I hairpin motif and the Serine hairpinalign in 3D space and in sequence with a short mismatch in the turn. The color scheme is as described in legendto Figure 1. The resulting HMM topology is shown. The shaped icons represent Markov states, with probabilisticemission properties (see Figure 2). Rectangles are predominantly b-strand states, and diamonds are predominantlyturns. The color of the icon indicates a sequence preference as follows: blue, hydrophobic; green, polar; yellow,glycine residue. Numbers in icons are arbitrary (unique) state identi®ers.

The Grammatical Structure of Protein Sequences 177

Performance summary

Table 2 gives a summary of the overall perform-ance of the model for a number of possible appli-cations, and below we discuss the results in moredetail. The performance is summarized for thetraining set and for a randomly selected, indepen-

dent test set of 61 proteins. In nearly all cases, thescores of the test and training sets are comparable;indicating that the process of re®ning the modelhas not resulted in over®tting of the data. Perform-ance is evaluated on models lD, lC, and lR. Thesequence score L(S; lD ÿ lbg) is a measure of theprobability of sequences occurring under the

Page 6: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Figure 4 (legend opposite)

178 The Grammatical Structure of Protein Sequences

Page 7: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Table 1. Applications of the HMM

Application Input Output Measure

Gene finding Single sequences (O) Likelihood for coding P(O)Secondary structure prediction Sequence profile (O) Secondary structure (D) P(DjO)Structural context prediction Sequence profile (O) Context (C) P(CjO)Dihedral angle region prediction Sequence profile (O) Dihedral angle region (R) P(RjO)Protein design Structure (D, C, R) Sequence (O) P(OjD, C, R)Sequence comparison Sequence 1 (O1) Likelihood for alignment P(O1�O2)

Sequence 2(O2)

Applications of the model are given in terms of input for the application, output, and the corresponding measure or statistic thatmay be evaluated with the HMM. The measures are probabilities P, where a vertical bar speci®es a conditional probability. Odenotes a sequence of amino acid residues, D a sequence of secondary structure symbols, C a sequence of structural context sym-bols, and R a sequence of dihedral angle region symbols. In the ®nal entry, � denotes a similarity measure.

The Grammatical Structure of Protein Sequences 179

model lD, relative to the probability under a sim-pler model which takes into account only aminoacid composition (lbg). In the ®rst row of Table 2this score is evaluated on single sequences, as isrelevant for the case of novel sequences with noknown homologs. The value on the test set, 0.024,indicates that a protein of length 350 residues gen-erally has probability 20.024 � 350 � 338 times greaterwith lD than with the simpler model. For pro®le-based evaluation, our sequence score is 0.226 onthe test set. The Q3 score is the fraction of positionscorrectly assigned to one of the three states: helix,strand, or turn. The value on the test setQ3 � 74.3 %, is an indication that local structure atthe level of three-state prediction is accuratelyreproduced by lD. The next two lines in Table 2show the accuracy of high-con®dence (>70 %) pre-dictions of structural ``context'', b-hairpin versus b-diverging turn on the one hand and middle b-strand versus end b-strand on the other, for modellC. The MDA score is the fraction of residues thatare found in correctly predicted eight-residue seg-ments, i.e. segments in which no predicted back-bone torsion angle differs by more than 120 � fromthe true structure. The overall MDA score, 59 % onthe test set, for model lR, is substantially higherthan the 48 % overall accuracy reported in our ear-lier work (Bystroff & Baker, 1998). The success ofthe secondary structure, context and torsion anglepredictions indicates that these predictions maygive useful constraints for tertiary structure predic-tion (Simons et al., 1998). For the applicationsabove, scores evaluated from the two models otherthan the one quoted were only slightly lower.

Figure 4. The topology of the model lR and the locationFigure 3, and with circles predominantly helix; magenta, psitions with probability greater than 0.001 and all Markov sover all database positions) greater than 0.0001 are shown.sinks are connected to all sources through transitions to the n(lower right) connect to the main graph only through the nmotifs, whose structures are drawn in the margins. In somestructural motif (e.g. helix N-cap) and in other cases, multipmodel (e.g. states 14 and 15).

Identification of coding regions

The desired test for the potential of our HMM toaid in the identi®cation of coding regions involvesintegration into one of the several excellent geneidenti®cation systems now in use (see e.g. Li, linka-ge.rockefeller.edu/wli/gene/), and evaluation bycomparison with existing methods. In lieu of acomprehensive test of this nature, we compare oursequence scores to those obtained from simple butrelevant models of protein sequence. We alsoevaluate scores on randomized sequences.

The sequence score L(S; l ÿ lbg) can be used togauge whether or not a given string of amino acidresidues or a pro®le is likely to be a proteinsequence, as modeled by the HMM l. The score isgiven by the logarithm of the probability of thesequence given l, minus the corresponding quan-tity given the very simple ``background'' modellbg. This is equivalent to a likelihood ratio test,which assesses which of the two models is bestsupported by observed protein sequences. In thebackground model, the probability of a sequence isa product of independent contributions from eachposition, where the probability of a residue isgiven by the frequency of that residue in the train-ing set as a whole. To facilitate comparisons in thecontext of information theory we use the base 2logarithm and average over all positions to obtainL(S; l ÿ lbg) in bits per position. In terms ofinformation content, L(S; l ÿ lbg) is the reductionof uncertainty, or information gain, in using l as amodel of sequence rather than lbg. (See Schneider,T., http://www.lecb.ncifcrf.gov/ � toms/paper/primer/for a primer on information theory inbiology.)

HMMSTR signi®cantly improved in general per-formance when pro®les were used for training

s of selected I-sites motifs. Icon colors and shapes as inroline; white, no sequence preference. All forward tran-tates with posterior probabilities (obtained by summing

Icons with bold borders represent sinks and sources. Allon-emitting ``nought'' state (not shown). Disjoint graphs

ought state. Backshaded pathways are mapped to I-sitescases, there are multiple pathways that map to a single

le local structure motifs share a common segment of the

Page 8: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Figure 5. Backbone angle regions. The color scale indi-cates relative frequency of occurrence in the database.Frequently occurring fc angles are shown in red andinfrequent ones in blue. Not shown is the cis-peptideregion (c).

180 The Grammatical Structure of Protein Sequences

instead of single sequences. For gene identi®cation,scores on single sequences are relevant. The distri-bution of the single sequence scores for individualproteins in the training set (0.026 bits/residue over-all) and the test set (0.024 bits/residue) is given inFigure 6. There is no clear correlation between theper-residue score and the length of the sequence(data not shown). A preliminary study indicates

Table 2. Summary of performance measures

Statistic (model) Training set Test set

L(S; lD ÿ lbg), singlesequences (bits) 0.026 0.024

L(S; lD ÿ lbg),profiles (bits) 0.236 0.226

Q3 secondarystructure (lD) (%) 74.9 74.3

Hairpin vs. divergingturn (lC) (%) 75.1 85.2

End vs. middlestrand (lC) (%) 77.3 78.4

MDA ratio (lR) (%) 57.1 59.1

Performance is evaluated on models lD, lC, and lR optimizedfor prediction of secondary structure, structural context, anddihedral angle regions, respectively. L(S; lD ÿ lbg) is thesequence score per position, in bits, evaluated on singlesequences or on pro®les. Q3 is the fraction of positions correctlypredicted to be helix, strand, or turn. For turns separating twostrands, we give the fraction of positions that are correctly pre-dicted in a hairpin or diverging con®guration. For strands, wegive the fraction of positions correctly predicted as middlestrands (e.g. interior of a b-sheet) or end strands (e.g. boundaryof a b-sheet). The latter two measures are shown for predictionswith greater than 70 % con®dence. The MDA ratio is the frac-tion of residues in segments of length 8 with no greater than120 � maximum deviation between observed and predictedbackbone torsion angles.

that training on a properly weighted set of singlesequences rather than pro®les can give rise to aper-position score of 0.05 or more, but furtherevaluation is required (see Discussion).

Table 3 summarizes result for scores on singlesequences for one of our models, lD, and a ``dipep-tide model'', ldip, evaluated on natural proteinsequences and on two randomly generatedsequence sets. The dipeptide model, ldip, is a 20-state Markov chain model that encodes observeddipeptide frequencies in its transition probabilities.The sequence sets, Sbg and Sdip, were generated sto-chastically using the overall amino acid frequenciesand the dipeptide frequencies, respectively. Dipep-tide content is related to dicodon content, used suc-cessfully as a component in gene recognitionmethods (Fickett & Tung, 1992).

We ®rst note that the score for the model lD onthe database, 0.0260, is considerably higher thanthe corresponding score for the dipeptide model,0.0072. To evaluate the signi®cance of this scoredifference, we compared the distributions of thesescores over all proteins. Since the scores are paired(one score from lD and one score from the dipep-tide model for each protein), we carried out apaired t-test with the null hypothesis being thatthe mean of the distribution of the differencesbetween the scores of lD and the dipeptide model,taken for each protein, is zero. For the test set, thepaired t-test rejects the null hypothesis (P-valueP � 2 � 10ÿ5, t-statistic t � 4.5, 95 % con®denceinterval CI � (0.009, 0.022)). For the training set,the hypothesis is rejected with P � 2 � 10ÿ71,t � 17.8, CI � (0.020, 0.025). This suggests that thescore L(S; lD ÿ lbg) is likely to be a useful additionto gene identi®cation methods. Secondly, there is atrend toward smaller scores going from right toleft in Table 3. For lD, this is consistent with thefact that lD is optimized on Sdb and would expectto fare less well on sequences lacking correlationsfound in natural sequences. For scores using ldip,the downward trend and the interesting relationsamong the magnitude of the scores may be con-®rmed independently by relatively simple calcu-lations (see Appendix).

Three-state secondary structure prediction

The Q3 score is the fraction of positions correctlyassigned to one of the three states: helix, strand, or

Table 3. Single sequence score

Model Sbg Sdip Sdb

ldip ÿ0.0073 0.0073 0.0072lD ÿ0.0105 ÿ0.0071 0.0260

L(S; lD ÿ lbg) in bits per position, for selected models andsequence sets. Models; ldip, dipeptide model; lD, HMMSTR.Sequence sets: Sbg, random amino acids with background distri-bution; Sdip, random with dipeptide distribution; Sdb trainingset.

Page 9: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Figure 6. Distribution of single sequence scores. Thenumber of proteins with single sequence scores withinintervals of width 0.01 bits per position is shown. Whitebars, training set; Gray bars, test set.

The Grammatical Structure of Protein Sequences 181

turn. Evaluating the prediction of lD on the testset, we obtain Q3 � 74.3 %, an indication that localstructure at the level of three-state prediction isaccurately reproduced. Our initial models gaveQ3 � 55 %, and the score was greatly increased byfactors such as training, pruning, and the use ofpro®les instead of single sequences. Signi®cantimprovements were also obtained through the useof a voting scheme, as a means of combining con-tributions of multiple paths through the HMM,instead of basing predictions on the single mostlikely (Viterbi) path. The early statistical predictionmethods of Chou & Fasman (1978) gaveQ3 � 60 %, while recent work on neural net modelsreport Q3 � 74 % (Rost, 1997) and Q3 � 77 %(Jones, insulin.brunel.ac.uk/psipred/).

As shown in Table 4, there are some imbalancesin the secondary structure prediction. For example,the model overpredicts turns, and underpredictshelices and strands. However, as indicated inFigure 7, the length distributions of secondary

Table 4. Three-state secondary structure prediction

Hpred Spred

A. Training setHobs 35,943 1794Sobs 1484 18,15Tobs 7115 6406Total 44,542 25,35

B. Test setHobs 4252 276Sobs 178 1972Tobs 706 619Total 5136 2867

Training and test set positions are categorized by observed andright margin shows the summed observed observations, and the bostructure. H, helix; S, strand; T, turn.

structures are reproduced fairly well. The truestructure of segments incorrectly predicted to beshort helical segments of length 1-3 is highly vari-able. NaõÈve conversion of such segments to adifferent secondary structure simply tends to intro-duce new types of errors, such that there is no netgain in Q3 score. However, some improvement inprediction quality might be produced by a post-processing step analogous to the second layer inthe PHD secondary structure prediction method,which reassigns the predicted secondary structureat a position based on the predictions at surround-ing positions (Rost & Sander, 1994). The complextopology of HMMSTR clearly provides the ¯exi-bility to accurately model secondary structurelength distributions.

Backbone angle prediction

Backbone angle predictions were assessed usingthe MDA score (Bystroff & Baker, 1998), which isan appropriate measure for low-resolution back-bone angle prediction, and is more precise than theQ3 score. The MDA score is the percentage of allresidues that are found in correctly predictedeight-residue segments, meaning 8-mers in whichno backbone angle deviates by more than 120 �from the true value (Bystroff & Baker, 1998). Pairsof 8-mers with a maximum backbone angle devi-ation of 120 � or less tend to have backbone rmsds(root mean square deviations of superimposedbackbone atom positions) of less than 1.4 AÊ .

Backbone angles were predicted using a votingprocedure to combine contributions from multipleMarkov states. A hierarchical voting proceduregave an overall MDA score of 58.8 % on trainingdata, and outperformed a non-hierarchical one,which gave 56.7 %. The hierarchical voting methodoverpredicts the most heavily populated angleregions: H, E and B (Figure 5) and under-predictsmost of the sparsely-populated regions, especiallyregions b, d, and e (Table 5), although the distri-bution is even more skewed using the non-hier-archical procedure. States representing b, d, and e

Tpred Total

8983 46,7204 10,454 30,092

54306 67,8274 73,743 144,639

1160 56881129 32795547 68727836 15,839

predicted three-state secondary structure for the model lD. Thettom margin shows the summed predictions for each secondary

Page 10: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Figure 7. Length distributions for three-state second-ary structures. For a given three-state secondary struc-ture, the ordinate represents the number of occurrencesof contiguous regions of that state, of length given onthe abscissa. Solid line, prediction by the model lD, bro-ken line, true distribution. Uppermost panel, helix;middle panel, strand; Lowest panel, turn.

182 The Grammatical Structure of Protein Sequences

often represent them weakly, having diffuse back-bone angle emission probabilities (r), while moststates representing regions H, E and B have shar-ply peaked r-vectors, sometimes to the exclusion ofall but one region. It may be that the rarer back-bone conformations, corresponding to the sparselypopulated r regions, may be determined more bynon-local strains and stresses than by strong localsequence signals, in which case they would not beaccurately modeled by the HMM.

The prediction accuracy of the I-sites library wasMDA � 48 %, and 54 % when combined with PHDsecondary structure predictions (Rost, 1997). Thecurrent results, MDA � 59 %, represent a consider-able improvement in accuracy for the prediction ofbackbone angles. A paired t-test establishes the sig-ni®cance of this difference at P � 10ÿ4 (t-statistict � 4.1, 95 % con®dence interval CI � (0.04, 0.11))for the test set, and P � 2 � 10ÿ44 (t � 14.1,CI � (0.06, 0.08)) for the training set.

Context prediction

We examined the ability of the HMM to predictlocal structural context, in addition to secondarystructure and backbone angles. Short turns (<eightresidues) between secondary structure were classi-®ed according to the types of secondary structurethat bracket them: helix-turn-helix, helix-turn-strand, strand-turn-strand, etc. Turns between twostrands were designated to be ``hairpins'' (h) if thestrands were paired, and ``diverging turns'' (d) ifthey were not. Table 6 shows results of predictionsfrom model lC. Short turns that were correctly pre-dicted to be between strands were binned accord-ing to the con®dence of the prediction that the turnis diverging, rather than hairpin, Pd/(Pd � Ph).Results for the training set and test set are quitecomparable. Note that hairpin and diverging turnpredictions of high con®dence are generallycorrect.

The context of a strand unit is de®ned as``middle'' (m) if there are hydrogen-bonding b-strands on both sides of it, otherwise it was an``end'' strand (n) (we are using the DSSP de®nitionof b-strand residues, in which a residue is part of astrand if it participates in strand-strand H bonds).The context is de®ned residue by residue, thus b-strands may be partly ``middle'' and partly ``end''.In Table 7, positions that were correctly predictedto be strand positions using lC were binnedaccording to the value of Pn/(Pu � Pm), the con®-dence that the position is found in an end strand.Evaluation on both training and test sets shows thedesired behavior for high-con®dence predictions.

Table 2 summarizes results for con®dence above70 %. The high scores indicate that there is in somecases strong and reliable sequence signal for struc-tural context. We are unaware of any precedent forthis type of prediction, which may be consideredsuper secondary structure prediction, and thereforecannot compare our results with existing methods.We are presently investigating the utility of thesecontext predictions for tertiary structure predictionalgorithms.

Discussion

The HMM developed in this paper captures sim-ultaneously the recurrent local features of proteinsequences and protein structures in a single com-pact form. In contrast to the more familiar family-speci®c HMMs (Sonnhammer et al., 1997, Eddy,1998), which have shown considerable power inremote homolog detection, HMMSTR models fea-tures common to protein sequences and structuresgenerally. The models lD, lC, and lR, and c � �source code for model training and evaluation, areavailable at isites.bio.rpi.edu/hmmstr/.

Comparison to the I-sites library model

HMMSTR extends and generalizes the I-siteslibrary of sequence-structure motifs. First, the

Page 11: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Table 5. Backbone angle region prediction

Hpred Gpred Bpred Epred dpred bpred epred Lpred lpred xpred cpred Total

Hobs 6909 42 593 589 2 0 13 0 15 37 0 8200Gobs 887 55 266 198 0 0 3 0 10 6 0 1425Bobs 904 19 1429 747 1 2 19 0 6 31 0 3158Eobs 677 8 598 1737 1 0 13 1 6 10 0 3051dobs 107 3 129 127 10 0 0 0 1 2 0 379bobs 174 4 165 174 0 1 1 0 1 0 0 520eobs 188 5 162 201 0 0 35 0 16 34 0 641Lobs 214 14 83 52 0 0 12 0 21 19 0 415lobs 140 2 35 29 0 0 33 2 107 74 0 422xobs 62 3 38 16 0 0 17 0 18 45 0 199cobs 6 0 33 5 0 0 1 0 1 1 0 47Total 10,269 154 3531 3875 14 3 147 3 202 259 0 18,640

Position-wise predicted backbone angle regions are tabulated according to their true values. Letters in the margins refer to theregions of phi/psi space mapped out in Figure 5. c denotes residues N-terminal to a cis-peptide bond. The right margin shows thesummed observations, and the bottom margin show the summed predictions in each region.

The Grammatical Structure of Protein Sequences 183

non-random transitions between differentsequence motifs are described. For example, ahelix followed by a certain type of helix captends to transition into a strand further down-stream, whereas a helix followed by a differenttype of cap preferentially transitions into anotherhelix. These regularities in the order of occur-rence of motifs are not treated in the original I-sites model, but are compactly represented bytransitions between motifs in the HMM. In thissense, the model captures the grammatical struc-ture of protein sequence. Second, overlappingmotifs are condensed, resulting in a morecomplete description with fewer parameters. TheI-sites library, from which the HMM was built,may be viewed as a mixture model (Bailey &

Table 6. Hairpin versus diverging turn

Pd/(Ph � Pd) Hairpin Diverging

A. Training set0.0-0.1 74 360.1-0.2 235 680.2-0.3 318 1080.3-0.4 232 1780.4-0.5 252 1360.5-0.6 278 1750.6-0.7 116 1950.7-0.8 101 1380.8-0.9 49 1170.9-1.0 18 262Total 1673 1413

B. Test set0.0-0.1 2 00.1-0.2 16 40.2-0.3 15 40.3-0.4 18 180.4-0.5 39 500.5-0.6 15 300.6-0.7 17 110.7-0.8 5 290.8-0.9 4 120.9-1.0 0 24Total 131 182

Each position in turn regions between strands is categorizedby the con®dence that the turn is a diverging turn rather thana hairpin turn, Pd/(Pd � Ph) (column 1), and by its true context(columns 2 and 3).

Elkan, 1994) where the I-sites motifs are the mix-ture components. Many of the mixture com-ponents contain overlapping portions of thesame sub-segment pattern, both in sequence andin structure (Figure 3); the overlapping regionsare represented by a single set of states in theHMM. The added information and condensedrepresentation are likely to account for the con-siderably improved local structure predictioncapabilities of the HMM compared to the orig-inal I-sites model, which predicted secondarystructure with only 64 % accuracy and localstructure (MDA) with 48 % accuracy.

One drawback to the model, as with the librarymodel, is the assumption of positional indepen-dence. When multiply aligned sequences are con-densed to form a pro®le, the information containedin pairwise statistical interactions (sequence covari-ance) is lost. One place where covariance is signi®-cant is in the region of the helix N-capping box(Doig et al., 1997), another is in polar residues inhelices (Lacroix et al., 1998). Covariance can resultfrom conserved salt bridges or from packing inter-actions. The problem is partially overcome byallowing multiple paths for each motif, each pathcontaining a different covariant pair. To a certainextent this has occurred in the case of the helix N-cap, for which there are several independent pathsin the HMM (Figure 4) Use of the full multiplesequence alignment rather than pro®les in thetraining process could improve the retention of thistype of covariance between nearby positions.

What is the origin of the non-random connectiv-ities between the different I-sites motifs inHMMSTR? The presence of a motif pattern has aselective effect on the allowable sequence-structurepatterns upstream and downstream. This is onemanner in which structure may propagate throughthe chain during the very early stages of folding.Alternatively, the connectivities between motifsmay arise from global structural constraints onprotein structures and recurring super-secondarystructure motifs, for example the b-a-b motif ofmany a-b-proteins.

Page 12: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

Table 7. Middle versus end strand

Pn/(Pm � Pn) Middle End

A. Training set0.0-0.1 596 920.1-0.2 2346 5330.2-0.3 2545 8730.3-0.4 2103 11040.4-0.5 1285 9880.5-0.6 724 7350.6-0.7 418 5720.7-0.8 131 3330.8-0.9 94 1560.9-1.0 38 25Total 10,280 5411

B. Test set0.0-0.1 79 50.1-0.2 266 690.2-0.3 236 930.3-0.4 194 1280.4-0.5 171 1170.5-0.6 80 680.6-0.7 52 620.7-0.8 9 240.8-0.9 6 190.9-1.0 1 1Total 1094 586

Each position in a correctly predicted strand is categorizedby the con®dence that the strand is an end strand rather than amiddle strand, Pn/(Pn � Pm) (column 1), and by its true context(columns 2 and 3).

184 The Grammatical Structure of Protein Sequences

As with the I-sites model, low complexityregions and particular types of proteins (such asmembrane-associated, see Methods), wereexcluded. Our HMM is therefore less suitable forgene or structure prediction for such proteins.Treatment of such sequences would necessitate theaddition of modules for membrane spanning andlow complexity/disordered region segments to theHMM.

Applications

The local structure predictions made by theHMM may be useful for both ab initio structureprediction and homology modeling. In the caseof ab initio structure prediction, discriminationbetween internal and external strands and betweenhairpin and diverging turns should facilitate identi-®cation of native-like topologies of beta sheet pro-teins, which tend to be particularly challenging forab initio folding methods. In the case of homologymodeling, the challenging loop building problemmay be facilitated by the backbone angle predic-tions produced by the HMM, which can serve asprior distributions to guide loop closing/buildingprocedures.

The ``inverse'' folding problem is the design ofprotein sequences that have a desired structure. Inour model, the dual nature in which sequentialand structural information is incorporated enablesthe prediction of sequence from structure to be per-formed analogously to the prediction of structurefrom sequence. A sequence pro®le consistent witha given structure can be obtained from the HMM

simply by threading the desired secondary struc-ture, backbone angle, and structural context stringssimultaneously through the HMM and recordingthe amino acid residues emitted from the visitedstates. Such sequence pro®les could be useful asprior distributions for sequence searching methodsbased on explicit side-chain modeling (Dahiyatet al., 1996), particularly in turns and on the surfacewhere steric repulsion plays a less important role.Such sequence predictions could potentially bemade more effective by incorporating additionalstructural attributes, such as the number of resi-dues within a certain distance from the position ofinterest.

Sequence comparison methods are among themost widely used computational methods inbiology and can provide considerable insight intoboth the structure and function of a novelsequence. Scoring any two positions in sequencealignments of proteins typically involves substi-tution matrices, such as those of the BLOSUMseries (Henikoff & Henikoff, 1992). Most scoringmatrices in current use are position independent.Context dependent substitution tables can be com-puted for each of the states in the HMM, and sub-sequently computed for any query sequence fromthe paths it takes through the HMM in a manneranalogous to that described here for secondarystructure prediction, etc. These context dependentsubstitution tables could then be used to increasethe sensitivity and speci®city of standard sequencesearching methods, perhaps achieving some of theimproved performance of methods, such as PSI-BLAST (Altschul et al., 1997), which use multiplesequence information, but using only a singlesequence.

Methods

Identification of weak sequence-structure correlations

A procedure was described previously for identifyingincreasingly weak sequence-structure correlations by``peak removal'' and re-clustering (Bystroff & Baker,1998). In short, the I-sites motifs were used to masklocations in the database that had strong correlationswith structure, and we then clustered the remainingsequence segments. Segments were clustered at eachlength from 3 to 19 residues, and cluster correlationswere re®ned by supervised learning, as before. Con®-dence curves were calculated for the new clusters, map-ping sequence score to the probability of being withinapproximately 1.4 AÊ rmsd of the cluster's paradigm(most common) three-dimensional structure. Weakmotifs are classi®ed as those sequence clusters whosehighest-scoring 30 members were at least 40 % but lessthan 70 % correctly predicted. Previously, the I-sitesmotifs had been de®ned as clusters where at least 70 %of the 30 highest-scoring members were correctly pre-dicted. By masking out all sequence segments thatwould be more con®dently predicted by one of thestrong I-sites motifs, we obtained only new sequence-patterns. The entire set of I-sites clusters can be viewedat: http://isites.bio.rpi.edu/Isites/by motif.html.

Page 13: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

The Grammatical Structure of Protein Sequences 185

Formulation of the hidden Markov model

Here, we describe the parameters of the model andthe general equations for training and prediction. Werefer the reader to the tutorial by Rabiner (1989) for afull treatment of the methods (expectation-maximization,the forward-backward algorithm, and the Viterbi algor-ithm), which is not included here.

A hidden Markov model is a network of Markovstates connected by directed transitions. Each state emitssymbols representing sequence and structure. Morespeci®cally, the components of the model, collectivelydenoted by the symbol l, are as follows. The HMM con-sists of a network of N states denoted qi(1 4 i 4 N). Theprobability of initiating a sequence at state j is given bypj. The probability of a transition from state i to state j isgiven by aij. For a given state i, there are a set of emis-sion probabilities collectively called Bi. Here, we use fourin this collection, denoted b, d, r, and c (see Figure 2).The values bi(m) (1 4 m 4 20) are associated with prob-abilities for the emission of amino acid residues. Thevalues di(m) (1 4 m 4 3), are the probabilities of emit-ting helix (H), strand (S) or loop (T), respectively. (InBayesian notation: di(m) � P(m � {H,S,T}jqi).) The valuesri(m) (1 4 m 4 11) are the probabilities of emitting oneof the 11 dihedral angle symbols (Figure 5). Finally,ci(m)(1 4 m 4 10) is the probability of emitting one often structural context symbols (described below).

The database is encoded as a linear sequence of aminoacid and structural observables. The amino acidsequence data consists of a ``parent'' amino acidsequence of known three-dimensional structure, and anamino acid pro®le obtained by alignments to the parentsequence (Bystroff & Baker, 1998). The amino acid of theparent sequence is denoted by Ot, and the pro®le by{Ot

m}(1 4 m 4 20). For the structural identi®ers at eachposition t, the following nomenclature is used: three-state secondary structure, Dt; discrete backbone angleregion, Rt; and the context symbol, Ct. A sequence s oflength T is given by the values of the attributes at allpositions st � {Ot, {Ot

m}, Dt, Rt, Ct} (1 4 t 4 T).The utility of the HMM to model protein sequences is

based on the notion of a path. A path is a sequence ofstates through the HMM, denoted Q � q1q2 . . . qT. Thus,the probability of a sequence s given the model l, P(sjl),is obtained by summing the relevant contributions fromall possible paths Q:

P�sjl� �XallQ

pq1Bq1�s1�aq1q2

Bq1�s2� . . . aqTÿ1qT

B�sT� �1�

Here, Bi(st) is the probability of observing st at state i,which for observation of a single sequence is given by:

Bi�st� �di�Dt�ri�Rt�ci�Ct�

0@ 1Abi�Ot� �2�

Usually, only one of the structural emission symbols d, r,or c is included in Bi in any given training run. However,in principle, any combination could be used. Our HMMsshowed signi®cant improvements in performance whenwe used amino acid pro®les instead of single amino acidsequences for training and for subsequent predictions.For the probability of observing a given pro®le {Ot

m} pos-ition t in a sequence, we use the multinomial distri-bution, and the expression for B becomes

Bi�st� �di�Dt�ri�Rt�ci�Ct�

0@ 1AX20

m�1

�bi�m��Ncount�O mt �3�

To give equal weight to the information in sequencefamilies of different depths, Ncount was taken to be a glo-bal parameter. To each term in Bi we added a smallpseudocount, e (not shown), to prevent machine errorsand to allow state paths Q to be introduced in the train-ing process which are otherwise excluded when terms inBi are zero.

Protein database

For training, evaluation and testing of the HMM weused a non-redundant database of proteins of knownstructure, PDBselect: December 1998 (Hobohm & Sander,1994) containing 691 proteins and their sequencefamilies. Entries in the database were selectivelyremoved if the structure was solved by NMR, had alarge number of disul®de bridges or cis-peptide bonds,or if it was a membrane-associated protein according tothe header records. Disordered or missing coordinates inthe middle of a sequence were addressed by dividingthe sequence at that point. Contiguous segments oflength less than 20 were ignored. Multiple sequencealignments were generated from each sequence usingPSI-BLAST (Altschul et al., 1997) after ®ltering the querysequences for low-complexity regions using SEG (Woot-ten & Federhen, 1996). Data for training the HMMincluded the sequence pro®le, computed from the mul-tiple sequence alignment as described (Bystroff & Baker,1998), the DSSP secondary structure assignments(Kabsch & Sander, 1983), the backbone angles, and astructural ``context'' symbol.

All trans phi/psi pairs from the protein database werepartitioned using k-means clustering (k � 10). Boundariesenclosing the cluster centroids were found using aVoronoi method, yielding ten regions in phi/psi space(Figure 5). All cis peptides form an 11th region.

A randomly selected set of 73 of the 691 proteins(19,000 positions) was then set aside and not used fortraining, but only for the ®nal cross-validation. Beforecross-validation, a test for true independence wasapplied to each member of the test set as describedbelow, and 12 members were removed. The ®nal test setthus contained 61 proteins and 16,000 positions.

The remaining set of 618 parent sequences (145,000positions) was used for training, and divided into a largeset of 564 sequences (133,000 positions), used for optim-ization via the EM algorithm, and a small set of 54sequences (12,000 positions) used to evaluate of the pre-dictive ability of the model during training. Note thatthe small set of 54 sequences is used only for evaluationof the performance of a model and may thus appear tobe a test set. However, decisions regarding the modi®-cation of the model are based on results of those evalu-ations. The set of 54 sequences is therefore not a test set,but a training set. For the ®nal round of training werecombined the large and small training sets, to a total of618 sequence families. After the ®nal round of training,the models were frozen.

The ®nal evaluations (the only ones reported here)were done on the independent test set of 61 proteins.This set of proteins was not used in any way prior to the®nal evaluations (in particular, the set was not used toevaluate any earlier models).

Page 14: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

186 The Grammatical Structure of Protein Sequences

Cross-validation

Here, we are looking for local sequence-structure cor-relations that transcend protein sequence families. Wemust be careful that the information contained in themodel is not biased toward any one molecular ancestryand that the test set is completely unrelated to the train-ing set. While the PDBselect list used in this work doeshave most homologous pairs of proteins alreadyremoved, for this study it was important to take furtherprecautions.

To eliminate redundancy in the multiple sequencealignments used in the training process, we used PSI-BLAST (Altschul et al., 1997) to detect remote homologs.First, a ®le was created of pointers from each sequencein the ``nr'' database to each homologous protein on thePDBselect list. If a sequence was found in a PSI-BLASTsearch using one of the PDB select parent sequences,then that parent, with annotations specifying the overlapregion and the percent identity, was recorded in the ®le.In a second pass through the PDBselect list, a ``nr''sequence was omitted from the multiple alignment ifamong its pointers there was another parent with a high-er percent ID in an overlapping region. This guaranteedthat no sequence was used more than once in the result-ing database.

To remove sequences from the test set that could berelated to proteins in the training set, a list was made ofsets of PDB select sequences that mutually matched anytarget sequence in ``nr'' in the same overapping region ofthat target with a BLAST E-value less than 0.1. Parentsequences that both matched the same target were con-sidered remote homologs, even if they did not recognizeeach other directly. Test set members that were found tobe remote homologs of training set members wereremoved from the test set. For example, 1alvA (calciumbinding domain Vi) and 1wdcB (scallop myosin) both hitthe target sequence caltractin (GI 729051) with E-valuesof less than 0.1 (and in fact the two parent proteins sharea superimposable calcium binding domain). 1wdcB is amember of the training set, therefore 1alvA was removedfrom the test set.

Can training set bias originate from the I-sites library?

One might be concerned that care was not taken toexclude the test set proteins during the discovery of theI-sites motifs and subsequent initialization of the HMMtopology based on these motifs. Even after training theparameters and topology of the HMM on the trainingset, there is a remote chance that one or more membersof the test set could be favored (i.e. give arti®cially highscores) purely as an artifact of this initialization. Theexistence of a such a bias would require the ``survival''of long, non-local, sequence-structure correlations, orpaths, speci®c to ``test-set'' protein families (a sort of pro-®le-HMM (Eddy, 1998) within our HMM) through theprocesses of initialization and training of the HMM.Below, we argue that such paths are unlikely to exist,even at HMM initialization. Subsequent training cannotenhance any such bias, since the training and test setsare truly independent. Ultimately, only bona ®de blindpredictions for new structures solved since the publi-cation of this work can remove residual doubts concern-ing this potential source of bias.

Method 1 of initiating the HMM (based on co-occur-rence, see below) measured the adjacencies of I-sitesmotifs in the training dataset. However, method 1 pro-duced a model with no long pathways. The length of a

Markov state pathway over which there is signi®cant``memory'' of past states can be computed as the productof N adjacent transition probabilities p. The longestsequence of states with p > 0.1 was N � 12 in lD and lC.Method 2 of initializing the HMM (based on motif align-ment, see below) did not use the I-sites training set, butinstead used only local pro®le similarity between I-sitesmotifs. Incorporation of non-local bias toward speci®cproteins used to test the I-sites library is therefore notlikely for lR.

Initialization of HMM state profiles and connectivity

An I-sites motif may be represented as a simpleHMM in which each position de®nes a state and eachstate, except the last, has a single transition to thenext state. To generate the initial HMM, we treat theI-sites motifs (262 in number, including strong andweak motifs) as individual linear HMMs, then annealthem together using a measure of similarity. In thiscase, ``annealing'' refers to the merging of two ormore single-residue positions to make one, retainingall of the connectivity. The resulting HMM has asingle state for each set of merged single-residue pos-itions, and a forward transition for each positionmerged, unless that position was the last one in themotif (Figure 3). Two measures of similarity wereused to de®ne which states to merge.

Initialization based on co-occurence of I-sites motifs

The ®rst measure of state similarity was derived byscoring a large sequence database for matches to each ofthe I-sites patterns, and then assessing the correlations inthe scores (over all positions in the database) of eachpair of positions in the I-sites library. S, a similaritymatrix with indices corresponding to individual pos-itions in the 262 I-sites motifs was initialized to

Sij �

Xt

cf �t; i� � cf �t; j�

minX

t

cf �t; i�;X

t

cf �t; j� ! �4�

where cf(t, i) is the con®dence associated with the predic-tion of an I-sites motif i at position t in the database. Ifthe predicted structure was wrong using the criteriadescribed (Bystroff & Baker, 1998), cf was set to zero.

A greedy algorithm was employed to choose whichmotif positions to merge: the highest value of Sij wasidenti®ed, and the ``states'' i and j were merged. Thematrix elements in columns i and j are reset to the valuemin(Sik, Sjk) for all rows k. and Sij was set to zero. Theprocess is repeated until all elements of S are equal tozero. By this process, 2169 initial states in the I-siteslibrary were condensed to 208 states. To initialize the for-ward transition probabilities, each position in the data-base was assigned a state according to the motif atwhich it scored the highest. The probability of a tran-sition from state i to state j was initialized to the fre-quency of ij pairs in the database divided by thefrequency of state i. The total number of states in thestarting model, generated by this method, was 209(including the unknown state) and the total number ofnon-zero forward transitions was 7329. This initialmodel led to models lD and lC.

Page 15: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

The Grammatical Structure of Protein Sequences 187

Initialization based on pairwise alignment of I-sites motifs

The second approach to initializing the HMM top-ology employed pairwise all-against-all dynamic pro-gramming alignment of the 262 I-sites motifs using thecorrelation between the amino acid frequency pro®les asthe similarity measure. To force aligned positions tohave the same backbone conformation, a large negativevalue was added to the pairwise score if there was abackbone angle deviation of more than 90 � betweenaligned positions. In addition, the average distancematrix error (DME) over the aligned positions was com-puted during the traceback step, and the part of theDME exceeding 0.3 AÊ was scaled and subtracted fromthe score. Each pair of the aligned positions was mergedas described above, starting with the highest-scoringalignment and proceeding down the sorted list. How-ever, a motif-motif alignment was ignored if: (i) the mer-ging of the current pair of motifs would create a self-loop in the HMM; (ii) the merging of the current pair ofmotifs would create a cyclic graph, and the alignmentscore was below a cut-off; or (iii) any ungapped subseg-ment of the alignment did not contain at least three con-tiguous positions.

Some subsets of the I-sites motifs did not merge withother subsets under these constraints, resulting in a frag-mented HMM. Disjoint graphs within the HMM wereconnected using a non-emitting state (see below). Thethree rules above improved the prediction of local andsecondary structure over models made without them.

The forward transition probabilities aij were set to oneif states i and j were adjacent in at least one I-sites motif,and zero otherwise. The aij matrix was then normalizedby rows. This relatively crude initialization was accepta-ble because of the robustness of the expectation-maximi-zation (EM) algorithm. EM optimization of the modelinitialized in this manner converged in about ®ve cycleswhen the aij values were the only parameters allowed tovary. The total number of states in the starting model,initialized by this method, was 281, and the total numberof non-zero transitions was 371. Of these transition prob-abilities, 254 were exactly one and not variable, leaving117 variable transitions. This initial model led to modellR.

An example of how the two methods differ is in thegapped alignment of the two b-hairpins shown inFigure 3. The two hairpins never co-occur in the data-base, since they have different turn lengths, thereforethey would not be merged by the ®rst method. Howevertheir sequences and structures can be aligned with aone-residue gap, so the second method merges them.

The non-emitting state (``nought'' state)

A single, non-emitting (nought) state was used to con-nect all ``sink'' states (states with no forward transitions)with all ``source'' states (states with no backward tran-sitions). All transitions to and from the nought state wereinitially set equal and normalized to one, and were opti-mized along with the other aij values. The formalism forthe use of a non-emitting state was found to be a simpleextension of that for a standard HMM. By replacing aij

with (aij � ai0a0j) in equation (3), where ``0`` representsthe nought state, we have the expression for the prob-ability of a sequence given two alternative paths betweeneach pair of states. In practice, the second term in theparentheses is non-zero only if the ®rst term is zero. Thenew equation satis®es the mathematical requirement thatthe sum of P(sjl) over all sequences s is equal to one.

New update rules for each of the model parameterswere derived from the new equation.

The alternative model for connecting sinks andsources would be to provide an explicit transition foreach sink to each source. In the initial model there were53 sinks and 47 sources, so this would require53 � 47 � 2491 independent variable parameters. Theuse of a centralized, non-emitting connector requiresonly 53 � 47 � 100 variable parameters, with some lossof generality.

Expectation maximization

The parameters of the model were iteratively opti-mized to best re¯ect correlated sequence-structure pat-terns of the database. This involved maximizing theprobability of observing the training set given the model,equation (1). Expectation maximization (Lawrence &Reilly, 1990) was performed with a generalization of thealgorithm described by Rabiner (1989). The expectationstep involves computing gt(i), the probability of being instate i at time t, and zt(i, j), the probability of being instate i at time t and state j at time t � 1. The maximiza-tion step involves re-evaluating the parameters of lbased on gt(i) and zt(i, j). For any given training run, weincluded in Bi the bi term and only one of the symbols di,ri, or ci. Choosing two or more of these symbols typicallyresulted in loss of performance. This is to be expected,since there is a high degree of interdependence amongvariables of these sets. Whether or not they were used inthe expectation step, all variables in l were re-evaluated(updated) in the maximization step.

Modifications of HMM topology

Here, we describe a number of auxiliary functions thatwere developed to allow ¯exible exploration of differenttopologies. We also brie¯y describe the role of thesefunctions in creating the models lD, lC and lR, whichwere optimized for the prediction of three-state second-ary structure, context, and backbone angles, respectively.Expectation maximization can modify the topology of anHMM in the sense that transitions may vanish that arenot consistent with the training set, but the topology isotherwise limited to that of the initial HMM. Threeauxiliary functions were created to reduce the complex-ity of models; one to remove weak transitions (both for-ward and backward), one to remove seldom-visitedstates, and one to merge states. We also developed threeroutines that add complexity to the model: one thatredistributes a portion of the transition probabilityassociated with selected strong transitions among thetransitions aij which are currently frozen at zero, afunction that splits states into two (nearly identical)copies, and a function to link two states by a newtransition.

In general, changes in topology were followed by run-ning EM to convergence, followed by evaluation ofsequence, Q3 and MDA scores on the small test set. Thesmaller training set was not used during EM optimiz-ation but only as a tool for accepting or rejecting modi®-cations of the model. If the scores on the smaller trainingset improved we kept the changes, otherwise we scaledback or undid the modi®cation.

The common ancestor of models lD and lC was amodel based on motif co-occurrences (see above). The

Page 16: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

188 The Grammatical Structure of Protein Sequences

initial graph was highly complex, with 209 states and7329 transitions, whereas the ®nal model lD has 107states and 263 variable transitions. This model under-went an initial smearing, followed by alternate cycles ofremoving weak transitions and states. The resultingmodel underwent ®nal rounds of training using emissionvectors d and c in equation (3), separately; creatingmodels lD and lC, respectively.

Model lR originated from a model initialized basedon pairwise motif alignments (see above). This initialgraph was less complex, having 281 states and only117 variable transitions. In the ®nal model lR thereare 247 states and 149 variable transitions. The modelunderwent automatic removal of weak transitions andstates. Subsequently, new transitions were added tomend the fragmented nature of the initial model, andstates with high priors (frequently visited states) orhigh connectivity were targeted for splitting. Splittinga state usually resulted in the two copies taking ondifferent properties and connections. If the two copiesremained nearly identical to each other, we undid thechange. Occasionally one of the two states would sub-sequently disappear during EM. Unconnected statepairs ij that exhibited high values of:

xij �X

t

a�i; t�b� j; t� 1� �5�

were targeted for new transitions (splicing). Here, aand b are the forward and backward variables(Rabiner, 1989). New transitions were initialized tovery small values, which either went back to zero orincreased during subsequent cycles of EM. In all, 80new transitions were added and 48 transitions wereremoved.

A trend we observed for all three models was that theprobability distributions for structural attributes oftensharpened during training, to the point that most statescan be identi®ed with a single structural attribute. If theprobability of any attribute went to zero during training,it reduced the number of variable parameters in ourmodel by one. In this way, EM served as an unbiasedpruning device.

Identification of coding regions

Given a set S of K sequences of variable lengths Tk,the sequence score L is de®ned as the base 2 logarithmof the probability of all sequences in S given the modell, relative to the probability given the backgrounddistribution:

L�S; lÿ lbg� � 1

T

XK

k�1

�log2 P�skjl� ÿ log2 P�skjlbg�� �6�

Here, T is the total number of positions in the set,obtained by summing all lengths Tk. The score is thusrepresented in bits per position, which is convenient forassessing the potential of our model to discriminateactual protein sequences from non-coding sequencesthrough the use of information theory. In this sense, L(S;l ÿ lbg) is the information gain, or entropy loss; in usingl in comparison to the simple sequence model lbg. Inscoring models on sets of amino acid sequences or pro-

®les, no structural attributes were used to evaluateP(skjl).

Our HMM models showed signi®cant improvementsin performance when we used pro®les for training andscoring in the place of single sequences. However, forthe purpose of comparing our models to simpler modelsof protein sequence, more direct relations can be madeconsidering only parent sequences of the training set. Itis also worth bearing in mind that the parameters bi(m)in our model are parameters of a multinomial distri-bution, whereas in this test they are used as ``emission''probabilities for single amino acids. Two randomizedsets of amino acid sequences were constructed for asses-sing the ability of our models to discriminate proteinsequence from non-coding sequence. The ®rst set, Sbg,contains sequences with amino acid frequencies given bythe training set, in random order. The second set, Sdip,contains sequences having the same dipeptide frequen-cies as the training set, but otherwise unconstrained. Therandom sequences consist of 1000 ``proteins'' of length900. Results are insensitive to the length and number ofsequences as long as suf®ciently many positions are gen-erated to accurately re¯ect the relevant statistical distri-butions. For single sequences the reference score P(skjlbg)is particularly simple, being the product of bbg(st) for allpositions t in the sequence sk.

Structure prediction by voting

The predictions of three-state (helix/strand/turn), con-text, and backbone angles, given a sequence {Ot

m}, weremade using a voting procedure. For three-state second-ary structure prediction, the predicted state, H, S or T, isgiven by the largest of the sum of the prior-weightedemission symbols over all states:

Dt � argmaxi�fH;S;Tg

Pt�i�

� argmaxi�fH;S;Tg

XN

n�1

P�dijqn�P�qnjst�" #

� argmaxi�fH;S;Tg

XN

n�1

dn�i�gt�n�" #

�7�

A two-tiered voting scheme was found to be betterwhen choosing backbone angles. We ®rst vote for abroadly de®ned region (regions BEbde, GH, Ll and ex inFigure 5), and then choose a speci®c region within it.Similar to the case of parliamentary elections, under-populated regions are better represented in the electeddistribution when their votes are consolidated. In ourcase, the four-b-sheet backbone angle regions were ®rstconsolidated. If the consolidated region ``won'', then``run-off'' elections were held between the sub-regions.The most highly populated backbone angle region is theone corresponding to a-helix; these angles also occur inabout one-fourth of all loop positions. Using hierarchicalvoting decreased the overprediction of these angles.

The prediction of context was based on the secondarystructure prediction. Sequence segments predicted to bea strand were assigned a probability of being either anend strand or a middle strand based on the position-speci®c probabilities Pt(n) and Pt(m), respectively,obtained by the voting scheme described above. Simi-larly, Pt(d) and Pt(h) were computed for all turns ¯anked

Page 17: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

The Grammatical Structure of Protein Sequences 189

by strands to assess the likelihood of ®nding divergingversus hairpin turns.

The summand in equation (1), P(s, Qjl), represents thejoint probability of the observation sequence s and thestate sequence Q given the model l. The prediction ofstructural attributes is also possible by determining thestate sequence that maximizes P(s, Qjl) for a givensequence of amino acids or pro®le, known as the Viterbipath (see Rabiner, 1989). Structural prediction based onvoting consistently outperformed predictions based onthe Viterbi path in our models.

Acknowledgments

We wish to thank Phil Green, Richard M. Karp,Anders Krogh, Chip Lawrence, Ingo Ruczinski, and EdThayer for helpful discussions, and a referee for pointingout an error in an earlier version of this manuscript. Thiswork was supported by a University of WashingtonTraining Grant in Interdisciplinary Genome Sciences(V.T.), by a Sloan Foundation/Department of EnergyFellowship in Computational Molecular Biology(V.T.),by the NSF (STC cooperative agreement BIR-9214821,D.B.), and by a Packard Fellowship in Science andEngineering (D.B.) and a grant from the Howard HughesMedical Institute to the Rensselaer BioinformaticsProgram (C.B.).

References

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J.,Zhang, Z., Miller, W. & Lipman, D. J. (1997).Gapped BLAST and PSI-BLAST: a new generationof protein database search programs. Nucl. Acids.Res. 25, 3389-3402.

Asai, K., Hayamizu, S. & Handa, K. (1993). Prediction ofprotein secondary structure by the hidden Markovmodel. Comput. Appl. Biosci. 9, 141-146.

Aurora, R. & Rose, G. D. (1998). Helix capping. ProteinSci. 7, 21-38.

Bailey, T. L. & Elkan, C. (1994). Fitting a mixture modelby expectation maximization to discover motifs inbiopolymers. In Proceedings of the Second Inter-national Conference on Intelligent Systems for MolecularBiology (Altman, R., Brutlag, D., Karp, P., Lathrop,R. & Searls, D., eds), pp. 28-36, AAAI Press, MenloPark, USA.

Burge, C. B. & Karlin, S. (1998). Finding the genes ingenomic DNA. Curr. Opin. Struct. Biol. 8, 346-354.

Bystroff, C. & Baker, D. (1998). Prediction of local struc-ture in proteins using a library of sequence-struc-ture motifs. J. Mol. Biol. 281, 565-577.

Chou, P. Y. & Fasman, G. D. (1978). Prediction of thesecondary structure of proteins from their aminoacid sequence. Advan. Enzymol. 47, 45-148.

Dahiyat, B. I. & Mayo, S. L. (1996). Protein design auto-mation. Protein Sci. 5, 895-903.

Di Francesco, V., McQueen, P., Garnier, J. & Munson,P. J. (1997). Incorporating global information intosecondary structure prediction with hidden Markovmodels of protein folds. In Proceedings of the FifthInternational Conference on Intelligent Systems for Mol-ecular Biology (Gaasterland, T., Karp, P., Karplus, K.,Ouzounis, C., Sander, C. & Valencia, A., eds), pp.100-103, AAAI Press, Menlo Park, USA.

Doig, A. J. & Baldwin, R. L. (1995). N and C-cappingpreferences for all 20 amino acids in alpha-helicalpeptides. Protein Sci. 4, 1325-1336.

Doig, A. J., MacArthur, M. W., Stapley, B. J. &Thornton, J. M. (1997). Structures of N-termini ofhelices in proteins. Protein Sci. 6, 147-155.

Eddy, S. R. (1996). Hidden Markov models. Curr. Opin.Struct. Biol. 6, 361-365.

Eddy, S. R. (1998). Pro®le Hidden Markov models.Bioinformatics, 14, 755-763.

E®mov, A. V. (1993). Standard structures in proteins.Prog. Biophys. Mol. Biol. 60, 201-239.

Fickett, J. W. & Tung, C. S. (1992). Assessment of pro-tein coding measures. Nucl. Acids Res. 20, 6441-6450.

Han, K. F. & Baker, D. (1996). Global properties of themapping between local amino acid sequence andlocal structure in proteins. Proc. Natl Acad. Sci. USA,93, 5814-5818.

Henderson, J., Salzberg, S. & Fasman, K. H. (1997). Find-ing genes in DNA with a hidden Markov model.J. Comput. Biol. 4, 127-141.

Henikoff, S. & Henikoff, J. G. (1992). Amino acid substi-tution matrices from protein blocks. Proc. Natl Acad.Sci. USA, 89, 10915-10919.

Hobohm, U. & Sander, C. (1994). Enlarged representa-tive set of protein structures. Protein Sci. 3, 522-524.

Kabsch, W. & Sander, C. (1983). Dictionary of proteinsecondary structure: pattern recognition of hydro-gen-bonded and geometrical features. Biopolymers,22, 2577-2637.

Karplus, K., Barrett, C. & Hughey, R. (19??). HiddenMarkov models for detecting remote protein hom-ologies. Bioinformatics, 14, 846-856.

Krogh, A., Brown, M., Mian, I. S., Sjolander, K. &Haussler, D. (1994). Hidden Markov models incomputational biology: applications to protein mod-eling. J. Mol. Biol. 235, 1501-1531.

Kulp, D., Haussler, D., Reese, M. G. & Eeckman, F. H.(1996). A generalized hidden Markov model for therecognition of human genes in DNA. In Proceedingsof the Fourth International Conference on IntelligentSystems for Molecular Biology (States, D. J., Agarwal,P., Gaasterland, T., Hunter, L. & Smith, R. F., eds),pp. 134-142, AAAI Press, Menlo Park, USA.

Lacroix, E., Viguera, A. R. & Serrano, L. (1998). Elucidat-ing the folding problem of alpha-helices: localmotifs, long-range electrostatics, ionic-strengthdependence and prediction of NMR parameters.J. Mol. Biol. 284, 173-191.

Lawrence, C. E. & Reilly, A. A. (1990). An expectationmaximization (EM) algorithm for the identi®cationand characterization of common sites in unalignedbiopolymer sequences. Proteins: Struct. Funct. Genet.7, 41-51.

Lio, P., Goldman, N., Thorne, J. L. & Jones, D. T. (1998).PASSML: combining evolutionary inference andprotein secondary structure prediction. Bioinfor-matics, 14, 726-733.

Lukashin, A. V. & Borodovsky, M. (1998). Gene-Mark.hmm: new solutions for gene ®nding. Nucl.Acids Res. 26, 1107-1115.

Rabiner, L. R. (1989). A tutorial on Hidden MarkovModels and selected applications in speech recog-nition. Proc. IEEE, 77, 257-286.

Rost, B. & Sander, C. (1994). Combining evolutionaryinformation and neural networks to predict proteinsecondary structure. Proteins: Struct. Funct. Genet.19, 55-72.

Rost, B. (1997). Better 1D predictions by experts withmachines. Proteins: Struct. Funct. Genet. Suppl. 1,192-197.

Page 18: HMMSTR: a Hidden Markov Model for Local Sequence-Structure ... · We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure

190 The Grammatical Structure of Protein Sequences

Simons, K. T., Kooperberg, C., Huang, E. & Baker, D.(1999). Assembly of protein tertiary structures fromfragments with similar local sequences using simu-lated annealing and Bayesian scoring functions.Proteins: Struct. Funct. Genet. 34, 82-95.

Sonnhammer, E. L. L., Eddy, S. R. & Durbin, R. (1997).Pfam: a comprehensive database of protein domainfamilies based on seed alignments. Proteins: Struct.Funct. Genet. 28, 405-420.

Sonnhammer, E. L. L., von Heijne, G. & Krogh, A.(1998). A hidden Markov model for predictingtransmembrane helices in protein sequences. In Pro-ceedings of the Sixth International Conference on Intelli-gent Systems for Molecular Biology (Glasgow, J.,Littlejohn, T., Major, F., Lathrop, R., Sankoff, D. &Sensen, C., eds), pp. 175-182, AAAI Press, MenloPark, USA.

Stultz, C. M., White, J. V. & Smith, T. F. (1993).Structural analysis based on state-space modeling.Protein Sci. 2, 305-314.

Wootton, J. C. & , Federhen S. (1996). Analysis of com-positionally biased regions in sequence databases.Methods Enzymol. 266, 554-571.

Appendix

There are several relations among the results inTable 3 that are noteworthy. For the sequencescores evaluated with ldip we ®nd that: (i) the scoreon the random sequences Sdip is equal to that of thedatabase sequences Sdb; and (ii) the score on therandom sequences Sbg is the negative of the scoreon Sdb, up to the precision with which these valueswere computed. Here, we show how relations (i)and (ii) may be con®rmed independently by rela-tively simple calculations.

We ®rst show that the sequence score of ldip onthe database Sdb is equal to the mutual informationfor the probability distribution of adjacent aminoacids relative to the background amino acid fre-quencies. Let Ni be the number of positions in thedatabase Sdb with amino acid i, and Nij be the num-ber of positions where amino acid j is preceded byamino acid i. The amino acid frequency in thedatabase is pi � Ni/T, and the dipeptide frequencyis pij � Nij/(T ÿ K), where K is the number ofsequences. The mutual information M(pij), whichre¯ects the information in the joint distribution pij

in comparison to that of the marginal distributionpi, is de®ned as:

M�pij� �X

ij

pij log2

pij

pipj�A1�

The dipeptide model ldip is a Markov chain with20 states, one corresponding to each amino acid (astate in a Markov chain, unlike that in a HiddenMarkov Model, can ``emit'' only one symbol). The

transition probabilities are: aij � pij/pi. The prob-ability of a sequence for ldip is given by the pro-duct of the aij for adjacent pairs of amino acids.The background model reference term in equation(4) is the same for all sequence sets consideredhere and is given by: �i (pi log2 pi). To evaluateL(Sdb; ldip ÿ lbg), we collect adjacent ij positions inthe database to obtain

1

T

XK

k�1

log2 P�sk 2 Sdbjldip� �X

ij

log2 aij �A2�

Subtracting the reference term, one ®nds L(Sdb;ldip ÿ lbg) �M(pij). We are now ready to addressstatements (i) and (ii) above.

Statement (i) is straightforward, as the dipeptidemodel is sensitive only to adjacent residues, whichobey identical distributions in the sequence setsSdip and Sdb. Therefore, L(Sdip; ldip ÿ lbg) � L(Sdb;ldip ÿ lbg) �M(pij).

(ii) For sequences in Sbg, the joint probability onthe right hand side of equation (A2) is replaced bythe product of the marginal probabilities pipj,resulting in:

L�Sbg; ldip ÿ lbg� �X

ij

pipj log2

pij

pipj�A3�

Surprisingly, this differs from the negative ofequation (A1) by only 0.4 %, for the pij of Sdb,which is consistent with the results of Table 3. Toverify this, we de®ne the small deviations eij bypij � pipj(1 � eij) and perform expansionsin eij. Expanding the logarithm gives log(1 � eij) � eij ÿ eij

2/2 � O(eij3), where O(eij

3) denotesterms of order eij

3. The sum of equation (A1) and(A3) is (we omit the constant log 2):

Xij

�pij�pipj� logpij

pipj�X

ij

pipj�2� eij�

� eij ÿ 1

2e2

ij �O�e3ij�

� ��X

ij

pipjO�e3ij� �A4�

The leading, O(eij), term vanishes due to the con-straint �j(eijpj) � 0, for each i, which follows frompi � �jpij. Exact cancellation takes place for the sub-leading, O(eij

2), terms. Repeating this expansion forM(pij) itself, there is no such cancellation and M(pij)is of order (eij)

2. This accounts for relation (ii).

Edited by J. Thornton

(Received 27 March 2000; accepted 2 May 2000)