a strategy for the rapid multiple alignment of protein sequences · 2005. 12. 9. · third sequence...

11
J. Mol. Biol. (1987'1 198,327-337 A Strategy for the Rapid Multiple Alignment of Protein Sequences ConfidenceLevels from Tertiary Structure Comparisons Geoffrey J. Bartont and Michael J. E. Sternberg Laboratory of Molecular Biology D epartment of C ry stallography Birkbeck College, Malet Street, Lonilon WClE 7HX, U.K. ( Receiued, 28 April 1987, anil in reuised, form 18 July 1987 ) An algorithm is presented for the multiple alignment of protein sequences that is both accurate and rapid computationally. The approach is based on the conventional dynamic- programming method of pairwise alignment. Initially, two sequences are aligned, then the third sequence is aligned against the alignment of both sequences one and two. Similarly, the fourth sequenceis aligned against one, two and three. This is repeated until all sequences have been aligned. Iteration is then performed to yield a final alignment. The accuracy of sequencealignment is evaluated from alignment of the secondary structures in a family of proteins. For the globins, the multiple alignment was on average 99/o accuratecompared Lo g}o/o for pairwise comparison of sequences. For the alignment of immunoglobulin constant and variable domains, the use of many sequences yielded an alignment of 630/o average accuracy compared to 4lfo average for individual variable/ constant alignments. The multiple alignment algorithm yields an assignment of disulphide connectivity in mammalian serotransferrin that is consistent with crystallographic data, whereas pairwise alignments give an alternative assignment. 1. Introduction The advent of fast techniques for DNA sequencing has led to a rapid expansion in the number of known protein sequences (currently -4000 in the PIR databank: George et al., 1986). Access to these primary structures leads to the alignment of two or more protein sequences that can identify conserved regions of functional and/or structural importance. Furthermore, if homology can be shown with a biochemically or crystallo- graphically well-characterized protein, many properties or aspects of three-dimensional structure may be predicted (e.g.see Browne et al., 1969). Since the early work of Fitch (1966) and Needleman & Wunsch (1970), techniques for the comparisonand alignment of two protein or DNA sequences have been developed for speed (e.g. see Gotoh, 1982; Taylor, 1984; Fickett, 1984; Wilbur & Lipman, 1983),the identification of local similarities (e.g. see Sellers, 1979; Goad & Kanehisa, 1982; Boswell & Mclachlan, 1984) and increased sensitivity (Argos, 1987). Although the alignment of f Author to whom reprint requests should be addressed. $22-2836 I 87 | 220327 -r | $03.00/0 three sequences has been used to confirm weak homology betweentwo sequences (e.g.seeDoolittle, l98l), the practical limitations of computer memory and central processing unit (CPU) time restrict the rigorous extension of two sequence methods (e.g. see Needleman & Wunsch, 1970) to short sequences (Murata et a1.,1985). The multiple alignment of four or more sequences cannot in practice be solved by a rigorous method, sincethe number of segment comparisons that must be carried out is of the order of the product of the sequence lengths (many more if gaps are explicitly considered). Thus, multiple alignment algorithms in common with fast pairwise methods (e.g.see Wilbur & Lipman) seek to identify an optimum alignment by considering only a small number of the total possible residueor segmentcomparisons. Sankoff and co-workers (Sankoff & Cedergren, 1976) provided a workable multiple alignment algorithm applicable to nucleic acid sequences, which requires the sequences to be linked by a predeterminedevolutionary tree. Sobel & Martinez (1986) described an algorithm that bases the alignment of DNA sequences on the identification of common subsequences of a specified minimum length. Waterman (1986) elaborated a similar 327 O 1987 Academic Press Limited

Upload: others

Post on 31-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • J. Mol. Biol. (1987'1 198,327-337

    A Strategy for the Rapid Multiple Alignmentof Protein Sequences

    Confidence Levels from Tertiary Structure Comparisons

    Geoffrey J. Bartont and Michael J. E. Sternberg

    Laboratory of Molecular BiologyD epartment of C ry stallography

    Birkbeck College, Malet Street, Lonilon WClE 7HX, U.K.

    ( Receiued, 28 April 1987, anil in reuised, form 18 July 1987 )

    An algorithm is presented for the multiple alignment of protein sequences that is bothaccurate and rapid computationally. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, two sequences are aligned, then thethird sequence is aligned against the alignment of both sequences one and two. Similarly,the fourth sequence is aligned against one, two and three. This is repeated until allsequences have been aligned. Iteration is then performed to yield a final alignment.

    The accuracy of sequence alignment is evaluated from alignment of the secondarystructures in a family of proteins. For the globins, the multiple alignment was on average99/o accurate compared Lo g}o/o for pairwise comparison of sequences. For the alignment ofimmunoglobulin constant and variable domains, the use of many sequences yielded analignment of 630/o average accuracy compared to 4lfo average for individual variable/constant alignments. The multiple alignment algorithm yields an assignment of disulphideconnectivity in mammalian serotransferrin that is consistent with crystallographic data,whereas pairwise alignments give an alternative assignment.

    1. Introduction

    The advent of fast techniques for DNAsequencing has led to a rapid expansion in thenumber of known protein sequences (currently-4000 in the PIR databank: George et al., 1986).Access to these primary structures leads to thealignment of two or more protein sequences thatcan identify conserved regions of functional and/orstructural importance. Furthermore, if homologycan be shown with a biochemically or crystallo-graphically well-characterized protein, manyproperties or aspects of three-dimensional structuremay be predicted (e.g. see Browne et al., 1969).

    Since the early work of Fitch (1966) andNeedleman & Wunsch (1970), techniques for thecomparison and alignment of two protein or DNAsequences have been developed for speed (e.g. seeGotoh, 1982; Taylor, 1984; Fickett, 1984;Wilbur & Lipman, 1983), the identification of localsimilarities (e.g. see Sellers, 1979; Goad & Kanehisa,1982; Boswell & Mclachlan, 1984) and increasedsensitivity (Argos, 1987). Although the alignment of

    f Author to whom reprint requests should beaddressed.

    $22-2836 I 87 | 220327 -r | $03.00/0

    three sequences has been used to confirm weakhomology between two sequences (e.g. see Doolittle,l98l), the practical limitations of computermemory and central processing unit (CPU) timerestrict the rigorous extension of two sequencemethods (e.g. see Needleman & Wunsch, 1970) toshort sequences (Murata et a1.,1985).

    The multiple alignment of four or more sequencescannot in practice be solved by a rigorous method,since the number of segment comparisons that mustbe carried out is of the order of the product of thesequence lengths (many more if gaps are explicitlyconsidered). Thus, multiple alignment algorithms incommon with fast pairwise methods (e.g. see Wilbur& Lipman) seek to identify an optimum alignmentby considering only a small number of the totalpossible residue or segment comparisons.

    Sankoff and co-workers (Sankoff & Cedergren,1976) provided a workable multiple alignmentalgorithm applicable to nucleic acid sequences,which requires the sequences to be linked by apredetermined evolutionary tree. Sobel & Martinez(1986) described an algorithm that bases thealignment of DNA sequences on the identificationof common subsequences of a specified minimumlength. Waterman (1986) elaborated a similar

    327O 1987 Academic Press Limited

  • 328 G. J. Barton and, M. J. E. Sternberg

    technique but also allowed for mismatches, whilstBains (1986) described a related algorithm thatworks well for some families of closely similarnucleic acid sequences. None of these methods hasbeen applied directly to protein sequences, althoughWaterman (1986) described how his algorithmcould be so applied.

    The algorithm of Taylor (1986) allows largenumbers of protein sequences to be alignedbut for maximum effect requires that three-dimensional structures are known for some of thesequences in order to provide a "seed" alignment.Johnson & Doolittle (1986) desuibed a moregeneral multiple alignment algorithm for proteinsequences whereby a small subset of all possiblesegment comparisons is considered. Although theiralgorithm can cope with the three-way alignment oflong sequences, alignment of more than foursequences is restricted to short proteins, due toexcessive CPU demands. Bacon & Anderson (1986)reduced the number of segment comparisonsperformed by considering the sequences in anarbitrary order and maintain only the best-scoringsegments as each new sequence is added. Theiralgorithm does not explicitly cater for gaps, nordoes it produce a complete alignment of thesequences; however, it presents a sensitivetechnique for the identification of significant shorthomologies.

    This paper reports an algorithm that cangenerate a multiple alignment including theconsideration of gaps for a large number of proteinsequences without the need to introduce additionalnon-sequence information. Performing all pairwisecomparisons for the sequences suggests confidencelevels for the multiple alignment of particularsequence groups.

    2. Procedures

    (a) Needlemnn & Wunsch algorithm for two sequences

    (l) A matrix of amino acid pair scores, D, is chosen.Throughout this study the MDM^ matrix was used(Dayhoff, 1978) with a constant of 8 added to remove allnegative terms.

    (2) The protein sequences are defined as Al., A2n,where m and z are the number of residues in sequence Al ,A2, respectively.

    (3) A matrix R..n is generated with reference to D,where each element .B;,; represents the score for Al, uersusA2i'

    (4) n^," is acted on to generate 8..,, where eachelement Sr,.; holds the maximum score for a comparison ofAlr,. with A2;,,.

    (5) Either suitable pointers are recorded in (4), or atraceback procedure through ,B.,n is performed to enablean alignment with the maximum score for Al^uersus A2nto be generated.

    In order to limit the total number of residues alignedwith blanks, a gap-penalty, G, is subtracted during theprocess of generating Br,, whenever a gap is introduced.In our earlier work (Barton & Sternberg, 1987) westudied the effect upon the accuracy of pair-wisealignment of varying both length-dependent and length-independent gap penalties. The results indicated that a

    length-dependent penalty is unnecessary. Furtherunpublished results suggest that for the given D matrix, alength-independent penalty in the range 6 to l0 oftenyields a reasonable alignment. Ideally a range of penaltiesshould be investigated. However, on the basis of thesefindings and to provide a consistent benchmark we chosethe penalty of 8 (which is not necegsarily optimal) for usethroughout the current study.

    (b) Multiple alignment

    Let the sequences to be aligned by Al .. . A/y', then:(l) Align A2 with Al using Needleman & Wunsch

    algorithm. Let the length of the aligned sequences bedenoted L1 ,2 .

    (2) Align A3 with the alignment of A2 and Al obtainedin step ( l) .

    (3) Align sequence A,4 with the alignment of Al, .{2and A3 length Ll, 3.

    (4) Similarly align the sequences A5 to A1[.In general, for the &th sequence align with the

    previously obtained alignment for Al through A(ft- l).When generating the matrix .B in order to align the ftth

    sequence a scoring scheme is adopted that includes acontribution from all previously aligned sequences, thushighlighting congerved regions in the alignment

    Let i be the position of an aligned residue in sequencesA l . . . A ( f - l ) s u c h t h a t I < i < L l , ( l - l ) . L e t j b e t h eposition of a residue in sequence Ak then:

    o,., : J; of t

    Dno,.n*r. ( l)' k - t o = 1

    For example, if 3 sequences have been aligned so far andwe are considering the comparison of the ith position inthe alignment (Ala-Val-Leu) with thejth amino acid in the4th sequence (Ala), then the score (fi,,r) is given by thescore for (Ala uersus Ala) * (Ala aersus Val) + (Alaaersus Leu,) x l /3 : (10+8+6)/3:8. The value ofD2.p,,e*t when Ap, is a gap is set to the minimum value forany residue to residue score (0 in this work).

    The multiple alignment obtained in (4) may be refinedby realigning each sequence with the completedalignment less that sequence. Accordingly, sequence Al isaligned with the alignment of sequences A,2 . . . A1[(having first removed any gaps that are common toA2 . . . Anf ). A.2 is then realigned with the alignment ofAl, A3...Alf . This process is repeated unti l A1{ hasbeen realigned with Al .. . A(nf-l). The complete cyclemay then be repeated.

    (c) Criteria for assessing the quality of alignment

    The unique conformation that a globular proteinadopts and the resultant disposition of key catalytic orbinding residues ultimately determines its biologicalactivity. Therefore, when 2 or more protein sequencesare aligned it is of crucial importance that residuesdefining a common tertiary fold are correctlyequivalenced. Although the general fold may beconserved, there can be considerable variation in3-dimensional structure within a protein family. Inparticular, the presence of insertions or deletions makes itimpossible to assign structurally equivalent residues overthe full length and common to o/l members of a family.Even when insertions and deletions are absent, it can bedifficult to justify a particular structural alignmentespecially in the loop regions that connect elements ofsecondary structure. In the light of these observations weuse those regions that are common and also found in the

  • Multiple Sequence Alignmenl 329

    :

    core f-strands or a-helices of the proteins as test zones.An automatic alignment method should at least be ableto align these zones correctly, although sequencealignments based on structure may also be justifiedoutside these regions.

    We define the accura,cy of an alignment of 2 sequencesas the percentage of residues within these zones that arealigned in the same wa,y as in the reference alignment.

    (d) Ord,er o! alignment

    The order in which the sequences are added may beexpected to have an effect on the final multiplealignment. However, for N sequences there are N!alternative orders, so it is important to have a systematicapproach to selecting the single order. Our previousfindings suggested that a pairwise sequence comparisonthat gives a significance score of >6.0 s.o. may be alignedto >751o accuracy within regions of secondary structure(Barton & Sternberg, 1987). Further results presented inthis paper for 49 pairwise comparisons of immunoglobulinand globin sequences (Fig. l) support this observationand indicate that the alignment eccuracy is correlatedwith the significance score. Accordingly, our strategy fordetermining the alignment order first identifies the pair ofsequences that have the highest pairwise significancescore. Having established Al and A2, A3 is identified asthe sequence having the highest significance score witheither Al or 42. Similarly, A.4 is the sequence thatexhibits the highest significance score with Al, A,2 or AB.This procedure is continued until all sequences in thegroup have been entered in the order.

    o c peo o o

    B m

    (e) Red,ucing calculation time

    Before applying the ordering algorithm to a group ofsequences it is necessary to perform all pairwisecomparisons for the group. The calculation of significancescores for all pairwise alignments of N sequences is anexpensive procedure since if M randomizations per pairare performed then 1[ x (lf - l) x M 12 comparisons mustbe carried out. If no pairwise comparisons havepreviously been made then it is necessary to determinethe cheapest (in CPU time) approach to determining anorder.

    Feng el al. (1985) considered how many randomizationsneed to be performed on a pair of sequences beforeconsistent results are obtained and suggested on the basisof 4 pairs of sequences that as few as 25 could produce agenuinely reflective score. We have repeated this analysiswith a larger data set by carrying out all pairwisecomparisons for I members of the immunoglobulinsuperfamily and 6 serine proteinase sequences (47 pairs inall), using from l0 to 100 randomizations in steps of 10.The results indicate that instabilities in significance scoredo not damp out until at least 60 randomizations havebeen performed. It would seem impractical therefore touse this approach routinely for establishing an alignmentorder when more than a small number of sequences areinvolved.

    Doolittle (1981) has demonstrated the usefulness ofscoring systems based upon a single alignment score andnot involving randomization of the sequences. Fig. 2illustrates the relationship between one such scheme, thematch score divided by the length of the shortestsequence (NASs) and the significance score for 2 groups of

    @ o^ o

    u o

    o

    8a 6 0foooo--

    lg 40

    . +i +

    +++

    +++

    D Vorioble vcrsas vorioble* Vorioble vel'sus constontO Constont r€lsl/s consionlO Globins

    Significonce score (soJ

    Figure 1. The relationship between alignment accuracy as measured by reference to alignments obtained from3-dimensional structure superposition and the significance score for 2l pairwise alignments of 7 globin sequences and 28alignments of 8 immunoglobulin domains (Variable refers to immunoglobulin variable domains,- Constant toimmunoglobulin constant domains).

  • 330 G. J. Barton anil M. J. E. Sternberg

    A

    oo

    A

    oo

    aa

    {o,o

    A ,

    /o

    O Serine oroleinosesA lmmunoglobulin superfomily

    5 ro 15 20 25 30 35 40 45Signif iconce score (s.a)

    Figure 2. Relationship between the normalizedalignment score (NASs) calculated from the match scoredivided by the length of the shortest sequence and thesignificance score for eaeh of 47 pairwise comparisonswithin the serine proteinases and immunoglobulinsuperfamily. The data are correlated at 0.gb?. NASa(match score divided by the number of residues notaligned with a gap) for the same sequences ga,ve acorrelation value of 0.958.

    r2.o

    I t .5

    sequences that contain some very closely relatedmembers and many with tenuous similarities. Althoughthese data are not sufficiently representative of proteinsin general to allow a conclusion to be drawn on thequantitative relationship between NASs and significance,the qualitative relationship is clear. F'or groups ofproteins where randomization procedures would be tooexpensive, NASs values or the slightly more expensiveNASa (match score divided by the number of residues notaligned with gaps), may be used to generatesystematically a rational order for multiple alignment.

    (f) Test sequences and reference alignments

    (i) Globins

    Globins belong to the ala class of proteins and have acommon fold that is highly conserved in proteins fromorganisms distantly related in evolution. The sequences,however, show considerable variation and provide aninteresting test for the alignment method. Seven globinsequences and their reference alignment were taken fromLesk & Chothia (1980) without modification (humanhaemoglobin d-strand (HAHU), human haemoglobinf-strand (HBHU), horse haemoglobin a-strand (HAHO),horse haemoglobin B-strand (HBHO), sperm whalemyoglobin (MYWHP), sea lamprey cyanohaemoglobin(PILHB) and root nodule leghaemoglobin (LGHB)).Seven zones totalling 95 residues and corresponding tothe A, B, C, E, F, G and H a-helices were defined for eachsequence as illustrated in Fig. 3(a).

    (ii) Immunogl,obulin d,omains

    The immunoglobulin domains belong to the plp class ofproteins and have been studied in detail in terms of bothsequence and tertiary structure (e.g. see Amzel & Poljak,1976; Beale & Feinstein, 1976; Lesk & Chothia, 1982).Although the overall fold of the domains is conserved,there is considerable sequence variation, particularlybetween the variable and constant domains. Thesesequences thus provide a particularly stringent test for analignment method.

    Eight domains were selected (Brookhaven data bank

    EK A H O K K V L G R F

    K K V L H S F O E C

    KGHOKKVRDNL TNR

    KKVODRLTLR

    RI I I 'NV

    D L K K H O V T V L T R L G R

    aaz

    I t .o

    ro .5

    ro.o

    q . 5

    9.0

    8.5

    8.0

    H E H U

    H B H O

    H N H U

    HRHO

    P l L H B

    I,IYIIHF

    L O H B

    TIBHU

    H B H O

    TIRHU

    HNHO

    F l L H B

    IIYIIHP

    L O H B

    h I d n

    h l d n

    h v d d

    h t d d

    r r d d

    t t g h

    v h l

    v q l

    v l

    v l

    p l v d t g r v a p l

    v l

    9 9

    f .

    a o

    n og o l

    f r r l g d l r t p d o v . g

    ? d r l g d l r n p g o v r g

    l p h f d l r h 9

    l p h f d l r h g

    f p t l k g l t t o d r l k l

    l d r f k h l k t r o r r k o

    f r r f l l t g g t r r v p q n q l . f r ! Y v a r l D n T L

    FTLSELHCDI I h v

    F f lLSELHCDk I h v

    Sf ,LSDLHf lHk I r v

    SNLSDLHht& I r v

    KDLSOKHf lK IT f r v

    K F L R O S H N

    g k . l t p p v

    g k d f t p r l

    p o . t t p o v

    p n d l t p o v

    p g d l g o d o

    g o t r r r r l

    f I h

    k t h

    t t .

    h t .

    o f

    k y l r l g y q g

    r r d d o o

    h k r

    ( o )

    Fig. 3.

    EEKSNV TRLIJG

    T5 G VI I IL VKF F

    P

    P

    CN V L V C

    R L L O N V L V V

    K L L S H C L L V

    LLSI ICLLs

    Y F K V L N R V I R D

    L E F I S E R I I H

    E N I L K T

    D K F L S S V S T V L

    K N L 0 5 V H V S N g v v

  • Multiple Sequence Alignmed 331

    FnEcL qproop l i l ru -F i lp . . . . rqonro l i lG i i f r rypgo yF R B C H I c r t l g p l S V F F L { p r r t r r r g g t o l a t O C t - v f a y r p . p vF C c H 3 q p . r p l O V Y t L { p . . . . r t t n q v l S t t C L V I g t y l r d rF c c H 2 p l s v F L F 4 p r p r d t t r t r r r p l E v T c v v l v d v r h r d p q v

    cF

    d r r p v l o l t p r l q r n n l y c

    g o l t . o v l q r r g l y r

    n g q P . n p r l d r d g r l l

    d g r g v h n c q q t h . t t .

    q l p g t o p t l l t l

    q l p g r o p k l l t ; r d o . r p . g Y p t r

    q o p g k g l r r r o I I r d d g r d q h y o d r r t g c f l l

    d g v q v h n

    Bl"'*+rrscT0lrlRLscssflsLTcTvFlEvTcvvrlTLvcLrlt|nLccLvrlsLTcLvF

    FnBcL q' t. t, t.pEEi{o. e r ri iEllr y. p !. o.F n E c H l r r l g t q t l v t c r v $ h p r n t k l V D $ r r p k r oFccns nrqqgn' l rscsvnl r r .nt r rnr ,y 'Tol l . lFcc ! t 2 n r t dg t r c l yKcKv* r ko lp rp [ ! ! ] t r r r o rg

    A B C D E, . l i l ioe- f l . , .sopgq. , l i iEEiTl . . .nrsoenf t i i f r lqq lpsrcarnrr r ,nnn.{5vsra[ lssaru. . l vu rone f . . s tpsq . ' l r r sc ro l t . r " re r r l r vHu$qrpg rcp r l l r y rd . . . p . s rp r r r l sosx$ lqas r -. y l O L v 0 S C l g g v r q p g r r l l R L S C S l r g f t r r r y l 8 n Y U V l e q o p g l g l r r v c l l r d d g r r l q h y r r l r r r f r r l T l S R { d r r l N T L

    x r l o L E 0 s c b g l r f p . q t l l s l T c T v l . g r . l d d t l Y 3 T g + q p p g r g l r r l g t r t t h g t r d t d r p l . . r ' l I H L Y N l t r t l N 0 F

    d r r l r g t l l l r l r

    r r d l r y t g t l r l Y l g q

    g h g l o r r c r o t g p d l q g t p Y t r . rI I o g o I d l q g r l v ! v r r

    C"ll'F1.rlr vNtr Ylc

    4nHYuT4YsTrtiKFN"TvlTVf,UKlo

    "lrvsuuf,hvE1g|

    ( d )

    Figure 3. Test multiple alignments. Regions of the sequences written in capitals a,nd boxed correspond to test zonesselected from homologous secondary structures. (a) Alignment (l) 7 globin sequences: A, B, C, E, X', G, H are a-helixes;(b) alignment (2), 4 immunoglobulin constant domains; (c) alignment (3), 4 immunoglobulin variable domains;(d) alignment (4), 8 immunoglobulin domains including variable and constant. In (b) to (d), A, B, C, D, E, F and G aref-strands.

    ( b )

    FAEVL

    F B 4 V L

    F 8 4 V H

    F N E V H

    FNEVL

    F E + V L

    F D 4 V H

    FABVH

    ( c )

    FABVL

    F B T V L

    F B 4 V H

    FNBVH

    FCCH2

    FADCL

    FRBCHI

    F C C H 3

    FREVL

    F B 4 V L

    F B 4 V H

    FRBVH

    FCCH2

    FNBCL

    F N B C H I

    F C C H 3

    P

    q p k o o p

    o r ! k g p

    q P f . P

    . . n t g o g

    . r n l g .

    g l l l r r

    g t r f d d

    d v r h r d p q

    d l y p g o

    d y f p r p

    g l y p r d

    q q t

    p r k q r

    p o v l q r

    p v l d r

    . r . 9 . P g q . v

    . o . g ! P g q . Y

    g v r q p g r r l

    p g l r r p r q t l

    l p l p t r l l l r l r r t pr r r r l q c n k o

    r r L r l r g g l o

    r r a . r l t n q v

    t g l q o r d r o d

    g l r o r d r r d

    d r l r p r d t g v

    . Y ! ! . d l o r

    l h q n r l d g k r

    d r r p v l

    g o l t

    n 9 q P .

    g l k l t r l n

    l g t h v l v l g q

    g l p v l v . r

    g r l Y t r r t

    t t r k o l g

    t v r p l a o .

    Y . p h . o

    l r l

    p r q r l r h k r

    I t . l g t q t

    l r r r q g g n r

    d n r l r

    n r r d n r y

    l l o g o l r l

    k o l p o p

    . g . t

    k p r n t l

  • 332 G. J. Barton and, M. J. E. Sternberg

    codes). Four from 3FAB: (l) light chain constant regionCl (FABCL); (2) light chain variable region V,l (FABVL);(3) heavy chain constant region I Cyl (I,ABCHl);(4) heavy chain variable region V7 (FABVH). Three fromIFCI: (l) heavy chain constant region 2 C72 (FCCH2);(2) heavy chain constant region 3 C73 (X'CCH3). Twofrom lFB4: (l) light chain variable region V,t (FBaVL);(2) heavy chain variable region Vy (FB VH). Thereference alignment for the 8 domains was takenprincipally from Cohen et al. (1981) and Lesk & Chothia(1982). However, the most recent version of the co-ordinates deposited in the Brookhaven data bank(Bernstein et al., 19771 for SFAB shows a modifiedsequence in the B-C loop and C strand of FABVL. Froma consideration of hydrogen bonding, and least-squaresfitting of the domains, the alignment was revised in ttrisarea to take account of these changes. In addition itshould be noted that the sequence of FB4VL used here(taken from the Brookhaven data bank structure lFB4)differs slightly from the earlier version used bv Lesk &Chothia: threonine (T) has been substituted foiserine atresidue numbers 23 and.33, whilst alanine (A) substitutesfor glycine (G) at 75. Seven test zones comprising 38residue positions in total and corresponding to t[e 7homologous p-strands A to G were defined aJ illustratedin Fig. 3(b) to (d)).

    3. Results(a) Pairwise canq)arisons

    For each of the 28 unique pairwise comparisonsfor the immunoglobulins and 2l comparisons for

    Pouwise occurocy (7o)

    Figure 4. Comparison of accuracy for multiplealignments (l) to (a) with the same sequences alignedpairwise. Points above the diagonal line indicate animprovement in accuracy on multiple alignment.

    the globins, the percentage agreement with thereference alignment was calculated. In addition, aconventional test for significance was carried out byrandomizing each pair of sequences 100 times andcalculating the mean (m) and standard deviation(s.o.) of the distribution. The significance score isquoted as (V -m)ls.D., where 7 is the alignmentscore for the two native sequences.

    Figure I illustrates the result of these compari-sons. Alignments that score > 15.0 s.n. (7 examples)give at or near 100/o agreement with the referencealignment. Those scoring between 5.0 and lb.0 s.o.(25 examples) give better thanT0o/o agreement withthe reference alignment, whilst scores below b.0 s.n.(17 examples) show a sharp rise in alignmentaccuracy correlated with significance score andranging from 0o/o (0.57 s.o.; FABCHI aersusFB4VH) to 84o/o (2.4 s.r.; FABVL aerslrs FABCL).Above 5.0 s.l. there are no poor alignments;however, in the lower s.n. range small changes inobserved significance score can indicate a, consider-able difference in alignment accuracy.

    When aligning two sequences it is useful to bearin mind these findings since they can suggest thelikely quality of the alignment obtained. As anapproximate guide we consider an s.D. score above5.0 to indicate a "good" alignment, with theconfidence in alignment increasing with alignmentscore. An alignment giving a score below 5.0 s.D. weregard with a caution that becomes more stringentas the score decreases.

    (b) Test multiple alignments:comparison with pairwise

    In each of four test alignments performed, thesequences were ordered by similarity on the basis of100 randomizations as shown.

    (l) The seven globin sequences HBHU, HBHO,HAHU, HAHO, MYWHP, PILHB, LGHB.

    (2) The four constant domains FABCL,F'ABCHI, F'CCH3, FCCH2.

    (3) The four variable immunoqlobulin domainsFABVL, FB4VL, FB4VH, FABVII.

    (a) The eight immunoglobulin domains FABVL,FB4VL, FB4VH, FABVH, FCCH2, FABCL,FABCHI. FCCH3.

    Figure 4 shows the accuracy of alignmentobtained for pairs of sequences within multiplealignments (l) to (a) (Fig. 3, (a) to (d)) comparedwith the same sequences aligned pairwise. Pointsabove the diagonal line represent a,n improvementin alignment when the multiple alignmentalgorithm is applied. Multiple alignment of theglobins (1) results in an improvement from g0lo to99/o overall with the most dramatic improvementsfor the more distantly related sequences PILHB,MYWHB and LGHB and the largest changeoccurring for the comparison of LGHB with HBHO(77 to SS"/").The final alignment shown in Figure3(a) has 94 out of 95 defined residues correctlyaligned for all seven sequences. Furthermore, the D

    Ioo:o

    g.9=

    601T

    I40l-

    O Constont domoins (olone) (2)tr Vorioble domoins (olone) (3)X Conslont domoins (with vorioble) (4)A Vorioble domoins (wifh consfont) (4)t Vorioble yersas constonf (4)

  • Multiple Sequerwe Alignment 333

    a-helix, which is only present in HBHU, HBHO,MYWHP and PILHB, is also correctly aligned.The single error occurs at the beginning of the Fhelix and may in part be caused by the choice ofscore used for a residue aersus a pre-introduced gap(discussed further below). Analysis of the globinsignificance scores by the technique of single linkagecluster analysis (Dayhoff et al., 1972; Sokol &Sneath, 1973) in the light of these results suggeststhat sequences that cluster above 5.0 s.n. align atleast as well by the multiple algorithm as they dopairwise. The immunoglobulin constant domains (2)that cluster at 8.5 s.l. and align to 86/o accuracypairwise and 90/o by the multiple algorithm with30/38 positions correctly equivalenced across thecomplete four-sequence alignment (Fig. 3(b)) andthe variable domains, which also cluster at 8.5 s.o.with mean accuracies of 83o/o (pairwise), 84/o(multiple) and 29/38 positions correctly alignedacross all four sequences (Fig. 3(c)), lend furthersupport to this hypothesis.

    When all eight immunoglobulins are aligned (4),the accuracies of variable aersus variable andconstant uersus constant domain comparisonsmarginally deteriorate (84+ l0% to 8lt7o/);however, there is a striking improvement in thealignment accuracy for variable aersus constantdomains from a me&n value of 4l +28o/o to63+3yo. This is most noticeable for the alignmentof FB4VH aersus FABCHI and FB4VH uersusFCCH3, which were completely misaligned whencompared pairwise. FABVH uersus FABCL, whichcould only be aligned pairwise to

  • 334 G. J. Barton and, M. J. E. Sternberg

    Table IEffect of iteration on the mean alignment accuracy of pairs of sequences within the

    multiple alignments

    Iteration (mean accuracy* I s.o.)

    Alignmentf

    ( l )(2)(3 )(4)

    98 .91 l88.2 + 683.3 + I62.6+t4

    99.5 + 0'588.2 + 683.3 + I70.5+ t2

    99.5+0.589.5 + 784.2+g70.8+ l0

    99.5 + 0.589.5 + 785.5 + I70.5 + l0

    99.5 + 0.589'5+ 787.3 + 771 .6+ l 0

    f (l) 7 globins; (2) 4 immunoglobulin constant domains; (3) 4 immunoglobulin variable domains; (4)8 immunoglobulin domains (0during iteration can help to alleviate this problembut may also introduce errors at other points in thealignment (results not shown). Averaging also hasan effect on the non-gap regions of the alignment,so that the effect of iteration becomes less apparentas the number of sequences is increased. Thisproperty can be an advantage for very largealignments, since a single alignment pass with noiterations may be sufficient to yield a finalalignment.

    The alignment of sequences in one specific orderis the main route by which this algorithm reducesthe number of comparisons to manageable propor-tions. To investigate the importance of order on thefinal result, ten unique alternative orders weregenerated for alignments (l) and (4). The meanalignment accuracy for the ten globin orders waslittle different, at 98.7 o/o, from that obtained foralignments ordered by s.o. score (gg.b/o) or NASa(98.9%). However, the ten immunoglobulin ordersgave a mean value of 57.60/o compared to 70.8o/ofor the alignment ordered by s.o. score. This resultis hardly surprising, since many of the orders startwith the poor alignment of a variable and constantdomain that is not subsequently corrected. lndeed,the order that performed least well of the ten wasone in which variable and constant domainsalternated (FABVH, FABCHI, FB4VH, FCCH3,FABVL, FABCL, FB4VL, FCCH2). This gave only38/o accuracy before iteration with no correctlyaligned positions across all eight sequences. Aftertwo iterations, the accuracy had improved to 460/oand 4/38 residues (the first four of the F-strandwere in complete alignment). The order defined byNASa scores (FABVL, FB4VL, FB4VH, FABVH,FABCHI, FABCL, FCCH3, FCCH2) is verv similarto the s.D. score order and gave an aiignmentaccuracy of 67'2o/o.

    Although it might appear from the variable andconstant domain example that a bad initialalignment will always lead to a generally pooroverall alignment, this is not necessarily true. Forexample, it is possible for a multiple alignment of20 sequences to be poor for the first ten, yet goodfor the second ten provided that the second ten areclosely related. This feature is another consequence

    of the scoring scheme shown by equation (l), sinceone good comparison can be identified against abackground of average scores.

    4. Applications and Conclusions

    One advantage of the algorithm described here isits speed. For example, the complete seven-sequence globin alignment (Fig. 3(a)) (2 iterations)required 65 seconds, whilst the same operation forthe eight-sequence immunoglobulin alignment(Fig. 3(d)) took 50 seconds CPU time on aVAX ll/750. Pairwise comparisons to establish anorder without randomization required 44 secondsfor the globins and 34 seconds for the immuno-globulins. For comparison, the Johnson & Doolittle(1986) algorithm requires 60 minutes CPU time toalign five sequences of less than S0 residues inlength.

    The determination of an alignment order uiapairwise comparisons is the most time-consurningpart of the procedure, particularly if random-izations are performed. However, such an analysisis often carried out as part of the characterizationof a newly determined sequence and would not needto be repeated to permit multiple alignment. If thetime required to perform all pairwise comparisons isprohibitive then the seven-globin alignmentsuggests that an arbitrary order may performalmost as well for an alignment of similarsequences. Each iterative pass of our algorithmrequires time approximately proportional Lo N M2,where N is the number of sequences and tll is thelength of the sequences when aligned. Although it isexpensive to align long sequences, the task is notimpossible: however, the longest alignment thatmay be produced is limited by the need to store onearray of dimensions M x M.

    Aligning large numbers of medium-length proteinsequences (150 to 300 residues) is therefore a matterof routine. For example, the alignment of 128globin sequences, including haemoglobin-a and, B,myoglobin and leghaemoglobin from a wide rangeof species, required 25 minutes of CPU timeincluding two iterations (Fig. 5). The sequencesused in the previous section to test the method areindicated on the Figure together with the test

  • Multiple Sequence Alignment 335

    E

    es{c3GPSYC2GPSYS@sYclclmGPPiI6m

    Figure 5. Multiple alignment of 128 globin sequences taken from the PIR databank (databank codes shown). The 7sequences of known 3-dimensional structure illustrated in Fig. 3(a) are indicated (4 MYWHP, sperm whale myoglobin;40 HAHU, human a-haemoglobin; 43HAHO, horse a-haemoglobin; 79 HBHU, human B-haemoglobin; 82 HBHO, horsep-haemoglobin; l l9 PILHB, sea lamprey cyanohaemoglobin; 120 LGHB, root nodule leghaemoglobin).

    rB0rxw0XYBDflwMBffi

    @ IMEIYSHTYBO

    wcHryft2mmwLzxwr0

    MYmxJMruUCHPE

    NXucH2sccllNllM

    NYWA2NOGuo|llm

    ilTSlN@ucPilonUE

    l:|lmSJSBum|w2ucYsccllM|{Mffi1l&lfuffiBt|ftl2ffiQPrcwtHM!hBns

    4 5

    a a

    5 2

    5a

    5A5960

    626 3

    6 556

    7 t7 2

    6a tl@J65 HBLL65 HBBOGa7 H8808AA BEYA269 HMN90 HHA91 Hm92 HEES93 HBOL9a Hm95 HilS96 Src97 HEEOF9 a | f f i99 Hmt

    IOO H88Y10r HwIO2 HBOElo3 '{m104 @l105 mxG2C106 HrcH107 ffiro8 t]MIO9 HEP110 llwr t l BQrr2 ffiE113 l;&TItla |:ml115 |.@115 ffiI I ' 6R{J

    zones, whilst the alignment order was determinedby similarity, on the basis of NASa scores for allpairwise comparisons. The seven test sequences arealigned correctly except for the beginning of theF-strand (as before), and the last two residues ofthe G-strand in Pl LHB. Furthermore, theremaining l2l sequences also appear to be aligned

    in a consistent manner. For example, the proximaland distal histidine residues are aligned in all buttwo sequences. The exceptions are the distalhistidine of MYELI (sequence 16, Indian elephant)where Gln is substituted, and HACCI (sequence 63,desert sucker a-chain) in which the His is displacedby one residue. In a recent detailed study of residue

  • 336 G. J. Barton and, M. J. E. Sternberg

    Figure 6. Connectivity of disulphide bridges in thetransferrins. Bridges I to 4, 3 to b and 2 to 6 correspondto bridges 4, 5 and I I in the nomenclature of Williams(1982). TI'HUL, human lactotransferrin; TFHUP,human serotransferrin; TFCHE, chicken ovotransferrin;P97, human melanoma antigen. (a) Connectivity forTFCHE; (b) connectivity suggested by Metz-Boutigue etal. (1984\, and Rose et al. (1986); (c) conne"ii,ritysuggested by the rabbit serotransferrin crystal srructure;3nd (d) multiple alignment that supports t"he connectivityin (c) .

    conservation in the globins based upon pairwisealignment with ma,nual corrections (Bashford,Chothia & Lesk, personal communication), thesecond example was identified as a sequencingerror, since the order of residues &s shown(H G K K) would lead to a shift in the E helix by aquarter turn; the sequence should be (K H G K).

    We stress that the alignment shown in Figure 5was produced entirely automatically, withoul anymanual intervention or pre-alignment of k"yregions. To our knowledge there is no otheralgorithm that will permit an objective globalalignment of so many protein sequences and to sucha high level of accura,cy.

    Application of the multiple alignment algorithmhas proved valuable during the crystallographicdetermination of a mammalian serotransferrin(rabbit) in this laboratory (Gorinsky et al., tgTg;P. Lindley et al., perconal communication). Humantransferrin (TFHUP) shows strong sequencehomology with chicken transferrin (TFCHE).human lactotransferrin (TFHUL)

    "nd human

    melanoma antigen (P97). TFCHE has been studiedbiochemically and the connectivity of thedisulphide links determined (Williams et al., lg82\.Figure 6(a) illustrates the topology of disulphidesI to 4 and 3 to 5, which are unambiguous in thealignment of TFCHE, TFHUL and PgZ. However,TFHUP in common with rabbit serotransferrin hasan additional disulphide in this region, and the

    lopolggy as indicated by the published alignmenrsfor TFHUP with TFCHE and TFHUT (Metz-Boutigue et al., 1984) as well as TFHUP with pg7(Rose ef ol., 1986) is shown in Figure 6(b). However,inspection of the electron density map for serotrans-ferrin at 3.3 A (l A:0.1 nm) iesolulion suggestedthat this topology could not be accommodated, butthat the arrangement shown in Figure 6(c) wasmore likely. To provide independent-evidence, themultiple alignment algorithm was applied to thefour sequences. Pairwise comparisonJ ihowed thatthe seq_uences clustered at 4l s.n., suggesting a highlevel of confidence in the alignment, whilst multiplealignment of the complete sequences (- 800residues) performed with two iterations lead to thealignment pa,rtly shown in Figure O(d). Thisalignment supports the crystallographic interpreta-tion of topology shown in Figure 6(c).

    The algorithm presented in this paper representsa practical solution to the problem of automaticallyaligning more than two protein sequences whenonly sequence information is available. It appearsmost valuable when there are weaker similarities(e.g. immunoglobulin constant uersus variabledomains). However, the great sensitivity of align-ment_accuracy below 5.0 s.n. (Fig. l) to changes insignificance score and the sensitivity to alternativealignment orders make the level of success in analignment that includes weakly similar sequencesdifficult to predict.

    ,The test systems presented here suggest thatwhen a group of sequences cluster at >5.0s.o.multiple alignment by our algorithm will provide aconvenient representation, which is likely to be>70o/o correct within secondary structures andmore &ccur&te than individual pairwise alignments.

    -.Our results suggest the overall accuracy ofalignment that might be expected for a particularsignificance score; however, the problem stillremains of identifying which regions of thealignment are correct. Argos (lg8?) has described asensitive procedure for identifying significant localhomologies between two sequences or pre-alignedfamilies of sequences and shown that these oft"r,correspond to regions of similarity in three-dimensional structure. Thus, when there are two ormore distinct clusters of sequences to be aligned,the Argos method may be applied to alignmentsobtained automatically by our algoritlim andprovide an indication of which regions are correctlyequivalenced.

    Although the incorporation of properties inaddition to the Dayhoff matrix can improve thesensitivity of sequence comparison methods byreducing background noise (Argos, lg87), without

    ( o )

    ( b )

    ( c )

    ( d )

    I I F C I E

    F I E l , l f lF F F s h s l c l v F c F 0 x c 0 F p N L l c l n L l c l F c r c E d hn F F s c s l c l a F E R o o r o F r o r l c l o L l c l e o{ F F sn s lc lv F c F r r e o KL lc ln o lc l r o 0F( r (0 y F c c s l q v p c h 0 E I s y s E s L l q i L E l i c o 3 s 0 e c

  • Multiple S equen ce Alignment 337

    a more complete understanding of the relationshipbetween sequence and three-dimensional structureit is difficult to envisage a scoring scheme thatwould, for example, lead to the correct alignment ofthe A p-strands of immunoglobulin variable andconstant domains (Fig. 3(d)). When sequencesimilarity is weak, sequence alignment becomes anexercise in structure prediction and correctalignment is constrained by the fact that the coderelating amino acid sequence to three-dimensionalstructure is degenerate.

    Alignments of clearly similar sequences generatedby our algorithm have been used as the basis of animproved secondary structure and active siteprediction algorithm (Zvelebil et al., 1987) and alsoto align four different strains of human immuno-deficiency virus (HIV) enu (800 residues), gag (500residues) and pol (1000 residues) polyproteins withthe aim of predicting potential T andB-lymphocyte-defined epitopes (Coates et al., 1987Sternberg et aL,1987).

    It has been suggested that a few key residues canbe sufficient to define a tertiary fold (e.g. seeWierenga et al., 1986). The multiple alignmentalgorithm provides a useful tool for identifying suchpatterns from closely related sequences. We arecurrently developing and calibrating techniques foridentifying these patterns, and rapidly scanning theprotein sequence databank to identify proteins ofpotentially similar tertiary folds.

    We .thank Professor T. Blundell for his continuedsupport, M. Zvelebil and I. Haneef for helpfuldiscussions, and R,. Garrett, B. Gorinsky and P. Lindleyfor presenting the transferrin problem. This work wasfunded by the Science and Engineering Research Council.

    References

    Amzel, L. M. & Poljak, R. (1979). Annu. Reu. Biochem.48, 961-997.

    Argos, P. (1987). J. Mol. Biol.193,385-396.Bacon, D. J. & Anderson, W. F. (1986). J. Mol. Biol. l9l,

    153- l6 l .Bains, W. (1986). Nu.cI. Acids Res. 14,159-177.Barton, G. J. & Sternberg, M. J. E. (1987). Protein Erry.

    1,89-94.Beale, D. & Feinstein, A. (1976). Quart. Reu. Biophys.9,

    135-180.Bernstein, F. C., Koetzle, T. F., Will iams, G. J. B..

    Meyer, D. F., Jr, Brice, M. D., Rodgers, J. R,.,Kennard, O., Shimanouchi, T. & Tasumi, M. (1977).J. MoI. Biol. ll2.535-542.

    Boswell, D. R. & Mclachlan, A. D. (1984) . Nu,cI. AcitlsRes. 12,457-464.

    Browne, W. J., North, A. C. T., Phil l ips, D. C., Brew, K.,Vanaman, T. C. & Hill, R,. C. (1969). J. Mol. Biol.120,97-120.

    Coates, A. R. M., Cookson, J., Barton, G. J., Zvel6bil,M. J. & Sternberg, M. J. E. (1987). Nature ( Lond,on),326.549-550.

    Cohen, F. E., Novotny, J., Sternberg, M. J. E., Campbell,D. G. & Will iams, A. F. (1981). Biochem. J. 195,3l-40.

    Dayhoff, M. O. (1972) .ln Atlas of Protein Sequence and'Strunture (Dayhoff, M. O., ed.), vol. 5, pp. 89-110,National Biomedical Research Foundation,Washington, DC.

    Dayhoff, M. O. (1978) .In Atlas of Protein Sequence and,Structure (Dayhoff, M. O., ed.), vol. 5, suppl. 3,pp. 345-358, National Biomedical ResearchFoundation, Washington, DC.

    Dayhoff, M. O., Park, C. M. & Mclaughlin, P. J. (1972).ln Atlas of Protein Sequence and, Structure (Dayhoff,M. O., ed.), p. 12. National Biomedical ResearchFoundation, Washington, DC. -

    Doolittle, R. F. ( 1981 ). Scierwe, 2l4, l4g-159.Edelman, G. M. (1970). Biochemistry,9, 3197-3205.Feng, D. F., Johnson, M. S. & Doolittle, R,. F. (1985).

    J. Mol. Eool. 21. ll2-125.Fickett, J. W. (1984). NwL Acitls Res. 12,175-179.Fitch, W. M. (1966). J. Mol. BioI. 16,9-16.George, D. G., Barker, W. C. & Hunt, L. T. (1986). Nuct.

    Acitls Res. 14, I l-15.Goad, W. B. & Kanehisa, M. I. (1982). NucI. Acids Res.

    10,247-263.Gorinsky, B., Horsbaugh, C., Lindley, P. F., Moss, D. S.,

    Parkar, M. & Watson, J. L. (1979). Nature (Lond,on),281 ,157 -158 .

    Gotoh, O. (1982). J. MoI. BioI.162,705-708.Johnson, M. S. & Doolitt le, R. F. (1986). J. Mol. Eool.23,

    267-278.Lesk, A. M. & Chothia, C. (1980). J. Mol. Biol. 136,

    225-270.Lesk, A. M. & Chothia, C. (1982). J. Mol. BioI. 160,

    325-342.Metz-Boutigue, M. H., Jollds, J., Mazurier, J.,

    Schoentgen, F., Legrand, D., Spik, G., Montreuil, J.& Jollds, P. (1984). Eur. J. Biochem. 145, 659-6?6.

    Moore, G. W. & Goodman, M. (1977). J. Mol. Euol. 9,l2 l -130.

    Murata, M., R,ichardson, J. S. & Sussman, J. L. (1985).Proc. Nat. Aca.d. Sci.. U.5.A.82. 3073-3077.

    Needleman, S. B. & Wunsch, C. D. (1970). J. Mol. Biol.48,443-453.

    Rose, T. M., Plowman, G. D., Teplow, D. B., Dreyer,W. J., Hellstrom, K. E. & Brown, J. P. (1986). Proc.Nat. Acatl. Sci., U.5.A.83, l26l-1265.

    Sankoff, R,. J. & Cedergren, G. L. (1976). J. MoI. Eaol.7,133-149.

    Sellers, P. (1979). Proc. Nat. Acad'. Sci., U.5.A.76,304I.Sobel, E. & Martinez, H. M. (1986). NucI. Acitls Res. 14,

    363-374.Sokol, R. R. & Sneath, P. A. H. (1973). In Numerical

    Tarornmy, p. 201, Freeman, San Francisco.Sternberg, M. J. E., Barton G. J., Zvel6bil, M. J. J.,

    Cookson, J. & Coates, A. R. M. (1987). IEBSLettere,2lE, 231-237.

    Taylor, P. (1984). Nucl. Aciils Res. 12, 447-455.Taylor, W. R. (1986). J. Mol. Biol. 188,233-258.Waterman, M. S. (1986). Nu.cl. Acids Res. 14,9095-9102.Wilbur, W. J. & Lipman, D. J. (1983). Proc. Nat. Acad'.

    Sci., U.S.A. E0, 726-730.Williams, J. (1982). Trend's Biochem. Sci.7,894-397.Williams, J., Elleman, T. C., Kingston, I. 8., Wilkins,

    A. G. & Kuhn, K. A. (1982). Eur. J. Biochem.122,297-303.

    Wierenga, R,. K., Terpstra, P. & Hol. W. G. J. (1986).J. Mol. BioI. 187, l0l-107.

    Zvelebil, M. J., Barton, G. J., Taylor, W. R. & Sternberg,M. J. E. (1987). J. MoI. 8io1.195,957-961.

    Eiliteil by R. Huber