fast multiple alignment of protein structures using conformational letter blocks

Upload: rian-triana

Post on 30-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    1/15

    The Open Bioinformatics Journal, 2009, 3, 69-83 69

    1875-0362/09 2009 Bentham Open

    Open Access

    Fast Multiple Alignment of Protein Structures Using ConformationalLetter Blocks

    Sheng Wang and Wei-Mou Zheng*

    Institute of Theoretical Physics, Academia Sinica, Beijing 100190, China

    Abstract: Most approaches for protein structure alignment start from a search for similar fragments since this localsimilarity is necessary to the alignment even though is insufficient. In contrary to the sequence alignment, anyinsignificant trial alignment for structures can be detected by structure superposition and then excluded. It is then

    practicable to select from locally similar fragments those responsible for alignment and build up it. An efficient way forlocal similarity search is to use a conformational alphabet, which is a discretized description of protein chain localgeometry. Using our conformational alphabet and its substitution matrix CLESUM, we propose a tool called BLOMAPSfor fast multiple structure alignment.

    By means of the conformational alphabet, a structural fragment is mapped to a string, and two strings with their CLESUMscore being higher than a preset threshold form a similar fragment pair (SFP). A string from one protein as a seed and itshighly similar fragments from other proteins form a similar fragment block. Taking one protein as the pivot, BLOMAPS

    uses the rigid transformation for SFPs in a block to superimpose proteins and initiate an anchor-based alignment.BLOMAPS is greedy in nature, guided by CLESUM similarity scores. It consists of several steps including findingsimilar fragment blocks based on a pivot protein, removing block redundancy, constructing scaffold by checkingconsistency in spatial arrangement among fragments from different blocks, dealing with unanchored structures, and thefinal step of refinement where the average template for alignment is obtained and motifs missing from the pivot proteinare found and added. The utility of BLOMAPS is tested on various protein structure ensembles including large scale ones,and compared with several other tools including MATT.

    BLOMAPS is available at: www.itp.ac.cn/zheng/blomaps.rar

    Keywords: Protein structure, Multiple structural alignment, Protein conformational alphabet.

    1. INTRODUCTION

    Protein structures are better conserved than amino acid

    sequences. Remote homology is detectable more reliably bycomparing structures. Multiple alignment carriessignificantly more information than pairwise alignment, andhence is a much more powerful tool for classifying proteins,detecting evolutionary relationship and common structuralmotifs, and assisting structure/function prediction. Mostrecent reviews on protein structure alignment include Refs.[1, 2]. Being able to provide very useful information andinsights for proteomics, structural bioinformatics and drugdevelopment, the comparison and alignment of proteinstructures has come to be a fundamental and widely usedtask in computational structure biology. However,developing accurate and fast methods for multiple structurealignment is still regarded as an open challenge. Here, we

    propose a tool called BLOMAPS for fast multiple structurealignment.

    The common goal of all multiple structure alignmentmethods is to identify a set of residue columns from eachrow protein that are structurally similar, or to find anoptimal correspondence among the atoms of these molecularstructures. For a given multiple alignment, several aligned

    *Address correspondence to this author at the Institute of TheoreticalPhysics, Academia Sinica, Beijing 100190, China; E-mail: [email protected]

    blocks, which correspond to contiguous columns, usuallycan be identified. Each block is again composed of locally

    similar fragments. This local similarity within a block maybe phrased as vertical equivalency. The local similarity inecessary to the alignment, but is insufficient. For any twostructures in the multiple alignment, the transformation tosuperimpose a fragment duad in an aligned block should alsobring the fragment duads in other blocks spatially close. Thiis the horizontal consistency. In contrary to the sequencealignment, any insignificant trial alignment for structures can be detected by structure superposition and then excludedThus, it is practicable to select from locally similarfragments those responsible for the global alignment andbuild up it.

    Structure alignment involves the geometricrepresentation for structures. In most cases, only the

    backbone of pseudobonds formed by!

    C atoms are

    considered. Atom coordinates, which change undetranslation and rotation in 3D space, are not geometricinvariants. Distances used by DALI [3, 4] or CE [5] are theintrinsic property of a geometric object. The peptide chaindihedral angles or the bending and torsion angles o pseudobonds [6] are also geometric invariants. The unitvector used by MAMMOTH [7] implies pseudobond anglesbut is not invariant. MASS [8, 9] replace secondary structureelements by the vectors of their axes. This vector

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    2/15

    70 The Open Bioinformatics Journal, 2009, Volume 3 Wang and Zheng

    representation speeds up the computation, but has a low precision for structural elements. BLOMAPS uses pseudobond angles, but goes one step further. The threepseudobond angles formed by four contiguous residues have been coarse-grained into discrete conformational letters,and a substitution matrix called CLESUM has beenconstructed for these letters based on a database of alignedprotein structures [10, 11]. By means of this conformational

    alphabet, a structural fragment is mapped to a string.BLOMAPS measures the similarity of two fragments withthe CLESUM score of their strings.

    The horizontal consistency described above is based onsuperposition. In the language of distances, the consistencyis expressed as the distance constraint that any correspondingdistances from two structures in alignment should be nearlyequal. A few methods like DALI or CE construct analignment by joining fragment duads satisfying the distanceconstraint without superposition. Many methods start with aninitial trial correspondence, and then iteratively update theoptimal transformation and correspondence in turn until the best correspondence is finally found. Often a dynamicprogramming algorithm is employed in piecing up a globalalignment; it enforces the alignment to be collinear. MATT(Multiple Alignment with Translations and Twists),including the difference between transformations forsuperposing fragment duads as a part of penalty for dynamicprogramming, is able to deal with local flexibility [12].

    The pairwise structure alignment forms the basis for themultiple structure alignment. Most existing methods ofmultiple structural alignment combine a pairwise alignmentand some heuristic with a progressive-type layout to mergepairwise alignments into a multiple alignment [6, 7, 13-16].Such pairwise-based methods have the limitation thatalignments which are optimal for the whole input set might be missed. Another limitation is in speed. There are ahandful of truly multiple methods [8, 9, 17, 18]. BLOMAPSalso belongs to the category as well as CLEMAPS, anothertool developed in our group [19].

    BLOMAPS is developed from our CLEPAPS, a fast toolfor pairwise protein structure alignment [20]. CLEPAPSsearches for similar fragment pairs (SFPs) by stringcomparison based on CLESUM scores. Like ProSup [21],CLEPAPS is anchor-based. It takes a single good SFP as aninitial correspondence. The optimal transformation for thisseed SFP is used to superimpose the pair proteins and then toupdate the correspondence. The procedure of progressively

    building up larger correspondence is iterated until the bestcorrespondence is finally found. CLEPAPS adopts a greedystrategy guided by CLESUM scores. In fact, the usage ofconformational letters may be integrated into any approachesengaging SFPs. However, the number of operations involvedin consistency checking for a CE-type method is, roughlyspeaking, quadratic in the number of relevant residues, whilefor a ProSup-type it is linear. Our CLEMAPS, beingdeveloped before CLEPAPS, is a multiple version of the CE-type, which is less superior in speed than the ProSup-type.Encouraged by the performance of CLEPAPS, we decided toextend it to BLOMAPS, a multiple version. The design of

    BLOMAPS architecture has been briefly reported in [22and [11]. Recently, appeared a new tool MATT, whichoutperforms other programs in alignment quality on distanstructures, and sets a high standard for other tools to reachWe present here a full version of BLOMAPS, whichincludes not only a detailed description of the method andcareful analysis of its implementation, but also some tests onlarge scale ensembles and case study. BLOMAPS i

    compared with several other tools including MATT.

    2. METHODS

    BLOMAPS is greedy in nature. Its several steps includefinding similar fragment blocks based on a pivot proteinremoving block redundancy, constructing scaffold bychecking consistency in spatial arrangement amongfragments from different blocks, dealing with unanchoredstructures, and the final step of refinement where the averagetemplate for alignment is obtained and motifs missing fromthe pivot protein are found and added.

    2.1. Abbreviations

    Five main abbreviations SFP, AFP, SFB, HSFB andMAB are frequently used and listed as follows. Their moreprecise definitions will be given in the text.

    SFP = Similar fragment pair. As the name suggests, iis a pair of segments, each from each of twogiven proteins. SFPs are based only on locageometry.

    AFP = Aligned fragment pair. We save word alignedonly for global features such as orientation osecondary structure elements and overaltopology. An AFP must be an SFP, not the otheway around. AFPs are those SFPs which finally

    appear in the global alignment of two wholestructures.

    SFB = Similar fragment block. SFB is the counterparof SFP for multiple proteins. We use asimplified version of SFB, for which there is aseed fragment member, and all other memberare compared only with this seed. SFBs arebased only on local geometry.

    HSFB = Highly similar fragment block. For a givenfragment as the seed, there are usually manySFBs. The HSFB has its members most similarto the seed. That is, among these SFBs, for anyprotein in comparison, the member in the HSFB

    is more similar to the seed than those in anyother SFBs.

    MAB = Multi-aligned block. MAB is the counterpart oAFP for multiple proteins. MABs form the coreof a multiple structure alignment, and then canbe extracted from the alignment. Thus, an MABmust be an SFB, not the other way around.

    2.2. Conformational Alphabet

    To represent protein structures with conformationaalphabets, which are discretized conformational states o

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    3/15

    Fast Multiple Alignment of Protein Structures The Open Bioinformatics Journal, 2009, Volume 3 71

    certain fragment units of backbones, is an old idea [23-29].BLOMAPS uses a 3D structure coding of protein backbonesconsisting of

    !C pseudobonds. Three contiguous

    !C atoms

    determine two pseudobonds and a bending angle betweenthem. Four contiguous

    !C atoms, say a , b , c and d ,

    determine two such bending angles ),( !! " and a torsion

    angle )(! which is the dihedral angle between the two

    planes of triangles abc and bcd . Since the length of pseudobones in the dominant trans configuration is almostconstant these bending and torsion angles, as the chaincounterparts of curvature and torsion of a smooth curve,maintain the three dimensional information. The smallestunit possessing one-to-one correspondence between anglesand coordinates is the quadrupeptide unit. By using amixture model for the density distribution of the threeangles, the local structural states have been clustered as 17discrete states or letters (A to Q) of a protein conformationalalphabet. When using structural codes for the structuralcomparison, a score matrix similar to the BLOSUM foramino acids is desired. Based on the alignments forrepresentative structures in the database FSSP of Holm andSander, we have constructed a substitution matrix calledCLESUM for the conformational letters. The matrix (in anupdated version) is shown in Table 1, where a scaling factorof 20 instead of 2 is used to show more details [10]. To the best of our knowledge, CLESUM is the first substitutionmatrix directly derived from structure alignments for aconformational alphabet. Among the 17 letters, H representsthe prototype of helices, while E represents that of extendedstrands. In Table 1 similar conformational letters have beengrouped closely. CLESUM reflects not only a geometricalsimilarity, but also an evolutionary similarity. For example,

    the diagonal entry for the frequent H is relatively smaldespite the high geometrical similarity between two helices.

    An essential parameter of BLOSUM is its validevolutionary distance, which is controlled by the identity rateof sequences for training. Similarly, we use the structurefamily indices of FSSP to carefully control the similaritybetween structures in our training set. A training set of too

    high similarity will make most non-diagonal entries of thesubstitution matrix negative.

    2.3. Finding Similar Fragment Blocks

    Suppose that protein P is one from the structures to bealigned. The coordinates }{

    ir of

    !C atoms of the protein are

    converted to a sequence S of conformational letters. Sinceeach letter corresponds to a quadrupeptide unit, the length oS is shorter than that of P by three. The first letter iassigned to the third residue by convention, the second to thefourth and so on, until finally the last letter is assigned to thelast residue but one. (This assignment is supported by ananalysis on the position dependence of the mutua

    information between the letters and amino acids.)For given two fragments of the same length l, one start

    at residue i of P and the other at j of another protein P

    (with conformational sequence S! ), their local structurasimilarity may be measured by

    ),,(=1

    0=

    kjki

    l

    k

    ssM++

    !

    "#$ (1

    where ),( baM is the ),( ba -entry of the CLESUM, and

    kis

    +and kjs +! are the conformational letters of the

    Table 1. CLESUM: The Conformation Letter Substitution Matrix (in Units of 0.05 Bit)

    J 37

    H 13 23

    I 16 18 23

    K 13 5 21 49

    N -2 -34 -11 28 90

    Q -44 -87 -62 -24 32 90

    L -32 -62 -41 -1 8 26 74

    G -21 -51 -34 -13 -8 8 29 69

    M 16 -4 1 12 7 -7 5 21 61

    B -57 -96 -74 -50 -11 12 -12 13 -13 51P -34 -60 -49 -36 -3 7 -12 5 8 42 66

    A -23 -45 -31 -19 10 16 -11 -6 -2 20 35 73

    O -24 -55 -34 5 15 -13 -4 -1 5 -12 4 25 104

    C -43 -77 -56 -33 -5 29 0 -4 -12 7 4 13 3 53

    E -93 -127 -108 -84 -43 -6 -21 -22 -47 15 -5 -25 -48 3 36

    F -73 -107 -88 -69 -32 3 -16 -5 -33 7 0 -20 -30 20 26 50

    D -88 -124 -105 -81 -44 14 -22 -31 -49 13 -10 -17 -42 21 22 21 52

    J H I K N Q L G M B P A O C E F D

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    4/15

    72 The Open Bioinformatics Journal, 2009, Volume 3 Wang and Zheng

    corresponding residues. Here the same index is kept for aresidue and its conformational letter. If the pair score ! isgreater than a preset threshold T , the pair is called a similarfragment pair (SFP), which defines a correspondence of lresidue duads. Comparing each string of S with length lagainst S! , we can find all SFPs. Usually, a small l and alow T will result in a long list of SFPs.

    A rigid transformation can be found to superimpose thetwo members of a given long enough SFP and make thespatial deviation of its duad

    !C atoms very small [30, 31].

    Since an SFP is determined only by local similarity, asuperposition valid for one SFP need not be valid foranother. We define the spatial separation between twomembers of a certain SFP under a given transformation by

    |},||,||,{|max=S)',(

    jijijiFP

    jr

    ir

    zzyyxx !"!"!"#

    $ (2)

    where )',(ji rr is a residue duad of the SFP after

    transformation, and ),,( zyx denotes the 3D coordinates of

    r . A small separation ! implies a good superposition of thetwo SFP members.

    Selecting a pivot protein from a set of structures to bealigned and taking an l long fragment of the pivot as a seed,we search the structures other than the pivot for fragmentssimilar to the seed. A fragment which forms an SFP with theseed is called a neighbor of the seed. For a given seed,several neighbors may be found in a single structure. Ablock of similar fragments (SFB) may be formed by takingone neighbor from each structure which has neighbors of theseed. There is a block which consists of the seed and thoseneighbors which score the highest amongst neighbors in eachstructure possessing neighbors. Such a block is called a

    highly similar fragment block (HSFB). Of course, a seedmight have no neighbor in some structures. The total numberof fragments in an HSFB will be called the depth of theHSFB. The total score ! of an HSFB is defined as the sumof ! scores between the seed and each of its neighbors inthe HSFB. Another characteristic of an HSFB is itsconsensus, which is defined as follows. For a given set ofconformational letters, their consensus letter is defined as theletter which belongs to the set and has the highest sum ofCLESUM scores between itself and all letters in the set. Theconsensus of an HSFB is then defined as the string whichconsists of the consensus letters of columns when aligningthe row strings of the HSFB. Thus, besides the positions of

    its member fragments, an HSFB has a depth, score andconsensus.

    To examine each l long string of every conformationalsequence and find all possible SFBs is inefficient.BLOMAPS simply takes the shortest protein as the pivot tocreate all HSFBs. That is, all seeds are extracted from this protein. (When seeking SFBs, CLEMAPS conducts an all-against-all search for the best centers and finding SFBs asso-called center-stars. The center of such a block is alwaysthe consensus. However, besides the high computation cost,it is hard to efficiently remove redundancy from such blocks.) Bearing only local similarity, an SFB need not

    correspond exactly to a multi-aligned block (MAB), whichappears in the multiple structural alignment. Obviously, anMAB must be an SFB in the sense of the CLESUM scoreFor a set of closely related structures, we expect that there ia good chance of finding certain members of MABs in someHSFBs.

    For an HSFB with a large depth and score, shifts of its

    seed would plausibly also generate HSFBs with a large depthand score. To remove this redundancy, the HSFBs are sortedfirst in descending order of depth and then in that of score. A2D grid of atom indices is created with its rowscorresponding to individual proteins. The atom indices of thefirst HSFB in the grid are marked, and then the secondHSFB is examined. If the overlapping positions between thetwo HSFBs are less than lm!=" , where l is the width o

    the block, m the depth of the second HSFB and ! a

    parameter for redundancy, we fill in the grid the seconHSFB. Otherwise, we skip it, and examine the third. Whenexamining a new HSFB, the number of its positions thaoverlap with the marked indices of the grid is counted. Only

    when its overlapping proportion is less than ! do we fill inits place in the grid. This procedure is continued until the lasHSFB is examined.

    2.4. Building up a Scaffold

    Multiple structure alignment requires both the verticaequivalency and horizontal consistency. This increases thedifficulty of alignment, but also reduces the chance ofmaking a wrong alignment. That is, by superposition thewrong alignment can be detected and then discarded. Themultiple alignment algorithms which progressively merge pairwise alignments may be classified as horizontal-firsOur BLOMAPS may be regarded as vertical-first. To speed

    up, all structures are aligned against some template, and athe beginning the shortest protein is taken as the template tocreate HSFBs. BLOMAPS then starts with an HSFB takenfrom the top K HSFBs as an anchor. To choose the mosrepresentative structure, the template is updated to theprotein whose member in the anchor HSFB has the highessimilarity score with the consensus. The updating otemplate may result in that the new pivot protein does nohave a fragment in an HSFB. Since structures are alignedagainst the template, it is desired for the template to have asmany HSFB members as possible. When an HSFB lacks afragment of the pivot protein, the consensus of the HSFB isused to search the pivot for possible neighboring fragments

    with a lower threshold !T

    , and add the optimal neighbor tothe HSFB.

    The transformations to superimpose the fragments of theanchor HSFB are also used to superimpose the wholestructures where the fragments are. After havingsuperimposed all the structures which possess a fragment inthe anchor HSFB, the horizontal consistency is examined afollows (see a schematic in Fig. 1). We mark all the HSFBsand their fragments uncolored, and set up a counter foeach HSFB. If an HSFB contains a fragment of the pivo protein, the fragment will be regarded as the center of thHSFB, and separations between the center and othe

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    5/15

    Fast Multiple Alignment of Protein Structures The Open Bioinformatics Journal, 2009, Volume 3 73

    members of the HSFB are calculated. When a separation isfound below a preset cutoff

    1d the count of the counter for

    the HSFB is advanced by one, the HSFB is marked coloredand both the fragment and the center colored. At the firsttime when such an HSFB is found, the anchor HSFB, and itstwo fragments taken for creating the transform forsuperposition are also colored. The total number of coloredHSFBs and that of colored fragments are assigned to theanchor HSFB. Having calculated the total counts for all thetop K anchor HSFBs, we select the optimal anchor HSFBto go on to the next step by inspecting first the total numberof colored HSFBs, and then the total number of coloredfragments if necessary.

    For an optimal anchor HSFB, its colored HSFBs arespatially consistent with the anchor. Thus, the coloredfragments of the pivot protein are supported bothhorizontally and vertically, and form a scaffold for multiplealignment. It may happen that the anchor HSFB has noneighbors in some structure (even after an extra search withthe block consensus), which is then regarded as an

    unanchored protein. No transformation can be found basedon the anchor HSFB to superimpose such an unanchoredprotein on the pivot protein.

    If an protein has a colored fragment in an anchor HSFB, but has no fragment in a colored HSFB, we search theprotein for fragments which do not overlap with any coloredfragments and are similar to the center either by checking thesimilarity score first, or directly examining the separationwith respect to the center fragment. If a separation is foundbeing smaller than the cutoff

    1d the corresponding fragment

    is colored and added to the HSFB. For a structure which hasa fragment in the anchor HSFB, if the number of its coloredfragments is smaller than a lower bound

    hn the protein is

    also regarded as unanchored, and otherwise as anchored (seeFig. 1).

    2.5. Improving the Scaffold

    Improving of the scaffold greatly follows the pairwiseCLEPAPS [20]. By using the colored fragments, thetransformation to superimpose two structures, which so far is based only on a single pair of fragments, can be updatedbased on more fragment pairs satisfying consistency. That is,

    for an anchored protein, the transformation optimal forsuperimposing all its colored fragments on the pivot proteinis the updated transformation for the two proteins. With thetransformation updated, we use a width l! smaller than land a more stringent cutoff

    2d to examine the consistency o

    colored fragments. If an l long colored fragment has at leasl! residues whose coordinates deviate from thei

    counterparts on the pivot protein within 2d , the fragmenremains colored, and otherwise it is changed to uncolored.

    Having examined all l long colored fragments, werecruit aligned fragment pairs (AFPs) for every anchoredprotein as follows. For a given anchored protein, the coloredfragments and their partner fragments on the pivot define theprimary AFPs. They are masked from the anchored and pivo proteins. With each fragment of length l! from theunmasked region of the pivot, the unmasked region of theanchored protein is searched for SFPs at a lenient thresholdT! . After sorting the found SFPs in descending order oscores, we examine the separations of the SFPs insuccession. Whenever a separation is found to be smallerthan

    2d and the positions of the corresponding SFP do no

    conflict with the existing AFPs, the SFP is recorded as anAFP, and its fragment on the pivot is assigned as a part othe scaffold. An extended scaffold is then obtained. TheAFPs map residues of proteins other than the pivot to thoseof the pivot protein, and define columns of residuecorrespondence. We construct the first average template byaveraging transformed coordinates of atoms over individuacolumns, and use it for dealing with unanchored structures.

    2.6. Dealing with Unanchored Structures

    The optimal anchor HSFB does not provide any guidancefor aligning an unanchored protein. However, the proteinmay still have members in other colored HSFBs. Any ofsuch members can be used to generate a transformation forsuperimposing the protein on the template for examining theconsistency of other fragments. For efficiency, we may sorthe members of colored HSFBs according to their depths orsimilarity scores, and then examine only the top K . If anunanchored protein does not have enough number omembers which can be associated with colored HSFBs, apairwise alignment is necessary.

    Fig. (1). Schematic of horizontal consistency examination. Selected by the consensus of the anchor HSFB, structure 2 is the pivot, on whichall other structures are superimposed. The anchor HSFB has 4 colored HSFBs and 13 colored fragments. For 3=

    hn structures 1 and 3 are

    anchored.

    3 anchored

    2 pivot

    1 anchored

    4 unanchored

    5 unanchored

    uncoloredHSFB

    anchorHSFB

    coloredHSFB

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    6/15

    74 The Open Bioinformatics Journal, 2009, Volume 3 Wang and Zheng

    There is a difference between the pairwise alignmenthere and the ordinary one. The scaffold on the pivot proteinnow provides a guidance. Each fragment of length l is takenfrom the scaffold as a seed, and the unanchored protein issearched for SFPs at threshold T . After sorting the foundSFPs in descending order of scores, the top K SFPs are usedto generate transformations and the top )( KJ ! , say K10 ,

    are used for consistency checking. (The algorithm speed isinsensitive to the choice of J ). The transformation derivedfrom the fragment pair of an SFP in the top K is used tosuperimpose the unanchored protein on the template.Separations of the top J SFPs are then successively

    examined. We count the total numberp

    n of non-

    overlapping SFPs whose separations are less than2

    d . The

    SFP which has the largestp

    n among the top K and its

    consistent SFPs are used to update the transformation forsuperimposing the unanchored protein. The procedure ofrecruiting AFPs is then applied to extend the portion alignedto the scaffold. Every unanchored protein can be treated in

    this way.2.7. Refinement

    2.7.1. Updating the Average Template

    The average template can be updated once unanchoredproteins are aligned on the template and missing motifs arediscovered. A cutoff

    3d , even more stringent than

    2d , is

    applied to examine the deviation between residue duads ofAFPs and their flanking sites. Fragments are elongated orshrunken according to the deviation cutoff

    3d , and hence the

    AFPs updated. The modified AFPs lead to an updatedaverage template. This is an iteration, and its convergence is

    usually rather fast. At the final iteration step, the distancecutoff

    0d is used to control residue duad distances (instead

    of separations which are just maximal coordinatedifferences).

    2.7.2. Seeking for Missing Motifs

    The patterns or motifs considered here are ungappedfragments which are structurally similar locally, and arearranged in space very alike globally. It is doomed by thegreedy nature of the above scheme that only patterns sharedby the pivot protein have a chance to be discovered. Some patterns could be shared by a subset of structures, but beabsent from the pivot protein. They are missing motifs to

    the pivot protein. Their information has to be extracted fromthe structures sharing them.

    A motif in a set of structures must also be a motif of theirconformational sequences. Although the reverse need not betrue, the latter provides a candidate for the former. There aremany methods for discovering motifs in a set of sequences,e.g. a simple center-star approach in the sense of stringsinstead of sequences [32]. An exceptionally large proteinwill have a large proportion of blank regions, most ofwhich have no contribution to missing motifs, and hence aprocedure to directly examine all fragments would be rather

    inefficient. The inefficiency occurs also when the number ostructures is large.

    Considering the fact that the structures have beensuperposed in the space, we propose a way to rescue missingmotifs by cell registration. The space occupied by thestructures after superimposition is divided into uniform cubiccells of a finite size, say 6 . The number of different

    proteins which have their residues falling in a given cell ithe depth of the cell. The cells with their depths below apreset cutoff are discarded. The remaining cells are sorted indescending order of depth. Picking a cell of a large depth asa base, in each dimension we select from its two nearesneighbor cells the one with the higher depth to expand the base cell and double its size into an octad. The residuefalling into the octad are marked. In this octad, if a proteinwhich has at least three marked residues with their indiceswithin l! is found, we take the segment of the protein whichcovers the most marked residues as a seed, and examine althe segments which are from other proteins and have somemarked residues by checking their separation from the seedIf AFBs are found, we discard the eight cells of the octadafter finding the AFBs, and continue with the cell of the nexhighest depth until all cells are examined.

    So far it is implicitly assumed that there is a commoncore shared by the whole set of structures. In the case thestructure set is actually divided into subsets according tocommon cores, the above scheme extracts a subset based onthe pivot. The algorithm first accomplishes the alignment forthis subset, and then treats the remainders as a new input set.

    2.8. Evaluation

    The final alignment may be described with the involvedrigid transformations and viewed visually. Anothe

    convenient way is to give the complete residuecorrespondence. A full column of the correspondence hasresidues from every protein of the structure set. Usually, thecommon core of alignment is rigorously defined by all fulcolumns. A partial core may also be defined by introducinga parameter of proportion. For example, core-60 is given bycolumns covering sixty percent or more proteins of thestructure set. Assume that a new motif is found on a proteinother than the pivot. With the help of the common core, wecan map the protein together with the missing motif on thetemplate, and update the template by averaging coordinatesover residues in individual new columns. After havingsuperimposed structures on the final average template, wecan calculate the total squared deviation of aligned residue

    with respect to the template for full or partial core of thealignment, and then the RMSD (root mean square deviationfor evaluation.

    The default parameters of BLOMAPS are listed in Table 2Finally, we summarize the overall algorithm of BLOMAPS1) creating HSFBs using the shortest protein as a templatesorting HSFBs, and deriving redundancy-removed HFSBs2) for each HSFB in top K , selecting the pivot protein basedon the HSFB consensus, superimposing other proteins on the pivot, finding consistent HSFBs, and selecting the beHSFB according to the number of consistent HSFBs; 3)

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    7/15

    Fast Multiple Alignment of Protein Structures The Open Bioinformatics Journal, 2009, Volume 3 75

    building a primary scaffold from the consistent HSFBs,updating the transformation using the consistent HSFBs,recruiting aligned fragments, and creating an averagetemplate; 4) dealing with unanchored proteins, findingmissing motifs by cell registration, and refining thealignment; and finally 5) evaluating the alignment.

    3. RESULTS

    BLOMAPS has been tested on 16 protein structureensembles as well as some large scale ensembles. The 16ensembles, covering various challenging cases of structuralalignment, are taken from several references: five from Ref.[17], three from [8], four from [9], and four from [6]. Someensembles contain structural homologies at different levels,some exhibit submotifs not shared by all members ordifferent topologies, while others contain a large number of proteins, or exhibit symmetry or repetition. The ensemblesare briefly summarized in Table 3. The meaning of theabbreviated names and original references for the ensembleslisted in the table are as follows. MicrRib: Microbialribonucleases, Subtil: Subtilisins, TIM61: a set of 61 TIMbarrels, Serin5: a set of 5 Serine proteinases, Serpin: Serpins,Thior: Thioredoxins, Beta: All beta immunoglobulins,Glob10: a set of 10 Globins, Glob16: another set of 16

    Globins, Serin68: another set of 68 Serine proteinases,CaBind: Calcium-binding proteins, CL-GL: Cofilin-like/Gelsolin-like proteins, PLP: PLP-dependenttransferases, C2: C2-domains, TIM7: another set of 7 TIM barrels, and HelBun: Helix-Bundles (with the number ofproteins less by one due to the fail in tracing PDB-ID in theupdated PDB version). DBNWa: MASS [8], DBNWb:MASS [9], SNW: MultiProt [17], YJ: Ye & Janardan [6],CK: Chew & Kedem [13].

    To conduct large scale comparisons between BLOMAPSand other tools, they should be downloadable to run locally,

    and provide readable alignments easy for further handlingFortunately, MAMMOTH-mult (abbreviated aMAMMOTH later on) [7], Mustang [16] and MATT [12] aresuch softwares available from the web or authors. They aluse dynamic programming, and build multi-alignmen progressively from pairwise alignments. Occasionally, walso manually make comparisons with some other tools. (Asmentioned before, CLEMAPS is of the CE-type, involvingmany pair comparisons. The use of conformational lettersmakes it still fast, but its alignment quality is lesssatisfactory at least for the published version. The paper of

    CLEMAPS reported that the running time for Serin68 wa27s. The same set takes BLOMAPS 1.84s. Since CLEMAPShas been compared with MAMMOTH, we shall not makefurther comparison with CLEMAPS.)

    3.1. Implementation of BLOMAPS on 16 Test Sets

    BLOMAPS uses several greedy strategies. It starts bytaking the shortest protein as a pivot for finding HSFBs. Icould happen that the shortest one poorly represents the setThe found HSFBs will then have a low quality, and evencannot pass the examination of vertical equivalency andhorizontal consistency. In this case, BLOMAPS will gewarned at the very beginning, and a second protein has to be

    taken as a new pivot. However, taking the shortest as a pivoworks for all the 16 ensembles. Furthermore, for fiveensembles (MicrRib, Subtil, TIM61, Serin5, Serpin), three owhich have members over 60, the similarity betweenmembers in a set is so high that the shortest always welrepresents the set, and the optimal anchor HSFBs, whichalways have members from all proteins in the set, directlycorrespond to a MAB (except for one member in 63 ofMicrRib). They are then trivial cases. The top few HSFBof local similarity also agree with the global alignment. Merestring comparison for conformational codes with theCLESUM matrix can lead to the right answer.

    Table 2. Default Parameters of BLOMAPS

    Symbol Value Meaning

    l 12 width of wide SFBs

    T 200 similarity threshold for wide HSFBs

    K 5 number of wide HSFBs tested as anchors

    l! 8 width of narrow SFPs for scaffold improvement

    T! 100 similarity threshold for narrow SFPs

    ! 0.5 overlap proportion for removing redundancy of HSFBs

    ! 4 minimum length of aligned fragments

    0d 5 distance cutoff for evaluating overall alignment

    1d 10 separation threshold for consistency

    2d 7.5 separation bound for recruiting AFPs based on the pivot

    3d 5 separation cutoff for recruiting AFPs based on the template

    hn

    3 minimal number of colored fragments for an anchored protein

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    8/15

    76 The Open Bioinformatics Journal, 2009, Volume 3 Wang and Zheng

    Although these trivial cases are not the best examples toshow the power of the conformational letter description wemention a pair of structures, 1gci: 1gnsA, in the set Subtil(see Table 3), whose optimal anchor HSFB is supported by12 other colored HSFBs. The amino acid identity rate for thepair in these 13 HSFBs as the aligned proportion is 88/156while 120/156 of their conformational letters are identical. Asimple elongation of colored HSFBs with duads of positive

    CLESUM scores with respect to the pivot protein leads tototal of 216 aligned positions for the set.

    The first notable ensemble is the set of Thioredoxins(Thior in Table 3). The shortest protein 1fo5A is of length85, rather shorter than all the rest nine (between 105 and112). The HSFB of rank 1 is selected as the optimal anchorHSFB, having six colored HSFBs. (Here, rank 1 means thatthe HSFB is ordered first according to its depth and then toits total score among HSFBs.) The seed of the HSFB hascodes FEEENOGCEDEQ from 1fo5A, while the consensusof the HSFB selects EEEENOGCPLDE of 2tir as arepresentative, then 2tir becomes the pivot protein. The shiftof center member results in a score change of the HSFBfrom 439 to 688. This HSFB happens to be an MAB.Interestingly, being supported by only two coloredfragments, 1fo5A turns to be unanchored. Its member in theanchor HSFB may be regarded as a weak member. That is,although it finally appears in the global alignment, it is notstrong enough for inferring the alignment. However, SFPs ofother colored HSFBs can still be used to superimpose 1fo5Aagainst the pivot for checking horizontal consistency, andthen align it. If the members of a protein in all the coloredHSFB are wrong or week, which should be rare, we have to

    conduct a direct pairwise alignment to align the proteinagainst the scaffold. However, this does not happen here.

    The next notable ensemble is the set of 16 globins(Glob16), which was first studied by Chew and Kedem [613]. The shortest protein is 1eca. The optimal anchor HSFBwith 1bdbA being the pivot selected by the consensus of theHSFB, has 6 colored HSFBs, all with depth 16. The tota

    number of colored fragments is 38. Since an HSFB reflectsonly the local similarity it need not be an MAB. For this sethe anchor HSFB has two wrong members from 1eca and2hbg, i.e., it is not the fragment appearing in the finaalignment. No supported SFPs are found for these two wrongfragments in the examination of horizontal consistency, andthey are then rejected. In fact, there are two more memberof the HSFB which fail in the horizontal consistencyexamination. The scaffold after recruiting aligned fragmentconsists of 17 pieces. By using the colored HSFBs, the fourstructures unanchored so far are easily aligned against thiscaffold.

    Ensemble C2 consists of ten proteins, fou

    Synaptotagmin-like proteins and six PLC-like proteinstaken from two families of the C2 domain superfamily [8]The two families are related by a circular permutation whileeach forms a topological group. The lengths spread over awide range from 123 (1bdyA) to 841 (1qasA). No dynamicprogramming is conducted in MASS, so the non-topologicaalignment of this ensemble was detected by MASS [8]. Thecore of the MASS alignment forms a sandwich of eight !

    stands, which will be indexed from a to h . The modealignment for the four Synaptotagmin-like proteins is

    Table 3. The 16 Ensembles Used to Test BLOMAPS. The Last Five Columns are the Rank, Depth, Number of Wrong Members o

    the Optimal Anchor HSFB and Numbers of its Colored HSFBs and Fragments, respectively

    Name Ref Size L Rank Depthrong

    nw

    b

    N fN

    MicrRib DBNWb 63 100:104 5 63 1 5 309

    Subtil DBNWb 60 263:281 2 60 0 13 770

    TIM61 DBNWb 61 385:443 4 61 0 22 1217

    Serin68 DBNWa 68 181:396 1 68 8 9 410

    Serin5 SNW 5 274:279 2 5 0 14 58

    Serpin SNW 13 337:420 1 13 0 16 150

    Thior YJ 10 85:112 1 10 0 6 38

    Beta YJ,CK 6 95:115 1 6 0 4 16

    Glob10 YJ 10 136:158 3 10 3 7 32

    Glob16 YJ,CK 16 136:158 2 16 2 6 38

    CL-GL DBNWb 12 96:174 2 12 0 5 32

    PLP DBNWa 11 361:730 1 11 4 15 54

    C2 DBNWa 10 123:841 1 10 4 6 26

    CaBind SNW 6 75:185 3 6 4 3 7

    TIM7 SNW 7 247:491 5 7 5 4 10

    HelBun SNW 9 106:159 3 9 5 4 11

    Size: Ensemble size, L: Length range.

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    9/15

    Fast Multiple Alignment of Protein Structures The Open Bioinformatics Journal, 2009, Volume 3 77

    abcdefgh in the element indices while that for the six

    PLC-like proteins is bcdefgha . In the MASS alignment

    element d was absent in 3rpbA and 1bdyA. Any multiplealignment tool which uses dynamic programming is able todiscover only collinear alignment. For example, in theMATT alignment element a of PLC-like (near C-terminus) was missing, but element d exists in all the 10

    structures. The missing element a counts 8 columns, andmakes the MATT alignment shorter than BLOMAPS.According to PDB files the segments corresponding toelement d of 3rpbA and 1bdyA are not annotated as sheets,which explains the missing of the element in the alignmentof MASS, a secondary structure based tool. In fact, theconformational codes of element d for the whole set aretypical of strands. BLOMAPS is able to detect all the eightelements.

    Compared with the other members in ensemble C2, protein 1qasA is tremendously large (about seven timeslarger than the smallest). This ensemble is a good example todemonstrate the cell registration technique for missingmotifs. Besides the above mentioned core of eight ! -

    strands, there are five submotifs, which are not shared by all proteins. After the 10 structures have been superimposedtogether, they occupy a volume of 1980=111512 !! (in

    units of 6.0 for sides). Among the 1980 cells, only 101cells contain points from at least three proteins. By sortingthese cells in descending order of depth, The five subpatternsare discovered using the first, second, fifth, sixth and eighthcells. The third, fourth and seventh cells are removed duringthe octa formation. (A cell size of 5.0 has been tested, andalso works.)

    Another ensemble showing subset alignment is CL-GL,

    which consists of 12 structures belonging to the fold Actindepolymerizing proteins, four from the Cofilin-like(CL)family and eight from the Gelsolin-like (GL) family. Thetwo families share five ! -strands (indexed from a to e )

    and two ! -helices (indexed as 3 and 4), and family CL hastwo additional helices (indexed as 1 and 2). Written in theseindices, the structurally conserved common core is eacd3 ;the CL family is characterized by ecdab 321 while GLfamily by 43eabcd [9]. The shortest structure, being 1d0n,and the pivot protein selected by the consensus of theoptimal anchor HSFB are the same one. It belongs to the GLfamily. The anchor HSFB is supported by 5 colored HSFBsand 32 colored fragments. Although the HSFB is an MAB,i.e., all its members appear in the final alignment, there arestill three members from CL family which fail in theconsistency examination. The first scaffold is then built based on the rest nine structures. After recruiting alignedfragment pairs the improved scaffold consists of 10 pieces.There is no difficulty to align the three unanchored structuresagainst the scaffold by using colored HSFBs. Since the pivotstructure belongs to the GL family, the two additional helicesspecific to the CL family have to be detected as missingmotifs. By cell registration we find not only these two CLhelices, but also many other subpatterns. For example, thehelix 4 is split into two submotifs according to CL and GL,

    and it is missing in CL protein 1f7sA. However, 1f7sAshares with the two other CL members some submotifswhich are missing in CL member 1ak6. If looking only at thecore of full aligned columns, the BLOMAPS alignmenagrees with those of other tools, but is a little longer.

    An example of large ensemble is Serin68, whichcomprises 68 molecules of the SCOP family Prokaryotic

    trypsin-like serine protease [33]. Aligning this ensemble isnot acceptable to the MAMMOTH-mult web server due toits large size (although it is not an actual limit ofMAMMOTH). The optimal anchor HSFB has 68 membersand is top-1 HSFB in the sorted list, with 8 members beingwrong. Protein 1csoE is selected as the pivot by theconsensus of this HSFB. It is supported by eight otherHSFBs, seven of which have their depths 68 (full). Thetransformations based on single SFPs of the anchor HSFBare able to align 60 structures against all the nine fragmentsof the primary scaffold. There is no difficulty in aligning the8 unanchored proteins on the scaffold by using coloredHSFBs other than the anchor. The lengths of proteins varyover a large range in this ensemble, which requires a step formissing motifs. It should be pointed out that this ensemble ishighly redundant. At least 40 members have lengths between195 and 198; their members in the top five HSFBs are highlyidentical in both amino acids and conformational letters. Thecore of alignment for these 40 structures has 195 residueswith RMSD 0.139 .

    Ensemble CaBind has six proteins of EF hand-likesuperfamily extracted from three families [17]. According tothe pairwise relation, the ensemble can be split into twogroups: non-repetitive (3icb, and 4cpv) and repetitive(2scpA, 1scmB, 1top and 2sas). For the latter, a pairwisealignment of a single structure against itself admits multiple

    solutions. This set is rather tricky. With the shortest 3icb, 10HSFBs are found. The consensus of the optimal HSFB picksup 2sas as the pivot. However, due to a lack of enoughcolored fragments no structures are anchored, and a pairwisealignment becomes necessary. After aligning the fivestructures against 2sas, the common core of BLOMAPSfinally has 56 residues. The BLOMAPS alignment is ratheclose to that of MATT.

    Since various criteria are used it is difficult to define ageneral comparison between different aligning methods. Acommon goal for structural alignment is to minimize thedeviation of the conserved core while maximizing the size othe core. We set up a cutoff 3.0 to standardize the

    deviation criterion for the core: only when the root meansquared deviation (RMSD) of the residues in a given alignedcolumn with respect to their corresponding site in theaverage template is smaller than 3.0 will the column bekept in the core. Note that the RMSD here is in the sense of asingle aligned column. The core size obtained with thiscutoff is denoted by

    cN . Such

    cN can be compared directly

    Besides, an important measure for comparison is someidentity rate between the compatible cores of alignments otwo methods. Here we take the BLOMAPS alignment as areference. For example, when we compare BLOMAPS with

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    10/15

    78 The Open Bioinformatics Journal, 2009, Volume 3 Wang and Zheng

    MATT, usually corresponding columns from two cores ofalignment can be easily recognized by counting the identicalsite indices, which should not be less than the half of theensemble size. The identical indices are summed, afterdivided by the ensemble size, to give the effective number ofidentical columns

    0N . To include also small shifts, we

    count nonidentical site indices which deviate at most foursites (one turn of a helix) in the related columns. The number

    of site indices is converted to another effective number ofcolumns

    N . The number

    + NN

    0may be then taken as a

    measure of accordance between two alignments. Thecomparison of BLOMAPS with MUSTANG, MAMMOTHand MATT on 12 of the 16 sets is summarized in Table 4.For the 4 sets with set size over 60, either the size is beyondthe limit of a tool, or the running time is too long except forBLOMAPS, so these sets are not included in the table.BLOMAPSs

    cN for these four large sets MicrRib, Subtil,

    Tim61 and Serin68 are 99, 257, 364 and 120, respectively. Arunning time comparison conducted on subsets of varioussizes extracted from Tim61 will be presented later. One caseof discrepancy between BLOMAPS and other toolsmentioned above is ensemble CaBind due to repetition.Ensembles CaBind, TIM7 and HelBun contain membersfrom different superfamilies or even different folds, besidesthe symmetry and repetition. Thus, observing discrepancyamong different methods in these ensembles is not sosurprising. It is seen that, in absence of symmetry andrepetition, alignments of BLOMAPS usually agree withthose of other three tools; BLOMAPS is most close toMATT. As a rule, the aligned length obtained by a tool usingdynamic programming is longer and scrappier than by a toolwithout it. BLOMAPS superadds a minimum length ofaligned fragments which is set to four as default. To make a

    close comparison, the aligned lengths !c

    N with this same

    restriction (applied before the cutoff 3.0 for fairness) arealso given in the table.

    3.2. Large Scale Comparison of BLOMAPS with other

    Tools

    Large scale tests are conducted on two sets: ASTRAL40-

    fam and SABmark-sup. ASTRAL40 is extracted from %40percentage identity filtered ASTRAL SCOP genetic domainsequence subsets [34]. By excluding families with memberless than 3 or over 25, and three more families containingmember proteins which cannot be traced by their PDBindices in the updated version of PDB, ASTRAL40-famcovers 852 SCOP families. Taking the BLOMAPSalignment as reference, we divide the set into two subsetsalike and unlike. For a family in subset alike,

    cN o

    BLOMAPS is never less than 85% of the largest (bestamong the three

    cN values from MUSTANG, MAMMOTH

    and MATT. At the same time,

    + NN0

    between

    BLOMAPS and the best tool is never less than 85% of thesmaller in the twoc

    N values. (When a tool reports no

    alignment for a family, thec

    N is assigned as zero.) Thi

    subset alike has 783 families, and the rest 69 families formsubset unlike. The comparison of BLOMAPSMUSTANG, MAMMOTH and MATT is summarized inTable 5.

    The original SABmark, designed as a sequencealignment benchmark, provides structurally alignable groupsof SCOP superfamily level, and consists of two subsetsTwilight Zone (-twi) and Superfamilies (-sup) [35]. The

    Table 4. Comparison of BLOMAPS Alignments with those of MUSTANG, MAMMOTH and MATT

    BLOMAPS MUSTANG MAMMOTH MATT

    Name Sizeean

    Lm

    !

    cN

    !

    ccNN :

    + NN

    0

    !

    ccNN :

    + NN

    0

    !

    ccNN :

    + NN

    0

    Serin5 5 277 238 236 230 224+ 6 241 239 227+ 7 243 232 224+11

    Serpin 13 369 305 301 299 289+ 9 303 297 285+14 307 301 293+10

    Thior 10 105 75 74 67 66+ 4 80 80 71+ 2 76 73 66+ 4

    Beta 6 107 76 77 76 71+ 2 79 76 70+ 1 78 69 69+ 3

    Glob10 10 147 116 116 112 110+ 2 117 116 110+ 1 118 118 111+ 3

    Glob16 16 146 99 98 93 89+ 3 98 95 90+ 2 105 98 87+10

    CL-GL 12 126 66 62 62 49+11 62 59 45+12 64 58 49+12

    PLP 11 443 198 143 119 118+15 167 154 124+19 194 131 124+23

    C2 10 258 84 4 3 0+ 0 32 30 26+ 1 73 62 67+ 3

    CaBind 6 140 56 41 41 28+ 3 0 0 0+ 0 59 52 50+ 5

    TIM7 7 390 91 1 0 0+ 0 0 0 0+ 0 69 50 11+ 4

    HelBun 9 131 60 0 0 0+ 0 18 18 4+ 1 33 29 13+ 1

    Lmean

    : the mean length.c

    N : the number of fully aligned columns at cutoff 3.0 without the restriction on minimum length of aligned fragments; !c

    N : the number of the ful

    columns with the restriction set to 4 (applied before cutoff 3.0 ).0

    N (or

    N ): effective numbers of fully aligned columns (inc

    N ) which coincide with those of BLOMAPS (or

    shift at most four sites).

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    11/15

    Fast Multiple Alignment of Protein Structures The Open Bioinformatics Journal, 2009, Volume 3 79

    former is more divergent in sequences than the latter. Thus,SABmark-twi might be better for the case study, and for thelarge scale comparison we use only SABmark-sup of 425

    groups, each of which contains 3 to 25 members. (Note thatthe false positive sequence counterparts for the purpose ofdiscriminant analysis are never used here.) Ten groupscontain member proteins which cannot be traced by theirPDB indices. The remain 415 groups are then used forcomparison. The set can also be divided into alike andunlike subsets with the same criteria for ASTRAL40-fam.The averaged values of

    cN ,

    0N and

    N for the two subsets

    are summarized in Table 6.

    Finally, we give three examples of case study in moredetails. The first example, taken from SABmark-sup (Group323), is the SCOP superfamily Glutaminesynthetase/guanido kinase of three domains: d1m15a2,d1qh4a2 of family Guanido kinase and d1f52a2 of familyGlutamine synthetase with a fold description of the commoncore being two beta-alpha-beta2-alpha repeats. Domaind1f52a2 is longer than the other two by about 100. Written inhelical elements (indexed from 1 to 5) and beta strands(indexed from a to g), the BLOMAPS alignment is

    1a2bc3de4fg5 (from the N- to C-terminus), which welrepresents the fold. Values of

    cN (at cutoff 3.0 ) are 147

    21, 34 and 74 for BLOMAPS, MUSTANG, MAMMOTH

    and MATT, respectively. MATT aligns 1a2bc3 of domaind1f52a2 to 3e4fg5 of the other two. Alignments reported byMUSTANG and MAMMOTH are not better than that oMATT. Comparative alignments are illustrated in Fig. (2).

    The second example, taken from ASTRAL40-famconsists of 12 proteins of SCOP family Thiolase-relatedThis is an example of submotifs. The 12 proteins form threegroups: I=(d1m3ka2, d2bywa2, d1ox0a2, d1tqyb2, d1tqya2)II=(d1tqyb1, d1tqya1, d2bywa1, d1ox0a1), andIII=(d1wdkc1, d1afwa1, d1m3ka1). With helical element being indexed by arabic digits and strands by letters, thcommon core for the whole family can be characterized as

    dcba 321 , while group I as 4321 !!!! dcccba , group II as433222111 !!!!!!!!!!!! ddcbaaa , and group III as

    7645321 hfgdcba . The segment cc !!! , consisting of a long

    loop intervened between two very short strands c! and c !! , imissing in d1m3ka2 of I; elements d! and 4! are missing ind1tqyb1 of II. The BLOMAPS alignment of the family ishown in Fig. (3). The aligned cores of BLOMAPS

    Table 5. Comparison Between BLOMAPS, MUSTANG, MAMMOTH and MATT on Test Set ASTRAL40-fam

    Alike(783) BLOMAPS MUSTANG MAMMOTH MATT

    BLOMAPS 123.3 5.3 5.7 7.2

    MUSTANG 103.7 114.0(109.8) 4.9 5.4

    MAMMOTH 106.8 101.2 117.4(115.3) 6.1

    MATT 110.2 103.8 106.1 124.2(116.5)

    Unlike(69) BLOMAPS MUSTANG MAMMOTH MATT

    BLOMAPS 57.7 1.9 2.9 3.2

    MUSTANG 23.2 43.4 2.9 2.2

    MAMMOTH 21.4 22.6 44.6 2.8

    MATT 29.5 29.4 28.5 65.0

    Average )( !cc

    NN (diagonal),0

    N (lower-left) and

    N (upper-right) are listed for two subsets of alike and unlike.

    Table 6. Comparison Between BLOMAPS, MUSTANG, MAMMOTH and MATT on Test Set SABmark-Sup

    Alike(345) BLOMAPS MUSTANG MAMMOTH MATT

    BLOMAPS 105.0 5.6 6.0 7.8

    MUSTANG 80.6 91.4(87.6) 5.0 5.7

    MAMMOTH 82.5 74.9 94.0(92.0) 6.8

    MATT 89.0 79.8 80.7 104.6(96.4)

    Unlike(70) BLOMAPS MUSTANG MAMMOTH MATT

    BLOMAPS 53.0 2.5 3.4 3.5

    MUSTANG 16.1 34.3 2.3 2.6

    MAMMOTH 15.5 13.4 36.1 4.1

    MATT 21.5 18.7 20.5 57.1

    Average )( !cc

    NN (diagonal),0

    N (lower-left) and

    N (upper-right) are listed for two subsets of alike and unlike.

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    12/15

    80 The Open Bioinformatics Journal, 2009, Volume 3 Wang and Zheng

    MUSTANG and MATT are very similar, while that ofMAMMOTH is much shorter.

    The third example, taken also from ASTRAL40-fam, isthe SCOP family Legume lectins of 4 proteins in class All beta. Domains d1gzca represents itself and other twostructures (d1v6ia and d1dhkb), while d1nlsa forms anothergroup of the family. Each structure consists of two segmentsA and B of eight strands. They can also be identified in thealigned core of BLOMAPS; the lengths corresponding to A

    and B are of 102 and 80 residues. Structure d1gzca ischaracterized as AB while d1nlsa as BA . The two segmentscannot be related by a simple symmetry of rotation. TheBLOMAPS alignment of the family is shown in Fig. (4). Analigning tool using dynamic programming can at most findthe longer segment A as the core. Indeed, while

    cN of

    BLOMAPS is 182, those of MUSTANG, MAMMOTH andMATT are 102, 101 and 103, respectively.

    A comparison on the running time is summarized inTable 7. BLOMAPS is the fastest. Furthermore, the running

    Fig. (2). Comparative alignments of SABmark-sup Group 323. (a) The core of BLOMAPS alignment is shown with bold lines for d1f52a2

    (in blue), d1m15a2 (in green) and d1qh4a2 (in red). A more stringent cutoff2.5

    is used for a perceptible visualization. The fragments ofthis core will be also shown in the following alignments of MUSTANG, MAMMOTH and MATT as insets. (b) The core of MUSTANGalignment, (c) The core of MATT alignment, and (d) The core of MAMMOTH alignment. (Pictures were drawn with RasMol [36]).

    Fig. (3). BLOMAPS alignment of SCOP family Thiolase-relatedThe full core is shown with bold lines, and partial core with thicklines. Structures in groups I, II and III are colored in blue, greenand red tones, respectively. (Pictures were drawn with RasMo[36]).

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    13/15

    Fast Multiple Alignment of Protein Structures The Open Bioinformatics Journal, 2009, Volume 3 81

    time for BLOMAPS grows linearly with the set size incontrast to the quadratic growth for the other three tools.

    Fig. (4). BLOMAPS alignment of two representative structures1gzca (brown) and 1nlsa (cyan) in SCOP family Legume lectins.The common core is shown with bold lines. The N - and C-termini of the two structures are indicated, showing a topologicalshift. (Pictures were drawn with RasMol [36]).

    4. DISCUSSION

    BLOMAPS distinguishes itself from most other existingalgorithms for multiple structure alignment by its use of

    conformational letters. The description of 3D segmentalstructural states by a few discrete conformational lettersgives a compromise between precision and simplicity. Thesubstitution matrix CLESUM provides us with a propermeasure of the similarity between these discrete states orletters. Such a description fits ! -congruent problems verywell [17]. Furthermore, extracted from the database FSSP ofstructure alignments, CLESUM contains information of thestructure database statistics. For example, scores betweentwo frequent helical states are relatively low, which reduces

    the chance of accidental matching of two irrelevant helicesThe conversion of coordinates of a 3D structure to itsconformational codes requires little computation. Once wehave transformed the 3D structures to 1D sequences oletters, tools for analyzing ordinary sequences can be directlyapplied. The use of conformational letters for a fast locasimilarity search can be integrated in many existing tools toimprove their efficiency.

    BLOMAPS is developed from our CLEPAPS, a tool for pairwise structure alignment, which is based on similafragment pairs (SFPs) defined by CLESUM scores fromstring comparison. CLEPAPS takes a single good SFP as aninitial correspondence, and iteratively builds up largecorrespondence with ever stringent thresholds of zoom-inCLEPAPS adopts a greedy strategy guided by CLESUMscores. When dealing with the multiple alignment, weinevitably encounter the problem of combinatorial explosionThe concept of similar fragment pairs is extended to that oaligned fragment blocks. A multiple alignment has to satisfythe equivalency among the fragments of differen proteinsinside an aligned fragment block, and, at the same time, theconsistency among different aligned fragment blocks of thealignment. The strong restriction of both equivalency andconsistency reduces the chance of making an insignificanalignment. Our heuristic to avoid the combinatoriaexplosion includes the using of a single pivot protein andHSFBs (instead of SFBs). Wrong assignment ocorrespondence can be detected, then removed, and replaced by a correct one in a later stage. In contrast with MATTBLOMAPS uses a zoom-in technique to go from local toglobal alignment. Thanks to the operation reduction tomerely string comparison, greedy strategy guided byCLESUM scores and the cell-registration technique, theimplement of BLOMAPS is considerably fast. At the same

    time, its alignment quality is competitive with otherprograms.

    Contrary to CLEPAPS, for multi-alignment BLOMAPSexcludes multiple solutions. Thus, it is encouraged to have asurvey of the symmetry and repetition on the structures to bealigned, for example, by running a pairwise alignment on arandomly picked structure with itself. An inspection onconformational sequences can also be informative. Inexistence of a symmetry, the possibility of a cycle permutation should be examined. For repetition, we ma

    Table 7. Comparison of Running Time (in Units of s) Among BLOMAPS, MUSTANG, MAMMOTH and MATT on Different Sets

    Test Set BLOMAPS MUSTANG MAMMOTH MATT

    C2 (10 structures) 0.16 508.53 6.50 151.58

    ASTRAL40-fam (averagedover 852 groups)

    0.15 73.77 2.70 63.68

    SABmark-sup (averagedover 415 groups)

    0.26 167.87 3.90 97.88

    First 5 structures of TIM61 0.20 46.20 4.31 137.42

    First 10 structures of TIM61 0.40 181.19 16.97 718.05

    First 15 structures of TIM61 0.61 450.25 36.97 1744.16

    First 20 structures of TIM61 0.81 807.83 67.06 3193.61

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    14/15

    82 The Open Bioinformatics Journal, 2009, Volume 3 Wang and Zheng

    mask the already aligned portion of a structure, and thenhave a second run of alignment. This technique of maskingis also useful for detecting components with a domain moveor translations and twists of MATT.

    It should be emphasized that all the results presentedabove are derived with the same set of the default parameterslisted in Table 2. The tuning of parameters is generally

    crucial to an optimal performance of an algorithm. Forexample, for BLOMAPS a large value of fragment width lor similarity threshold T would reduce search times, but atthe price of sensitivity. A too long l would leads to too fewHSFBs. Similarly, a too high T would result in too sparseHSFBs. Without an enough number of dense HSFBs anefficient checking of vertical equivalency and horizontalconsistency becomes impossible. Our strategy is to usemoderately stringent parameters first for building a reliableprimary scaffold for alignment, and then fill in the missing blanks for later compensation of the sensitivity loss withrelaxed parameters. Under the assumption that structures to be aligned are independent of each other, a relatively low

    thresholdT

    , which might be weak for a pairwise alignment,can still be significant for a multiple alignment.

    The large ensembles we have tested are taken from theliterature and all highly redundant. Extremely closestructures can be detected rather reliably by merely aligningtheir conformational letter sequences. To conduct analignment only for representative structures is more efficientand less biased.

    CLESUM only considers information of conformation.However, the FSSP alignments from which CLESUM wasderived also contain the amino acid information. The use ofmodified CLESUM matrices that also include suchinformation would illuminate the biochemical role in

    alignment [10]. Recently, we have clustered amino acids intotwo groups AVCFIWLMY (type h ) and

    TDEGHKNPQRS (type p ), and then obtained CLESUM

    matrices of types p - p , h -h and h - p [11]. We expect

    that such matrices would further improve the efficiency ofour tools for structure alignments.

    ACKNOWLEDGEMENTS

    We are grateful to the authors of MAMMOTH-mult for providing their source code. This work is supported byNational Natural Science Foundation of China and NationalBasic Research Program of China (2007CB814800).

    REFERENCES[1] H. Hasegawa and L. Holm, Advances and pitfalls of protein

    structural alignment, Curr. Opin. Struct. Biol., vol. 19, pp. 341-348, 2009.

    [2] M.A. Marti-Renom, E. Capriotti, I.N. Shindyalov and P.E. Bourne,Structure Comparison and Alignment, In: Structural

    Bioinformatics, 2nd ed. J. Gu and P.E. Bourne, Eds., John Wiley &Sons: USA 2009, pp. 397-417.

    [3] L. Holm and C. Sander, The FSSP database of structurally alignedprotein fold families, Nucleic Acid Res., vol. 22, pp. 3600-3609,1994.

    [4] L. Holm and C. Sander, Dali/FSSP classification of three-dimensional protein folds,Nucleic Acid Res., vol. 25, pp. 231-234,1997.

    [5] I.N. Shindyalov and P.E. Bourne, Protein structure alignment byincremental combinatorial extension (CE) of the optimal path

    Protein Eng., vol. 11, pp. 739-747, 1998.[6] J. Ye and R. Janardan, Approximate multiple protein structur

    alignment using the sum-of-pairs distance, J. Comput. Biol., vol11, pp. 986-1000, 2004.

    [7] D. Lupyan, A. Leo-Macias and A. R. Ortiz, A new progressiveiterative algorithm for multiple structure alignment

    Bioinformatics, vol. 21, pp. 3255-3263, 2005.[8] O. Dror, H. Benyamini, R. Nussinov and H. Wolfson, MASS

    Multiple structural alignment by secondary structuresBioinformatics, vol. 19, pp. i95-i104, 2003.[9] O. Dror, H. Benyamini, R. Nussinov and H. Wolfson, Multipl

    structural alignment by secondary structures: Algorithm andapplications,Protein Sci., vol. 12, pp. 2492-2507, 2003.

    [10] W.M. Zheng and X. Liu, A protein structural alphabet and itsubstitution matrix CLESUM, In: Lecture Notes

    Bioinformatics, vol. 3680, C. Priami and A. Zelikovsky, EdsBerlin: Springer Verlag, 2005; pp. 59-67.

    [11] W.M. Zheng, Protein Conformational Alphabets, In: ProteinConformations: New Research, L.B. Roswell, Ed. Nova SciencPublishers, 2008, pp. 1-49.

    [12] M. Menke, B. Berger and L. Cowen, Matt: Local flexibility aidprotein multiple structure alignment, PLoS Comput. Biol., vol. 4no. e10, pp. 88-99, 2008.

    [13] L.P. Chew and K. Kedem, Finding the consensus shape of protein family, In: 18th Annual ACM Symposium on

    Computational Geometry, ACM, 2002, pp. 64-73.[14] C. Guda, S. Lu, E.D. Sheeff, P. E. Bourne and I. N. ShindyalovCE-MC: A multiple protein structure alignment server, Nuclei

    Acids Res., vol. 32, pp. W100-W103, 2004.[15] Y. Ye and A. Godzik, Multiple flexible structure alignment usin

    partial order graphs,Bioinformatics, vol. 21, pp. 2362-2369, 2005[16] A. Konagurthu, J. Whisstock, P. Stuckey and A. Lesk

    MUSTANG: A multiple structural alignment algorithmProteins, vol. 64, pp. 559-574, 2006.

    [17] M. Shatsky, R. Nussinov and H. Wolfson, MultiProt - A multipl protein structural alignment algorithm, In: Lecture Notes Computer Science, vol. 2452, R. Guigo and D. Gusfield, Eds.Springer Verlag: Rome 2002, pp. 235-250.

    [18] J. Ebert and D. Brutlag, Development and validation of consistency based multiple structure alignment algorithm

    Bioinformatics, vol. 22, pp. 1080-1087, 2006.[19] X. Liu, Y. Zhao and W.M. Zheng, CLEMAPS: Multipl

    alignment of protein structures based on conformational lettersProteins, vol. 71, pp. 728-736, 2008.

    [20] S. Wang and W.M. Zheng, CLEPAPS: Fast pair alignment o protein structures based on conformational letters, J. BioinformComput. Biol., vol. 6, pp. 347-366, 2008.

    [21] P. Lackner, W.A. Koppensteiner, M.J. Sippl and F. S. DominguesProSup: A refined tool for protein structure alignment, Protein

    Eng., vol. 13, pp. 745-752, 2000.[22] W.M. Zheng, The use of a conformational alphabet for fas

    alignment of protein structures, In: Lecture Notes in ComputeScience, Springer: USA, 2008, vol. 4983, pp. 331-342.

    [23] M.J. Rooman, J.P. Kocher and S.J. Wodak, Prediction of protein backbone conformation based on seven structure assignmentInflunce of local interactions,J. Mol. Biol., vol. 221, pp. 961-9791991.

    [24] B.H. Park and M. Levitt, The complexity and accuracy of discretstate models of protein structure, J. Mol. Biol., vol. 249, pp. 493

    507, 1995.[25] T. Edgoose, L. Allison and D.L. Dowe, An MML classification oprotein structure that knows about angles and sequences, In: 3rdPacific Symposium on Biocomputing, 1998, pp. 585-596.

    [26] A.C. Camproux, P. Tuffery, J.P. Chevrolat, J.F. Boisvieux and SHazout, Hidden markov model approach for identifying thmodular framework of the protein backbone, Protein Eng., vol12, pp. 1063-1073, 1999.

    [27] B. Offmann, M. Tyagi and A.G. de Brevern, Local proteinstructures, Curr. Bioinform., vol. 2, pp. 165-202, 2007.

    [28] R.W. Montalvao, R.E. Smith, S.C. Lovell and T.L. BlundellCHORAL: A differential geometry approach to the prediction othe cores of protein structures, Bioinformatics, vol. 21, pp. 37193725, 2005.

  • 8/9/2019 Fast Multiple Alignment of Protein Structures Using Conformational Letter Blocks

    15/15

    Fast Multiple Alignment of Protein Structures The Open Bioinformatics Journal, 2009, Volume 3 83

    [29] P.L. Chang, A.W. Rinne and T.G. Dewey, Structure alignment based on coding of local geometric measures, BMC Bioinform.,vol. 7, pp. 346-356, 2006.

    [30] W. Kabsch, A discussion of the solution for the best rotation torelated two sets of vectors, Acta Crystallogr., vol. 34A, pp. 827-828, 1978.

    [31] S. Umeyama, Least-squares estimation of transformationparameters between 2-point patterns, IEEE Trans. Pattern Anal.Machine Intel., vol. 13, pp. 376-380, 1991.

    [32] W.M. Zheng, Relation between weight matrix and substitution

    matrix: motif search by similarity, Bioinformatics, vol. 21, pp.938-943, 2005.

    [33] A.G. Murzin, S.E. Brenner, T. Hubbard and C. Chothia, SCOP: Astructural classification of proteins database for the investigation osequences and structures, J. Mol. Biol., vol. 247, pp. 536-5401995.

    [34] J.M. Chandonia, G. Hon, N.S. Walker, L.L. Conte, P. Koehl, MLevitt and S.E. Brenner, The ASTRAL compendium in 2004

    Nucleic Acids Res., vol. 32, pp. D189-D192, 2004.[35] I.V. Walle, I. Lasters and L. Wyns, SABmark --- a benchmark fo

    sequence alignment that covers the entire known fold spaceBioinformatics, vol. 21, pp. 1267-1268, 2005.

    [36] R.A. Sayle and E.J. Milner-White, RasMol: Biomoleculagraphics for all, Trends Biochem. Sci., vol. 20, pp. 374-376, 1995

    Received: September 09, 2009 Revised: October 09, 2009 Accepted: October 09, 2009

    Wang and Zheng; LicenseeBentham Open.

    This is an open access article licensed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided thework is properly cited.