lnai 4293 - fast protein structure alignment algorithm based

11
A. Gelbukh and C.A. Reyes-Garcia (Eds.): MICAI 2006, LNAI 4293, pp. 1179 1189, 2006. © Springer-Verlag Berlin Heidelberg 2006 Fast Protein Structure Alignment Algorithm Based on Local Geometric Similarity Chan-Yong Park 1 , Sung-Hee Park 1 , Dae-Hee Kim 1 , Soo-Jun Park 1 , Man-Kyu Sung 1 , Hong-Ro Lee 2 , Jung-Sub Shin 2 , and Chi-Jung Hwang 2 1 Electronics and Telecommunications Research Institute, 161 Gajung, Yusung, Daejeon, Korea {cypark, sunghee, dhkim98, psj, mksung}@etri.re.kr 2 Dept. of Computer Science, Chung Nam University (CNU), Daejon, Korea {hrlee, iplsub, cjhwang}@ipl.cnu.ac.kr Abstract. This paper proposes a novel fast protein structure alignment algorithm and its application. Because it is known that the functions of protein are derived from its structure, the method of measuring the structural similarities between two proteins can be used to infer their functional closeness. In this paper, we propose a 3D chain code representation for fast measuring the local geometric similarity of protein and introduce a backtracking algorithm for joining a similar local substructure efficiently. A 3D chain code, which is a sequence of the directional vectors between the atoms in a protein, represents a local similarity of protein. After constructing a pair of similar substructures by referencing local similarity, we perform the protein alignment by joining the similar substructure pair through a backtracking algorithm. This method has particular advantages over all previous approaches; our 3D chain code representation is more intuitive and our experiments prove that the backtracking algorithm is faster than dynamic programming in general case. We have designed and implemented a protein structure alignment system based on our protein visualization software (MoleView). These experiments show rapid alignment with precise results. 1 Introduction Since it is known that functions of protein might be derived from its structure, functional closeness can be inferred from the method of measuring the structural similarity between two proteins [1]. Therefore, fast structural comparison methods are crucial in dealing with the increasing number of protein structural data. This paper proposes a fast and efficient method of protein structure alignment. Many structural alignment methods for proteins have been proposed [2, 3, 4, 5, 6] in recent years, where distance matrices, and vector representation are the most commonly used. Distance matrices, also known as distance plots or distance maps, contain all the pair-wise distances between alpha-carbon atoms, i.e. the Cα atoms of each residue [3]. This method has critical weak points in terms of its computational complexity and sensitivity to errors in the global optimization of alignment. Another research approach represents a protein structure as vectors of the protein’s secondary

Upload: others

Post on 03-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

A. Gelbukh and C.A. Reyes-Garcia (Eds.): MICAI 2006, LNAI 4293, pp. 1179 – 1189, 2006. © Springer-Verlag Berlin Heidelberg 2006

Fast Protein Structure Alignment Algorithm Based on Local Geometric Similarity

Chan-Yong Park1, Sung-Hee Park1, Dae-Hee Kim1, Soo-Jun Park1, Man-Kyu Sung1, Hong-Ro Lee2, Jung-Sub Shin2, and Chi-Jung Hwang2

1 Electronics and Telecommunications Research Institute, 161 Gajung, Yusung, Daejeon, Korea {cypark, sunghee, dhkim98, psj, mksung}@etri.re.kr

2 Dept. of Computer Science, Chung Nam University (CNU), Daejon, Korea {hrlee, iplsub, cjhwang}@ipl.cnu.ac.kr

Abstract. This paper proposes a novel fast protein structure alignment algorithm and its application. Because it is known that the functions of protein are derived from its structure, the method of measuring the structural similarities between two proteins can be used to infer their functional closeness. In this paper, we propose a 3D chain code representation for fast measuring the local geometric similarity of protein and introduce a backtracking algorithm for joining a similar local substructure efficiently. A 3D chain code, which is a sequence of the directional vectors between the atoms in a protein, represents a local similarity of protein. After constructing a pair of similar substructures by referencing local similarity, we perform the protein alignment by joining the similar substructure pair through a backtracking algorithm. This method has particular advantages over all previous approaches; our 3D chain code representation is more intuitive and our experiments prove that the backtracking algorithm is faster than dynamic programming in general case. We have designed and implemented a protein structure alignment system based on our protein visualization software (MoleView). These experiments show rapid alignment with precise results.

1 Introduction

Since it is known that functions of protein might be derived from its structure, functional closeness can be inferred from the method of measuring the structural similarity between two proteins [1]. Therefore, fast structural comparison methods are crucial in dealing with the increasing number of protein structural data. This paper proposes a fast and efficient method of protein structure alignment.

Many structural alignment methods for proteins have been proposed [2, 3, 4, 5, 6] in recent years, where distance matrices, and vector representation are the most commonly used. Distance matrices, also known as distance plots or distance maps, contain all the pair-wise distances between alpha-carbon atoms, i.e. the Cα atoms of each residue [3]. This method has critical weak points in terms of its computational complexity and sensitivity to errors in the global optimization of alignment. Another research approach represents a protein structure as vectors of the protein’s secondary

1180 C.-Y. Park et al.

structural elements (SSEs; namely α-helices and β-strands). In this method, a protein structures are simplified as a vector for efficient searching and substructure matching [5]. But, this approach suffers from the relatively low accuracy of SSE alignments, and in some cases, it causes a failure in producing an SSE alignment due to the lack of SSEs in the input structures.

A major drawback of these approaches is that it needs to perform an exhaustive sequential scan of a structure database to find similar structures to a target protein, which makes all previous methods not be feasible to be used for the large structure databases, such as the PDB [4].

This paper is organized as follows. In Section 2 a new alignment algorithm is proposed. We propose the 3D chain code and apply the backtracking algorithm for protein alignment. In Section 3 we show alignment result as RMSD and computation time.

2 The Proposed Protein Structure Alignment Algorithm

The algorithm comprises four steps. (Figure 1) Step 1: We make a 3D chain code for 3-dimensional information of the protein. Since the protein chain is similar to thread, we regard a protein chain as a thread. Then, we convert the thread into a progressive direction vector and use the angles of the direction vector as local features. This method basically exploits the local similarity of the two proteins. Step 2: For local alignment of the two proteins, we compare each of the 3D chain code pairs. If two 3D

Protein A Protein B

Generate 3D Chain CodeGenerate 3D Chain Code

Construct Similarity map

Compare 3D Chain Code

Find SSPs

Merge SSPs

Join SSPs

Step 1

Step 2

Step 3

Step 4

Final Alignment

Fig. 1. Overall algorithm steps

Fast Protein Structure Alignment Algorithm Based on Local Geometric Similarity 1181

chain code pairs are similar, we plot a dot on the similarity map. After finishing the comparison of the two 3D chain code pairs, we make a similar substructure pair(SSP)set. Step 3: For fast calculation, we merge SSPs with secondary structure information of the proteins. Step 4: We apply a backtracking algorithm to join SSP by combing the gaps between two consecutive SSPs, each with its own score.

In this section, we provide a detailed description of the new algorithm and its implementation.

2.1 3D Chain Code

The protein structure data are obtained from the Protein Data Bank [4]. For each residue of the protein we obtain the 3D coordinates of its Cα atoms from the PDB file. As a result, each protein is represented by approximately equidistant sampling points in 3D space. To make a 3D chain code, we regard four Cα atoms as a set (Fig 2)[18]. We calculate a homogeneous coordinate transform to create a new coordinate (u,v,n) composed of a Up Vector (Cα i, Cα i+1) and a directional vector (Cα i+1, Cα i+2) as the new axis coordinate.

This method uses the following equation:

⎪⎪⎭

⎪⎪⎬

⎪⎪⎩

⎪⎪⎨

=

1

0

0

0

321

333231

232221

131211

TTT

RRR

RRR

RRR

TTransformCoordinatesHomogeneou

(1)

The directional vector Dir (R31, R32, R33) is:

223

223

223

2333

2332

2331

)()()(

,,,

zzyyxxv

v

zzR

v

yyR

v

xxR

−+−+−=

−=−=−= (2)

Up vector Up (R21, R22, R23) is:

Up = Upw-(Upw·Dir)*Dir,

Upw = (x1-x2, y1-y2, z1-z2) (3)

Right vector R (R11, R12, R13) is:

R = Up x Dir (4)

Translation vector T(T1, T2, T3) is:

T= (-x3, -y3, -z3) (5)

The transform T is applied to Cαi+3 to calculate a transformed Cα’i+3 ( xt, yt, zt). Then, we convert Cα’i+3 to a spherical coordinate. The conversion from cartesian coordinate to spherical coordinate is as follows:

1182 C.-Y. Park et al.

( ) )(coscos

)(,tan

11

1

222

πφπφ

πθπθ

<<−=⎟⎠⎞

⎜⎝⎛=

<<−⎟⎟⎠

⎞⎜⎜⎝

⎛=

++=

−−

tt

t

t

ttt

zr

z

x

y

zyxr

(6)

For protein structure matching, the Ca atoms along the backbone can be considered as equally spaced because of the consistency in chemical bond formation. Since we can use the same polygonal length between the Ca atoms, we regard r as 1 in the spherical coordinate.

By following this step, the 3D chain code (CCA) of protein A is created for a protein chain:

CCA={{Ø1,θ1}, {Ø2,θ2}, … {Øn,θn}} (7)

Where n is the total number of amino acid of protein A minus 3.

Fig. 2. The 3D chain code

2.2 Finding Similar Substructure Pair Set

Because the 3D chain code represents a relative direction in the 4 atoms of a protein, we can compare local similarity of two proteins by means of comparing 3D chain code of the two proteins.

Given two proteins, we construct a similarity map. The similarity map represents how much two proteins are aligned together. The entry D(i,j) of the similarity map denotes the similarity between the 3D chain code values of the ith residue of protein A({Øi,θi}) and the jth residue of protein B({Øj,θj}), and is defined by the following equation.

22 )()(),( jijijiD θθφφ −+−= (8)

θ

φ

x

z

y

v

u

n

r

Cαi (x1,y1 ,z1)

Cαi+1 (x2 ,y2 ,z2)

Cαi+2 (x3, y3, z3)

Cαi+3 (x4, y4 ,z4)

Fast Protein Structure Alignment Algorithm Based on Local Geometric Similarity 1183

This measure is basically the Euclidian distance. After calculating each D(i,j) for for i and j, we obtain the entry value below than degree angles of threshold (Td) in similarity map. We use 10 as Td in our experiments. Figure 3 shows an example of a similarity map for the 3D chain code between two particular proteins called 1HCL and 1JSU:A.

By using this similarity map, our goal is to find all SSPs in the map. A SSP is represented as a diagonal line in the map. For finding a SSP, we find first element D(i,j) with the value below Td and then, find the next element at D(i+1, j+1) and D(i-1, j-1) with the value below Td and the same procedure is repeated until the next elements is below Td. This process can be viewed as finding diagonal lines in the similarity map. After finding a SSP, we define it as a SSPk

l (i,j)(Fig.4).

SSPk l (i,j) = { {{Øi+0,θi+0}, {Øi+1,θi+1}, … {Øi+l,θi+l}},

{{Øj+0,θ j+0}, {Ø j+1,θ j+1}, … {Ø j+l,θ j+l}} } (9)

(k is the index of SSP, l is the length of the SSP, i is the index of protein A, j is the index of protein B).

Fig. 3. The similarity map of 1HCL and 1JSU:A

Fig. 4. The k-th SSPk l (i,j) in similarity map

2.3 Merging Similar Substructure Pairs

In the previous section, we have found many SSPs. Because the computation time of the protein alignment depends on the number of SSPs, we merge specific SSPs into a SSP.

i i+l

j

j+l

SSPk l (i,j)

Protein A

Protein B

1184 C.-Y. Park et al.

In the similarity map, we find rectangular shape which is composed of many SSPs. The SSPs which has same secondary structure cause the rectangular shape.(Figure 5) For example, if protein A and protein B has same α-helix structure, they have a similar geometric structure each 1 rotation turn. The β-strands are same. In this case, we merge SSPs with same secondary structure into a single SSP. After merging SSPs, the similarity map is shown in Figure 6.

Fig. 5. The rectangular shape in the similarity map

Fig. 6. After merging SSPs, the similarity map of 1HCL and 1JSU:A

2.4 Joining Similar Substructure Pairs

In this section, we should find optimal SSPs, which describe a possible alignment of protein A with protein B.

We apply the modified backtracking algorithm [15] for joining SSPs. The backtracking algorithm is a refinement of the brute force approach, which systematically searches for a solution of a problem from among all the available

1 9

1 9

1 9 1 9

9

9

1

1

1 9 1

9 1 9

1

9 1 9

1

9

1 9 1

9

Protein A: Green Protein B: Red Protein A

Protein B

… …

Fast Protein Structure Alignment Algorithm Based on Local Geometric Similarity 1185

options. It does so by assuming that the solutions are represented by the vectors (s1, ..., sm) of values and by traversing, in a depth-first manner, the domains of the vectors until the solutions are found. When invoked, the algorithm starts with an empty vector. At each stage it extends the partial vector with a new value. On reaching a partial vector (s1, ..., si) which cannot represent a partial solution by promising function, the algorithm backtracks by removing the trailing value from the vector, and then proceeds by trying to extend the vector with alternative values.

The traversal of the solution space can be represented by a depth-first traversal of a tree. We represent the SSPs as nodes (vi) of the state space tree. The simple pseudo code is shown in figure 7.

Fig. 7. The backtracking algorithm

We use a connectivity value for each SSP as a promising function. If two SSPs (SSPk and SSPk+1 ) have a similar 3D rotation and translation below the threshold, the promising function returns the value true. The pseudo code is shown in figure 8.

Fig. 8. The promising function in the backtracking algorithm

The root node in the tree is the first SSP. After running this algorithm, many solutions are established. We calculate the RMSD value from the solutions offered. Then, we select a solution with the minimum RMSD value.

3 Implementation and Results

We have testes our algorithm on a MoleView visualization tool(Figure 9). MoleView is a Win2000/XP-based protein structure visualization tool. MoleView was designed

bool promising (node v) { Transform T = parentNode().SSP1.GetTransform(); v.SSP2.apply(T); double f = RMSD(v.SSP1, v.SSP2); return (f>threshold)? FALSE: TRUE; }

void backtrack(node v) // A SSP is represented as a node { if ( promising(node v) ) if (there is a solution at node v) Write solution else for ( each child node u of node v ) backtrack(u);

}

1186 C.-Y. Park et al.

to display and analyze the structural information contained in the Protein Data Bank (PDB), and can be run as a stand-alone application. MoleView is similar to programs such as VMD , MolMol , weblab, Swiss-Pdb Viewer, MolScript, RasMol, qmol[16], and raster3d[17], but it is optimized for a fast, high-quality rendering of the current PC-installed video card with an easy-to-use user interface.

(a) display stick model (b) display secondary structure

(c) display ball and stick model and secondary structure (d) Close look at (c)

Fig. 9. The screenshot of MoleView

Our empirical study of the protein structure alignment system using the 3D chain code could lead to very encouraging results. Figure 10 shows the alignment result of protein 1HCL_ and 1JSU_A. These protein are cyclin-dependent protein kinases, the uncomplexed monomer (1HCL:_) in the open state and the complex with cyclin and P27 (1JSU:A) in the closed state. While the sequences of the uncomplexed and complexed state are almost identical with 96.2% homology, there are significant conformational differences. Differences are found in both active site. The RMSD of the two proteins is 1.70 and the alignment time is 0.41 sec.

Figure 11 shows the alignment result for protein 1WAJ_ and 1NOY_A. These proteins are the DNA Polymerase. The residues that matched are [7,31]-[63,78]-[87,102]-[104,119]-[128,253]-[260,297]-[310,372] of the protein 1WAJ and [6,30]-[60,75]-[84,99]-[101,116]-[124,249]-[256,293]-[306,368] of 1NOY_A. The number of alignment is 261 and the RMSD is 2.67. The processing time for alignment is 13.18 sec. The processing time is very short. In CE [7], this time is 298 seconds.

Fast Protein Structure Alignment Algorithm Based on Local Geometric Similarity 1187

A further result is our use as a test of the protein kinases, for which over 30 structures are available in the PDB. The results of a search against the complete PDB using the quaternary complex of the cAMP-dependent protein kinase in a closed conformation (1ATP:E) as a probe structure is presented in Table 1. The average RMSD is 2.45 and the average alignment time is 0.54 sec.

The number of aligned AA: 202

RMSD is 1.70

Alignment time is 0.41 sec

Matched SSP

[18,23]-[31,58]-[64,133]-[153,162]-[168, 279] [29,34]-[39,66]-[72,141]-[162,171]-[177, 288]

Fig. 10. The alignment between 1HCL_ and 1JSU_A(Image captured by MoleView)

The number of aligned AA: 261

RMSD is 2.67

Alignment time is 13.18 sec

Matched SSP

[6,30]-[60,75]-[84,99]-[101,116]-[124,249]-[256,293]-[306,368] [7,31]-[63,78]-[87,102]-[104,119]-[128,253]-[260,297]-[310,372]

Fig. 11. The alignment between 1WAJ_ and 1NOY_A (Image captured by MoleView)

1188 C.-Y. Park et al.

Table 1. Experimental results of protein alignment

No. Chain 2 #Alignment RMSD Time(seconds) 1 1APM_E 336 0.54 0.91 2 1CDK_A 336 0.80 0.51 3 1YDR_E 324 0.66 0.48 4 1CTP_E 304 2.63 0.49 5 1PHK_ 233 1.68 0.43 6 1KOA_ 118 1.53 0.70 7 1KOB_A 166 2.95 0.54 8 1AD5_A 111 3.43 0.68 9 1CKI_A 120 2.61 0.48

10 1CSN_ 142 2.77 0.46 11 1ERK_ 116 2.99 0.57 12 1FIN_A 73 2.46 0.55 13 1GOL_ 116 3.07 0.56 14 1JST_A 88 2.60 0.45 15 1IRK_ 64 3.25 0.46 16 1FGK_A 44 3.16 0.55 17 1FMK_ 97 2.37 0.67 18 1WFC_ 101 3.44 0.54 19 1KNY_A 27 2.41 0.46 20 1TIG_ 25 3.66 0.31

4 Discussion and Conclusion

This paper proposed a noble protein structure alignment method through the 3D chain code of a protein chain direction vector and a backtracking algorithm for joining SSPs. The 3D chain code represents the protein chain structure efficiently. The essential concept here is the idea of a protein chain as a thread. Beginning with this idea, we made a 3D chain code for searching similar substructures. For joining SSPs, we use the backtracking algorithm. Other protein structure alignment systems use dynamic programming. However, in this case, the backtracking algorithm is more intuitive and operates more efficiently.

This algorithm has particular merit, unlike other algorithms. The methodology uses a 3D chain code that is more intuitive and a backtracking algorithm that is faster than dynamic programming generally speaking. Thus, the alignment is very faster. In general cases, the alignment time is 0.5 of a second and rarely exceeds 1.0 second.

Consequently, because the proposed protein structure alignment system shows fast alignment with relatively precise results, it can be used for pre-screening purposes using the huge protein database.

References

[1] Philip E. Bourne and Helge Weissig: Structural Bioinformatics, Wiley-Liss, 2003. [2] Taylor, W. and Orengo, C., “Protein structure alignment,” Journal of Molecular Biology,

Vol. 208(1989), pp. 1-22.

Fast Protein Structure Alignment Algorithm Based on Local Geometric Similarity 1189

[3] L.Holm and C.Sander, “Protein Structure Comparison by alignment of distance matrices”, Journal of Molecular Biology, Vol. 233(1993), pp. 123-138.

[4] Rabian Schwarzer and Itay Lotan, “Approximation of Protein Structure for Fast Similarity Measures”, Proc. 7th Annual International Conference on Research in Computational Molecular Biology(RECOMB) (2003), pp. 267-276.

[5] Amit P. Singh and Douglas L. Brutlag, “Hierarchical Protein Structure Superposition using both Secondary Structure and Atomic Representation”, Proc. Intelligent Systems for Molecular Biology(1993).

[6] Won, C.S., Park, D.K. and Park, S.J., “Efficient use of MPEG-7 Edge Histogram Descriptor”, ETRI Journal, Vol.24, No. 1, Feb. 2002, pp.22-30.

[7] Shindyalov, I.N. and Bourne, P.E., “Protein structure alignment by incremental combinatorial extension (CE) of the optimal path”, Protein Eng., 11(1993), pp. 739-747.

[8] Databases and Tools for 3-D protein Structure Comparison and Alignment Using the Combinatorial Extension (CE) Method ( http://cl.sdsc.edu/ce.html).

[9] Chanyong Park, et al, MoleView: A program for molecular visualization, Genome Informatics 2004, p167-1

[10] Lamdan, Y. and Wolfson, H.J., “Geometric hashing: a general and efficient model-based recognition scheme”, In Proc. of the 2nd International Conference on ComputerVision (ICCV), 238-249, 1988.

[11] Leibowitz, N., Fligelman, Z.Y., Nussinov, R., and Wolfson, H.J., “Multiple Structural Alignment and Core Detection by Geometric Hashing”, In Proc. of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB), 169-177, 1999

[12] Nussinov, R. and Wolfson, H.J., “Efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques”, Biophysics, 88: 10495-10499, 1991.

[13] Pennec, X. and Ayache, N., “A geometric algorithm to find small but highly similar 3D substructures in proteins”, Bioinformatics, 14(6): 516-522, 1998.

[14] Holm, L. and Sander, C., “Protein Structure Comparison by Alignment of Distance Matrices”, Journal of Molecular Biology, 233(1): 123-138, 1993.

[15] S. Golomb and L. Baumert. Backtrack programming. J. ACM, 12:516-524, 1965. [16] Gans J, Shalloway D Qmol: A program for molecular visualization on Windows based

PCs Journal of Molecular Graphics and Modelling 19 557-559, 2001 [17] http://www.bmsc.washington.edu/raster3d/ [18] Bribiesca E. A chain code for representing 3D curves. Pattern Recognition 2000;33:

755–65.