a geometric representation of protein sequences

9
A Geometric Representation of Protein Sequences Shengyin Gu Institute for Data Analysis and Visualization Department of Computer Science University of California Davis, California, USA [email protected] Olivier Poch Laboratoire de Biologie Structurale IGBMC (CNRS/INSERM/ULP) Strasbourg, France [email protected] Bernd Hamann Institute for Data Analysis and Visualization Department of Computer Science University of California Davis, California, USA [email protected] Patrice Koehl Genome Center University of California Davis, California, USA [email protected] Abstract The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acid in ways that the visualization of protein sequences facilitates the decoding of its in- formation content. We show that a feature-based rep- resentation in a three-dimensional (3D) space derived from substitution matrices provides an adequate repre- sentation from which the domain content of a protein can be predicted. In addition, we show that each di- mension of the feature space can be related to a physical property of the amino acids. 1. Introduction The genetic information encoded in the genome of an organism represents the blueprint for its develop- ment and activity; its implementation depends on the functions of the corresponding gene products (i.e., nu- cleic acids and proteins). Among these products, pro- teins play a central role as they catalyze most bio- chemical reactions, and are responsible, among other functions, for the transport of nutrients and for sig- nal transmission within and between cells. As a con- sequence, a major focus of bioinformatics is to study how the information contained in a gene is decoded to yield a functional protein, and how this function has evolved between and within species. It is well-known that proteins function because they adopt a unique native 3D conformation. While a direct relationship between sequence similarity and conservation of 3D structure has been clearly established for proteins [4], the relationship between their 3D structures and func- tions is much more complex [43]. This complexity calls for more rigorous descriptions of molecular and cellu- lar functions, and a better understanding of sequence- structure-function relationships. Efforts to unravel the latter currently focus on protein sequence analysis, as a consequence of the wealth of sequence data result- ing from various genome projects, either completed or ongoing. (There were more than 4,100,000 sequences deposited in Uniprot, the general repository of protein sequences as of early March 2007.) Data produced by these projects have already lead to significant improve- ment in predictions of both 3D structures and functions [43]. However, we still stand at the dawn of under- standing the information encoded in the sequence of a gene. In this paper, we focus on protein sequence representations and show how visualization can play a role in decoding gene information content. Proteins are heteropolymer chains of amino acids, often referred to as residues in structural biology. There are twenty standard types of amino acids, which share a common backbone and are distinguished by their chemically diverse side-chains, which range in size from a single hydrogen atom to large aromatic rings with up to 17 atoms. Amino acids include either only

Upload: others

Post on 04-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Geometric Representation of Protein Sequences

A Geometric Representation of Protein Sequences

Shengyin GuInstitute for Data Analysis and Visualization

Department of Computer ScienceUniversity of CaliforniaDavis, California, USA

[email protected]

Olivier PochLaboratoire de Biologie StructuraleIGBMC (CNRS/INSERM/ULP)

Strasbourg, [email protected]

Bernd HamannInstitute for Data Analysis and Visualization

Department of Computer ScienceUniversity of CaliforniaDavis, California, [email protected]

Patrice KoehlGenome Center

University of CaliforniaDavis, California, [email protected]

Abstract

The amino acid sequence of a protein is the key tounderstanding its structure and ultimately its functionin the cell. This paper addresses the fundamental issueof encoding amino acid in ways that the visualizationof protein sequences facilitates the decoding of its in-formation content. We show that a feature-based rep-resentation in a three-dimensional (3D) space derivedfrom substitution matrices provides an adequate repre-sentation from which the domain content of a proteincan be predicted. In addition, we show that each di-mension of the feature space can be related to a physicalproperty of the amino acids.

1. Introduction

The genetic information encoded in the genome ofan organism represents the blueprint for its develop-ment and activity; its implementation depends on thefunctions of the corresponding gene products (i.e., nu-cleic acids and proteins). Among these products, pro-teins play a central role as they catalyze most bio-chemical reactions, and are responsible, among otherfunctions, for the transport of nutrients and for sig-nal transmission within and between cells. As a con-sequence, a major focus of bioinformatics is to studyhow the information contained in a gene is decoded toyield a functional protein, and how this function has

evolved between and within species. It is well-knownthat proteins function because they adopt a uniquenative 3D conformation. While a direct relationshipbetween sequence similarity and conservation of 3Dstructure has been clearly established for proteins [4],the relationship between their 3D structures and func-tions is much more complex [43]. This complexity callsfor more rigorous descriptions of molecular and cellu-lar functions, and a better understanding of sequence-structure-function relationships. Efforts to unravel thelatter currently focus on protein sequence analysis, asa consequence of the wealth of sequence data result-ing from various genome projects, either completed orongoing. (There were more than 4,100,000 sequencesdeposited in Uniprot, the general repository of proteinsequences as of early March 2007.) Data produced bythese projects have already lead to significant improve-ment in predictions of both 3D structures and functions[43]. However, we still stand at the dawn of under-standing the information encoded in the sequence ofa gene. In this paper, we focus on protein sequencerepresentations and show how visualization can play arole in decoding gene information content.

Proteins are heteropolymer chains of amino acids,often referred to as residues in structural biology.There are twenty standard types of amino acids, whichshare a common backbone and are distinguished bytheir chemically diverse side-chains, which range in sizefrom a single hydrogen atom to large aromatic ringswith up to 17 atoms. Amino acids include either only

Page 2: A Geometric Representation of Protein Sequences

non-polar saturated hydrocarbons (the non-polar, orhydrophobic amino acids), or a mixture of non-polarand polar atoms (polar amino acids), the latter be-ing neutral, positively charged (basic) or negativelycharged (acidic). The order in which amino acids ap-pear defines the primary sequence of a protein. Aminoacids are usually labeled using a one-letter code, andsequences are correspondingly represented as a usuallylong string of letters. This representation has provedvery valuable, especially in the context of sequencecomparisons that are performed using string matchingalgorithms. It does however carry limitations: lettersalone poorly represent the physical and chemical prop-erties of amino acids and as such are usually difficultto decipher. Computer programs that represent pro-tein sequences often resort to different coloring schemes[21, 38] to facilitate their interpretation (see, for exam-ple, ClustalX for multiple sequence alignments [39]), orto increase their information content (see, for example,the SAS server that encodes the 3D structure of a pro-tein on its sequence using a color coding [26]). The ad-dition of well-chosen colors improves the readability ofmultiple sequence alignments (MSA) [39]; their impor-tance however for deciphering single sequences remainslimited. Note that a coloring scheme ultimately corre-sponds to adding dimensions to the representation ofa protein sequence or a multiple sequence alignment inorder to help decipher its information content. Thisconcept of increased dimensions was applied to multi-ple sequence alignments using Hilbert curves [35]. Itcan naturally be extended to the idea of a geomet-ric representation and visualization of individual se-quences.

The concept of geometric representation of proteinsequences was originally introduced by Swanson [36]who proposed a two-dimensional vector representationof the standard 20 amino acids, based on Dayhoff’smutation matrix [34]. In Swanson’s representation,the two coordinates of the vectors coincide with sizeand hydrophobicity. Protein sequences are visualizedby concatenating the vectors representing each aminoacid types, yielding a vector representation of proteins(VRP). Since the original work of Swanson, many newgeometric representations of protein sequences havebeen proposed. Among those, we mention the vectordiagram introduced by Yamamoto and Yoshikura [44],which represents each amino acid according to its hy-drophilicity and propensity to belong to different typesof secondary structures (beta-strands and turns). TheZp plot introduced by Feng et al. [9] represents a pro-tein sequence in 3D space based on its hydrophobic,polar and charged residue content. The Zp plot is infact a graphical extension of PHYSEAN, a physical se-

quence analysis software that takes into account phys-ical, chemical and biological properties of amino acids[22]. Maetschke and colleagues [23] described a seriesof multi-dimensional encoding of amino acids, conclud-ing that among those that they tested, an extensionof the VRP introduced by Swanson [36] to higher di-mensions performed the best in identifying putativecleavage sites in proteins. Recently, Bai and Wang[2] proposed to map each amino acid to a vertex ofa regular dodecahedron, generating vectors that cansubsequently be used via concatenation to represent afull protein sequence, following the concept of Z-curvesintroduced for DNA sequence [45]. These various rep-resentations have been used for detecting and measur-ing similarities between sequences [36, 2], for predictingcleavage sites in protein [23], to predict sub-cellular lo-cations of proteins [9] and to predict the location ofprotein domains [22].

All the methods referenced above share the idea ofmoving away from a simple representation of a pro-tein sequence as a string of letters, encoding insteadeach amino acid as a set of values representing someof its properties. This paper draws from this conceptand describes a feature-based representation of proteinsequences, in which each amino acid is encoded by aunique vector of features. Our approach differs fromthe existing approaches described above in the way weconstruct our 3D vectors. The 3D vectors we com-pute are such that each of the three dimensions en-codes a physical property of the amino acids. Key toour approach is the use of the graphical properties ofour geometrical representation to identify properties ofthe sequence considered. We show preliminary appli-cations to the identification of domains within proteinsequences. This paper presents work in progress andmore details will be provided later. In section 2, we de-scribe 3D feature vectors for representing amino acidsbased on substitution matrices. Section 3 presents howthese vectors can be used to represent entire sequences,as well as applications of these representations. In sec-tion 4, we conclude and allude to other applications ofour graphical representation of protein sequences.

2. A Geometric Representation of AminoAcids

Amino acid sequences are commonly represented asa string of letters, with one letter per amino acid. Thechoice of the letters is purely lexicographic: it is eitherthe first letter of the word naming the amino acid, ora substitute when two or more words start with thesame letter. As such, the string of letter is mostly un-informative and can only be interpreted through the

Page 3: A Geometric Representation of Protein Sequences

Figure 1. 3D vectors based on BLOSUM62: BLV62. This plot represents the similarity betweenamino acids as encoded by the BLOSUM62 matrix. The geometric proximity of amino acids corre-spond to their known chemical similarities. To highlight this fact, we show the known polar residues(Q, R, E, K, N, D, T, S) in solid vectors with upper-case labels, the hydrophobic residues in solid vec-tors with lower-case labels (m, v, l, i, h, p, c) and aromatic residues in dashed vector with lower-caselabels (y, f, w). Note that the two small amino acids, A and G (in dashed vectors with upper-case la-bels), stand out. Note also that Cysteine (c), though non-polar, differs from other amino acids basedon its ability to form disulphide bridges, usually highly conserved in proteins.

trained eyes of a protein chemist who can implicitly vi-sualize the chemical structure of the amino acids, or bya program in which this chemical information has beenencoded. One way to improve upon this is to encodeproperties to each amino acid representation. To fur-ther improve this representation is to color the lettersbased on the amino acid properties. Different coloringschemes have been proposed, and are now standard invisualization programs such as ClustalX [39]. Colorsadd one dimension to the representation (though, the-oretically, it is possible to add nuances in the colorsto add further dimensions [38]), which is usually notenough to discriminate amino acid properties. Swan-son [36] pioneered a vector representation for proteinsequences, in which each amino acid is encoded intoa 2D vector whose coordinates correspond to size andhydrophobicity.

We draw from this original idea and represent aminoacids as 3D vectors, in which each dimension is a fea-ture of the amino acid. Our goal is to incorporate asmany properties of an amino acid as needed into a geo-metric representation that allow us to visualize proteinsequence properties. These properties can then be ana-

lyzed directly visually by a human, or through standardgeometric procedures. This is a generalization of the3D encoding proposed in BLOMAP [23]. We describeone possible set of features, derived from substitutionmatrices. Note that the same concept can accommo-date other features, such as Chou and Fasman propen-sities of amino acids to belong to secondary structures.

2.1. Constructing Feature Vectors Based onSimilarity Matrices

Common measures of similarities between aminoacids are usually presented in the form of a substi-tution matrix, which stores the odds that any givenamino acid can be replaced by any other. Substi-tution matrices can be compiled based on known ac-cepted substitutions in protein sequences (see [16]), ordirectly from amino acids properties (see, for example,[40]). In sequence-based substitution matrices, aminoacids which are frequently mutually substituted areregarded as similar. Schwartz and Dayhoff [34] werethe first to compile such a matrix, using 71 groupsof closely related proteins (i.e., with more than 85%

Page 4: A Geometric Representation of Protein Sequences

%IGENVALUE��

%IG

EN

VA

LUE

#U

MU

LATI

VE

�EN

ERG

Y��

%IGENVALUE��

%IG

EN

VA

LUE

#U

MU

LATI

VE

�EN

ERG

Y��

0!-���� ",/35-��

� � � � �� �� �� �� �� ���

���

���

���

���

��

��

��

���

� � � � �� �� �� �� �� ���

���

����

����

��

��

��

���

Figure 2. Principal component analysis of substitution matrices. The twenty eigenvalues (+, leftaxis) of the PAM250 (left panel) and BLOSUM62 (right panel) substitutions matrices as well as theircumulative energies (o, right axis) are plotted in decreasing order of amplitudes. The largest threeeigenvalues account for 80% and 70% of the energy (or information content) of PAM250 and BLO-SUM62, respectively.

pairwise sequence identity), and collecting the dataof point accepted mutations, or PAMs. Henikoff andHenikoff [15] extended this concept to include more di-vergent sequences and generated the BLOSUM matri-ces. BLOSUM matrices are derived from BLOCKs se-quence alignments [15]. Several matrices have been de-rived, corresponding to different cutoff in the acceptedsequence identity within the BLOCKs. For example,BLOSUM62 is a substitution matrix derived from pro-tein sequence alignments in which the sequences areat least 62% identical; it is considered to provide goodperformance for database search.

Substitution matrices describe each amino acid witha set of twenty numerical values (sometimes referredto as amino acid index [18, 40]), henceforth defin-ing a twenty-dimensional space. While such a high-dimensional space is useful for computer-guided se-quence alignment methods, it is impractical for anyform of visualization. Swanson was the first to em-bed the space corresponding to the original PAM ma-trix MDM78 into a plane, using a principal compo-nent analysis (PCA) approach [36]. More recently,Maetschke et al. [23] embedded the BLOSUM62 ma-trix into five dimensions, using the Sammon’s projec-tion technique [33], noticing that three dimensions al-ready produce a reasonably good approximation. Tofurther characterize which dimension is appropriate forvisualizing the information content of BLOSUM62, werepeated the embedding of both PAM250 (which is verysimilar to the original MDM78) and BLOSUM62, us-

ing a PCA. Results are shown in Figure 2. Swanson[36] and Maetschke et al. [23] used PCA and multi-dimensional scaling using Sammon mapping, respec-tively. Their methods first convert the substitutionmatrix into a “distance” matrix, by exponentiation ofthe scores included in the matrix. We kept the sub-stitution matrix as it is. Each column of this matrixcorresponds to a different amino acid, while each rowis treated as a probe of a property of that amino acid.In the PCA analysis, the substitution matrix is firstcentered, and then the eigenvalues and eigenvectors ofits covariance matrix are computed. We have foundthat the three largest eigenvectors account for 82% and70% of the total “energy” (or information content) ofPAM250 and BLOSUM62, respectively. These resultsagree with those of Swanson [36] and Maetschke et al.[23].

2.2. Information Content of Amino Acids3D Feature Vectors

The entropy value of a substitution matrix is an in-formation theoretic value that measures the informa-tion content [1]. In Figure 3 we show that the energy ofthe three largest eigenvalues of a substitution matrixis correlated with its entropy value, with correlationvalues of -0.91 and 0.46 for PAM and BLOSUM matri-ces, respectively. The difference in sign stems from thedefinitions of the matrices. PAM matrices with low IDnumbers are computed from alignments of highly sim-ilar sequences, and as such are comparable with BLO-

Page 5: A Geometric Representation of Protein Sequences

",/35-�)$

%N

ERG

Y�O

F�FI

RST�

��E

IGE

NVA

LUE

S���

%N

TRO

PY

0!-�)$

%N

ERG

Y�O

F�FI

RST�

��E

IGE

NVA

LUE

S���

%N

TRO

PY

0!-�MATRICES ",/35-�MATRICES

� ��� ��� ��� ��� �����

��

��

��

���

�� �� �� �� �� �� �� �����

��

��

��

���

���

Figure 3. Information content of substitution matrices. The cumulative energy of the three largesteigenvalues (+, left axis) and the entropy (o, right axis) of substitutions matrices are plotted asfunctions of the matrix ID.

SUM matrices of high ID numbers. They are bothdesigned for comparisons of closely related sequences.Reversely, BLOSUM matrices with low ID numbersand PAM matrices with high ID numbers are designedfor comparisons of distantly related proteins. Interest-ingly, the energies of the three largest eigenvalues ofPAM matrices are always higher than the energies oftheir equivalent BLOSUM matrices (with PAM250 cor-responding to BLOSUM45, PAM120 to BLOSUM80and PAM100 to BLOSUM90, based on entropy com-parison).

2.3. BLV62: Information in Each Dimen-sion

BLOSUM62 is the preferred substitution matrix fordatabase search. We focus on this matrix in the follow-ing. Each amino acid can be represented as a 3D vec-tor, using its corresponding coordinates in the largestthree eigenvectors of the covariance matrix of BLO-SUM62. Figure 1 shows these twenty vectors, whichwe refer to as BLV62, all centered at the origin (withthe origin being contained in the bounding box). Forcompleteness, the coordinates of these vectors are pro-vided as an appendix. It is difficult to interpret thethree axes of the BL vectors, as these are mathemati-cally constructed to provide sub-components of the ma-trices with decreasing energy/information content. Wecompared the vector containing the coordinates of thetwenty amino acids on the first axis corresponding tothe BLV62 vectors, with 528 amino acid indices avail-able in the AAIndex database [40]. Five of the 528 in-dices were selected with a correlation coefficient greater

than or equal to 0.95: the “buriability” of Zhou andZhou [46], an amino acid contact number with a cut-off of 14 A[27], a normalized hydrophobicity scale [6],and two interactivity scales designed to correlate withhydropathy scales [3]. Note that all these indices arerelated to amino acid burial and their hydrophobicity.These results are in agreement with the original find-ings of French and Robson [10], Swanson [36] and Tomiiand Kanehisa [40]. Interestingly, this behavior differsfrom the results described by Kinjo and Nishikawa [19],who performed spectral analysis on substitution ma-trices compiled from protein structure alignments, in-cluding proteins with varying levels of sequence sim-ilarities. Using the same AAIndex database that weused [40], Kinjo and Nishikawa showed that at highsequence identities hydrophobicity plays a minor role,and that the “relative mutabilities” of Dayhoff et al. [7]and Jones et al. [17] dominates. As BLOSUM62 is de-rived from blocks of sequences with more than 62% se-quence identify, it qualifies as a high sequence identitysubstitution matrix. The difference between our re-sults and those of Kinjo and Nishikawa is unclear. Thebest correlations between the second and third axes ofthe 3D BLOSUM62 vectors and an amino acid indexcontained in AAIndex are 0.77 and 0.75, respectively.The second axis is found to correlate well with aver-age non-bonded energies [28], which is related to size.Interestingly, the third axis is found to correlate withcomputed alpha-helix propensities [20], as well as withstatistics on turns in proteins [5].

Page 6: A Geometric Representation of Protein Sequences

3 Applications: A Geometric Repre-sentation of Protein Sequences

A sequence of a protein describes the successionof its amino acids from its N-terminal end to its C-terminal end. It is usually encoded as a string of let-ters, one for each residue in the protein, with each letterspecific to one of the twenty amino acids. In the sectionabove, we have shown that representing amino acids as3D vectors improves the decoding of their properties.We extend this geometric concept to the representationof the whole sequence of a protein by direct “head-to-tail concatenation” of the vectors representing its con-stituent amino acids. A protein sequence then becomesa polyline in 3D space, which we refer to as the pro-tein’s 3D trace. We describe one application of sucha representation, namely the detection of domains inlong protein sequences.

Large proteins do not contain a single large hy-drophobic core, probably because of limitations in theirfolding kinetics and stability. Single compact unitsof more than 500 amino acids are rare. Large pro-teins in fact are organized into units with sizes around200-300 residues, referred to as domains [31, 30]. In-terestingly, while the concept of domains in proteinsis well-established, there is no consensus definition ofwhat a domain is. A domain is either defined based onsequence (regions that display a significant level of se-quence similarity), function (the minimal part of a genethat is capable of performing a function) or structure(compact, spatially distinct units of protein structure)[42]. When the structure of the protein is known, itsdomains are usually defined by a combination of visualinspection of the structure with automated methodsthat take into account the globular nature of domains(for a review of existing methods, see [42]). It wouldbe of practical interest, however, to delineate domainboundaries in protein sequence alone, as this informa-tion would facilitate structure and function prediction.Current methods for domain prediction rely mostly onMSAs [14, 12]; these methods perform poorly on or-phan sequences. Other approaches include analysis ofsecondary structure prediction [24], sidechain entropy[11], clusters of hydrophobic residues [12] or amino acidcomposition in the linker regions [13, 37].

We propose to visualize domain transition in pro-teins using our 3D representation of protein sequences,their 3D traces. We illustrate our approach on thesequence of Prf, a disease resistance gene in toma-toes [32]. The Prf gene encodes for a protein, PRF,of approximately 1800 residues, that contains at leastthree domains: an Nterminal domain, of which littleis known, a nucleotide binding domain (NBS), and a

Leucine Rich Repeat domain (LRR) [32]. PRF is amember of the large family of NBS-LRR proteins (forreview, see [25]). The 3D trace representation of PRFusing the BLOSUM62 3D vectors is shown in Figure 4.Each domain transition in PRF is revealed through achange in the overall direction of its 3D sequence trace.

We applied the same procedure to three large pro-teins whose domain definitions are known: pyruvatephosphate dikinase (PDB code 1DIK; 884 residues),human brain hexokinase type I (PDB code 1HKB;914 residues), and human ceruloplasmin (PDB code1KCW; 1040 residues). Results are shown in Figure 4

According to SCOP, 1DIK contains three domains:an ATP binding domain (alpha+beta ATP grasp do-main) from residue 1 to 376, a phosphohistidine do-main (beta/beta/alpha domain) from residue 377 to505, and a pyruvate kinase domain (alpha/beta timbarrel) from residue 510 to 884; all three domains areclearly delineated on the 3D trace.

1HKB is a good example of the current limits of theuse of the 3D trace for domain identification. Chain Bof 1HKB contains four consecutive ribonuclease H-likemotifs (alpha+beta domains): these are more difficultto distinguish based on the 3D trace only, as there areno significant changes in the overall directions of thetrace.

Interestingly, results are much better on 1KCW,which contains six consecutive rubredoxin-like domains(all beta domains), in particular for domains three,four, five and six.

4 Conclusions and Perspectives

The amino acid sequence of a protein is the key tounderstanding its structure and ultimately its functionin the cell. We have shown that amino acids can be en-coded by 3D vectors, thereby allowing us to generatea geometric representation of their properties. We de-rived one set of 3D vectors, namely the BLV62, basedon the BLOSUM62 substitution matrix, respectively.Concatenation of the vectors corresponding to the suc-cessive amino acids in a protein sequence generates a3D trace.

Substitution matrices provide the odds that anygiven amino acid can be replaced by any other for agiven amount of time. Among all existing substitutionmatrices, BLOSUM62 occupies a special position as itis the default matrix used for protein sequence databasesearch. Using PCA, we have shown that BLOSUM62can be projected into a 3D space, without significantloss of information. The three principal axes correlatebest with hydrophobicity, number of contacts (whichrelates to size), and propensities to belong to an α-helix

Page 7: A Geometric Representation of Protein Sequences

Figure 4. 3D sequence traces detect domains in proteins. The 3D sequence trace of PRF, a diseaseresistance gene of tomatoes, 1DIK, a pyruvate phosphate dikinase, chain B of 1HKB (1HKBB), ahexokinase type I, and 1KCW, a ceruloplasmin are shown. Separation between known domainsof these proteins (based on sequence analysis for PRF, and based on the SCOP classification ofprotein structures for 1DIK 1HKBB and 1KCW) are shown as line segments; they usually correspondto change in directions in the 3D trace. 1HKBB is an exception (see text for details). Model structuresfor each known domains are shown in cartoon representation. These models were generated usingpymol (http://www.pymol.org) .

or a turn, respectively. While the dependence for thefirst two axes was already described for MDM78 [36],the dependence of the third axis on secondary structurewas not previously described. We believe that this isof importance as it clearly adds a structural informa-tion onto the sequence representation. We will furtherstudy this correlation.

We have focused this research on the visualization of

protein sequences in 3D space, using the novel 3D traceconcept. Simple visual analysis of this 3D polygonalrepresentation provides access to structural propertiesof the corresponding protein, such as its partitioninginto domains. We will extend this approach. In partic-ular we are interested in developing a quantification ofthe information contained in the 3D trace. We will con-sider applying wavelet transforms to extract geometric

Page 8: A Geometric Representation of Protein Sequences

signatures of the 3D trace. Wavelet transforms have al-ready been applied to the analysis of protein sequences(see, for example, [8, 29]. In these approaches, the pro-tein sequences are converted to numerical sequences,using an amino acid electron-ion interaction potential[41]. Interestingly, this amino acid index (included intoAAIndex), is not correlated to the three principal axesof BLOSUM62 to any significant extend.

There are many ways to combine the 3D vectors cor-responding to the amino acids into a complete repre-sentation for the entire sequence of a protein. We haverelied on probably the simplest of such representations,i.e., the concatenation of the vectors. We will furtherinvestigate which other graphical representations sup-port highly effective visual and quantitative extractionof the information contained in a protein sequence.

In this paper, we have encoded amino acids into3D vectors derived from a substitution matrix (BLO-SUM62). Note that this concept can be generalized toother properties. It is possible, for example, to repre-sent each amino acid using a vector that contains itspropensities to belong to a helix, a β strand, or a turn.Such vectors, and the corresponding 3D traces, shouldprove useful for predicting the structural classes of aprotein. We are currently developing this representa-tion.

5. Acknowledgments

This work was supported in part by the NationalScience Foundation under contracts CCF-0625744 anda large Information Technology Research (ITR) grant.

References

[1] S. Altschul. Amino acid substitution matrices froman information theoretic perspective. J. Mol. Biol.,219:555–565, 1991.

[2] F. Bai and T. Wang. On graphical and numericalrepresentation of protein sequences. J. Biomol. Struct.Dyn., 23:537–545, 2006.

[3] U. Bastolla, M. Porto, H. Roman, and M. Vendrus-colo. Principal eigenvector of contact matrices andhydrophobicity profiles in proteins. Proteins: Struct.Func. Bioinfo., 58:22–30, 2005.

[4] C. Chothia and A. Lesk. The relation between the di-vergence of sequence and structure in proteins. EMBOJ., 5:823–826, 1986.

[5] P. Chou and G. Fasman. Prediction of the secondarystructure of proteins from their amino acid sequence.Adv. Enzymol. Relat. Areas Mol. Biol., 47:45–148,1978.

[6] H. Cid, M. Bunster, M. Canales, and F. Cazitua. Hy-drophobicity and structural classes in proteins. Prot.Eng., 5:373–375, 1992.

[7] M. Dayhoff. A model of evolutionary changes in pro-teins. Atlas of Protein Sequence and Structure, 5:345–352, 1978.

[8] C. H. de Trad, Q. Fang, and I. Cosic. Protein sequencecomparison based on the wavelet transform approach.Protein Eng, 15:193–203, 2002.

[9] Z. Feng and C.-T. Zhang. A graphic representationof protein sequence and predicting the subcellular lo-cations of prokaryotic proteins. Int. J. Biochem. CellBiol., 34:298–307, 2002.

[10] S. French and B. Robson. What is a conservative sub-stitution. J. Molec. Evol., 19:171–175, 1983.

[11] O. Galzitskaya and B. Melnik. Prediction of proteindomain boundaries from sequence alone. Protein Sci.,12:696–701, 2003.

[12] R. George and J. Heringa. Snapdragon: a methodto delineate protein structural domains from sequencedata. J. Mol. Biol., 316:839–851, 2002.

[13] R. George and J. Heringa. An analysis of protein do-main linkers: their classification and role in proteinfolding. Prot. Eng., 15:871–879, 2003.

[14] X. Guan and L. Du. Domain identification by cluster-ing sequence alignments. Bioinformatics, 14:783–788,1998.

[15] S. Henikoff and J. Henikoff. Amino acid substitutionmatrices from protein blocks. Proc. Natl. Acad. Sci.(USA), 89:10915–10919, 1992.

[16] S. Henikoff and J. Henikoff. Amino acid substitutionmatrices. Adv. Protein Chem., 54:73–97, 2000.

[17] D. Jones, W. Taylor, and J. Thornton. The rapidgeneration of mutation data matrices from protein se-quences. CABIOS, 8:275–282, 1992.

[18] A. Kidera, Y. Konishi, M. Oka, T. Ooi, and H. Scher-aga. Statistical analysis of the physical properties ofthe 20 naturally occuring amino acids. J. Prot. Chem.,4:23–55, 1985.

[19] A. Kinjo and K. Nishikawa. Eigenvalue analysis ofamino acid substitution matrices reveals a sharp tran-sition of the mode of sequence conservations in pro-teins. Bioinformatics, 20:2504–2508, 2004.

[20] P. Koehl and M. Levitt. Structure-based conforma-tional preferences of amino acids. Proc. Natl. Acad.Sci. (USA), 96:12524–9, 1999.

[21] J. Kyte and R. F. Doolittle. A simple method fordisplaying the hydropathic character of a protein. J.Mol. Biol., 157:105–132, 1982.

[22] I. Ladunga. Physean: Physical sequence analysisfor the identification of protein domains on the ba-sis of physical and chemical properties of amino acids.Bioinformatics, 15:1028–1038, 1999.

[23] S. Maetschke, M. Towsey, and M. Boden. Blomap: anencoding of amino acids which improves signal peptidecleavage site prediction. Asia Pacific BioinformaticsConference, pages 141–150, 2005.

[24] R. Marsden, L. McGuffin, and D. Jones. Rapid proteindomain assignment from amino acid sequence usingpredicted secondary structure. Protein Sci., 11:2814–2824, 2002.

Page 9: A Geometric Representation of Protein Sequences

[25] L. McHale, X. Tan, P. Koehl, and R. Michelmore.Plant nbs-lrr proteins: adaptable guards. Genome Bi-ology, 7:212, 2006.

[26] D. Milbrurn, R. Laskowski, and J. Thornton. Se-quences annotated by structure: a tool to facilitatethe use of structural information in sequence analysis.Prot. Engineering, 11:855–859, 1998.

[27] K. Nishikawa and T. Ooi. Radial locations of aminoacid residues in a globular protein: Correlation withthe sequence. J. Biochem., 100:1043–1047, 1986.

[28] M. Oobatake and T. Ooi. An analysis of non-bondedenergy of proteins. J. Theor. Biol., 67:567–584, 1977.

[29] T. Riaz, K.-B. Li, F. Tang, and A. Krishnan. Cmd-wave: Conserved motifs detection using wavelets. InSilico Biology, 5:0038, 2005.

[30] J. Richardson. The anatomy and taxonomy of proteinstructure. Adv. Protein Chem., 34:167–339, 1981.

[31] G. Rose. Hierarchic organization of domains in glob-ular proteins. J. Mol. Biol., 134:447–470, 1979.

[32] J. Salmeron, G. Oldroyd, C. Rommens, and S. S. et al.Tomato prf is a member of the leucine-rich repeat classof plant disease resistance genes and lies embeddedwithin the pto kinase gene cluster. Cell, 86:123–133,1996.

[33] J. Sammon. A nonlinear mapping for data structureanalysis. IEEE Trans. Comput., C-18:401–409, 1969.

[34] R. Schwartz and M. Dayhoff. Matrices for detectingdistant relationships. Atlas of Protein Sequence andStructure, 5:345–352, 1978.

[35] N. Shah, S. Dillard, G. Weber, and B. Hamann. Vol-ume vizualization of multiple alignment of large ge-nomic dna. In T. Moeller, B. Hamann, and R. Russell,editors, Mathematical foundations of scientific visual-ization, computer graphics, and massive data explo-ration. Springer Verlag, Heidelberg, Germany, 2007.to appear.

[36] R. Swanson. A vector representation for amino acidsequences. Bull. Math. Bio., 46:623–639, 1984.

[37] T. Tanaka, Y. Kuroda, and S. Yokoyama. Charac-teristics and prediction of domain linker sequences inmulti-domain proteins. J. Struct. Funct. Genomics,4:79–85, 2003.

[38] W. R. Taylor. Residual colours: a proposal foraminochromography. Protein Eng, 10:743–746, 1997.

[39] J. Thompson, T. Gibson, F. Plewniak, F. Jeanmougin,and D. Higgins. The clustalx windows interface: flexi-ble strategies for multiple sequence alignment aided byquality analysis tools. Nucl. Acids. Res., 25:4876–82,1997.

[40] K. Tomii and M. Kanehisa. Analysis of amino acidindices and mutation matrices for sequence compari-son and structure prediction of proteins. Prot. Eng.,9:27–26, 1996.

[41] V. Veljkovic, I. Cosic, B. Dimitrijevic, and D. Lalovic.Is it possible to analyze dna and protein sequences bythe method of digital signal processing? IEEE Trans.Biomed. Eng., 32:337–341, 1985.

[42] S. Veretnik, P. Bourne, N. Alexandrov, andI. Shindyalov. Toward consistent assignment of struc-tural domains in proteins. J. Mol. Biol., 339:647–678,2004.

[43] J. Watson, R. Laskowski, and J. Thornton. Predictingprotein function from sequence and structural data.Curr. Opin. Struct. Biol., 15:275–284, 2005.

[44] K. Yamamoto and H. Yoshikura. A new representationof protein structure: vector diagram. CABIOS, 2:83–88, 1986.

[45] R. Zhang and C.-T. Zhang. Z curves, an intutive toolfor visualizing and analyzing the dna sequences. J.Biomol. Str. Dyn, 11:767–782, 1994.

[46] H. Zhou and Y. Zhou. Quantifying the effect of burialof amino acid residues on protein stability. Proteins:Struct. Func. Genet., 54:315–322, 2004.

6. Appendix

BLV62Amino acid X Y Z

Ala (A) -0.43 4.02 1.99Cys (C) 6.36 5.21 6.80Leu (L) 8.42 3.01 -3.36Met (M) 5.77 2.26 -5.26Glu (E) -7.64 -1.40 -1.83Gln (Q) -6.09 -1.65 -3.45His (H) -3.71 -7.16 -1.47Lys (K) -6.14 -0.50 -1.51Val (V) 6.50 5.15 -2.51Ile (I) 8.28 4.01 -2.74

Phe (F) 8.98 -4.54 -0.12Tyr (Y) 6.29 -7.75 -0.24Trp (W) 8.74 -8.89 4.91Thr (T) -1.05 3.92 -0.25Gly (G) -3.93 0.95 6.28Ser (S) -4.65 1.91 1.93Asp (D) -8.31 0.79 0.48Asn (N) -7.62 -1.00 -0.09Pro (P) -4.12 3.44 3.05Arg (R) -5.65 -1.80 -2.60

Table 1. Coordinates of the centered BLV62vectors for all twenty amino acids.