proteinsclassificationeng.doc

17
Machine learning approach to protein fold class recognition when classes have low homology between their amino acid sequences Introduction Over the last several decades there has a great mutual interest among specialists in computer science and in molecular biology. This interest from the side of biologists stems from the bare necessity to develop adequate mathematics and models for storing, accessing and processing the tremendous and constantly grows volume of biological material. The researchers from computer science have similar interests, emanating from the nature of microbiological data. The main feature of such a data is the possibility to represent it by a sequence of symbols drawn from the finite and compact alphabet – the set of 20 amino acids. By studying different properties of sequences one obtains data structures unique enough from for which classical theories of data analysis and pattern recognition can not be directly applied. For this reason, many algorithms for processing such a data were have been developed. Besides, it should be taken into account the social significance of such collaboration. The fields of cell biology application are very vast. There are pharmaceutics, food industry, agriculture, ecology, and oil industries. This collaboration has been called “bioinformatics”[1]. There now exists a settled set of biological problems, which represent as a machine learning problems as well. There are: sequence clustering and cluster topology, protein structure prediction, protein function prediction, protein family classification. This paper will describe the construction of automatic classifier for concrete protein data base, given the authors’ experience in the pattern recognition. One of the hardest problem of bioinformatics is protein fold recognition [2,3]. It focused on the design of a recognizer which be able to assign a given amino acid sequence with one predefined classes of 3D structures of protein molecules [4]. The problem becomes more difficult when we consider classes of protein structures which contain proteins whose amino acid sequences are very far from each other (more precisely, whose pair-wise alignment is less then 25% of their length [5]). This “subproblem” of protein fold recognition was stimulated by arrival of new genomic data. 1

Upload: butest

Post on 14-Jun-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ProteinsClassificationEng.doc

Machine learning approach to protein fold class recognition when classes have low homology between their amino acid sequences

Introduction

Over the last several decades there has a great mutual interest among specialists in computer science and in molecular biology. This interest from the side of biologists stems from the bare necessity to develop adequate mathematics and models for storing, accessing and processing the tremendous and constantly grows volume of biological material. The researchers from computer science have similar interests, emanating from the nature of microbiological data. The main feature of such a data is the possibility to represent it by a sequence of symbols drawn from the finite and compact alphabet – the set of 20 amino acids. By studying different properties of sequences one obtains data structures unique enough from for which classical theories of data analysis and pattern recognition can not be directly applied. For this reason, many algorithms for processing such a data were have been developed. Besides, it should be taken into account the social significance of such collaboration. The fields of cell biology application are very vast. There are pharmaceutics, food industry, agriculture, ecology, and oil industries. This collaboration has been called “bioinformatics”[1]. There now exists a settled set of biological problems, which represent as a machine learning problems as well. There are: sequence clustering and cluster topology, protein structure prediction, protein function prediction, protein family classification. This paper will describe the construction of automatic classifier for concrete protein data base, given the authors’ experience in the pattern recognition.

One of the hardest problem of bioinformatics is protein fold recognition [2,3]. It focused on the design of a recognizer which be able to assign a given amino acid sequence with one predefined classes of 3D structures of protein molecules [4]. The problem becomes more difficult when we consider classes of protein structures which contain proteins whose amino acid sequences are very far from each other (more precisely, whose pair-wise alignment is less then 25% of their length [5]). This “subproblem” of protein fold recognition was stimulated by arrival of new genomic data. This new source of data showed that all existed methods for protein structural assignment can work well enough only on 60% of new data. Conversely, any one can not say anything about structural properties on 40% of new data [6].The problem to recognize fold class which sequence identity is less or equal 25% has been topic of discussion in most of bioinformatics conferences and journals the last two years. However, until now nobody has provided a data set, which can be a basis for experimental studies on the problem. Fortunately, there exists one exception. Professor Sung-Hou Kim (Lawrence Berkley National Laboratory) has recently developed the particular data base, which, on the one hand gives a good coverage for such existing and known fold classes, and on the other hand provides examples for each classes [7]. His database consists of 420 protein domain structures, which were divided in to 51 similar fold groups. In Appendix I we give the lists of PDB IDs for all these proteins which locations for the domain fragments on the corresponding entire protein sequences. The Appendix also contains information about grouping of the proteins into 51 fold classes.

Classification of proteins 3D structure

As is obvious from the foregoing one of the main task in molecular biology is the classification – selection from all possible sequences sets, which organize closely spaced groups. “Closely “ here has various meaning. At first, it is similarity in terms of biological functions. At

1

Page 2: ProteinsClassificationEng.doc

second, it is a distance similarity among 3D structures; it is believed that such a similarity resembles the first one. Finally, it is a percentage of alignment similarity. Information about such a classification is collected in special hierarchical databases. Here are several samples of such databases.

(1) SCOP (Structural Classification of Proteins) – most famous structure classification (http://scop.mrc-lmb.cam.ac.uk/scop/). The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification. The SCOP has four level of classification - family, superfamily, fold and superfold.

(2) CATH (http://www.biochem.ucl.ac.uk/cath/) – is a novel hierarchical classification of protein domain structures, which clusters proteins at four major levels, class(C), architecture(A), topology(T) and homologous superfamily (H). Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. The topology level clusters structures according to their toplogical connections and numbers of secondary structures. The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to toplogy families and homologous superfamilies are made by sequence and structure comparisons.

(3) FSSP (Fold classification based on Structure-Structure alignment of Proteins) (http://www2.ebi.ac.uk/dali/fssp/). Structure alignments are performed automatically using the Dali program. All protein chains from the Protein Data Bank which are longer than 30 residues are included. The chains are divided into a representative set and sequence homologs of structures in the representative set. Sequence homologs have more than 25 % sequence identity, and the represenative set contains no pair of such sequence homologs. An all-against-all structure comparison is performed on the representative set. The resulting alignments are reported in the FSSP entries for individual chains. In addition, FSSP entries include the structure alignments of the search structure with its sequence homologs. The classification and structure alignments are continuously updated as new structures are released by the Protein Data Bank.

Data and problem task

The Kim data set represents a very difficult biological problem and the one of the first experiments was to try to understand peculiarity and main features of such an obstacle.

Data structure: Based on the 3D structure Dr. Kim divided all 420 sequences into 51 groups (fold classes). The presented groups have varying numbers of representatives. Four smallest folds (#12 - C2 domain, #30 - Thiamin-binding, #35 - Phosphoribosyltransferases (PRTases), #51 - C-type lectin) contain only 3 sequences, the biggest folds contain 38(#26 - TIM Barrel), 31(#9 - Immunoglobulin beta-sandwich) and 20 (#45 - Ferredoxin) sequences.

The problem task: (1) To construct a system of reasonable features and to build a multidimensional feature space. (2) Based on methods of machine learning theory find the discriminant functions for the multiclassification recognition task. To confirm statistically the results. (3) Based on methods of cluster analysis find the natively compact groups of subsequences as well as groups of features.

2

Page 3: ProteinsClassificationEng.doc

Table 1.Fold names.

№ Fold name Number of proteins

1 Globin 122 Cytochrome c 73 Four-helical bundle 84 Ferritin 85 4-helical cytokines 116 EF Hand 137 Cyclin 48 Cytochrome P450 59 Immunoglobulin beta-sandwich 3110 Common fold of diphtheria toxin/transcription factors/cytochrome 511 Cupredoxins 912 C2 domain 313 Viral coat and capsid proteins 1514 Crystallins/protein S/yeast killer toxin 515 Galactose-binding domain 416 ConA lectins/glucanases 817 OB-fold 1718 Beta-Trefoil 519 Reductase/isomerase/elongation factor common domain 420 Trypsin serine proteases 621 Acid proteases 522 PH domain 723 Lipocalins 624 Double-stranded beta-helix 625 Barrel-sandwich hybrid 626 TIM-barrel 3827 Flavodoxin 928 Adenine nucleotide alpha hydrolase 429 Rossmann-fold domains 1430 Thiamin-binding 331 P-loop containing NTP hydrolases 932 Thioredoxin fold 933 Restriction endonucleases 534 Ribonuclease H motif 935 Phosphoribosyltransferases (PRTases) 336 S-adenosyl-L-methionine-dependent methyltransferases 537 alpha/beta-Hydrolases 1238 Phosphorylase/hydrolase 539 Periplasmic binding protein I 740 Periplasmic binding protein II 741 Lysozyme 442 Cysteine proteinases 443 Beta-Grasp 844 Cystatin 745 Ferredoxin 2046 Zincin 747 N-terminal nucleophile aminohydrolases (Ntn hydrolases) 448 ADP-ribosylation 449 C-type lectin 650 Protein kinases (PK), catalytic core 451 beta-Lactamase/D-ala carboxypeptidase 3

Feature space construction

3

Page 4: ProteinsClassificationEng.doc

The input data was represented in, so called, “primary sequences structure”, or simple sequence of amino acids. That is why our first step was try to find some numerical features, which would reflect the essence of Kim classification. We considered two ways of constructing such features. (1) Base on alignment procedure Fasta 3 [8,9] (ftp://ftp.virginia.edu/pub/fasta) the matrix of mutual similarities between all sequences was built. We then studied this data for pattern recognition algorithms. (2) Hidden Markov models were constructed for all folds. After that a set of scores were calculated for all sequences for each of the models. The result is a 420x51 matrix of correspondent vectors sequences. Because of making the valid test procedure, this scheme was repeated 420 times removing one sequence from data (jakknife procedure).

Methods of pattern recognition

For applying the methods of pattern recognition theory we examined algorithm of k-Nearest Neighbor [10] (k was equal of 1 and 3) and method of determining of reliability functions. In the 3-Nearest Neighbor algorithms the decision in favor of some class was taken if two or three nearest objects have the same index of classification. The multiclass classification based on reliability functions method was made by the scheme “one-against-others”. The results of cross validation jackknife procedure for all 51 folds are presented in tables 2-4. The each column of the table it is the result of one experiment “test class against others”. The number in diagonal element shows the number of cases correctly recognized. The sum of the number in non-diagonal elements shows how many times the objects from non-class were recognized as being members of the test class. It is clear that a positive result is a matrix which contains big numbers in the diagonal elements and zero values off the diagonal.

Analysis of results

The results for matrix of similarity built by Fasta 3 (table 2-4) show approximately equal accuracy for the 3-Nearest Neighbor and reliability functions methods. The applying the combination of all 3 methods allows us to recognize 14 classes (which contains 93 proteins) from 51 with accuracy 80% (see table 8).

Below we give the explanation how we calculate such parameters as Specificity, Sensitivity, False Negative and False Positive. The capital letters A and B define the indices of two classes: A – class, B – non-class. Collection of such parameters for all 3 made experiments is represented in tables 5-7.

RealA B

Mac

hine

AA BA A

AB BB B

Specificity = AA/(AA+AB) Sensitivity = BB/(BA+BB)False Negative = BA/(BA+AA) False Positive = AB/(AB+BB)

4

Page 5: ProteinsClassificationEng.doc

Analysis of results, which were obtained on data matrix based HMM, shows the inadequacy of such an approach for this data. The positive result was reach only on one fold class (#8 - Cytochrome P450). The models which were built show good results on training stage, but they have a poor extrapolative properties. This is reflected in the fact that when object was removed from training set it has a small probability of correct membership, although when it present in training set it was taken absolutely correct. Our opinion that such a “slackness” of features, which were obtained based on objectively more informative model (in comparison with alignment scores), is a result of insufficient representation of studying classes.

5

Page 6: ProteinsClassificationEng.doc

Table 2.Results of Jackknife cross validation.

Classifier – 1-Neares Neighbor.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 511 12 122 7 73 8 4 2 1 14 8 5 1 1 15 11 1 1 1 1 1 1 1 1 1 1 16 13 11 1 17 4 3 18 5 59 31 16 1 3 1 1 2 1 1 1 1 1 110 5 1 1 1 1 111 9 8 112 3 1 1 113 15 1 1 5 1 2 1 1 1 1 114 5 3 1 115 4 1 1 1 116 8 2 2 1 1 217 17 1 1 1 2 1 1 2 2 1 1 1 1 1 118 5 1 1 1 1 119 4 1 1 220 6 1 3 1 121 5 4 122 7 1 2 3 123 6 1 4 124 6 1 2 1 1 125 6 1 526 38 1 1 1 1 1 24 1 1 1 1 1 1 1 1 127 9 1 1 4 1 1 128 4 2 1 129 14 2 1 1130 3 2 131 9 1 1 2 1 2 1 132 9 1 7 133 5 1 1 1 1 134 9 1 1 1 2 1 1 1 135 3 1 236 5 2 1 1 137 12 1 1 9 138 5 1 1 1 239 7 740 7 1 641 4 2 1 142 4 1 343 8 1 1 1 1 1 1 1 144 7 1 1 1 1 1 245 20 1 1 1 3 2 1 1 9 146 7 1 1 1 1 347 4 448 4 449 6 1 550 4 451 3 3

6

Page 7: ProteinsClassificationEng.doc

Table 3.Results of Jackknife cross validation.

Classifier – 3-Neares Neighbor.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 511 12 102 7 63 8 3 14 8 25 11 1 16 13 117 4 18 5 59 31 12 1 1 110 5 111 9 712 313 15 1 1 114 5 415 416 8 1 117 17 118 5 119 4 120 6 3 121 5 322 7 123 6 424 6 1 125 6 426 38 1827 9 1 128 429 14 1 1130 331 9 132 9 233 534 935 3 136 5 137 12 538 5 139 7 1 1 640 7 1 141 4 142 4 143 844 7 145 20 1 1 746 7 1 247 448 449 6 250 4 451 3

7

Page 8: ProteinsClassificationEng.doc

Table 4.Results of Jackknife cross validation.

Classifier – Reliability Function.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 511 12 11 12 7 63 8 2 1 14 8 1 1 25 11 1 16 13 107 4 28 5 49 31 19 1 210 5 111 9 612 313 15 1 2 114 5 1 3 115 4 116 817 17 1 4 1 118 5 1 119 420 6 121 5 322 7 2 123 6 424 6 125 6 1 426 38 1 1 2027 9 2 128 4 129 14 730 331 9 132 9 5 133 534 9 1 135 336 537 12 1 1 438 539 7 540 7 341 442 443 8 244 7 1 1 145 20 1 2 3 946 7 1 347 4 148 449 6 1 1 150 4 351 3

8

Page 9: ProteinsClassificationEng.doc

Table 5.Specificity, Sensitivity, False Negative and False Positive; 1-Nearest Neighbor.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25Specificity 100 100 50 63 9 85 75 100 52 0 89 0 33 60 20 20 12 0 50 50 80 43 67 33 83Sensitivity ~100 100 ~100 98 100 99 100 99 98 99 99 100 98 100 99 98 99 99 ~100 98 99 ~100 ~100 ~100 ~100F-Negative 14 0 20 58 0 21 0 44 27 100 33 - 67 0 67 78 71 100 67 70 56 25 20 50 29F-Positive 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51Specificity 63 44 50 79 67 22 78 0 22 67 0 75 40 100 86 0 75 0 29 45 43 100 100 83 100 100Sensitivity 93 99 ~100 98 ~100 97 99 100 99 99 99 98 98 97 98 100 99 100 98 ~100 ~100 100 ~100 100 ~100 99F-Negative 51 60 33 39 33 83 36 - 50 50 100 47 71 59 54 - 50 - 78 18 50 0 33 0 50 625F-Positive 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 0

Table 6.Specificity, Sensitivity, False Negative and False Positive; 3-Nearest Neighbor.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25Specificity 83 86 38 20 0 85 20 100 39 0 78 0 7 80 0 0 0 0 0 50 60 14 67 0 67Sensitivity ~100 100 100 ~100 100 100 100 ~100 ~100 100 100 100 100 ~100 100 ~100 100 100 100 ~100 100 100 ~100 100 100F-Negative 9 0 0 50 - 0 0 17 8 - 0 - 0 20 - 100 - - - 50 0 0 20 - 0F-Positive 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51Specificity 47 0 0 79 0 0 22 0 0 0 0 42 0 86 14 0 20 0 0 7 2 0 0 33 100 0Sensitivity 96 100 100 ~100 100 ~100 100 100 100 100 100 ~100 100 100 100 100 100 100 100 100 100 100 100 100 100 100F-Negative 44 - - 21 - - 0 - - - - 29 - 0 0 - 0 - - 0 0 - - 0 0 -F-Positive 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1

Table 7.Specificity, Sensitivity, False Negative and False Positive; Reliability Function

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25Specificity 92 86 20 13 9 77 50 80 61 0 67 0 13 60 0 0 24 0 0 17 60 14 67 0 67Sensitivity 100 100 100 100 ~100 100 100 100 98 100 100 100 ~100 100 100 ~100 97 100 100 100 100 100 100 100 100F-Negative 0 0 0 0 50 0 0 0 37 - 0 - 33 0 - 100 76 - - 0 0 0 0 - 0F-Positive 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51Specificity 53 22 0 50 0 11 56 0 11 0 0 33 0 71 43 0 0 0 0 45 43 20 0 17 75 0Sensitivity 99 100 100 ~100 100 100 100 100 100 100 100 100 100 100 100 100 100 99 100 98 100 100 100 100 100 100F-Negative 20 0 - 13 - 0 0 - 0 - - 0 - 0 0 - - 100 - 50 0 0 - 0 0 -F-Positive 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

9

Page 10: ProteinsClassificationEng.doc

Table 8.Recognized classes.

Fold number

Fold name # obj. in fold

# obj. recogn.

Method

1 Globin 12 11 RF2 Cytochrome c 7 7 1NN6 EF Hand 13 11 3NN7 Cyclin 4 3 1NN8 Cytochrome P450 5 4 RF11 Cupredoxins 9 7 3NN14 Crystallins/protein S/yeast killer toxin 5 3 RF21 Acid proteases 5 3 RF, 3NN23 Lipocalins 6 3 RF25 Barrel-sandwich hybrid 6 4 RF, 3NN39 Periplasmic binding protein I 7 6 3NN47 N-terminal nucleophile aminohydrolases 4 4 1NN49 C-type lectin 6 5 1NN50 Protein kinases (PK), catalytic core 4 4 3NN

Total: 93 75 80.6%

Bibliography

1. Baldi P., Brunak S. Bioinformatics: the machine learning approach. A Bradford Book, 1998.2. Dubchak I., Holbrook S., Kim S-H. Prediction of Protein Class From Amino Acid Composition. Proteins, 16:79-91 (1993).3. Iorger R., Rendell L. and Subramaniam S. Constructive Induction and Protein Tertiary Structure Prediction. In Proceedings of First International Conference on Intelligent Systems for Molecular Biology, Bethesda, 1993.4. Orengo CA, Flores TP, Taylor WR, Thornton JM. Identification and classification of protein fold families. Protein Engineering 1993; 6:485-500.5. Fisher D. and Eisenberg D. Protein fold recognition using sequence-derived predictions. Protein Science (1996), 5:947-955.6. Murzin A.G., Brenner S.E., Hubbard T. & Chotia C. (1995) SCOP: a structural classification of proteins database for investigation of sequence and structure. J. Mol. Biol. 247, 536-540.7. Kim S-H. at all. In press (1999).8. Pearson W. R. and Lipman D. J. (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85:2444- 2448, 9. Pearson W. R. (1990) "Rapid and Sensitive Sequence Comparison with FASTP and FASTA" Methods in Enzymology 183:63- 98).10. Duda R. and Hart P. 1973. Pattern Classification and Scene Analysis. New York: Wiley.

10

Page 11: ProteinsClassificationEng.doc

Appendix IKim Data

Globin1flp 1421mbd 1532fal 1462hbg 1473sdha 1452gdm 1532pghd 1461ash 1471itha 1411hlb 1571cpca 1621phnb 172

Cytochrome c1cyj 901ycc 1081cc5 83451c 822mtac 1471etpa 92 a:1-921fcdc 80 c:1-80

Four-helical bundle1nfn 1321vls 146256ba 1061bbha 1311cpq 1292ccya 1272mhr 1181rmva 156

Ferritin1bcfa 1571ryt 146 1-1462fha 1721afra 3451mhyd 5101mtyb 3841xika 3401xsm 288

4-helical cytokines1bgc 1581cnt3 1461huw 1661lki 1721hula 1081jli 1121rcb 1292gmfa 1213inkc 1211rfba 1192ilk 155

EF Hand5icb 751sra 1511rro 1081ctda 341ncx 1621pona 342sas 1852scpa 1741osa 1481wdcb 1421wdcc 1521rec 1851cpo 119 1-119

Cyclin1jkw 151 1-1511vin 128 1-1281aisb 98 b:1-971vola 95 a:1-96

Cytochrome P4501cpt 4121oxa 4031phd 4051rom 3992hpda 457

Immunoglobulin beta-sandwich1cd8 1141cdca 961cdy 97 1-971cid 105 1-1051hnf 101 1-1011igtb 114 b:1-1141neu 1158fabb 121 b:1-1211dbah 112 h:1-1121nqba 120 a:1-1201tcra 117 a:1-1171bec 115 1-1151agdb 991tit 891tlk 1031vcaa 90 a:1-901wit 931zxq 86 1-862ncm 991ctn 109 1-1091fiea 182 a:1-1821ksr 1001rhoa 1451cto 1091ebpa 2111ten 892hft 106 1-1061edha 2112mcm 1121xsoa 1501mspa 124

Common fold of diphtheria toxin/transcription factors/cytochrome1exg1anu 1381nbca 1551qba 173 1-1731tupa 196

Cupredoxins1aac 1041plc 997paz 1232cbp 961rcy 1511cyx 1581aoza 129 a:1-1291kcw 192 1-1921nif 159 1-159

C2 domain1rsy 1353dpa 99 120-2181who 94

Viral coat and capsid proteins2bpa2 1751bmv1 1851bmv2 3741cwpa 1491smva 1961stma 1412stv 1842tbva 2832bbva 308

1bbt1 1861bbt2 2101bbt3 2201pov1 2351vpsa 2852mev1 268

Crystallins/protein S/yeast killer toxin1amm 85 1-852bb2 87 -1-874gcr 85 1-851prs 90 1-901wkt 87

Galactose-binding domain1gof 150 1-1501bgla 217 a:1-2171bhga 204 a:1-2041ulo 152

ConA lectins/glucanases1led 2432ayh 2141lcl 1411slta 1331saca 2041kit 192 1-1921cela 4341xnb 185

OB-fold1snc 1351bcpd 1101bova 691prtf 981tiid 982qila 93 a:1-931krs 1101jmca 116 a:1-1163ulla 1061ah9 711mjc 691sro 761rip 811gpc 2181gvp 871pfsa 782prd 174

beta-Trefoil1bfg 1262i1b 1531abrb 140 b:1-1401wba 1711hce 118

Reductase/isomerase/elongation factor common domain1fdr 99 1-991fnc 136 1-1361ndh 123 1-1232pia 103 1-103

Trypsin serine proteases1agja 2421arb 2632sga 1815ptp 2231svpa 1601hava 216

Acid proteases1fmb 1042rspa 1151eaga 339

11

Page 12: ProteinsClassificationEng.doc

2asi 3561smra 331

PH domain1btka 1601btn 1061dyna 1131mai 1191pls 1131irsa 1121shca 195

Lipocalins1bbpa 1731beba 1561epaa 1511mup 1571obpa 1581hmt 131

Double-stranded beta-helix1caxb 1842phla 200 a:1-2001pmi 4401rgs 130 1-1302arca 1611wapa 69

Barrel-sandwich hybrid1bdo 80 -1fyc 1061ghj 791htp 1311iyu 791gpr 158

TIM-barrel1pama 383 a:1-3831smd 402 1-4021vjs 371 1-3712aaa 353 1-3531jdc 357 1-3571amy 346 1-3461byb 4901ceo 3321ecea 3581edg 3801xyza 3201gowa 4891pbga 4522myr 4991cnv 2831nar 2892ebn 2851fkx 3481psca 3291dhpa 2921fbaa 3601nal3 2911onra 3161dosa 3431frb 3141ak5 3291dora 3111gox 3501oyc 3992tmda 340 a:1-3401igs 2471nsj 2051pii 254 1-2542tysa 2554xis 3861luca 3261nfp 2281pud 372

Flavodoxin1ntr 1243chy 1281rcf 1695nul 138

1qrda 2731orda 107 a:1-1071cex 1971esc 3021dxy 100 1-100

Adenine nucleotide alpha hydrolase1gln 305 1-3051gtra 331 a:1-3312ts1 217 1-2171nsya 271

Rossmann-fold domains1cyda 2421dhr 2361enp 2971eny 2681fds 2821xel 3381ybva 2701bmda 154 a:1-1541hyha 146 a:1-1461ldg 146 1-1462cmd 145 1-1453ldh 162 1-1622pgd 176 1-1761scua 121 a:1-121

Thiamin-binding1poxa 174 a:1-1741pvda 180 a:1-1801trka 335 a:1-335

P-loop containing NTP hydrolases1deka 2411gky 1861dar 282 1-2821eft 212 1-2121hura 1801adea 4311dai 2191nipa 2832reb 266 1-266

Thioredoxin fold1aba 871erv 1051grx 85 -1kte 1051thx 1081gp1a 1841gnwa 84 a:1-841pgta 76 a:1-762trcp 217 p:

Restriction endonucleases1eria 2611rvaa 2441bam 2001pvua 1541cfr 283

Ribonuclease H motif1kay 185 1-1852btfa 145 a:1-1452yhx 201 1-2011glcg 250 g:1-2501chma 155 a:1-1552rn2 1551itg 1421vsd 1461hjra 158

Phosphoribosyltransferases (PRTases)1hgxa 1641nula 1421opr 213

S-adenosyl-L-methionine-dependent methyltransferases1vid 2141xvaa 2931v39 2921hmy 3282adma 386

alpha/beta-Hydrolases1ac5 4831cpy 4211ivya 4521ede 3101din 2321thta 2941tca 3171thg 5443tgl 2651cvl 3161gpl 336 1-3361yasa 256

Phosphorylase/hydrolase1ecpa 2371pbn 2892ctc 3071amp 291 -1xjo 271

Periplasmic binding protein I1dbqa 2761gca 3091pea 3681tlfa 2962dri 2712liv 3448abp 305

Periplasmic binding protein II1lst 2391pda 217 1-2171sbp 3091pot 3221ggga 2201lcf 334 1-3341ovb 159

Lysozyme1cnsa 2431l92 162153l 1851chka 238

Cysteine proteinases1gcb 4521ppn 2121thea 2531fiea 325 a:191-515

beta-Grasp1igd 612ptl 781ubi 761guab 761alo 80 1-801put 1061tif 761lgr 100 1-100

Cystatin1opy1mola 941cewi 108 i:1stfi 98 i:1std 1621ouna 1251udii 83 i:

Ferredoxin1fxd 581xer 103

12

Page 13: ProteinsClassificationEng.doc

2fxb 818atcb 93 b:1-921pyta 941spbp 712pii 1121npk 1501ha1 85 8-921sxl 971urna 962u1a 881vhia 1392bopa 851bura 136 a:1-1355ruba 136 a:1-1361ris 971afj 721fwp 691regx 122

Zincin1kuh 1321hyt 155 1-1551lml 4651iae 2001atla 2001sat 236 1-2361hfc 157

N-terminal nucleophile aminohydrolases (Ntn hydrolases)1ecfb 249 b:1-2491gdoa 2381pmaa 2211pmap 203

ADP-ribosylation1aera 2051bcpa 2241ddt 187 1-1871lt3a 226

C-type lectin1esl 118 1-1181lit 1291rtm1 117 1:33-1391bcpb 85 b:1-851prea 83 a:1-831tsg 98

Protein kinases (PK), catalytic core1cdka 3431hcl 2941koba 3531csn 293

beta-Lactamase/D-ala carboxypeptidase1btl 2632blta 3593pte 347

13