mining the structural genomics pipeline:...

16
Mining the Structural Genomics Pipeline: Identification of Protein Properties that Affect High-throughput Experimental Analysis Chern-Sing Goh 1,2 , Ning Lan 1,2 , Shawn M. Douglas 1,2,3 , Baolin Wu 4 Nathaniel Echols 1,2 , Andrew Smith 1,2,3 , Duncan Milburn 1,2 Gaetano T. Montelione 2,5,6,7 , Hongyu Zhao 4,8 and Mark Gerstein 1,2,3 * 1 Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Ave, New Haven, CT 06520, USA 2 Northeast Structural Genomics Consortium, Robert Wood Johnson Medical School, UMDNJ Piscataway, NJ 08854, USA 3 Computer Science, Yale University, 266 Whitney Ave, New Haven, CT 06520, USA 4 Department of Epidemiology and Public Health, Yale University, 266 Whitney Ave, New Haven, CT 06520, USA 5 Center for Advanced Biotechnology and Medicine Robert Wood Johnson Medical School, UMDNJ, Piscataway, NJ 08854, USA 6 Department of Molecular Biology and Biochemistry, Rutgers University, Robert Wood Johnson Medical School, UMDNJ Piscataway, NJ 08854, USA 7 Department of Biochemistry Robert Wood Johnson Medical School, UMDNJ, Piscataway, NJ 08854, USA 8 Department of Genetics, Yale University, 266 Whitney Ave, New Haven, CT 06520, USA Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able to discover correlations between a protein’s character- istics and its progress through each stage of the structural genomics pipe- line, from cloning, expression, purification, and ultimately to structural determination. First, we use tree-based analyses (decision trees and random forest algorithms) to discover the most significant protein features that influence a protein’s amenability to high-throughput experi- mentation. Based on this, we identify potential bottlenecks in various stages of the structural genomics process through specialized “pipeline schematics”. We find that the properties of a protein that are most signifi- cant are: (i) whether it is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydro- phobic patches; (iv) the number of binding partners it has; and (v) its length. Conversely, a number of other properties that might have been thought to be important, such as nuclear localization signals, are not significant. Thus, using our tree-based analyses, we are able to identify combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets. This information may prove useful in optimizing high- throughput experimentation. Further information is available from http://mining.nesg.org/. q 2003 Elsevier Ltd. All rights reserved. Keywords: structural genomics; COGs; charged residues; hydrophobicity; decision trees *Corresponding author 0022-2836/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. E-mail address of the corresponding author: [email protected] Abbreviations used: COG, clusters of orthologous groups; NLS, nuclear localization signal; oob, out-of-bag. doi:10.1016/j.jmb.2003.11.053 J. Mol. Biol. (2004) 336, 115–130

Upload: nguyenduong

Post on 10-Apr-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

Mining the Structural Genomics Pipeline:Identification of Protein Properties that AffectHigh-throughput Experimental Analysis

Chern-Sing Goh1,2, Ning Lan1,2, Shawn M. Douglas1,2,3, Baolin Wu4

Nathaniel Echols1,2, Andrew Smith1,2,3, Duncan Milburn1,2

Gaetano T. Montelione2,5,6,7, Hongyu Zhao4,8 and Mark Gerstein1,2,3*

1Molecular Biophysics andBiochemistry, Yale University, 266Whitney Ave, New Haven, CT06520, USA

2Northeast Structural GenomicsConsortium, Robert Wood JohnsonMedical School, UMDNJPiscataway, NJ 08854, USA

3Computer Science, YaleUniversity, 266 Whitney Ave, NewHaven, CT 06520, USA

4Department of Epidemiology andPublic Health, Yale University, 266Whitney Ave, New Haven, CT06520, USA

5Center for AdvancedBiotechnology and MedicineRobert Wood Johnson MedicalSchool, UMDNJ, Piscataway, NJ08854, USA

6Department of Molecular Biologyand Biochemistry, RutgersUniversity, Robert Wood JohnsonMedical School, UMDNJPiscataway, NJ 08854, USA

7Department of BiochemistryRobert Wood Johnson MedicalSchool, UMDNJ, Piscataway, NJ08854, USA

8Department of Genetics, YaleUniversity, 266 Whitney Ave, NewHaven, CT 06520, USA

Structural genomics projects represent major undertakings that willchange our understanding of proteins. They generate unique datasetsthat, for the first time, present a standardized view of proteins in termsof their physical and chemical properties. By analyzing these datasetshere, we are able to discover correlations between a protein’s character-istics and its progress through each stage of the structural genomics pipe-line, from cloning, expression, purification, and ultimately to structuraldetermination. First, we use tree-based analyses (decision trees andrandom forest algorithms) to discover the most significant protein featuresthat influence a protein’s amenability to high-throughput experi-mentation. Based on this, we identify potential bottlenecks in variousstages of the structural genomics process through specialized “pipelineschematics”. We find that the properties of a protein that are most signifi-cant are: (i) whether it is conserved across many organisms; (ii) thepercentage composition of charged residues; (iii) the occurrence of hydro-phobic patches; (iv) the number of binding partners it has; and (v) itslength. Conversely, a number of other properties that might have beenthought to be important, such as nuclear localization signals, are notsignificant. Thus, using our tree-based analyses, we are able to identifycombinations of features that best differentiate the small group ofproteins for which a structure has been determined from all the currentlyselected targets. This information may prove useful in optimizing high-throughput experimentation. Further information is available fromhttp://mining.nesg.org/.

q 2003 Elsevier Ltd. All rights reserved.

Keywords: structural genomics; COGs; charged residues; hydrophobicity;decision trees*Corresponding author

0022-2836/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.

E-mail address of the corresponding author: [email protected] used: COG, clusters of orthologous groups; NLS, nuclear localization signal; oob, out-of-bag.

doi:10.1016/j.jmb.2003.11.053 J. Mol. Biol. (2004) 336, 115–130

Page 2: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

Introduction

With the advent of the post-genomic era, thenext challenge is to determine the structure ofencoded proteins,1 which can lead to functionalannotation of previously uncharacterized geneproducts.2 – 5 The structural genomics effort hasdemonstrated the possibility of rapid structuredetermination on a genome-wide scale and isexpected to generate a considerable amount ofdata. However, there are several challenges thatcan deter the process of proteins through thestructural genomics pipeline,6 – 9 from targetcloning, expression, purification, to structuraldetermination.

In addition to a growing collection of crystal andNMR structures, structural genomics is generatingnew and novel datasets where proteins are subjectto uniform conditions for expression. Never beforehas it been possible to gain access to such a largeamount of standardized experimental proteindata, generated for thousands of targets frommany organisms, at multiple sites over variousstructural genomics consortia. These data sets canbe mined to look for correlations between aprotein’s properties and its performance in thesestandardized experiments. For instance, we mightimagine that proteins that have more hydrophobicsequences might be harder to express, or thatproteins that interact with partner proteins mightbe less able to crystallize or fold correctly. Thesequestions can be answered now through thesenew structural genomics datasets.

The SPINE database was created as an infor-mation repository for the Northeast StructuralGenomics Consortium (NESG), and as a vehicle tointegrate and manage data in a standardizedfashion that makes it accessible to systematic dataanalysis.10,11 Bertone et al. demonstrated thepotential data mining capabilities of the SPINEdatabase by developing a decision tree algorithmthat was used to infer whether a protein wassoluble from a dataset of 562 Methanobacteriumthermoautotrophicum protein expressionconstructs.10 Here, we used information from allthe targets from TargetDB† amounting to over27,000 selected targets from over 120 organisms, tosystematically correlate biophysical properties ofproteins with their sequence features in order todetermine their amenability to high-throughputexperimentation. This work has three values. Firstof all, it utilizes a unique dataset generated underrelatively uniform conditions. Second, it can tellus more about the properties of proteins in asystematic fashion and, thirdly, it can generateinformation needed to optimize protocols and con-ditions for effective high-throughput structuralgenomics.

Results and Discussion

Our overall approach to the data mining analysisis twofold. First, we employ two types of tree-based algorithms, random forest and decision treeanalysis, to identify features most influential indetermining whether a protein is amenable tohigh-throughput experimental analysis. Randomforest analysis12,13 is a robust algorithm particularlyuseful for calculating the importance of features bymeasuring the effect of permutations of eachfeature on prediction accuracy. It uses two tech-niques: bagging (bootstrap aggregating) andrandom feature selection. In combination, thesemethods have been shown to improve the stabilityand accuracy of prediction over a single-treemodel. While the random forest method is a robusttechnique for ranking features in terms of theirimportance, it is more difficult to interpret. Inorder to measure the frequencies of proteinscontaining certain features and understand howcombinations of these protein properties can affecttheir amenability for experimental analysis, weuse decision trees,14,15 a commonly used machinelearning method. In general, we partition the initialsample, consisting of positives and negatives, intodifferent subsets depending on a particular feature,such as amino acid composition or protein bindingpartners. If the feature separates the positives andnegatives preferentially, this is readily apparentand the most selective rules appear at the top ofthe decision trees.16 We use these decision trees toidentify and view features and combinations offeatures that are particularly selective. In a secondtype of analysis, which we call pipeline analysis,we diagram the way in which particular featureschange over the structural genomics pipeline andidentify bottlenecks or stages in the pipelinewhere these features show the largest change.

Tree-based analysis

As of February 2003, sequence and experimentalprogress information for 27,267 protein targetswere collected from the TargetDB and used in thetree analyses. We performed the tree analyses onall the targets found in the TargetDB in order todiscover protein features that are the strongest pre-dictors for whether a protein can be determinedstructurally. The protein properties used in theanalysis are listed in Table 1. These propertiescomprise of general sequence composition, andother protein characteristics such as clusters oforthologous groups (COG) assignment, length ofhydrophobic stretches, number of low-complexityregions, and number of interaction partners.

Based on the current data in TargetDB, 1.3%(370/27711) of all targets are structurally deter-mined. Results from the random forest (Table 2)and decision tree (Figure 1(a)) analyses suggestthat protein properties such as COG17 assignment;percentage composition of charged, polar, and† http://targetdb.pdb.org

116 Mining the Structural Genomics Pipeline

Page 3: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

Table 1. Protein features analyzed

Protein feature Protein description

Number oftrees

feature isfound

Alltargets

Structurallydetermined

targetsMeta

descriptor

COG Have COGs (%) .1 59 85 YesGAVLI Average GAVLI composition (%) .1 34.6 38.4 NoDE Average DE composition (%) .1 12.7 14.2 NoSCTM Average SCTM composition (%) .1 15.5 13.6 NoAny_partners Average number of known binding partners of

the yeast homolog from many sourcesa

.1 3.8 4.1 Yes

pI Average pI .1 7.1 6.3 NoLength (amino acidresidues)

Average length .1 291 243 No

Q Average glutamine composition (%) .1 3.6 3 NoW Average tryptophan composition (%) .1 1.1 1 NoK Average lysine composition (%) .1 6.7 6.7 Nohp_aa Average number of hydrophobic residues

within a hydrophobic stretch below athreshold of 21.0 kcal/mol

1 15 7 No

Sheet Average b-strand composition (%) 1 15.8 19.8 Nocplx_s Average normalized low-complexity

value—short1 14.6 13.3 No

S Average serine composition (%) 1 6.7 5.2 NoE Average glutamic acid composition (%) 1 7.4 8.6 NoNQ Average NQ composition (%) 1 7.6 6.5 NoDENQ Average DENQ composition (%) 1 19.7 20.5 NoI Average isoleucine composition (%) 1 5.9 6.2 NoAILV Average AILV composition (%) 1 29.4 31.6 NoST Average ST composition (%) 1 11.5 10 NoA Average alanine composition (%) 1 7.2 8.1 NoC Average cysteine composition (%) 1 1.5 1 NoKR Average KR composition (%) 1 12.5 12.6 NoM Average methionine composition (%) 1 2.5 2.4 NoP Average proline composition (%) 0 4.7 4.4 NoV Average valine composition (%) 0 6.9 7.8 NoN Average asparagine composition (%) 0 4 3.5 Nohphobe Average minimum hydrophobicity score on

the GES scale0 2.9 -0.6 No

Complex_partners Average number of known binding partners ofthe yeast homolog based on the MIPScomplex catalog35 –39

0 0.74 0.47 Yes

cplx_l Average normalized low complexityvalue—long

0 20.5 21.4 No

helix Average helix composition (%) 0 40.1 39 Nocoil Average coil composition (%) 0 44 41.1 NoLM Average LM composition (%) 0 11.6 11.8 NoR Average arginine composition (%) 0 5.9 5.9 NoDEKR Average DEKR composition (%) 0 26.8 25.2 NoHKR Average HKR composition (%) 0 14.3 14.5 NoD Average aspartic acid composition (%) 0 5.3 5.6 NoF Average phenylalanine composition (%) 0 4.1 3.7 NoY Average tyrosine composition (%) 0 3.1 2.9 NoT Average threonine composition (%) 0 5.1 4.9 NoH Average histidine composition (%) 0 2.3 2 NoG Average glycine composition (%) 0 6.5 7.3 NoFWY Average FWY composition (%) 0 8.4 7.6 Nosignal Have signal sequences(%) 0 15 8 NoPS00015 Contain PS00015 (nuclear localization signal

peptide) prosite motif (%)0 6. 5.4 Yes

PS00013 Contain PS00013 (membrane lipoproteinpeptide) prosite motif (%)

0 0.5 0.4 Yes

PS01129 Contain PS01129 (enzyme involved in RNAmetabolism) prosite motif (%)

0 0 0 Yes

PS00018 Contain PS00018 (EF-hand calcium-bindingdomain) prosite motif (%)

0 0.4 1.4 Yes

PS00030 Contain PS00030 (RNA recognition) prositemotif (%)

0 0.1 0.8 Yes

This Table reports the number of times that a feature is found in the decision trees shown in Figures 1 and 2. Some of the featuresthat appear in more than one tree may still exhibit no distinct difference between all the targets and those that have been determinedstructurally (columns 4 and 5). This occurs because certain features have more effect in different stages of the structural genomicspipeline, such as expression and purification, but not necessarily as great an influence on whether a protein can become characterizedstructurally.

a Sources include BIND,47 – 49 DIP,43 – 45 MIPS,35 – 39 Cellzome (http://www.celzome.com) databases and datasets from various yeasttwo-hybrid experiments.40 – 42

Mining the Structural Genomics Pipeline 117

Page 4: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

non-polar residues; and length of the protein corre-late with a tendency to be structurally determined.

Within the high-throughput structural genomicspipeline, there are many stages in the process thatcontribute to the attrition of proteins. Each stephas its own selective conditions that affect whethera protein target advances to the next step. In orderto identify protein characteristics most influentialin achieving the next level in the high-throughputexperimental determination, random forest anddecision tree analyses were run on proteins thatwere successfully cloned, expressed, or purified(Figures 1 and 2). The tree nodes in Figures 1 and2 represent the probability that proteins that satisfythe rule will be successful. The numbers to theright and left of the node are the numbers ofproteins that are successful and those that are not.The sum of the two numbers at each node is thetotal number of proteins that satisfy the rule or setof rules. To aid experimental design further, anadditional decision tree was created (Figure 2)using the same datasets as in Figure 1 but without“meta-descriptors” such as COG analysis or bind-ing partner information. The evolution of thesefeatures is traced through the pipeline figures(Figure 3). Some protein features such as COGassignment and protein sequence length are foundin both the pipeline analyses and in the overallstructure decision tree. It is noted that at eachstage of the protein determination pipeline, certainfeatures appear to be more influential than others.

Tree-based analysis on targets that have beenexpressed suggests COG assignment, sequencelength, and pI values are important features thataffect the outcome. These results are reflected inthe pipeline figure analysis (Figure 3), where thereare significant differences in these features betweencloned but not expressed proteins and expressedproteins. Out of the 14,385 cloned protein targetsin this analysis, 3764 have a COG assignment anda pI value below 5.9 (Figure 1(c)). The decisiontree analysis suggests that cloned proteins meetingthese criteria have a better chance (73%) of beingexpressed compared to all cloned targets (58%)that are expressed.

In contrast to these findings found for expressedproteins, purified proteins (Figure 1(d)) havedifferent determining characteristics that includethe percentage composition of charged residuessuch as aspartic acid and glutamic acid, and thepercentage sequence composition of asparagineand glutamine amino acids. This suggests that anoptimal combination of aspartic acid, glutamic

acid, asparagine, glutamine, and lysine sequencecomposition can increase the chance of expressedproteins to become purified from 46% to 77%( p-value 4.1 £ 10250).

Similarly, the tree analyses identify proteinfeatures such as methionine and alanine per-centage sequence composition that can affect theoutcome of whether a purified protein becomesstructurally determined (Figure 1(e)). The decisiontree analysis shows that proteins with a very lowcontent of methionine (less than 0.3%) and alaninecomposition less than 8.5% have a 67% chance ofbeing crystallized ( p-value 3.4 £ 1028).

Solubility

Since solubility is an important determinant forwhether a protein is amenable to structural deter-mination, an analysis was performed to findprotein characteristics that influence the outcomeof a protein’s solubility. Serine percentage compo-sition is shown to be the major determinant indetermining solubility. Other predictors of solu-bility such as conservation across organisms(COGs) and charged residue composition aresimilar to the other tree analyses performed on thevarious stages of the structure determination pipe-line, confirming the significance of a protein’ssolubility in its amenability to high-throughputexperimentation.

Analysis of specific structural genomicscenters

Decision tree analysis was performed on six sep-arate structural genomics centers: the NortheastStructural Genomics Consortium (NESG), the JointCenter for Structural Genomics (JCSG), theMycobacterium tuberculosis Structural GenomicsConsortium (TB), the Midwest Center for Struc-tural Genomics (MCSG), the Montreal-KingstonBacterial Structural Genomics Initiative (BSGI),and the Berkeley Structural Genomics Center(BSGC). These groups each have their own separ-ate initiatives with differing methods of targetselection, cloning, and purification.8 The decisiontrees in Figure 4 were performed to identifyimportant protein properties that would influencea target’s amenability to be determined structurallywithin each consortium. The resulting diverse treesillustrate the unique approach that each of theconsortia has taken.

It is notable that more than half of the targets

Table 2. Random forest analysis

Importanceranking

Structure/no structure

Cloned/uncloned

Expressed/cloned

Purified/expressed

Structure/purified

Soluble/insoluble

1 GAVLI I COG DE GAVLI S2 DE Q Length NQ A DE3 SCTM AVILM Hphobe pI C COG4 S Sheet DE COG M SCTM5 DENQ Length pI GAVLI pI Length

118 Mining the Structural Genomics Pipeline

Page 5: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

Figure 1. Decision tree analysis of two subsets of protein targets: (a) structurally determined versus not structurallydetermined; (b) cloned versus not cloned; (c) expressed versus cloned but not expressed; (d) purified versus expressedbut not purified; (e) structurally determined versus purified but not structurally determined; and (f) soluble versus notsoluble. The boxed values in the terminal nodes show the probability of a successful outcome at each node. To theright of the node, the value represents the number of successful proteins and the value to the left of the node denotesthe number of proteins that were unsuccessful. The bracketed number at the root of the tree is the number of totalproteins analyzed.

Mining the Structural Genomics Pipeline 119

Page 6: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

that are not determined structurally can be selectedby the top three rules in each of the consortiadecision trees. The results suggest that eachconsortium has its own methods of target selection,

cloning, and protein production. The decision treeanalysis is able to highlight patterns of successesfor these consortia. For example, the NESG(Figure 4(a)) seems to be more successful with

Figure 2. Decision tree analysis of two subsets of protein targets with only fundamental descriptors.

120 Mining the Structural Genomics Pipeline

Page 7: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

Figure 3. Pipeline figure analysis of: (a) percentage of proteins that belong to COGs; (b) average percent of GAVLIpercent composition; (c) percentage of proteins that have minimum hydrophobicity scores below 21 on the GESscale; (d) number of hydrophobic residues within hydrophobic stretches below 21 on the GES scale; (e) average pro-tein length; (f) average percentage of DE percentage composition; (g) average number of binding partners based onthe MIPS complex catalog; (h) percentage of proteins that contain signal sequences; and (i) average normalized low-complexity values. For proteins containing these features, stages where possible “bottlenecks” can occur are presentedin the pipeline figures. The pipeline figures show the mean values and their standard errors. The numbers in parenth-eses are the actual number of proteins at each stage in the structural genomics pipeline.

Mining the Structural Genomics Pipeline 121

Page 8: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

Figure 4. Decision tree analysis of two subsets of data from targets that are not structurally determined and thosethat are. This analysis is performed on data from: (a) the Northeast Structural Genomics Consortium (NESG); (b) theJoint Center for Structural Genomics (JCSG); (c) the M. tuberculosis Structural Genomics Consortium (TB); (d) theMidwest Center for Structural Genomics (MCSG); (e) the Montreal-Kingston Bacterial Structural Genomics Initiative(BSGI); and (f) the Berkeley Structural Genomics Center (BSGC). The boxed values in the terminal nodes showthe probability of a successful outcome at each node. To the right of the node, the value represents the number ofstructurally determined proteins and the value to the left of the node denotes the number of proteins that wereunsuccessful. The bracketed number at the root of the tree is the number of total proteins analyzed.

122 Mining the Structural Genomics Pipeline

Page 9: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

proteins that have more than 12% aspartic andglutamic acid and protein lengths of less than 112amino acid residues. The NESG targets arecomprised mostly of small (,340 amino acidresidues) prokaryotic and eukaryotic proteins,and, generally, smaller proteins tend to have ahigher concentration of charged residues thanlarger ones. Similarly, the BSGC (Figure 4(f))seems to achieve better results for protein targetswith protein lengths less than 346 amino acidresidues and proteins that are conserved acrossorganisms. Since the data collected from TargetDBfor each consortium do not distinguish betweentargets that have not been determined structurallyand those that cannot be determined structurally,these results may have alternately served to high-light certain targets within each consortium thathave progressed further through the pipeline.

Main protein features

Conservation across organisms

Protein features found commonly in the decisiontrees are indicated in Table 1, which illustrates thedifferences between the subset containing all thetargets and the subset consisting of targets thathave been determined structurally. These resultshighlight which features differ the most betweenthe two subsets. Most of the protein characteristicsfound commonly in the decision trees also contrastmarkedly between the two subsets. The percentageof proteins that have COGs rises from 59% in allthe targets to 85% in targets that have been deter-mined structurally ( p-value 1.4 £ 10224).

Pipeline figures (Figure 3) can show where com-mon bottlenecks occur in each step of proteinstructure determination process. At each stage ofthe pipeline, the number of total proteins isrepresented in parentheses, and the values of thecharacteristics of interest are shown. Of the 27,711protein targets, so far 14,767 have been cloned,8587 expressed, 4115 purified and 370 determinedstructurally. The results from the tree-basedanalyses corroborate with the data shown in thepipeline figures. For example, the tree analysesindicate that COG assignment is a major determi-nant for whether a protein will be expressed. Forthe COG assignment protein characteristic(Figure 3(a)), we can see in the pipeline that amajor bottleneck occurs at this stage. Of proteinsthat are cloned but not expressed, 53% have COGscompared to 70% in expressed proteins ( p-value2.2 £ 10292). Most proteins are expressed inbacteria, so targets with bacterial counterpartshave a better chance of being expressed than thosethat do not. Eukaryotic proteins expressed inbacterial vectors typically have a lower successrate than bacterial proteins.

COGs are families of proteins found in manyorganisms. Generally, more studies have been per-formed on these proteins due to their presencein various organisms, which can increase the

probability that these proteins will be determinedstructurally. Studying proteins that belong toCOGs is one of the most successful methods oftackling crystal structure determination,18 sincethese proteins can be cloned and purified frommany organisms. This study highlights the utilityof multiplex gene expression system analysis.

Hydrophobicity

In Table 1, the percentage of small hydrophobicprotein residues (GAVLI) increases from 34.6% inthe subset containing all the target proteins to38.4% in the subset consisting of proteins that arestructurally determined ( p-value 9.6 £ 10219).However, the average hydrophobicity (hphobe)and the average number of hydrophobic residueswithin hydrophobic stretches (hp_aa) are shownto decrease between the subsets of all targetproteins and structurally characterized proteins.Small hydrophobic residue composition (GAVLI)was found to be one of the determinants forwhether proteins could be determined structurally.Structurally determined proteins had an average of38.4% GAVLI composition as opposed to purifiedbut not structurally determined proteins that hadan average of 36% GAVLI composition ( p-value3.6 £ 1027).

Through target selection, most proteins withpredicted transmembrane regions have beenremoved. However, target proteins that are morehydrophobic are usually less amenable to high-throughput experimentation, probably due tosolubility issues. Overall, 30% of all the targetshave hydrophobic stretches with minimum hydro-phobicity scores below 21 (based on the GESscale19, see Methods) compared to the 21% that arestructurally determined ( p-value 1.4 £ 1024).Figure 3(c) highlights two bottlenecks in the pipe-line that could occur, based on the protein’s hydro-phobicity features. Of the proteins that are clonedbut not expressed, 36% have hydrophobicity scoresbelow 21 compared to a score of 25% forexpressed proteins ( p-value 7.2 £ 10236). Similarly,while 29% of proteins that are expressed but notpurified have low hydrophobicity scores, only20% of purified proteins contain highly hydro-phobic patches ( p-value 3.1 £ 10222). This seems toconfirm the idea that hydrophobic proteins areless likely to be both expressed and purified dueto their decreased solubility.

In general, all the protein targets had an averageof 15 hydrophobic residues within hydrophobicstretches compared to structurally determinedproteins, which had an average of seven hydro-phobic residues ( p-value 3.8 £ 1025). For proteinswith a high number of hydrophobic residueswithin hydrophobic stretches, the pipeline analysissuggests that two bottlenecks can occur at theexpression stage and the purification stage. Thedecision tree for the purification stage indicatesthat this feature is a strong determinant forwhether a protein can be purified. The expressed

Mining the Structural Genomics Pipeline 123

Page 10: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

proteins that do not become purified have a highnumber of hydrophobic residues, 16.3, comparedto the purified proteins that have only 6.5 hydro-phobic residues within hydrophobic stretches( p-value 7.7 £ 10230). This suggests that proteinswith large or many hydrophobic stretches may notfold properly in the experimental conditions beingused. The cloned proteins have an average of 16.3hydrophobic residues within a hydrophobic stretch( p-value 9.3 £ 1028). Cloned proteins that were notexpressed had an average of 22.7 hydrophobic resi-dues compared to expressed proteins, which hadan average of 11.6 hydrophobic residues ( p-value1.0 £ 10263), indicating that proteins with manyhydrophobic stretches may have more difficulty inbeing expressed, due to their decreased solubility.

Protein length

Protein length was another feature thatdecreased from 291 amino acid residues in thedata set containing all the target proteins to 243 inthe data set consisting of proteins that were deter-mined structurally ( p-value 3.2 £ 1024). The treeanalyses suggested that the outcome of whether aprotein was expressed could be correlated to itssequence length. The pipeline analysis results cor-roborated this finding. While cloned proteins hadan average length of 276 residues ( p-value4.8 £ 10225), the average protein length of clonedbut not expressed proteins was considerablyhigher at 305 residues, compared to expressedprotein, which had an average protein length of254 ( p-value 2.6 £ 10234). One explanation is thatlarge proteins could contain multiple domains,which increases the difficulty of experimentalanalysis.

The histogram of the protein lengths (Figure 5)demonstrates that the distribution of proteinlengths is different between proteins that areexpressed and those that are cloned but notexpressed. There are more expressed than non-expressed proteins that have a length of less than400 amino acid residues. However, there is a largernumber of cloned but not expressed proteins forproteins with lengths of 400 to 2000 amino acidresidues. The analysis suggests a possiblecorrelation between a protein’s length and itsamenability to high-throughput structuraldetermination.

Composition of specific amino acids

The percentage composition of small negativelycharged residues (DE) increased from 12.7% in allthe protein targets to 14.2% in structurallydetermined targets. The bottleneck for this feature,indicated in Figure 3(f), was found in the purifi-cation stage, where purified proteins had 14.2%DE (Asp/Glu) composition as opposed to 12.6%expressed proteins that were not purified( p-value ¼ 3.3 £ 102101). The percentage compo-sition of DE was highlighted in several tree

analyses including structure determination, purifi-cation, and solubility. Highly charged amino acidresidues interact favorably with solvent molecules,so proteins with a higher percentage of acidicamino acid residues have a higher probability ofbeing soluble. This is corroborated in the results ofthe decision tree analyses, where this feature washighlighted in the purification and solubility deter-mination trees.

The percentage composition of serine decreased,on average, from 6.7% in all the protein targets to5.2% in structurally determined targets ( p-value3.0 £ 10222). Serine was highlighted in thesolubility tree analyses, suggesting its influence ona protein’s solubility. However, it is not clearlyunderstood how the decrease in serine com-position affects a protein’s amenability to high-throughput experimental analysis.

Number of binding partners

Based on the interaction information found inthe MIPS complex catalog (complex_partners), theaverage number of binding partners for proteinsthat are structurally determined is less than all theprotein targets, suggesting that some proteins mayrequire the presence of their binding partners inorder to fold properly and, in turn, to be deter-mined structurally.20 – 23 The pipeline figure analysisindicates that a bottleneck for this feature occurs atthe purification stage, where proteins that areexpressed but not purified have an average of 0.96binding partner compared to those that become

Figure 5. Histogram of expressed compared to non-expressed protein lengths.

124 Mining the Structural Genomics Pipeline

Page 11: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

purified, which have an average of 0.7 bindingpartner ( p-value 0.06). The average number ofbinding partners decreases even further from thestage when a target is purified to the stage that itbecomes determined structurally, suggesting thatit is easier to purify and crystallize proteins thathave few binding partners. The reported averagenumber of binding partners for all of theseproteins is less than 1, due to the fact that lessthan 4% of these proteins are functionally shownto have binding partners. As more functionalinformation becomes available, the correlationbetween the number of binding partners and a pro-tein’s amenability to high-throughput structuraldetermination will most likely become morepronounced.

Occurrence of signal peptides

Signal peptides control the entry of proteins intosecretory pathways.24 – 26 As the protein is trans-located through the membrane, the signalsequence is cleaved off, releasing the matureprotein. The presence of a signal sequence was nothighlighted as a determining factor in any of thedecision tree analyses. However, there was ageneral decrease from 14.5% in all the targets to8.2% in the structurally determined proteins( p-value 5.3 £ 1024). The pipeline analysis suggeststhat expressed proteins have a lower percentage(13%) of signal sequences than proteins that arenot expressed (16.5%) ( p-value 1.2 £ 1027). Simi-larly, 10.6% of purified proteins have signalsequences compared to 15.7% of expressed but notpurified proteins ( p-value 9.3 £ 10212). Sincesecreted proteins are exported, their nativeenvironment is most likely extracellular. However,the current expectation profile for proteins that areexpected to be successful are proteins that areboth cytoplasmic and monomeric.

Features found not to be predictors

All the target proteins have an averaged normal-ized low-complexity value of 14.4 compared tostructurally determined proteins, which have avalue of 13.6 ( p-value 0.07). While there is a smalldecrease, this difference is probably not significant.However, cloned proteins have an average low-complexity value of 14.6 compared to expressedproteins, which have a value of 12.8 ( p-value2.1 £ 10254). Through target selection, most proteinswith long low-complexity regions have beenremoved,4,27 which may indicate why there is littlechange in this feature between all the targets andthose that become structurally determined.

There is little difference between the percentageof proteins containing a nuclear localization signal(NLS) motif (PS00015) in the structurally deter-mined subset and all the targets. There is a smallincrease from 4.3% of all the targets to 4.9% of thestructurally determined targets that contain the

NLS motif. However, this result is not found to besignificant ( p-value 0.52).

Other features, such as cysteine composition andKR (Lys/Arg) composition, did not show a distinctchange between the structurally determined subsetand the total targets subset. It was thought thatproteins containing disulfide bridges might bemore difficult to crystallize. Similar to proteinswith signal sequences, proteins with disulfidebridges are usually extracellular. However, theamount of cysteine in proteins that were crystal-lized compared to those that were not diddecrease, but not substantially. Additionally, thepercentage composition of the charged residues,DE (Asp/Glu), increased in proteins that werestructurally determined compared to all theprotein targets. We expected to observe a changein the percentage composition of other chargedresidues, such as KR (Lys/Arg). However, therewas very little difference in this feature betweenthe structurally determined subset and all thetargets.

Statistical issues using structural genomicsdatasets

Since the data collected from the TargetDB isbasically a “snapshot” of the structural genomicsprogress, it is not possible to distinguish betweentargets that have failed at a certain stage fromthose targets that are yet to be studied. The subsetof successful proteins can be partitioned into a“white” (or successful) subset. However, theremaining proteins are partitioned into a “gray”subset, rather than a completely unsuccessful(“black”) subset, since some of these proteinscould be successful if attempted. This can createmore difficulty in making accurate conclusionswhen trying to determine which features areimportant to the amenability of a protein to high-throughput experimentation. While the presenceof potential false negatives can decrease thestrength of prediction, our analysis shows astatistically significant correlation between certainprotein features and their amenability to beingdetermined successfully. The issue of having falsenegatives would manifest itself more greatly if thesuccessful subset of data was just as, or almost thesame as, the negative subset of data with respectto the features being studied, thereby creatingnon-statistically significant differences. However,since the protein features in this subset of proteinsare statistically distinct from the protein featuresfound in the rest of the population, this indicatesthat there is enough information to distinguish thetwo (unbalanced) sets even from this smaller suc-cessful subset. With the accumulation of more“successful” data, more information will belearned to increase the ability for predictions.

Using these machine learning techniques, we areable to identify common trends or correlationsbetween specific protein properties and the successof an outcome at each stage of the structural

Mining the Structural Genomics Pipeline 125

Page 12: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

genomics pipeline. This analysis uses standardtools that have been employed widely in suchfields as econometrics and epidemiology, andproduces straightforward, robust statistical con-clusions. However, these techniques cannot beused directly to establish causal relationships.Incorporating general biological information, weare able to utilize statistical information to makeinferences about the relationships between factorsand outcomes, and to determine whether theserelationships are plausible.

More specifically, the correlations found in thesehigh-throughput experiments can be defined aseither causal or non-causal. The causal correlationsare based on protein properties and features of aparticular experiment. For instance, the resultsshow that less hydrophobic proteins have a betterchance of being purified. For the purificationstage, the rate of success is highly dependent on aprotein’s fundamental properties. Alternatively,non-causal correlations are affected by otherfactors, such as the rate that scientists can startanalyzing proteins at each stage of the process orbiases in the way targets have been selected orprioritized for cloning, expression, crystallization,etc. These non-causal relationships have less effectin later stages of the structural genomics pipelinewhere there are fewer proteins. However, theyseem to be more predominant in earlier stages ofthe pipeline, such as in the cloning step, where thenumber of proteins selected outweighs theresources available.

Interpretation of pipeline figures

Because of the distinction between causal andnon-causal correlations that we elaborated onabove, we have to be particularly careful in inter-preting the pipeline figures with respect to cloning.It is known that cloning of proteins is usuallybelieved to be successful. This issue has ramifica-tions for interpreting the pipeline figures. A smallanalysis we have done within the NESG con-sortium has shown that most proteins can becloned with approximately 95% success (somewhathigher cloning success rates for prokaryoticproteins and somewhat lower success rates foreukaryotic proteins). Therefore, most of the statisti-cal differences that we observed between theselected and cloned parts of the pipeline do notreflect intrinsic properties of proteins. Rather, theyreflect subtle sociological biases in the way thevarious structural genomic centers have goneabout picking their targets for cloning. Thus, ininterpreting the pipeline figures, if we want tounderstand the appropriate statistics of each stagewe can start looking at the pipeline figures fromthe cloning stage on, up to expressed, purified,and so forth. Each of these steps represents a realstatistical difference attributed to the properties ofproteins.

However, there is another way that we can lookat the pipeline semantics. In a very global manner,

we can compare the properties of proteins thathave structures to the entire universe that istackled by structural genomics. The latter is essen-tially the proteins of TargetDB. In this way, we cancompare the very top of the pipeline schematic allthe way to the bottom, taking into account thatalmost all of the proteins in the selected poolcould be cloned if attempted. This gives us anidea of the overall statistical differences betweenproteins that are broadly targeted by the centersversus those that have been solved successfully.

Important discoveries for future data collectionin TargetDB

From this comprehensive analysis we can gleana number of points that will help us to bettergather data for TargetDB in the future. First, it isimportant to gather negative as well as positiveinformation. Second, it is critical to distinguishbetween proteins with negative information in thepipeline and proteins that are waiting in the pipe-line. In particular, it is critical to track (i) what isattempted, (ii) what is successful, and (iii) whatfails, at each step of the structure production pipe-line. Therefore, it might be useful to add a numberof categories to TargetDB including more detailabout the status of each protein.

As the recognition in the importance of charac-terizing structural genomics information increases,more detailed mechanisms for capturing the datawill greatly improve and enable further datamining efforts. Here, we have shown the generalutility of performing this type of analysis on infor-mation gathered from the structural genomicsefforts. We identify protein features that correlatewith successful outcomes at each stage of the pipe-line. We demonstrate plausible consistenciesbetween these identified protein properties andthe effect that they may have in determining theoutcome of a protein’s progress through the struc-tural genomics pipeline. The results of this analysiscan aid researchers in the choice of better targetproteins, and can increase the efficiency of high-throughput experimentation.

Conclusions

The structural genomics initiative will produce avast amount of experimental information that canprovide insights into protein structure and func-tion. As the numbers of solved structures areincreasing gradually, data collected from theseefforts can aid in optimizing and accelerating thestructure determination process. This studysuggests that several key protein characteristics,including protein length, composition of negativelycharged and polar residues, hydrophobicity,presence of a signal sequence, and COG assign-ment, can determine whether a protein will pro-gress through the stages of the structuredetermination pipeline. Proteins with an optimal

126 Mining the Structural Genomics Pipeline

Page 13: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

combination of these features can be selectedrapidly and moved through the pipeline more effi-ciently. Additionally, using these parameters, it ispossible that less ideal proteins can be re-engin-eered to increase their chance for being determinedstructurally.

Methods

Targets

The data set of protein targets was collected onFebruary 9, 2003 from TargetDB, a target registrationdatabase that includes target data from worldwidestructural genomics and proteomics projects. This subsetconsisted of 27,711 proteins and was inserted into theSPINE database for further analysis.

Random forest analysis

The random forest analysis combines two powerfulideas in machine learning techniques: bagging and ran-dom feature selection. Bagging (bootstrap aggregating)uses the final vote of bootstrap replicates to create a clas-sifier. The random forest analysis grows a random treefor each bootstrap sample by choosing the best split ateach node from a small number of randomly selectedpredictors.

For each data set, 5000 bootstrap samples were usedand seven features (the square-root of the total numberof features) were selected randomly at each node split.One-third of the original training set was omitted fromeach bootstrap sample. These out-of-bag (oob) sampleswere placed back into the trees and a final predictionwas based on the votes of these tree predictions. Theerror of each data set was calculated based on the per-centage of incorrect predictions made out of the totalnumber of predictions. The error rates for these datasets range, on average, from 15% to 30%.

The importance for each variable was measured bypermuting all the values for the variable in the oobsamples. When these values were placed back into thetree, a new test set error was computed. The amountthat the test error differs from the original test error wasdefined as the importance of the variable.

Decision tree analysis

Decision trees were constructed using the R treemodel software†28 with default parameters of minimumnode deviance of 0.1, minimum node size of 10, andcost complexity pruning of 5. Targets with missingvalues were omitted, resulting in 27,267 total protein tar-gets analyzed. The overall structure determination treeused a subset of proteins that were determined in com-parison to all the rest of the proteins that were not. Cor-respondingly, the cloning determination tree used asubset of proteins of all cloned target proteins comparedto all target proteins that were not cloned. Decision treeswere constructed for the expressed versus clonedbut not-expressed, purified versus expressed but not-purified, and structure versus purified but not struc-turally determined subsets. Under no associations, the

distribution of the positive outcome out of the totalnumber of samples (i.e. 14,385 cloned out of the total27,267 sample) was assigned randomly to each of theterminal nodes. The mean and variance of the terminalnodes were calculated, and the approximate p-value foreach terminal node was derived from its Z-score.

Cross-validation is a commonly used technique toestimate the error rate of future predictions. The ipredpackage‡ in R was used to perform a tenfold cross-validation on each of the decision trees, where eachsuccessive application of the learning procedure used adifferent 90% of the data set for the training and theremaining 10% for testing. The estimated error wascalculated by taking the sum of the number of incorrectclassifications obtained from each one of the ten test sub-sets and dividing that sum by the total number ofinstances that had been used for testing. The averageprediction success over all the decision trees in Figure 1was 76%. For the decision trees in Figure 4, the cross-validation approach resulted in an overall predictionsuccess of 96%.

Data analysis

Amino acid composition

Amino acid compositions were calculated by takingthe fraction of the total number of the amino acidresidues by the total number of amino acid residues inthe whole sequence.

Hydrophobicity

Hydrophobicity scores were measured using the GESscale,19 where the lower the score, the more hydrophobicthe amino acid is. Hydrophobic residues withinhydrophobic stretches were calculated by countingall the residues within a sliding 20 residue windowwith a hydrophobicity score below 21.0 kcal/mol(1 cal ¼ 4.184 J) on the GES hydrophobicity scale.Minimum hydrophobicity scores were calculated usingthe minimum hydrophobicity score of all the sliding20 residue windows for each protein.

Signal sequences

Signal sequences were measured by implementing apattern match for sequences containing a chargedresidue within the first seven amino acid residuesfollowed by a stretch of 14 hydrophobic residues.

Low-complexity scores

Entropic low-complexity scores were calculated usingthe SEG program.29 Long low-complexity regions wereidentified with SEG using standard parameters with atrigger complexity K(1) of 3.4, an extension complexityK(2) of 3.75, and a sequence window length of 45. Shortlow-complexity regions were identified using a triggercomplexity K(1) of 3.0, an extension complexity K(2) of3.3, and a sequence window length of 25.

† Team, R. D. C. (2003). R: a language and environmentfor statistical computing. http://www.R-project.org

‡ Peters, A. & Hothorn, T. (2003). Improved Predictors.http://cran.r-project.org/src/contrib/PACKAGES.html

Mining the Structural Genomics Pipeline 127

Page 14: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

Motifs

Prosite motifs were identified using the ps_scan30

program to scan the Prosite database for known motifs.31

Binding partners

A commonly used method for identifying protein–protein interactions is to utilize known binding infor-mation in one species to predict interactions of proteinsin another species.32 – 34 Interologs are pairs of potentialorthologs of known interacting partners. We employedinterolog information to identify target-binding partnersby mapping the known binding partners of a target’syeast interolog to itself. The complex_partners valueswere measured by calculating the average number ofknown binding partners for the yeast interolog found inthe MIPS35 – 39 complex catalog and mapping it to theprotein target. Comparatively, the any_partners valuesidentified the average number of known binding part-ners for a specific target’s yeast interolog using varioussources35 – 49 of experimentally determined informationincluding the MIPS database.

Statistical significance for pipeline figures

To test whether the features illustrated in Figure 3show an increasing or decreasing trend from the totaltarget population of 27,711 proteins to those 370 proteinswith identified structures, we conducted trend tests inthe following form:

T ¼XC

i¼1

wiyi

where C is the number of classes above the baseline totalpopulation, wi is the weight for the ith class, and yi is theobserved mean or ratio related to the feature of interestin the ith class. In our case, C ¼ 4, which corresponds tothe cloned proteins, expressed proteins, purified pro-teins, and proteins with identified structures. The valueof yi is the proportion of the proteins with a given featurein the ith class for Figure 3(a), (c), and (h), whereas thevalue of yi is the average feature value in the ith classfor the other features in Figure 3. The weights are 1, 2,3, 4 for an increasing trend for features in Figure 3(a),(b), (f), and (i). The weights are 4, 3, 2, and 1 for adecreasing trend for the other features.

To assess the statistical evidence of a trend in the databased on T, we calculated the mean and variance of Tunder the null hypothesis of no trend conditional on thefeature distribution in the baseline total population con-sisting of 27,711 proteins. It can be shown that when thefeature of interest is binary, i.e. a given protein eitherhas or does not have this feature:

EðTÞ ¼XC

i¼1

wiy0 and VarðTÞ

¼XC

i¼1

bi

!y0ð1 2 y0Þ2

XC

i¼1

bivi

where:

bi ¼XC

j¼i

wj

0@

1A

2

Ni21 2 Ni

Ni21 2 1

1

Ni; v1 ¼ 0; viþ1

¼ ð1 2 aiÞvi þ aiy0ð1 2 y0Þ; for i $ 1;ai

¼Ni21 2 Ni

Ni21 2 1

1

Ni

Ni is the number of proteins in the ith class, N0 is thenumber of proteins in the total population, and y0 is theproportion of proteins having a given feature in the totalpopulation. The statistical significance of the observedincreasing trend is:

1 2FT 2 EðTÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

VarðTÞp

and the statistical significance of the observed decreasingtrend is:

FT 2 EðTÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

VarðTÞp

where F is the cumulative function of the standardnormal distribution.

When the feature of interest is a continuous one, it canbe shown that, under the null hypothesis of no trend:

EðTÞ ¼XC

i¼1

wiy0 and VarðTÞ ¼XC

i¼1

gi

!s2

0

where gi ¼ ðPC

j¼i wjÞ2ððNi21 2 NiÞ=Ni21Þð1=NÞ, y0 and s2

0are the mean and variance of the given feature in thetotal population, respectively. For the observed trendtest statistic, we can use the above mean and varianceunder the null hypothesis to assess statistical significancelevel for an increasing or decreasing trend as above forthe binary case.

Acknowledgements

This work was supported, in part, by grant5P50GM062413-03 from the Protein StructureInitiative of the Institute of General MedicalSciences, National Institutes of Health and grantDMS-0241160 (to H.Y.Z.) from the NSF. We thankTom Acton for helpful discussions.

References

1. Service, R. F. (2000). Structural genomics offers high-speed look at proteins. Science, 287, 1954–1956.

2. Brenner, S. E. & Levitt, M. (2000). Expectations fromstructural genomics. Protein Sci. 9, 197–200.

3. Sanchez, R., Pieper, U., Melo, F., Eswar, N., Marti-Renom, M. A., Madhusudhan, M. S. et al. (2000).Protein structure modeling for structural genomics.Nature Struct. Biol. 7, 986–990.

4. Brenner, S. E. (2000). Target selection for structuralgenomics. Nature Struct. Biol. 7, 967–969.

5. Brenner, S. E. (2001). A tour of structural genomics.Nature Rev. Genet. 2, 801–809.

6. Service, R. F. (2002). Structural genomics. Tapping

128 Mining the Structural Genomics Pipeline

Page 15: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

DNA for structures produces a trickle. Science, 298,948–950.

7. Pedelacq, J. D., Piltch, E., Liong, E. C., Berendzen, J.,Kim, C. Y., Rho, B. S. et al. (2002). Engineering solubleproteins for structural genomics. Nature Biotechnol.20, 927–932.

8. Terwilliger, T. C. (2000). Structural genomics inNorth America. Nature Struct. Biol. 7, 935–939.

9. Chance, M. R., Bresnick, A. R., Burley, S. K., Jiang,J. S., Lima, C. D., Sali, A. et al. (2002). Structuralgenomics: a pipeline for providing structures for thebiologist. Protein Sci. 11, 723–738.

10. Bertone, P., Kluger, Y., Lan, N., Zheng, D.,Christendat, D., Yee, A. et al. (2001). SPINE: an inte-grated tracking database and data mining approachfor identifying feasible targets in high-throughputstructural proteomics. Nucl. Acids Res. 29, 2884–2898.

11. Goh, C. S., Lan, N., Echols, N., Douglas, S. M.,Milburn, D., Bertone, P. et al. (2003). SPINE 2: asystem for collaborative structural proteomics withina federated database framework. Nucl. Acids Res. 31,2833–2838.

12. Breiman, L. (2001). Random forests. Machine Learn.45, 5–32.

13. Breiman, L. (2002). IMS Wald Lecture 2 ProceedingsIMS Annual Meeting, Banff, Canada.

14. Quinlan, J. R. (1987). Simplifying decision trees. Int.J. Man-Machine Stud. 27, 221–234.

15. Quinlan, J. R. (1993). C4.5: Programs for MachineLearning, Morgan Kaufmann, San Francisco.

16. Dash, M. & Liu, H. (1997). Feature selection forclassification. Intelligent Data Anal. 1, 131–156.

17. Tatusov, R. L., Koonin, E. V. & Lipman, D. J. (1997). Agenomic perspective on protein families. Science, 278,631–637.

18. Savchenko, A., Yee, A., Khachatryan, A., Skarina, T.,Evdokimova, E., Pavlova, M. et al. (2003). Strategiesfor structural proteomics of prokaryotes: quantifyingthe advantages of studying orthologous proteins andof using both NMR and X-ray crystallographyapproaches. Proteins: Struct. Funct. Genet. 50,392–399.

19. Engelman, D. M., Steitz, T. A. & Goldman, A. (1986).Identifying nonpolar transbilayer helices in aminoacid sequences of membrane proteins. Annu. Rev.Biophys. Biophys. Chem. 15, 321–353.

20. Wright, P. E. & Dyson, H. J. (1999). Intrinsicallyunstructured proteins: re-assessing the protein struc-ture–function paradigm. J. Mol. Biol. 293, 321–331.

21. Dyson, H. J. & Wright, P. E. (2002). Coupling offolding and binding for unstructured proteins. Curr.Opin. Struct. Biol. 12, 54–60.

22. Yokoyama, S. (2003). Protein expression systems forstructural genomics and proteomics. Curr. Opin.Chem. Biol. 7, 39–43.

23. Dunker, A. K. & Obradovic, Z. (2001). The proteintrinity—linking function and disorder. NatureBiotechnol. 19, 805–806.

24. Gierasch, L. M. (1989). Signal sequences. Biochemistry,28, 923–930.

25. von Heijne, G. (1990). Protein targeting signals. Curr.Opin. Cell Biol. 2, 604–608.

26. Rapoport, T. A. (1992). Transport of proteins acrossthe endoplasmic reticulum membrane. Science, 258,931–936.

27. Sali, A. (2001). Target practice. Nature Struct. Biol. 8,482–484.

28. Ihaka, R. & Gentleman, R. (1996). R: a language for

data analysis and graphics. J. Comput. Graph. Stat. 5,299–314.

29. Wootton, J. C. & Federhen, S. (1996). Analysis ofcompositionally biased regions in sequence data-bases. Methods Enzymol. 266, 554–571.

30. Gattiker, A., Bienvenut, W. V., Bairoch, A. &Gasteiger, E. (2002). FindPept, a tool to identifyunmatched masses in peptide mass fingerprintingprotein identification. Proteomics, 2, 1435–1444.

31. Sigrist, C. J., Cerutti, L., Hulo, N., Gattiker, A.,Falquet, L., Pagni, M. et al. (2002). PROSITE: a docu-mented database using patterns and profiles asmotif descriptors. Brief Bioinform. 3, 265–274.

32. Walhout, A. J., Sordella, R., Lu, X., Hartley, J. L.,Temple, G. F., Brasch, M. A. et al. (2000). Proteininteraction mapping in C. elegans using proteinsinvolved in vulval development. Science, 287,116–122.

33. Walhout, A. J. & Vidal, M. (2001). Protein interactionmaps for model organisms. Nature Rev. Mol. CellBiol. 2, 55–62.

34. Yu, H., Luscombe, N. M., Zhu, X., Chung, S., Goh,C.-S. & Gerstein, M. (2004). Annotation transfer forgenomics: assessing the transferability of protein–protein and protein–DNA interactions betweenorganisms. Genome Res. In the press.

35. Mewes, H. W., Frishman, D., Guldener, U.,Mannhaupt, G., Mayer, K., Mokrejs, M. et al. (2002).MIPS: a database for genomes and proteinsequences. Nucl. Acids Res. 30, 31–34.

36. Mewes, H. W., Frishman, D., Gruber, C., Geier, B.,Haase, D., Kaps, A. et al. (2000). MIPS: a databasefor genomes and protein sequences. Nucl. Acids Res.28, 37–40.

37. Mewes, H. W., Heumann, K., Kaps, A., Mayer, K.,Pfeiffer, F., Stocker, S. & Frishman, D. (1999). MIPS:a database for genomes and protein sequences.Nucl. Acids Res. 27, 44–48.

38. Mewes, H. W., Hani, J., Pfeiffer, F. & Frishman, D.(1998). MIPS: a database for protein sequences andcomplete genomes. Nucl. Acids Res. 26, 33–37.

39. Mewes, H. W., Albermann, K., Heumann, K., Liebl,S. & Pfeiffer, F. (1997). MIPS: a database for proteinsequences, homology data and yeast genome infor-mation. Nucl. Acids Res. 25, 28–30.

40. Tong, A. H., Drees, B., Nardelli, G., Bader, G. D.,Brannetti, B., Castagnoli, L. et al. (2002). A combinedexperimental and computational strategy to defineprotein interaction networks for peptide recognitionmodules. Science, 295, 321–324.

41. Uetz, P., Giot, L., Cagney, G., Mansfield, T. A.,Judson, R. S., Knight, J. R. et al. (2000). A compre-hensive analysis of protein-protein interactions inSaccharomyces cerevisiae. Nature, 403, 623–627.

42. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M.& Sakaki, Y. (2001). A comprehensive two-hybridanalysis to explore the yeast protein interactome.Proc. Natl Acad. Sci. USA, 98, 4569–4574.

43. Xenarios, I., Salwinski, L., Duan, X. J., Higney, P.,Kim, S. M. & Eisenberg, D. (2002). DIP, the Databaseof Interacting Proteins: a research tool for studyingcellular networks of protein interactions. Nucl. AcidsRes. 30, 303–305.

44. Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K.,Marcotte, E. M. & Eisenberg, D. (2000). DIP: the data-base of interacting proteins. Nucl. Acids Res. 28,289–291.

45. Xenarios, I., Fernandez, E., Salwinski, L., Duan, X. J.,Thompson, M. J., Marcotte, E. M. & Eisenberg, D.

Mining the Structural Genomics Pipeline 129

Page 16: Mining the Structural Genomics Pipeline: …papers.gersteinlab.org/e-print/spine2-jmb/reprint.pdfMining the Structural Genomics Pipeline: Identification of Protein Properties that

(2001). DIP: The Database of Interacting Proteins:update. Nucl. Acids Res. 29, 239–241.

46. Gavin, A. C., Bosche, M., Krause, R., Grandi, P.,Marzioch, M., Bauer, A. et al. (2002). Functionalorganization of the yeast proteome by systematicanalysis of protein complexes. Nature, 415, 141–147.

47. Bader, G. D., Betel, D. & Hogue, C. W. (2003). BIND:the Biomolecular Interaction Network Database.Nucl. Acids Res. 31, 248–250.

48. Bader, G. D. & Hogue, C. W. (2000). BIND—a dataspecification for storing and describing biomolecularinteractions, molecular complexes and pathways.Bioinformatics, 16, 465–477.

49. Bader, G. D., Donaldson, I., Wolting, C., Ouellette,B. F., Pawson, T. & Hogue, C. W. (2001). BIND—thebiomolecular interaction network database. Nucl.Acids Res. 29, 242–245.

Edited by F. E. Cohen

(Received 21 July 2003; received in revised form 28 October 2003; accepted 19 November 2003)

130 Mining the Structural Genomics Pipeline