structural genomics of highly conserved microbial genes of...

17
Structural genomics of highly conserved microbial genes of unknown function in search of new antibacterial targets Chantal Abergel*, Bruno Coutard, Deborah Byrne, Sabine Chenivesse, Jean-Baptiste Claude, Céline Deregnaucourt, Thierry Fricaux, Celine Gianesini-Boutreux, Sandra Jeudy, Régine Lebrun 1 , Caroline Maza, Cédric Notredame, Olivier Poirot, Karsten Suhre, Majorie Varagnol & Jean-Michel Claverie Structural and Genomic Information Laboratory, UMR 1889 CNRS-AVENTIS, 1 Institute for Structural Biology and Microbiology, 31 Chemin Joseph Aiguier, 13402 Marseille cedex 20, France; * Author for Correspondence: E-mail: [email protected] Received 22 November 2002; Accepted in revised form 27 April 2003 Key words: Escherichia coli, Structural Genomics, Comparative Genomics, Bioinformatics, anti-bacterial Abstract With more than 100 antibacterial drugs at our disposal in the 1980’s, the problem of bacterial infection was considered solved. Today, however, most hospital infections are insensitive to several classes of antibacterial drugs, and deadly strains of Staphylococcus aureus resistant to vancomycin – the last resort antibiotic – have recently begin to appear. Other life-threatening microbes, such as Enterococcus faecalis and Mycobacterium tu- berculosis are already able to resist every available antibiotic. There is thus an urgent, and continuous need for new, preferably large-spectrum, antibacterial molecules, ideally targeting new biochemical pathways. Here we report on the progress of our structural genomics program aiming at the discovery of new antibacterial gene targets among evolutionary conserved genes of uncharacterized function. A series of bioinformatic and compara- tive genomics analyses were used to identify a set of 221 candidate genes common to Gram-positive and Gram-negative bacteria. These genes were split between two laboratories. They are now submitted to a system- atic 3-D structure determination protocol including cloning, protein expression and purification, crystallization, X-ray diffraction, structure interpretation, and function prediction. We describe here our strategies for the 111 genes processed in our laboratory. Bioinformatics is used at most stages of the production process and out of 111 genes processed – and 17 months into the project – 108 have been successfully cloned, 103 have exhibited de- tectable expression, 84 have led to the production of soluble protein, 46 have been purified, 12 have led to usable crystals, and 7 structures have been determined. Introduction Despite the exponential growth of sequence informa- tion from a large diversity of organisms, each newly sequenced bacterial genome continues to reveal up to 50% of genes without significant similarity to protein of characterized function [1]. By focusing a structural genomics project on such anonymous proteins, our goal is twofold. On one hand, we expect a sizable fraction of these proteins to exhibit a recognizable 3-D structure similarity with previously characterized protein families. Structure determination is thus used as a technique of functional genomics, allowing com- mon functional attributes to be recognized beyond the twilight zone of sequence similarity. On the other hand, focusing a structural genomics effort on anony- mous proteins should also enhance the probability of discovering original folds that are highly valuable by- products for the academic community. The first antibacterial drug was made available in 1936 (sulfonamides), and the list regularly expanded during the following 30 years (beta-lactam (1940), tetracyclines (1949), chloramphenicol (1949), ami- noglycosides (1950), macrolides (1952), strepto- 141 Journal of Structural and Functional Genomics 4: 141–157, 2003. © 2003 Kluwer Academic Publishers. Printed in the Netherlands.

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

Structural genomics of highly conserved microbial genes of unknownfunction in search of new antibacterial targets

Chantal Abergel Bruno Coutard Deborah Byrne Sabine Chenivesse Jean-BaptisteClaude Ceacuteline Deregnaucourt Thierry Fricaux Celine Gianesini-Boutreux Sandra JeudyReacutegine Lebrun1 Caroline Maza Ceacutedric Notredame Olivier Poirot Karsten Suhre MajorieVaragnol amp Jean-Michel ClaverieStructural and Genomic Information Laboratory UMR 1889 CNRS-AVENTIS 1Institute for Structural Biologyand Microbiology 31 Chemin Joseph Aiguier 13402 Marseille cedex 20 France Author forCorrespondence E-mail ChantalAbergeligscnrs-mrsfr

Received 22 November 2002 Accepted in revised form 27 April 2003

Key words Escherichia coli Structural Genomics Comparative Genomics Bioinformatics anti-bacterial

Abstract

With more than 100 antibacterial drugs at our disposal in the 1980rsquos the problem of bacterial infection wasconsidered solved Today however most hospital infections are insensitive to several classes of antibacterialdrugs and deadly strains of Staphylococcus aureus resistant to vancomycin ndash the last resort antibiotic ndash haverecently begin to appear Other life-threatening microbes such as Enterococcus faecalis and Mycobacterium tu-berculosis are already able to resist every available antibiotic There is thus an urgent and continuous need fornew preferably large-spectrum antibacterial molecules ideally targeting new biochemical pathways Here wereport on the progress of our structural genomics program aiming at the discovery of new antibacterial genetargets among evolutionary conserved genes of uncharacterized function A series of bioinformatic and compara-tive genomics analyses were used to identify a set of 221 candidate genes common to Gram-positive andGram-negative bacteria These genes were split between two laboratories They are now submitted to a system-atic 3-D structure determination protocol including cloning protein expression and purification crystallizationX-ray diffraction structure interpretation and function prediction We describe here our strategies for the 111genes processed in our laboratory Bioinformatics is used at most stages of the production process and out of 111genes processed ndash and 17 months into the project ndash 108 have been successfully cloned 103 have exhibited de-tectable expression 84 have led to the production of soluble protein 46 have been purified 12 have led to usablecrystals and 7 structures have been determined

Introduction

Despite the exponential growth of sequence informa-tion from a large diversity of organisms each newlysequenced bacterial genome continues to reveal up to50 of genes without significant similarity to proteinof characterized function [1] By focusing a structuralgenomics project on such anonymous proteins ourgoal is twofold On one hand we expect a sizablefraction of these proteins to exhibit a recognizable3-D structure similarity with previously characterizedprotein families Structure determination is thus used

as a technique of functional genomics allowing com-mon functional attributes to be recognized beyond thetwilight zone of sequence similarity On the otherhand focusing a structural genomics effort on anony-mous proteins should also enhance the probability ofdiscovering original folds that are highly valuable by-products for the academic community

The first antibacterial drug was made available in1936 (sulfonamides) and the list regularly expandedduring the following 30 years (beta-lactam (1940)tetracyclines (1949) chloramphenicol (1949) ami-noglycosides (1950) macrolides (1952) strepto-

141Journal of Structural and Functional Genomics 4 141ndash157 2003copy 2003 Kluwer Academic Publishers Printed in the Netherlands

gramins (1962) quinolones (1962) rifampin (1963))With more than 100 different drugs available in the1980rsquos the problem of bacterial infection was con-sidered to be definitely solved However resistantstrains started to appear and spread quickly Todaymost hospital-acquired infections are insensitive toone or several classes of antibiotics and deadlystrains of Enterococcus faecalis or Mycobacterium tu-berculosis have been found resistant to all availabledrugs Until the recent approval of an oxazolidinoneLinezolid and ketolide telithromycin by the FDA(2000 2003 respectively) no new class of antibacte-rial drug had been approved for more than 25 yearsand there are very few other classes of compoundcurrently in clinical development [2 3] besides somepromising inhibitors of peptide deformylase [4]

Linezolid and telithromycin [5 6] are activeagainst a broad spectrum of gram-positive pathogensincluding multidrug resistant Staphylococci Strepto-cocci and Enterococci However Linezolid resistantEnterococci strains have already been reported [7]The utility of existing antimicrobial agents is thusrapidly eroding and the need for research directedtoward development of new antibiotics has neverbeen greater

It is remarkable that less than 20 distinct evolu-tionary conserved macromolecules ndash belonging to 5essential pathways ndash constitute the targets of allantibiotics available today Given that pathogenicbacteria have a thousand genes or more it is realisticto expect that the rational analysis of the abundantgenomic information ndash more than 60 complete bacte-rial genomes sequences ndash that have become recentlyavailable could unravel a sizable set of entirely noveltarget genes and inspire the design of original classesof antibacterial molecules Our approach is gearedtoward the identification of genes and proteins in-volved in essential biochemical pathways not yettargeted by existing antibacterial molecules

The work presented in this paper concerns the twofirst steps of a rational drug development program i)the identification of candidate target genes and ii) thedetermination of the three-dimensional structure ofthe corresponding proteins (production purificationcharacterization crystallization and crystallographicanalysis) We first describe the comprehensive bioin-formatic analysis used to define the subset of the mostpromising candidate genes based on their evolution-ary conservation across a large panel of bacterialspecies (Table 1) and their further prioritization usingproperties computed from their amino-acid sequence

We then present the various experimental tech-niques integrated into our 3-D structure determinationpipeline

Table 1 Genome sequence data The lsquoreference genomesrsquo corre-spond to complete genomes used to determine the most conservedgenes Genomes in bold correspond to the one used to determinethe genes conserved both in E coli and at least one gram-positivebacteria lsquoOther genomesrsquo correspond to complete genomes of lessbiomedical relevance or incomplete genomes near completionThese genomes are accessible for further analysis

Reference Genomes Other Genomes

Aquifex aeolicus VF5 Archaeoglobus fulgidus

Bacillus subtilis 168 Aeropyrum pernix

Borrelia burgdorferi B31 Bacillus halodurans Cminus125

Campylobacter jejuni NCTC

11168

Buchnera spClostridium ac-

etobutylicum

Chlamydia muridarum Deinococcus radiodurans R1

Chlamydia pneumoniae AR39 Halobacterium sp

Chlamydia pneumoniae AR39

plasmid

Methanococcus jannaschii

Chlamydia pneumoniae

CWL029

Mycobacterium leprae

Chlamydia trachomatis serovar

D

Mycoplasma pulmonis

Chlamydophila pneumoniae

J138

Mesorhizobium loti

Escherichia coli Kminus12MG1655

Methanobacterium thermoau-

totrophicum

Escherichia coli O157H7 Neisseria meningitidis

Haemophilus influenzae

KW20

Pyrococcus abyssi

Helicobacter pylori J99 Pyrococcus horikoshii

Helicobacter pylori 26695 Sinorhizobium meliloti

Lactococcus lactis IL1403 Staphylococcus aureus strain

Mu50

Mycobacterium tuberculosisH37Rv

Staphylococcus aureus N315

Mycoplasma genitalium Gminus37 Streptococcus pneumoniae

Mycoplasma pneumoniae

M129

Sulfolobus solfataricus

Neisseria meningitidis MC58 Synechocystis sp

Pasteurella multocida PM70 Thermoplasma acidophilum

Pseudomonas aeruginosa PA01 Thermotoga maritima

Rickettsia conorii Treponema pallidum

Rickettsia prowazekii Madrid E Xylella fastidiosa

Streptococcus pyogenesTreponema pallidum Nichols

Ureaplasma urealyticum sero-

var 3

Vibrio cholerae chromosome I

amp II

142

Finally we show how the mapping of paradoxicalconservation patterns revealed by the multiple align-ment of orthologous sequences can help the interpre-tation of anonymous 3-D structures (prediction ofpotential active and cofactor-binding sites) and sug-gest functional hypotheses for further experimentalvalidation

Materials and methods

Target identification

The complete genome sequences were collected for anumber of bacterial species covering a wide range ofevolutionary distances (Gram-negative to Gram-positive) and life-style (free-living parasitic andorintra-cellular) (see Table 1) For a subset of these ge-nomes (lsquoReference Genomesrsquo in Table 1) all potentialcoding regions (ORFs) of length larger than 150 nu-cleotides (with ATG GTG or TTG as initiator codon)were extracted and their conceptual translation storedThis represented a total of 275791 putative ORFsmany of them probably not corresponding to actualproteins

All large-scale sequence comparisons were per-formed using a distributed processing protocol usinga cluster of 48 PC (lsquogigablasterrsquo [8]) under the linuxoperating system Each node included a 500 MH Pen-tium III 256 Mb of RAM and a 13 Gb disk Usingthe sequence alignment program Blastp [9] each ofthese putative ORFs were compared to i) the set ofall E coli K12 ORFs and ii) the sets of B subtilis Llactis S pyogenes and M tuberculosis ORFs Weonly retained the ORFs with a significant match inboth sets Each retained ORF was then mapped backto its E coli orthologous gene using Ecogene [10] asour primary reference database We then further re-duced our target gene set by only retaining the genesof unknown or hypothetical function (lsquoYrsquo genes ac-cording to the E coli nomenclature) A prediction ofmembrane spanning segments [11] was then per-formed on each of the previously selected genes todetermine which were most likely to be expressed assoluble globular proteins

Cloning

In order to allow batches of ORFs to be processed inparallel directional cloning was performed using theGATEWAY system (Invitrogen [12]) Oligonucleo-

tide primers were designed for each of the E coliORF with AttB1 and AttB2 overhangs added for re-combination cloning PCR was performed on E coliK12 purified genomic DNA using Pfx proof readingpolymerase from Invitrogen Alternatively Ex Taq(Takara) or Platinum High fidelity Taq (Invitrogen)were used to relieve amplification problems PCRproducts were purified as described in the GATEWAYmanual (Invitrogen) using Poly Ethylene Glycolprecipitation This procedure was further automatizedusing Multi screen-PCR 96 wells purification fromMillipore The corresponding PCR products were in-serted by homologous recombination using the lsquoonetube reactionrsquo in the pDEST17 expression plasmid inframe with a N-terminal His6-tag under the controlof a T7 promoter The reaction was performed using1 microl BP reaction buffer 25minus125 ng PCR products in avolume of 25 microl 05 microl pDONR vector and 1 microl BPclonase enzyme mix The reaction mix was incubatedat 25 degC for 4 hours before adding 025 microl NaCl075 M 075 microl pDEST17 and 15 microl LR clonase andthen incubated 2 hours at 25 degC 05 microl proteinase Kwas added 10 minutes at 37 degC before transformationin DH5 3 colonies per cloning were picked forovernight cultures using 2ml LB amp medium Plas-mids were then purified using Qiagen miniprep kit(spin or racks) and positive plasmids were screenedby restriction profiling using NdeI and HindIII en-zymes (Biolab) Positive clones were sequenced overa length of 600 nucleotides (GenomExpress) usingattB1 primer (5 ACA AGT TTG TAC AAA AAAGC 3)

Protein expression

The purified plasmids were used for the over-expres-sion of the recombinant proteins both in E coliBL21(DE3) cells and in a cell-free system (RocheApplied Science) (First pass standard protocol [13]for details) Incomplete factorial experimental design(as implemented by the SamBa software [14] httpigs-servercnrs-mrsfrsamba) was used to define a setof 12 conditions corresponding to the combination of3 variables (Expression strain Medium type andTemperature) This set of conditions was used as astarting point for the second pass screeningoptimiza-tion protocols A partially automated procedure (in-cluding liquid handling dilutions bacterial lysatenormalization via a connection to a Packardmicroquantspectrophotometer (Bio-Tek) and dot-blot spotting)was developed using a Multiprobe II (Perkin elmer)

143

automated platform allowing us to screen the expres-sion of 12 ORFs in parallel The cloned ORFs weretransformed in four E coli expression strainsBL21 (DE3) pLysS Rosetta (DE3) Origami (DE3)(Novagen) and C41 (Avidis) For each transformedstrain a colony was picked for overnight preculturein 2 ml 2YT with the suitable antibiotic The Multi-probe II automated platform dispensed on 24 wellsplates 25 ml of one of the 3 selected media (2YT(Difco) Superior Broth (SB) and Turbo Broth (TB)(Athena Enzyme Systems)) ndash each with the suitableantibiotic ndash and inoculated them with 25 precul-tures Culture plates were then incubated at 37 degCwith 250 rpm agitation for one hour before beingtemperature regulated at one of the 3 defined temper-atures (25 degC 37 degC and 42 degC) Induction wasperformed on the Packard platform with 05 mMIPTG when OD600nm reached 04ndash08 Cultures werestopped 3 hours after induction The measuredOD600nm value is used to normalize the amount ofbacteria used for protein quantification The 12 result-ing aliquots were then dispensed in a 96 well plateand centrifuged 10 minutes at 4000 rpm Pellets wereresuspended in 250 microl BugBuster (Novagen) 01 mgml lysozyme and 01 mgml DnaseI (Euromedex) 30min 4 degC under agitation 100 microl of each bacteriallysate was used for estimating the total amount of ex-pressed protein and 150 microl was used to assay thelevel of soluble expression

Assessment of soluble expression

A clearing filter plate (MultiScreen NucleicA fromMillipore) a 022 microm hydrophilic filter plate (Multi-Screen GV from Millipore) and a 96 well plate wereassembled The lysates were dispensed on the firstplate and filtered in one step with one minute centri-fugation at 4000 rpm Lysates were conserved atndash20 degC The detection of recombinant protein wasperformed using a Dot Blot procedure 4 microl of boththe soluble and total protein was resuspended in100 microl of a denaturant buffer (urea 8 M SodiumPhosphate 100 mM Tris-HCl 10 MM pH 8) 25 microl ofthis mix were transferred onto nitrocellulose mem-brane Protan 45 (Schleicher and Schuell) using a blot-ter adapted for the Multiprobe station The membranewas washed twice in TBS Buffer (Tris-HCl 10 mMNaCl 150 mM pH 75) blocked in blocking reagent1X (Qiagen) with 01 Tweenminus20 for 1 hour washed2 times 10 minutes in TBS-TT buffer (Tris HCl20 mM NaCl 500 mM 002 Tweenminus20 02

Triton Xminus100) and washed once 10 minutes in TBSbefore incubation with a penta-His HRP conjugatedantibody (Qiagen) diluted 12000 in blocking reagent1X with 01 Tweenminus20 for 1 hour Subsequentwashes were performed twice for 15 minutes in TBS-TT and once 10 minutes in TBS His-tagged proteinexpression was revealed by detecting HorseRadishPeroxidase activity conjugated to the anti-His5 anti-body using chemiluminescence with ECL RPN2106kit (Amersham Biosciences) on Biomax films(Kodak) Each of the 12 conditions received a scoreaccording to the Dot Blot spot intensity The influenceof each variable on the soluble expression was thenextracted by adding the scores of the experimentssharing the same variable value and comparing thistotal score to the others For instance if we considerthe variable lsquotemperaturersquo the sum of the scores ofthe four experiments performed at 37 degC is comparedto the sum of the scores of the four experiments doneat 25 degC and at 42 degC In the case these total scoresare similar temperature is not considered as a signif-icant factor for protein expression and solubility Ifthe scores are different temperature is retained as apertinent variable The temperature corresponding tothe highest total score was then used for the optimizedexpression protocol

Protein purification and characterization

Protein purifications were performed using the AKTAumlexplorer 10S (Amersham-Pharmacia) using standardaffinity chromatography on HiTrap chelating (Amer-sham-Pharmacia) or NiNTA (Qiagen) columnscharged with Ni2 eluted with an imidazole gradi-ent after removal of non-specific low interactingproteins by a preliminary wash with 50mM imidazolebuffer The recombinant proteins were usually elutedwithin a 70minus150 mM imidazole range A detailed pro-tocol has been published elsewhere [15]

A microdialysis step is applied to each protein toidentify a suitable buffer for desalting and concentra-tion We use 250 microl Dialysis Buttons (HamptonResearch Calafornia USA) with MWCO 10000SpectraPor CE membrane ( Spectrum Labs Califor-nia) 250 microl of each sample are loaded into 6 dialysisbuttons and dialyzed overnight at 4 degC on a rotatingwheel against 50 ml of 6 different buffers in the pres-ence or absence of ionic strength (NaCl 02 M) Weused the following buffers Tris 10 mM pH (7minus9)Mes 10 mM pH (55minus67) Mops 20 mM pH(69minus83) Hepes 10 mM pH (68minus82) Imidazole

144

10 mM pH (62minus78) and Sodium Acetate 10 mM pH(36minus56) The various buffers were adjusted to atleast one pH unit above or below the pI of each pro-tein To estimate the amount of precipitated versussoluble protein 125 microl of each sample were loadedonto a 96 microtitre plate for subsequent UV scanning(DO280total) The remaining 125 microl were centrifugedat 12000 rpm for 20 minutes and the supernatantloaded onto a 96 microtitre plate for soluble proteinevaluation (DO280soluble) The comparison of thetwo UV scans using the respective buffer as the blankwith the KC4 software on a microquant spectrophotom-eter (Bio-tek instruments inc) was used to select theoptimal desalting buffer as the one corresponding tothe smallest (DO280total ndash DO280soluble) The buff-ers were then exchanged using the Hiprep 2610desalting column (Amersham-pharmacia) on theAKTAuml explorer 10S

Quality control of the samples

Before entering crystallization trials purified recom-binant proteins were first characterized using circulardichroiumlsm spectroscopy to assess the folding status ofthe expression products and its consistency withsecondary structure predictions [11] Monodispersitybeing a key factor in the crystallization process theaggregation state of the protein solution was system-atically monitored by dynamics light scattering (Dy-napro MS800-TC) These measurements were alsoused to define the most suitable buffer in which toperform concentration phase diagrams and crystalli-zation trials Each sample was also controlled oniso-electro-focusing native and SDS gels both in theabsence and presence of reducing agent and 2 M ureaThe results were compared with the theoreticalmolecular weight of the target as well as with itspredicted isoelectric point and cysteine content Thestability of the protein over time was assessed bystoring purified samples at 3 temperatures 20 degC4 degC and ndash80 degC Every two weeks they were con-trolled by electrophoresis on SDS gels to reveal aneventual degradation process

Finally the purified proteins were all characterizedby mass spectroscopy (MALDI-TOF VoyagerDE-RP Perceptive Biosystem) and by 30 cycles ofamino-acid N-terminal Edman sequencing (AppliedBiosystem 473A)

Crystallization

The screening for crystallization conditions was per-formed on 3x96-well crystallization plates (Greiner)loaded by an 8-needle dispensing robot (Tecan WS1008 workstation modified for our needs) using one1-microl sitting drop per condition Each protein wasinitially tested against 480 different conditions at aconcentration determined by the phase diagramanalysis The tested crystallization conditions includein-house designed [14] and commercially availablesolution sets (CrystalScreen-Hampton researchWizards-Emerald BioStructures) After analyzing theresults of this initial screen conditions were refinedusing the incomplete factorial design approach [14and ref herein]

Data collection and structure determination

Diffraction data for proteins with structural homo-logues in the PDB [16] are interpreted using themolecular replacement procedure using the AMoResoftware [17] Alternatively the MultiwavelengthAnomalous Diffraction (MAD) technique [18 and refherein] was used for interpreting the diffraction dataof protein with no obvious structural homologuesusing crystals of the seleno-methionine substitutedproteins [19] Detailed protocols can be found in [20]

Results and discussion

Target selection

Because of the detrimental effect of their mutationsgenes belonging to essential pathways exhibit lowevolution rates Moreover the reliable identificationndash using sequence similarity ndash of homologous genesin very distant organisms is only possible for slowlyevolving genes Thus one expect the subsets of ubiq-uitous or lsquowide-spectrumrsquo genes to be enriched in es-sential genes for the same panel of microorganisms

The first step of the target selection process thusconsisted into a comprehensive cross-comparison ofa large panel of complete genomes from Gram-negative and Gram-positive bacteria (separated by 2billions years of divergent evolution) to identify themost conserved genes among those with clear homo-logues both in E coli and at least one Gram-positivebacteria This first step using a BlastP score thresh-old of 150 (with Blosum62) resulted into a set of

145

1300 different homologous ORF sets each with a dif-ferent representative in E coli For each of theseE coli ORFs a conservation index was computed as

CI all

BestScore

EcoliSelfScore

Where EcoliSelfScore is the BlastP [9] score of eachEcoli ORF aligned with itself and BestScore corre-sponds to the score of the best matching ORFs withineach genome (counting 0 for matches scoring belowthe threshold) the sum being performed over alltested genomes The CI value is thus quantifying thesequence conservation of the gene (and its presence)across a large number of evolutionary diverse micro-bial genomes including important pathogens For theinitial 1300 selected genes CI values ranged from225 (for the elongation factor subunit tufB) to 14 forthe putative sugar hydrolase ybgG gene (Figure 1)

For all 1300 selected genes (as well as for all 4222E coli genes) a number of properties were precom-puted and stored in a master database (lsquoMaster DBrsquo)In this database each gene is represented by a spe-cific interactive identity card (Figure 2) The ID cardcontains the gene nucleotide and amino-acid se-quence its precise position in the E coli genome itssignificant protein matches within the Reference Ge-nome set some theoretical properties of its proteinproduct (molecular weight pI extinction coefficient

number of cysteines methionines and chargedresidues) Each gene ID card also gives access to ad-ditional features such as a restriction map of the genethe amino-acid translations derived from alternativereading frames (for error checking) a map of the rarecodons as well as the protein hydropathy profile pre-dicted signal peptide membrane spanning segmentsand secondary structure The lsquoBlastrsquo button (Figure 2)gives access to interactive similarity searches againstadditional genomes listed in Table 1 (lsquoOther Ge-nomesrsquo) This set includes the complete genomes ofmicroorganisms of lesser biomedical relevance (eghyperthermophilic archebacteria) or interesting bacte-ria the genomic sequence of which is nearing com-pletion This list is submitted to periodical changesThe lsquoextended alignmentrsquo button generate a multiplealignment (using the T-coffee [21] program) of the Ecoli ORF with the homologous ORFs identified in theother bacterial genomes The lsquoPdb alignmentrsquo buttondoes the same with homologues of known 3-D struc-tures Master DB and the results of these varioussequence analyses are frequently consulted to guidethe subsequent experimental steps of the project suchas cloning gene expression protein purification andcrystallization trials

The set of 1300 well-conserved gene was furtherreduced to 317 by only retaining the genes of un-known or hypothetical function The distribution ofconservation index (CI values) in this subset rangesfrom 172 to 14 Figure (3A) presents a graph ofknown vs unknown genes ranked according to theirevolutionary conservation (CI value) This graph in-dicates that the most conserved genes (likely to cor-respond to essential pathways) have been historicallycharacterized in priority However it is comforting tosee that a fraction of well-characterized genes arefound at lower CI values where genes of yet unknownfunctions begin to dominate the distribution Thisshows that our protocol of ORF selection based onthe evolutionary conservation across Gram-positiveGram-negative genomes does actually converge ontoactual genes

The set of 317 highly conserved anonymous geneswas further reduced to enhance our probability of suc-cess in the experimental phase of the project that isthe over expression and determination of the 3-Dstructure of the largest possible number of these Ecoli gene products Although the structural determi-nation of membrane-bound proteins will eventuallybe attempted later on (eg by expressing the ORF asseparate domains) we chose to first focus on the

Figure 1 Mandelbrot plot (rank vs score) of the initial 1300 pu-tative target genes The genes (X-axis) are ranked according to theirassociated CI value (Y-axis) The first 200 genes with CI value gt10are ubiquitous genes involved in the most essential cellular path-ways The rest of the distribution is smoothly decreasing suggest-ing a fairly homogeneous selection pressure among all these can-didate target genes Genes of unknown functions are in red geneof known functions are in green

146

easier soluble targets (the so-called lsquolow hangingfruitsrsquo) A prediction of membrane spanning segments[11] was then performed to determine which of theabove 317 genes were most likely to be expressed assoluble globular protein This final bioinformatic fil-ter resulted into a final set of 221 candidate targetgenes The distribution of CI values within this geneset does not exhibit any significant bias ranging from172 to 14 (Figure 3B) Out of the 221 correspond-ing proteins only 17 exhibit a strong sequence simi-larity (defined as gt40 identity over a segment of atleast 130 residues) to entries in the PDB Our 221-candidate gene set is thus likely to correspond to asizable number of interesting new structures and pos-sibly some original folds 110 of these target geneswere attributed to the AFMB laboratory and are notfurther described

Additional bioinformatic analyses

Additional sequence analyses were performed at thevarious stage of the experimental process in particu-lar for genes and proteins for which difficulties wereencountered For instance putative cofactor-bindingmotifs were searched to help expressing purifying orcrystallizing recalcitrant proteins This was performedby combining the recognition of usual sequence sig-natures (eg Prosite motif [22]) with an analysis oftheir conservation within multiple alignments

Multiple alignment is also useful to identify theproper N-terminus of the ORF (for cloning purpose)or delineating detrimental flexible N- or C-terminalsegments of the candidate proteins An analysis of theconfidence measure at the heart of the T-coffee mul-tiple alignment program [21] indicated a correlationbetween the least reliably aligned protein segmentsthe regions of higher sequence variability and thoseof greater flexibility [23] As we work with well-con-servedubiquitous genes the situation is also optimalto take advantage of multiple alignments to identifykey-residues in these lsquoanonymousrsquo proteins and usethe conserved patterns to link them with protein fam-ilies of known functions despite a lack of overallsignificant sequence similarity (see 24 as an exam-ple)

Phylogenomics analyses aiming at identifyingsubset of genes co-segregating or co-evolving acrossmultiple bacterial genomes are also extensively usedto predict protein-protein interaction [25 26] eventu-

ally leading to the design of co-expression co-purifi-cation or co-crystallization experiments Phyloge-nomics is used to suggest links between ORFs ofunknown functions and identified pathways [25]These hints might help the functional interpretation ofthe 3-D structures of our anonymous targets Phylo-genomics analyses are performed with our in-housesoftware PhyDBac [26 httpigs-servercnrs-mrsfrphydbac]

On the negative side the priority of candidategenes was sometimes downgraded due to their ab-sence in important pathogens with reduced genomes(such as H influenzae or Mycoplasma) or for beingtoo similar to their human homologues Ideally onlybacteria-specific biochemical pathways should betargeted for the development of antibiotics (eg peni-cillin blocking peptidoglycan biosynthesis) Howeverthis constraint is probably not acceptable on the longrun given the constant need for new antibiotics andthe limited number of strictly bacterial specific tar-gets A good counter example is the success of thequinolones that are targeting DNA gyrase one of themost conserved proteins (26 identical amino acidsbetween human and E coli)

Finally the amino acid conservation patterns de-rived from multiple alignments become even moreinformative when analyzed in the context of a repre-sentative 3-D structure (from the E coli gene prod-uct) For instance the structural (hydrophobic core orstabilizing residues) vs functional (eg catalyticresidues) nature of the conserved positions can be as-sessed (as illustrated in [20]) Paradoxical features ndashsuch as well-exposed surface loops exhibiting strongresidue conservation ndash are highly informative for thefunctional interpretation of anonymous 3-D struc-tures Protein cavities lined with conserved residuesare also good candidates for lsquoactive sitesrsquo Previouslyunnoticed (ie misaligned) conserved isolated resi-dues (dispersed in the sequence) might be found closetogether in the 3-D structure and suggest the presenceof a catalytic center The spatial distribution of con-served residues can be exhaustively searched withinproteins of known function eventually resulting inthe identification of catalytic sites [27] Finally thedrugability of the regions of the protein pointed outby these analyses can be assessed and large librariesof putative ligands can be computationally screened[28 29]

147

Gene expression and protein purification pipelineFirst pass

The general principle governing our large-scale pro-tein expression approach is to quickly identify thelsquoeasyrsquo ORFs by passing our whole set of 111 througha common first pass standard protocol Further ex-pression trials are then performed on the lsquorecalcitrantrsquoproteins These trials involve lowering the tempera-ture trying other E coli strains adding predictedco-factors etc After about 17 months the first passhas been completed on the 111 targeted ORFs Theresults are as follows 108 ORFs have been success-fully cloned 104 have been tested 89 have exhibiteda detectable expression (either in vivo or using theRoche cell free system) 54 in a soluble form in a firstpass screening In parallel our industrial partner

(Infectious Disease Division Aventis Pharma) is as-sessing the essentiality of these putative target genesin major pathogens using proprietary approaches

Second pass addressing the bottleneck of solubleexpression

The above results identified the expression of the tar-gets in a soluble form as a major bottleneck A sec-ond pass protocol was thus designed using a statisticalscreening of various variables in order to determinethe best conditions for soluble expression We firstdetermined that the three main variables influencingprotein solubility andor expression yield were 1) thebacterial strain 2) the culture temperature and 3) thegrowth medium A set of 12 conditions correspond-ing to the combination of these 3 variables was thus

Figure 2 The yqhE gene identity card in Master DB The various buttons (top left column) provide interactive access to pre-computedadditional information as well as to lsquoon the flyrsquo sequence analysis

148

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 2: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

gramins (1962) quinolones (1962) rifampin (1963))With more than 100 different drugs available in the1980rsquos the problem of bacterial infection was con-sidered to be definitely solved However resistantstrains started to appear and spread quickly Todaymost hospital-acquired infections are insensitive toone or several classes of antibiotics and deadlystrains of Enterococcus faecalis or Mycobacterium tu-berculosis have been found resistant to all availabledrugs Until the recent approval of an oxazolidinoneLinezolid and ketolide telithromycin by the FDA(2000 2003 respectively) no new class of antibacte-rial drug had been approved for more than 25 yearsand there are very few other classes of compoundcurrently in clinical development [2 3] besides somepromising inhibitors of peptide deformylase [4]

Linezolid and telithromycin [5 6] are activeagainst a broad spectrum of gram-positive pathogensincluding multidrug resistant Staphylococci Strepto-cocci and Enterococci However Linezolid resistantEnterococci strains have already been reported [7]The utility of existing antimicrobial agents is thusrapidly eroding and the need for research directedtoward development of new antibiotics has neverbeen greater

It is remarkable that less than 20 distinct evolu-tionary conserved macromolecules ndash belonging to 5essential pathways ndash constitute the targets of allantibiotics available today Given that pathogenicbacteria have a thousand genes or more it is realisticto expect that the rational analysis of the abundantgenomic information ndash more than 60 complete bacte-rial genomes sequences ndash that have become recentlyavailable could unravel a sizable set of entirely noveltarget genes and inspire the design of original classesof antibacterial molecules Our approach is gearedtoward the identification of genes and proteins in-volved in essential biochemical pathways not yettargeted by existing antibacterial molecules

The work presented in this paper concerns the twofirst steps of a rational drug development program i)the identification of candidate target genes and ii) thedetermination of the three-dimensional structure ofthe corresponding proteins (production purificationcharacterization crystallization and crystallographicanalysis) We first describe the comprehensive bioin-formatic analysis used to define the subset of the mostpromising candidate genes based on their evolution-ary conservation across a large panel of bacterialspecies (Table 1) and their further prioritization usingproperties computed from their amino-acid sequence

We then present the various experimental tech-niques integrated into our 3-D structure determinationpipeline

Table 1 Genome sequence data The lsquoreference genomesrsquo corre-spond to complete genomes used to determine the most conservedgenes Genomes in bold correspond to the one used to determinethe genes conserved both in E coli and at least one gram-positivebacteria lsquoOther genomesrsquo correspond to complete genomes of lessbiomedical relevance or incomplete genomes near completionThese genomes are accessible for further analysis

Reference Genomes Other Genomes

Aquifex aeolicus VF5 Archaeoglobus fulgidus

Bacillus subtilis 168 Aeropyrum pernix

Borrelia burgdorferi B31 Bacillus halodurans Cminus125

Campylobacter jejuni NCTC

11168

Buchnera spClostridium ac-

etobutylicum

Chlamydia muridarum Deinococcus radiodurans R1

Chlamydia pneumoniae AR39 Halobacterium sp

Chlamydia pneumoniae AR39

plasmid

Methanococcus jannaschii

Chlamydia pneumoniae

CWL029

Mycobacterium leprae

Chlamydia trachomatis serovar

D

Mycoplasma pulmonis

Chlamydophila pneumoniae

J138

Mesorhizobium loti

Escherichia coli Kminus12MG1655

Methanobacterium thermoau-

totrophicum

Escherichia coli O157H7 Neisseria meningitidis

Haemophilus influenzae

KW20

Pyrococcus abyssi

Helicobacter pylori J99 Pyrococcus horikoshii

Helicobacter pylori 26695 Sinorhizobium meliloti

Lactococcus lactis IL1403 Staphylococcus aureus strain

Mu50

Mycobacterium tuberculosisH37Rv

Staphylococcus aureus N315

Mycoplasma genitalium Gminus37 Streptococcus pneumoniae

Mycoplasma pneumoniae

M129

Sulfolobus solfataricus

Neisseria meningitidis MC58 Synechocystis sp

Pasteurella multocida PM70 Thermoplasma acidophilum

Pseudomonas aeruginosa PA01 Thermotoga maritima

Rickettsia conorii Treponema pallidum

Rickettsia prowazekii Madrid E Xylella fastidiosa

Streptococcus pyogenesTreponema pallidum Nichols

Ureaplasma urealyticum sero-

var 3

Vibrio cholerae chromosome I

amp II

142

Finally we show how the mapping of paradoxicalconservation patterns revealed by the multiple align-ment of orthologous sequences can help the interpre-tation of anonymous 3-D structures (prediction ofpotential active and cofactor-binding sites) and sug-gest functional hypotheses for further experimentalvalidation

Materials and methods

Target identification

The complete genome sequences were collected for anumber of bacterial species covering a wide range ofevolutionary distances (Gram-negative to Gram-positive) and life-style (free-living parasitic andorintra-cellular) (see Table 1) For a subset of these ge-nomes (lsquoReference Genomesrsquo in Table 1) all potentialcoding regions (ORFs) of length larger than 150 nu-cleotides (with ATG GTG or TTG as initiator codon)were extracted and their conceptual translation storedThis represented a total of 275791 putative ORFsmany of them probably not corresponding to actualproteins

All large-scale sequence comparisons were per-formed using a distributed processing protocol usinga cluster of 48 PC (lsquogigablasterrsquo [8]) under the linuxoperating system Each node included a 500 MH Pen-tium III 256 Mb of RAM and a 13 Gb disk Usingthe sequence alignment program Blastp [9] each ofthese putative ORFs were compared to i) the set ofall E coli K12 ORFs and ii) the sets of B subtilis Llactis S pyogenes and M tuberculosis ORFs Weonly retained the ORFs with a significant match inboth sets Each retained ORF was then mapped backto its E coli orthologous gene using Ecogene [10] asour primary reference database We then further re-duced our target gene set by only retaining the genesof unknown or hypothetical function (lsquoYrsquo genes ac-cording to the E coli nomenclature) A prediction ofmembrane spanning segments [11] was then per-formed on each of the previously selected genes todetermine which were most likely to be expressed assoluble globular proteins

Cloning

In order to allow batches of ORFs to be processed inparallel directional cloning was performed using theGATEWAY system (Invitrogen [12]) Oligonucleo-

tide primers were designed for each of the E coliORF with AttB1 and AttB2 overhangs added for re-combination cloning PCR was performed on E coliK12 purified genomic DNA using Pfx proof readingpolymerase from Invitrogen Alternatively Ex Taq(Takara) or Platinum High fidelity Taq (Invitrogen)were used to relieve amplification problems PCRproducts were purified as described in the GATEWAYmanual (Invitrogen) using Poly Ethylene Glycolprecipitation This procedure was further automatizedusing Multi screen-PCR 96 wells purification fromMillipore The corresponding PCR products were in-serted by homologous recombination using the lsquoonetube reactionrsquo in the pDEST17 expression plasmid inframe with a N-terminal His6-tag under the controlof a T7 promoter The reaction was performed using1 microl BP reaction buffer 25minus125 ng PCR products in avolume of 25 microl 05 microl pDONR vector and 1 microl BPclonase enzyme mix The reaction mix was incubatedat 25 degC for 4 hours before adding 025 microl NaCl075 M 075 microl pDEST17 and 15 microl LR clonase andthen incubated 2 hours at 25 degC 05 microl proteinase Kwas added 10 minutes at 37 degC before transformationin DH5 3 colonies per cloning were picked forovernight cultures using 2ml LB amp medium Plas-mids were then purified using Qiagen miniprep kit(spin or racks) and positive plasmids were screenedby restriction profiling using NdeI and HindIII en-zymes (Biolab) Positive clones were sequenced overa length of 600 nucleotides (GenomExpress) usingattB1 primer (5 ACA AGT TTG TAC AAA AAAGC 3)

Protein expression

The purified plasmids were used for the over-expres-sion of the recombinant proteins both in E coliBL21(DE3) cells and in a cell-free system (RocheApplied Science) (First pass standard protocol [13]for details) Incomplete factorial experimental design(as implemented by the SamBa software [14] httpigs-servercnrs-mrsfrsamba) was used to define a setof 12 conditions corresponding to the combination of3 variables (Expression strain Medium type andTemperature) This set of conditions was used as astarting point for the second pass screeningoptimiza-tion protocols A partially automated procedure (in-cluding liquid handling dilutions bacterial lysatenormalization via a connection to a Packardmicroquantspectrophotometer (Bio-Tek) and dot-blot spotting)was developed using a Multiprobe II (Perkin elmer)

143

automated platform allowing us to screen the expres-sion of 12 ORFs in parallel The cloned ORFs weretransformed in four E coli expression strainsBL21 (DE3) pLysS Rosetta (DE3) Origami (DE3)(Novagen) and C41 (Avidis) For each transformedstrain a colony was picked for overnight preculturein 2 ml 2YT with the suitable antibiotic The Multi-probe II automated platform dispensed on 24 wellsplates 25 ml of one of the 3 selected media (2YT(Difco) Superior Broth (SB) and Turbo Broth (TB)(Athena Enzyme Systems)) ndash each with the suitableantibiotic ndash and inoculated them with 25 precul-tures Culture plates were then incubated at 37 degCwith 250 rpm agitation for one hour before beingtemperature regulated at one of the 3 defined temper-atures (25 degC 37 degC and 42 degC) Induction wasperformed on the Packard platform with 05 mMIPTG when OD600nm reached 04ndash08 Cultures werestopped 3 hours after induction The measuredOD600nm value is used to normalize the amount ofbacteria used for protein quantification The 12 result-ing aliquots were then dispensed in a 96 well plateand centrifuged 10 minutes at 4000 rpm Pellets wereresuspended in 250 microl BugBuster (Novagen) 01 mgml lysozyme and 01 mgml DnaseI (Euromedex) 30min 4 degC under agitation 100 microl of each bacteriallysate was used for estimating the total amount of ex-pressed protein and 150 microl was used to assay thelevel of soluble expression

Assessment of soluble expression

A clearing filter plate (MultiScreen NucleicA fromMillipore) a 022 microm hydrophilic filter plate (Multi-Screen GV from Millipore) and a 96 well plate wereassembled The lysates were dispensed on the firstplate and filtered in one step with one minute centri-fugation at 4000 rpm Lysates were conserved atndash20 degC The detection of recombinant protein wasperformed using a Dot Blot procedure 4 microl of boththe soluble and total protein was resuspended in100 microl of a denaturant buffer (urea 8 M SodiumPhosphate 100 mM Tris-HCl 10 MM pH 8) 25 microl ofthis mix were transferred onto nitrocellulose mem-brane Protan 45 (Schleicher and Schuell) using a blot-ter adapted for the Multiprobe station The membranewas washed twice in TBS Buffer (Tris-HCl 10 mMNaCl 150 mM pH 75) blocked in blocking reagent1X (Qiagen) with 01 Tweenminus20 for 1 hour washed2 times 10 minutes in TBS-TT buffer (Tris HCl20 mM NaCl 500 mM 002 Tweenminus20 02

Triton Xminus100) and washed once 10 minutes in TBSbefore incubation with a penta-His HRP conjugatedantibody (Qiagen) diluted 12000 in blocking reagent1X with 01 Tweenminus20 for 1 hour Subsequentwashes were performed twice for 15 minutes in TBS-TT and once 10 minutes in TBS His-tagged proteinexpression was revealed by detecting HorseRadishPeroxidase activity conjugated to the anti-His5 anti-body using chemiluminescence with ECL RPN2106kit (Amersham Biosciences) on Biomax films(Kodak) Each of the 12 conditions received a scoreaccording to the Dot Blot spot intensity The influenceof each variable on the soluble expression was thenextracted by adding the scores of the experimentssharing the same variable value and comparing thistotal score to the others For instance if we considerthe variable lsquotemperaturersquo the sum of the scores ofthe four experiments performed at 37 degC is comparedto the sum of the scores of the four experiments doneat 25 degC and at 42 degC In the case these total scoresare similar temperature is not considered as a signif-icant factor for protein expression and solubility Ifthe scores are different temperature is retained as apertinent variable The temperature corresponding tothe highest total score was then used for the optimizedexpression protocol

Protein purification and characterization

Protein purifications were performed using the AKTAumlexplorer 10S (Amersham-Pharmacia) using standardaffinity chromatography on HiTrap chelating (Amer-sham-Pharmacia) or NiNTA (Qiagen) columnscharged with Ni2 eluted with an imidazole gradi-ent after removal of non-specific low interactingproteins by a preliminary wash with 50mM imidazolebuffer The recombinant proteins were usually elutedwithin a 70minus150 mM imidazole range A detailed pro-tocol has been published elsewhere [15]

A microdialysis step is applied to each protein toidentify a suitable buffer for desalting and concentra-tion We use 250 microl Dialysis Buttons (HamptonResearch Calafornia USA) with MWCO 10000SpectraPor CE membrane ( Spectrum Labs Califor-nia) 250 microl of each sample are loaded into 6 dialysisbuttons and dialyzed overnight at 4 degC on a rotatingwheel against 50 ml of 6 different buffers in the pres-ence or absence of ionic strength (NaCl 02 M) Weused the following buffers Tris 10 mM pH (7minus9)Mes 10 mM pH (55minus67) Mops 20 mM pH(69minus83) Hepes 10 mM pH (68minus82) Imidazole

144

10 mM pH (62minus78) and Sodium Acetate 10 mM pH(36minus56) The various buffers were adjusted to atleast one pH unit above or below the pI of each pro-tein To estimate the amount of precipitated versussoluble protein 125 microl of each sample were loadedonto a 96 microtitre plate for subsequent UV scanning(DO280total) The remaining 125 microl were centrifugedat 12000 rpm for 20 minutes and the supernatantloaded onto a 96 microtitre plate for soluble proteinevaluation (DO280soluble) The comparison of thetwo UV scans using the respective buffer as the blankwith the KC4 software on a microquant spectrophotom-eter (Bio-tek instruments inc) was used to select theoptimal desalting buffer as the one corresponding tothe smallest (DO280total ndash DO280soluble) The buff-ers were then exchanged using the Hiprep 2610desalting column (Amersham-pharmacia) on theAKTAuml explorer 10S

Quality control of the samples

Before entering crystallization trials purified recom-binant proteins were first characterized using circulardichroiumlsm spectroscopy to assess the folding status ofthe expression products and its consistency withsecondary structure predictions [11] Monodispersitybeing a key factor in the crystallization process theaggregation state of the protein solution was system-atically monitored by dynamics light scattering (Dy-napro MS800-TC) These measurements were alsoused to define the most suitable buffer in which toperform concentration phase diagrams and crystalli-zation trials Each sample was also controlled oniso-electro-focusing native and SDS gels both in theabsence and presence of reducing agent and 2 M ureaThe results were compared with the theoreticalmolecular weight of the target as well as with itspredicted isoelectric point and cysteine content Thestability of the protein over time was assessed bystoring purified samples at 3 temperatures 20 degC4 degC and ndash80 degC Every two weeks they were con-trolled by electrophoresis on SDS gels to reveal aneventual degradation process

Finally the purified proteins were all characterizedby mass spectroscopy (MALDI-TOF VoyagerDE-RP Perceptive Biosystem) and by 30 cycles ofamino-acid N-terminal Edman sequencing (AppliedBiosystem 473A)

Crystallization

The screening for crystallization conditions was per-formed on 3x96-well crystallization plates (Greiner)loaded by an 8-needle dispensing robot (Tecan WS1008 workstation modified for our needs) using one1-microl sitting drop per condition Each protein wasinitially tested against 480 different conditions at aconcentration determined by the phase diagramanalysis The tested crystallization conditions includein-house designed [14] and commercially availablesolution sets (CrystalScreen-Hampton researchWizards-Emerald BioStructures) After analyzing theresults of this initial screen conditions were refinedusing the incomplete factorial design approach [14and ref herein]

Data collection and structure determination

Diffraction data for proteins with structural homo-logues in the PDB [16] are interpreted using themolecular replacement procedure using the AMoResoftware [17] Alternatively the MultiwavelengthAnomalous Diffraction (MAD) technique [18 and refherein] was used for interpreting the diffraction dataof protein with no obvious structural homologuesusing crystals of the seleno-methionine substitutedproteins [19] Detailed protocols can be found in [20]

Results and discussion

Target selection

Because of the detrimental effect of their mutationsgenes belonging to essential pathways exhibit lowevolution rates Moreover the reliable identificationndash using sequence similarity ndash of homologous genesin very distant organisms is only possible for slowlyevolving genes Thus one expect the subsets of ubiq-uitous or lsquowide-spectrumrsquo genes to be enriched in es-sential genes for the same panel of microorganisms

The first step of the target selection process thusconsisted into a comprehensive cross-comparison ofa large panel of complete genomes from Gram-negative and Gram-positive bacteria (separated by 2billions years of divergent evolution) to identify themost conserved genes among those with clear homo-logues both in E coli and at least one Gram-positivebacteria This first step using a BlastP score thresh-old of 150 (with Blosum62) resulted into a set of

145

1300 different homologous ORF sets each with a dif-ferent representative in E coli For each of theseE coli ORFs a conservation index was computed as

CI all

BestScore

EcoliSelfScore

Where EcoliSelfScore is the BlastP [9] score of eachEcoli ORF aligned with itself and BestScore corre-sponds to the score of the best matching ORFs withineach genome (counting 0 for matches scoring belowthe threshold) the sum being performed over alltested genomes The CI value is thus quantifying thesequence conservation of the gene (and its presence)across a large number of evolutionary diverse micro-bial genomes including important pathogens For theinitial 1300 selected genes CI values ranged from225 (for the elongation factor subunit tufB) to 14 forthe putative sugar hydrolase ybgG gene (Figure 1)

For all 1300 selected genes (as well as for all 4222E coli genes) a number of properties were precom-puted and stored in a master database (lsquoMaster DBrsquo)In this database each gene is represented by a spe-cific interactive identity card (Figure 2) The ID cardcontains the gene nucleotide and amino-acid se-quence its precise position in the E coli genome itssignificant protein matches within the Reference Ge-nome set some theoretical properties of its proteinproduct (molecular weight pI extinction coefficient

number of cysteines methionines and chargedresidues) Each gene ID card also gives access to ad-ditional features such as a restriction map of the genethe amino-acid translations derived from alternativereading frames (for error checking) a map of the rarecodons as well as the protein hydropathy profile pre-dicted signal peptide membrane spanning segmentsand secondary structure The lsquoBlastrsquo button (Figure 2)gives access to interactive similarity searches againstadditional genomes listed in Table 1 (lsquoOther Ge-nomesrsquo) This set includes the complete genomes ofmicroorganisms of lesser biomedical relevance (eghyperthermophilic archebacteria) or interesting bacte-ria the genomic sequence of which is nearing com-pletion This list is submitted to periodical changesThe lsquoextended alignmentrsquo button generate a multiplealignment (using the T-coffee [21] program) of the Ecoli ORF with the homologous ORFs identified in theother bacterial genomes The lsquoPdb alignmentrsquo buttondoes the same with homologues of known 3-D struc-tures Master DB and the results of these varioussequence analyses are frequently consulted to guidethe subsequent experimental steps of the project suchas cloning gene expression protein purification andcrystallization trials

The set of 1300 well-conserved gene was furtherreduced to 317 by only retaining the genes of un-known or hypothetical function The distribution ofconservation index (CI values) in this subset rangesfrom 172 to 14 Figure (3A) presents a graph ofknown vs unknown genes ranked according to theirevolutionary conservation (CI value) This graph in-dicates that the most conserved genes (likely to cor-respond to essential pathways) have been historicallycharacterized in priority However it is comforting tosee that a fraction of well-characterized genes arefound at lower CI values where genes of yet unknownfunctions begin to dominate the distribution Thisshows that our protocol of ORF selection based onthe evolutionary conservation across Gram-positiveGram-negative genomes does actually converge ontoactual genes

The set of 317 highly conserved anonymous geneswas further reduced to enhance our probability of suc-cess in the experimental phase of the project that isthe over expression and determination of the 3-Dstructure of the largest possible number of these Ecoli gene products Although the structural determi-nation of membrane-bound proteins will eventuallybe attempted later on (eg by expressing the ORF asseparate domains) we chose to first focus on the

Figure 1 Mandelbrot plot (rank vs score) of the initial 1300 pu-tative target genes The genes (X-axis) are ranked according to theirassociated CI value (Y-axis) The first 200 genes with CI value gt10are ubiquitous genes involved in the most essential cellular path-ways The rest of the distribution is smoothly decreasing suggest-ing a fairly homogeneous selection pressure among all these can-didate target genes Genes of unknown functions are in red geneof known functions are in green

146

easier soluble targets (the so-called lsquolow hangingfruitsrsquo) A prediction of membrane spanning segments[11] was then performed to determine which of theabove 317 genes were most likely to be expressed assoluble globular protein This final bioinformatic fil-ter resulted into a final set of 221 candidate targetgenes The distribution of CI values within this geneset does not exhibit any significant bias ranging from172 to 14 (Figure 3B) Out of the 221 correspond-ing proteins only 17 exhibit a strong sequence simi-larity (defined as gt40 identity over a segment of atleast 130 residues) to entries in the PDB Our 221-candidate gene set is thus likely to correspond to asizable number of interesting new structures and pos-sibly some original folds 110 of these target geneswere attributed to the AFMB laboratory and are notfurther described

Additional bioinformatic analyses

Additional sequence analyses were performed at thevarious stage of the experimental process in particu-lar for genes and proteins for which difficulties wereencountered For instance putative cofactor-bindingmotifs were searched to help expressing purifying orcrystallizing recalcitrant proteins This was performedby combining the recognition of usual sequence sig-natures (eg Prosite motif [22]) with an analysis oftheir conservation within multiple alignments

Multiple alignment is also useful to identify theproper N-terminus of the ORF (for cloning purpose)or delineating detrimental flexible N- or C-terminalsegments of the candidate proteins An analysis of theconfidence measure at the heart of the T-coffee mul-tiple alignment program [21] indicated a correlationbetween the least reliably aligned protein segmentsthe regions of higher sequence variability and thoseof greater flexibility [23] As we work with well-con-servedubiquitous genes the situation is also optimalto take advantage of multiple alignments to identifykey-residues in these lsquoanonymousrsquo proteins and usethe conserved patterns to link them with protein fam-ilies of known functions despite a lack of overallsignificant sequence similarity (see 24 as an exam-ple)

Phylogenomics analyses aiming at identifyingsubset of genes co-segregating or co-evolving acrossmultiple bacterial genomes are also extensively usedto predict protein-protein interaction [25 26] eventu-

ally leading to the design of co-expression co-purifi-cation or co-crystallization experiments Phyloge-nomics is used to suggest links between ORFs ofunknown functions and identified pathways [25]These hints might help the functional interpretation ofthe 3-D structures of our anonymous targets Phylo-genomics analyses are performed with our in-housesoftware PhyDBac [26 httpigs-servercnrs-mrsfrphydbac]

On the negative side the priority of candidategenes was sometimes downgraded due to their ab-sence in important pathogens with reduced genomes(such as H influenzae or Mycoplasma) or for beingtoo similar to their human homologues Ideally onlybacteria-specific biochemical pathways should betargeted for the development of antibiotics (eg peni-cillin blocking peptidoglycan biosynthesis) Howeverthis constraint is probably not acceptable on the longrun given the constant need for new antibiotics andthe limited number of strictly bacterial specific tar-gets A good counter example is the success of thequinolones that are targeting DNA gyrase one of themost conserved proteins (26 identical amino acidsbetween human and E coli)

Finally the amino acid conservation patterns de-rived from multiple alignments become even moreinformative when analyzed in the context of a repre-sentative 3-D structure (from the E coli gene prod-uct) For instance the structural (hydrophobic core orstabilizing residues) vs functional (eg catalyticresidues) nature of the conserved positions can be as-sessed (as illustrated in [20]) Paradoxical features ndashsuch as well-exposed surface loops exhibiting strongresidue conservation ndash are highly informative for thefunctional interpretation of anonymous 3-D struc-tures Protein cavities lined with conserved residuesare also good candidates for lsquoactive sitesrsquo Previouslyunnoticed (ie misaligned) conserved isolated resi-dues (dispersed in the sequence) might be found closetogether in the 3-D structure and suggest the presenceof a catalytic center The spatial distribution of con-served residues can be exhaustively searched withinproteins of known function eventually resulting inthe identification of catalytic sites [27] Finally thedrugability of the regions of the protein pointed outby these analyses can be assessed and large librariesof putative ligands can be computationally screened[28 29]

147

Gene expression and protein purification pipelineFirst pass

The general principle governing our large-scale pro-tein expression approach is to quickly identify thelsquoeasyrsquo ORFs by passing our whole set of 111 througha common first pass standard protocol Further ex-pression trials are then performed on the lsquorecalcitrantrsquoproteins These trials involve lowering the tempera-ture trying other E coli strains adding predictedco-factors etc After about 17 months the first passhas been completed on the 111 targeted ORFs Theresults are as follows 108 ORFs have been success-fully cloned 104 have been tested 89 have exhibiteda detectable expression (either in vivo or using theRoche cell free system) 54 in a soluble form in a firstpass screening In parallel our industrial partner

(Infectious Disease Division Aventis Pharma) is as-sessing the essentiality of these putative target genesin major pathogens using proprietary approaches

Second pass addressing the bottleneck of solubleexpression

The above results identified the expression of the tar-gets in a soluble form as a major bottleneck A sec-ond pass protocol was thus designed using a statisticalscreening of various variables in order to determinethe best conditions for soluble expression We firstdetermined that the three main variables influencingprotein solubility andor expression yield were 1) thebacterial strain 2) the culture temperature and 3) thegrowth medium A set of 12 conditions correspond-ing to the combination of these 3 variables was thus

Figure 2 The yqhE gene identity card in Master DB The various buttons (top left column) provide interactive access to pre-computedadditional information as well as to lsquoon the flyrsquo sequence analysis

148

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 3: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

Finally we show how the mapping of paradoxicalconservation patterns revealed by the multiple align-ment of orthologous sequences can help the interpre-tation of anonymous 3-D structures (prediction ofpotential active and cofactor-binding sites) and sug-gest functional hypotheses for further experimentalvalidation

Materials and methods

Target identification

The complete genome sequences were collected for anumber of bacterial species covering a wide range ofevolutionary distances (Gram-negative to Gram-positive) and life-style (free-living parasitic andorintra-cellular) (see Table 1) For a subset of these ge-nomes (lsquoReference Genomesrsquo in Table 1) all potentialcoding regions (ORFs) of length larger than 150 nu-cleotides (with ATG GTG or TTG as initiator codon)were extracted and their conceptual translation storedThis represented a total of 275791 putative ORFsmany of them probably not corresponding to actualproteins

All large-scale sequence comparisons were per-formed using a distributed processing protocol usinga cluster of 48 PC (lsquogigablasterrsquo [8]) under the linuxoperating system Each node included a 500 MH Pen-tium III 256 Mb of RAM and a 13 Gb disk Usingthe sequence alignment program Blastp [9] each ofthese putative ORFs were compared to i) the set ofall E coli K12 ORFs and ii) the sets of B subtilis Llactis S pyogenes and M tuberculosis ORFs Weonly retained the ORFs with a significant match inboth sets Each retained ORF was then mapped backto its E coli orthologous gene using Ecogene [10] asour primary reference database We then further re-duced our target gene set by only retaining the genesof unknown or hypothetical function (lsquoYrsquo genes ac-cording to the E coli nomenclature) A prediction ofmembrane spanning segments [11] was then per-formed on each of the previously selected genes todetermine which were most likely to be expressed assoluble globular proteins

Cloning

In order to allow batches of ORFs to be processed inparallel directional cloning was performed using theGATEWAY system (Invitrogen [12]) Oligonucleo-

tide primers were designed for each of the E coliORF with AttB1 and AttB2 overhangs added for re-combination cloning PCR was performed on E coliK12 purified genomic DNA using Pfx proof readingpolymerase from Invitrogen Alternatively Ex Taq(Takara) or Platinum High fidelity Taq (Invitrogen)were used to relieve amplification problems PCRproducts were purified as described in the GATEWAYmanual (Invitrogen) using Poly Ethylene Glycolprecipitation This procedure was further automatizedusing Multi screen-PCR 96 wells purification fromMillipore The corresponding PCR products were in-serted by homologous recombination using the lsquoonetube reactionrsquo in the pDEST17 expression plasmid inframe with a N-terminal His6-tag under the controlof a T7 promoter The reaction was performed using1 microl BP reaction buffer 25minus125 ng PCR products in avolume of 25 microl 05 microl pDONR vector and 1 microl BPclonase enzyme mix The reaction mix was incubatedat 25 degC for 4 hours before adding 025 microl NaCl075 M 075 microl pDEST17 and 15 microl LR clonase andthen incubated 2 hours at 25 degC 05 microl proteinase Kwas added 10 minutes at 37 degC before transformationin DH5 3 colonies per cloning were picked forovernight cultures using 2ml LB amp medium Plas-mids were then purified using Qiagen miniprep kit(spin or racks) and positive plasmids were screenedby restriction profiling using NdeI and HindIII en-zymes (Biolab) Positive clones were sequenced overa length of 600 nucleotides (GenomExpress) usingattB1 primer (5 ACA AGT TTG TAC AAA AAAGC 3)

Protein expression

The purified plasmids were used for the over-expres-sion of the recombinant proteins both in E coliBL21(DE3) cells and in a cell-free system (RocheApplied Science) (First pass standard protocol [13]for details) Incomplete factorial experimental design(as implemented by the SamBa software [14] httpigs-servercnrs-mrsfrsamba) was used to define a setof 12 conditions corresponding to the combination of3 variables (Expression strain Medium type andTemperature) This set of conditions was used as astarting point for the second pass screeningoptimiza-tion protocols A partially automated procedure (in-cluding liquid handling dilutions bacterial lysatenormalization via a connection to a Packardmicroquantspectrophotometer (Bio-Tek) and dot-blot spotting)was developed using a Multiprobe II (Perkin elmer)

143

automated platform allowing us to screen the expres-sion of 12 ORFs in parallel The cloned ORFs weretransformed in four E coli expression strainsBL21 (DE3) pLysS Rosetta (DE3) Origami (DE3)(Novagen) and C41 (Avidis) For each transformedstrain a colony was picked for overnight preculturein 2 ml 2YT with the suitable antibiotic The Multi-probe II automated platform dispensed on 24 wellsplates 25 ml of one of the 3 selected media (2YT(Difco) Superior Broth (SB) and Turbo Broth (TB)(Athena Enzyme Systems)) ndash each with the suitableantibiotic ndash and inoculated them with 25 precul-tures Culture plates were then incubated at 37 degCwith 250 rpm agitation for one hour before beingtemperature regulated at one of the 3 defined temper-atures (25 degC 37 degC and 42 degC) Induction wasperformed on the Packard platform with 05 mMIPTG when OD600nm reached 04ndash08 Cultures werestopped 3 hours after induction The measuredOD600nm value is used to normalize the amount ofbacteria used for protein quantification The 12 result-ing aliquots were then dispensed in a 96 well plateand centrifuged 10 minutes at 4000 rpm Pellets wereresuspended in 250 microl BugBuster (Novagen) 01 mgml lysozyme and 01 mgml DnaseI (Euromedex) 30min 4 degC under agitation 100 microl of each bacteriallysate was used for estimating the total amount of ex-pressed protein and 150 microl was used to assay thelevel of soluble expression

Assessment of soluble expression

A clearing filter plate (MultiScreen NucleicA fromMillipore) a 022 microm hydrophilic filter plate (Multi-Screen GV from Millipore) and a 96 well plate wereassembled The lysates were dispensed on the firstplate and filtered in one step with one minute centri-fugation at 4000 rpm Lysates were conserved atndash20 degC The detection of recombinant protein wasperformed using a Dot Blot procedure 4 microl of boththe soluble and total protein was resuspended in100 microl of a denaturant buffer (urea 8 M SodiumPhosphate 100 mM Tris-HCl 10 MM pH 8) 25 microl ofthis mix were transferred onto nitrocellulose mem-brane Protan 45 (Schleicher and Schuell) using a blot-ter adapted for the Multiprobe station The membranewas washed twice in TBS Buffer (Tris-HCl 10 mMNaCl 150 mM pH 75) blocked in blocking reagent1X (Qiagen) with 01 Tweenminus20 for 1 hour washed2 times 10 minutes in TBS-TT buffer (Tris HCl20 mM NaCl 500 mM 002 Tweenminus20 02

Triton Xminus100) and washed once 10 minutes in TBSbefore incubation with a penta-His HRP conjugatedantibody (Qiagen) diluted 12000 in blocking reagent1X with 01 Tweenminus20 for 1 hour Subsequentwashes were performed twice for 15 minutes in TBS-TT and once 10 minutes in TBS His-tagged proteinexpression was revealed by detecting HorseRadishPeroxidase activity conjugated to the anti-His5 anti-body using chemiluminescence with ECL RPN2106kit (Amersham Biosciences) on Biomax films(Kodak) Each of the 12 conditions received a scoreaccording to the Dot Blot spot intensity The influenceof each variable on the soluble expression was thenextracted by adding the scores of the experimentssharing the same variable value and comparing thistotal score to the others For instance if we considerthe variable lsquotemperaturersquo the sum of the scores ofthe four experiments performed at 37 degC is comparedto the sum of the scores of the four experiments doneat 25 degC and at 42 degC In the case these total scoresare similar temperature is not considered as a signif-icant factor for protein expression and solubility Ifthe scores are different temperature is retained as apertinent variable The temperature corresponding tothe highest total score was then used for the optimizedexpression protocol

Protein purification and characterization

Protein purifications were performed using the AKTAumlexplorer 10S (Amersham-Pharmacia) using standardaffinity chromatography on HiTrap chelating (Amer-sham-Pharmacia) or NiNTA (Qiagen) columnscharged with Ni2 eluted with an imidazole gradi-ent after removal of non-specific low interactingproteins by a preliminary wash with 50mM imidazolebuffer The recombinant proteins were usually elutedwithin a 70minus150 mM imidazole range A detailed pro-tocol has been published elsewhere [15]

A microdialysis step is applied to each protein toidentify a suitable buffer for desalting and concentra-tion We use 250 microl Dialysis Buttons (HamptonResearch Calafornia USA) with MWCO 10000SpectraPor CE membrane ( Spectrum Labs Califor-nia) 250 microl of each sample are loaded into 6 dialysisbuttons and dialyzed overnight at 4 degC on a rotatingwheel against 50 ml of 6 different buffers in the pres-ence or absence of ionic strength (NaCl 02 M) Weused the following buffers Tris 10 mM pH (7minus9)Mes 10 mM pH (55minus67) Mops 20 mM pH(69minus83) Hepes 10 mM pH (68minus82) Imidazole

144

10 mM pH (62minus78) and Sodium Acetate 10 mM pH(36minus56) The various buffers were adjusted to atleast one pH unit above or below the pI of each pro-tein To estimate the amount of precipitated versussoluble protein 125 microl of each sample were loadedonto a 96 microtitre plate for subsequent UV scanning(DO280total) The remaining 125 microl were centrifugedat 12000 rpm for 20 minutes and the supernatantloaded onto a 96 microtitre plate for soluble proteinevaluation (DO280soluble) The comparison of thetwo UV scans using the respective buffer as the blankwith the KC4 software on a microquant spectrophotom-eter (Bio-tek instruments inc) was used to select theoptimal desalting buffer as the one corresponding tothe smallest (DO280total ndash DO280soluble) The buff-ers were then exchanged using the Hiprep 2610desalting column (Amersham-pharmacia) on theAKTAuml explorer 10S

Quality control of the samples

Before entering crystallization trials purified recom-binant proteins were first characterized using circulardichroiumlsm spectroscopy to assess the folding status ofthe expression products and its consistency withsecondary structure predictions [11] Monodispersitybeing a key factor in the crystallization process theaggregation state of the protein solution was system-atically monitored by dynamics light scattering (Dy-napro MS800-TC) These measurements were alsoused to define the most suitable buffer in which toperform concentration phase diagrams and crystalli-zation trials Each sample was also controlled oniso-electro-focusing native and SDS gels both in theabsence and presence of reducing agent and 2 M ureaThe results were compared with the theoreticalmolecular weight of the target as well as with itspredicted isoelectric point and cysteine content Thestability of the protein over time was assessed bystoring purified samples at 3 temperatures 20 degC4 degC and ndash80 degC Every two weeks they were con-trolled by electrophoresis on SDS gels to reveal aneventual degradation process

Finally the purified proteins were all characterizedby mass spectroscopy (MALDI-TOF VoyagerDE-RP Perceptive Biosystem) and by 30 cycles ofamino-acid N-terminal Edman sequencing (AppliedBiosystem 473A)

Crystallization

The screening for crystallization conditions was per-formed on 3x96-well crystallization plates (Greiner)loaded by an 8-needle dispensing robot (Tecan WS1008 workstation modified for our needs) using one1-microl sitting drop per condition Each protein wasinitially tested against 480 different conditions at aconcentration determined by the phase diagramanalysis The tested crystallization conditions includein-house designed [14] and commercially availablesolution sets (CrystalScreen-Hampton researchWizards-Emerald BioStructures) After analyzing theresults of this initial screen conditions were refinedusing the incomplete factorial design approach [14and ref herein]

Data collection and structure determination

Diffraction data for proteins with structural homo-logues in the PDB [16] are interpreted using themolecular replacement procedure using the AMoResoftware [17] Alternatively the MultiwavelengthAnomalous Diffraction (MAD) technique [18 and refherein] was used for interpreting the diffraction dataof protein with no obvious structural homologuesusing crystals of the seleno-methionine substitutedproteins [19] Detailed protocols can be found in [20]

Results and discussion

Target selection

Because of the detrimental effect of their mutationsgenes belonging to essential pathways exhibit lowevolution rates Moreover the reliable identificationndash using sequence similarity ndash of homologous genesin very distant organisms is only possible for slowlyevolving genes Thus one expect the subsets of ubiq-uitous or lsquowide-spectrumrsquo genes to be enriched in es-sential genes for the same panel of microorganisms

The first step of the target selection process thusconsisted into a comprehensive cross-comparison ofa large panel of complete genomes from Gram-negative and Gram-positive bacteria (separated by 2billions years of divergent evolution) to identify themost conserved genes among those with clear homo-logues both in E coli and at least one Gram-positivebacteria This first step using a BlastP score thresh-old of 150 (with Blosum62) resulted into a set of

145

1300 different homologous ORF sets each with a dif-ferent representative in E coli For each of theseE coli ORFs a conservation index was computed as

CI all

BestScore

EcoliSelfScore

Where EcoliSelfScore is the BlastP [9] score of eachEcoli ORF aligned with itself and BestScore corre-sponds to the score of the best matching ORFs withineach genome (counting 0 for matches scoring belowthe threshold) the sum being performed over alltested genomes The CI value is thus quantifying thesequence conservation of the gene (and its presence)across a large number of evolutionary diverse micro-bial genomes including important pathogens For theinitial 1300 selected genes CI values ranged from225 (for the elongation factor subunit tufB) to 14 forthe putative sugar hydrolase ybgG gene (Figure 1)

For all 1300 selected genes (as well as for all 4222E coli genes) a number of properties were precom-puted and stored in a master database (lsquoMaster DBrsquo)In this database each gene is represented by a spe-cific interactive identity card (Figure 2) The ID cardcontains the gene nucleotide and amino-acid se-quence its precise position in the E coli genome itssignificant protein matches within the Reference Ge-nome set some theoretical properties of its proteinproduct (molecular weight pI extinction coefficient

number of cysteines methionines and chargedresidues) Each gene ID card also gives access to ad-ditional features such as a restriction map of the genethe amino-acid translations derived from alternativereading frames (for error checking) a map of the rarecodons as well as the protein hydropathy profile pre-dicted signal peptide membrane spanning segmentsand secondary structure The lsquoBlastrsquo button (Figure 2)gives access to interactive similarity searches againstadditional genomes listed in Table 1 (lsquoOther Ge-nomesrsquo) This set includes the complete genomes ofmicroorganisms of lesser biomedical relevance (eghyperthermophilic archebacteria) or interesting bacte-ria the genomic sequence of which is nearing com-pletion This list is submitted to periodical changesThe lsquoextended alignmentrsquo button generate a multiplealignment (using the T-coffee [21] program) of the Ecoli ORF with the homologous ORFs identified in theother bacterial genomes The lsquoPdb alignmentrsquo buttondoes the same with homologues of known 3-D struc-tures Master DB and the results of these varioussequence analyses are frequently consulted to guidethe subsequent experimental steps of the project suchas cloning gene expression protein purification andcrystallization trials

The set of 1300 well-conserved gene was furtherreduced to 317 by only retaining the genes of un-known or hypothetical function The distribution ofconservation index (CI values) in this subset rangesfrom 172 to 14 Figure (3A) presents a graph ofknown vs unknown genes ranked according to theirevolutionary conservation (CI value) This graph in-dicates that the most conserved genes (likely to cor-respond to essential pathways) have been historicallycharacterized in priority However it is comforting tosee that a fraction of well-characterized genes arefound at lower CI values where genes of yet unknownfunctions begin to dominate the distribution Thisshows that our protocol of ORF selection based onthe evolutionary conservation across Gram-positiveGram-negative genomes does actually converge ontoactual genes

The set of 317 highly conserved anonymous geneswas further reduced to enhance our probability of suc-cess in the experimental phase of the project that isthe over expression and determination of the 3-Dstructure of the largest possible number of these Ecoli gene products Although the structural determi-nation of membrane-bound proteins will eventuallybe attempted later on (eg by expressing the ORF asseparate domains) we chose to first focus on the

Figure 1 Mandelbrot plot (rank vs score) of the initial 1300 pu-tative target genes The genes (X-axis) are ranked according to theirassociated CI value (Y-axis) The first 200 genes with CI value gt10are ubiquitous genes involved in the most essential cellular path-ways The rest of the distribution is smoothly decreasing suggest-ing a fairly homogeneous selection pressure among all these can-didate target genes Genes of unknown functions are in red geneof known functions are in green

146

easier soluble targets (the so-called lsquolow hangingfruitsrsquo) A prediction of membrane spanning segments[11] was then performed to determine which of theabove 317 genes were most likely to be expressed assoluble globular protein This final bioinformatic fil-ter resulted into a final set of 221 candidate targetgenes The distribution of CI values within this geneset does not exhibit any significant bias ranging from172 to 14 (Figure 3B) Out of the 221 correspond-ing proteins only 17 exhibit a strong sequence simi-larity (defined as gt40 identity over a segment of atleast 130 residues) to entries in the PDB Our 221-candidate gene set is thus likely to correspond to asizable number of interesting new structures and pos-sibly some original folds 110 of these target geneswere attributed to the AFMB laboratory and are notfurther described

Additional bioinformatic analyses

Additional sequence analyses were performed at thevarious stage of the experimental process in particu-lar for genes and proteins for which difficulties wereencountered For instance putative cofactor-bindingmotifs were searched to help expressing purifying orcrystallizing recalcitrant proteins This was performedby combining the recognition of usual sequence sig-natures (eg Prosite motif [22]) with an analysis oftheir conservation within multiple alignments

Multiple alignment is also useful to identify theproper N-terminus of the ORF (for cloning purpose)or delineating detrimental flexible N- or C-terminalsegments of the candidate proteins An analysis of theconfidence measure at the heart of the T-coffee mul-tiple alignment program [21] indicated a correlationbetween the least reliably aligned protein segmentsthe regions of higher sequence variability and thoseof greater flexibility [23] As we work with well-con-servedubiquitous genes the situation is also optimalto take advantage of multiple alignments to identifykey-residues in these lsquoanonymousrsquo proteins and usethe conserved patterns to link them with protein fam-ilies of known functions despite a lack of overallsignificant sequence similarity (see 24 as an exam-ple)

Phylogenomics analyses aiming at identifyingsubset of genes co-segregating or co-evolving acrossmultiple bacterial genomes are also extensively usedto predict protein-protein interaction [25 26] eventu-

ally leading to the design of co-expression co-purifi-cation or co-crystallization experiments Phyloge-nomics is used to suggest links between ORFs ofunknown functions and identified pathways [25]These hints might help the functional interpretation ofthe 3-D structures of our anonymous targets Phylo-genomics analyses are performed with our in-housesoftware PhyDBac [26 httpigs-servercnrs-mrsfrphydbac]

On the negative side the priority of candidategenes was sometimes downgraded due to their ab-sence in important pathogens with reduced genomes(such as H influenzae or Mycoplasma) or for beingtoo similar to their human homologues Ideally onlybacteria-specific biochemical pathways should betargeted for the development of antibiotics (eg peni-cillin blocking peptidoglycan biosynthesis) Howeverthis constraint is probably not acceptable on the longrun given the constant need for new antibiotics andthe limited number of strictly bacterial specific tar-gets A good counter example is the success of thequinolones that are targeting DNA gyrase one of themost conserved proteins (26 identical amino acidsbetween human and E coli)

Finally the amino acid conservation patterns de-rived from multiple alignments become even moreinformative when analyzed in the context of a repre-sentative 3-D structure (from the E coli gene prod-uct) For instance the structural (hydrophobic core orstabilizing residues) vs functional (eg catalyticresidues) nature of the conserved positions can be as-sessed (as illustrated in [20]) Paradoxical features ndashsuch as well-exposed surface loops exhibiting strongresidue conservation ndash are highly informative for thefunctional interpretation of anonymous 3-D struc-tures Protein cavities lined with conserved residuesare also good candidates for lsquoactive sitesrsquo Previouslyunnoticed (ie misaligned) conserved isolated resi-dues (dispersed in the sequence) might be found closetogether in the 3-D structure and suggest the presenceof a catalytic center The spatial distribution of con-served residues can be exhaustively searched withinproteins of known function eventually resulting inthe identification of catalytic sites [27] Finally thedrugability of the regions of the protein pointed outby these analyses can be assessed and large librariesof putative ligands can be computationally screened[28 29]

147

Gene expression and protein purification pipelineFirst pass

The general principle governing our large-scale pro-tein expression approach is to quickly identify thelsquoeasyrsquo ORFs by passing our whole set of 111 througha common first pass standard protocol Further ex-pression trials are then performed on the lsquorecalcitrantrsquoproteins These trials involve lowering the tempera-ture trying other E coli strains adding predictedco-factors etc After about 17 months the first passhas been completed on the 111 targeted ORFs Theresults are as follows 108 ORFs have been success-fully cloned 104 have been tested 89 have exhibiteda detectable expression (either in vivo or using theRoche cell free system) 54 in a soluble form in a firstpass screening In parallel our industrial partner

(Infectious Disease Division Aventis Pharma) is as-sessing the essentiality of these putative target genesin major pathogens using proprietary approaches

Second pass addressing the bottleneck of solubleexpression

The above results identified the expression of the tar-gets in a soluble form as a major bottleneck A sec-ond pass protocol was thus designed using a statisticalscreening of various variables in order to determinethe best conditions for soluble expression We firstdetermined that the three main variables influencingprotein solubility andor expression yield were 1) thebacterial strain 2) the culture temperature and 3) thegrowth medium A set of 12 conditions correspond-ing to the combination of these 3 variables was thus

Figure 2 The yqhE gene identity card in Master DB The various buttons (top left column) provide interactive access to pre-computedadditional information as well as to lsquoon the flyrsquo sequence analysis

148

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 4: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

automated platform allowing us to screen the expres-sion of 12 ORFs in parallel The cloned ORFs weretransformed in four E coli expression strainsBL21 (DE3) pLysS Rosetta (DE3) Origami (DE3)(Novagen) and C41 (Avidis) For each transformedstrain a colony was picked for overnight preculturein 2 ml 2YT with the suitable antibiotic The Multi-probe II automated platform dispensed on 24 wellsplates 25 ml of one of the 3 selected media (2YT(Difco) Superior Broth (SB) and Turbo Broth (TB)(Athena Enzyme Systems)) ndash each with the suitableantibiotic ndash and inoculated them with 25 precul-tures Culture plates were then incubated at 37 degCwith 250 rpm agitation for one hour before beingtemperature regulated at one of the 3 defined temper-atures (25 degC 37 degC and 42 degC) Induction wasperformed on the Packard platform with 05 mMIPTG when OD600nm reached 04ndash08 Cultures werestopped 3 hours after induction The measuredOD600nm value is used to normalize the amount ofbacteria used for protein quantification The 12 result-ing aliquots were then dispensed in a 96 well plateand centrifuged 10 minutes at 4000 rpm Pellets wereresuspended in 250 microl BugBuster (Novagen) 01 mgml lysozyme and 01 mgml DnaseI (Euromedex) 30min 4 degC under agitation 100 microl of each bacteriallysate was used for estimating the total amount of ex-pressed protein and 150 microl was used to assay thelevel of soluble expression

Assessment of soluble expression

A clearing filter plate (MultiScreen NucleicA fromMillipore) a 022 microm hydrophilic filter plate (Multi-Screen GV from Millipore) and a 96 well plate wereassembled The lysates were dispensed on the firstplate and filtered in one step with one minute centri-fugation at 4000 rpm Lysates were conserved atndash20 degC The detection of recombinant protein wasperformed using a Dot Blot procedure 4 microl of boththe soluble and total protein was resuspended in100 microl of a denaturant buffer (urea 8 M SodiumPhosphate 100 mM Tris-HCl 10 MM pH 8) 25 microl ofthis mix were transferred onto nitrocellulose mem-brane Protan 45 (Schleicher and Schuell) using a blot-ter adapted for the Multiprobe station The membranewas washed twice in TBS Buffer (Tris-HCl 10 mMNaCl 150 mM pH 75) blocked in blocking reagent1X (Qiagen) with 01 Tweenminus20 for 1 hour washed2 times 10 minutes in TBS-TT buffer (Tris HCl20 mM NaCl 500 mM 002 Tweenminus20 02

Triton Xminus100) and washed once 10 minutes in TBSbefore incubation with a penta-His HRP conjugatedantibody (Qiagen) diluted 12000 in blocking reagent1X with 01 Tweenminus20 for 1 hour Subsequentwashes were performed twice for 15 minutes in TBS-TT and once 10 minutes in TBS His-tagged proteinexpression was revealed by detecting HorseRadishPeroxidase activity conjugated to the anti-His5 anti-body using chemiluminescence with ECL RPN2106kit (Amersham Biosciences) on Biomax films(Kodak) Each of the 12 conditions received a scoreaccording to the Dot Blot spot intensity The influenceof each variable on the soluble expression was thenextracted by adding the scores of the experimentssharing the same variable value and comparing thistotal score to the others For instance if we considerthe variable lsquotemperaturersquo the sum of the scores ofthe four experiments performed at 37 degC is comparedto the sum of the scores of the four experiments doneat 25 degC and at 42 degC In the case these total scoresare similar temperature is not considered as a signif-icant factor for protein expression and solubility Ifthe scores are different temperature is retained as apertinent variable The temperature corresponding tothe highest total score was then used for the optimizedexpression protocol

Protein purification and characterization

Protein purifications were performed using the AKTAumlexplorer 10S (Amersham-Pharmacia) using standardaffinity chromatography on HiTrap chelating (Amer-sham-Pharmacia) or NiNTA (Qiagen) columnscharged with Ni2 eluted with an imidazole gradi-ent after removal of non-specific low interactingproteins by a preliminary wash with 50mM imidazolebuffer The recombinant proteins were usually elutedwithin a 70minus150 mM imidazole range A detailed pro-tocol has been published elsewhere [15]

A microdialysis step is applied to each protein toidentify a suitable buffer for desalting and concentra-tion We use 250 microl Dialysis Buttons (HamptonResearch Calafornia USA) with MWCO 10000SpectraPor CE membrane ( Spectrum Labs Califor-nia) 250 microl of each sample are loaded into 6 dialysisbuttons and dialyzed overnight at 4 degC on a rotatingwheel against 50 ml of 6 different buffers in the pres-ence or absence of ionic strength (NaCl 02 M) Weused the following buffers Tris 10 mM pH (7minus9)Mes 10 mM pH (55minus67) Mops 20 mM pH(69minus83) Hepes 10 mM pH (68minus82) Imidazole

144

10 mM pH (62minus78) and Sodium Acetate 10 mM pH(36minus56) The various buffers were adjusted to atleast one pH unit above or below the pI of each pro-tein To estimate the amount of precipitated versussoluble protein 125 microl of each sample were loadedonto a 96 microtitre plate for subsequent UV scanning(DO280total) The remaining 125 microl were centrifugedat 12000 rpm for 20 minutes and the supernatantloaded onto a 96 microtitre plate for soluble proteinevaluation (DO280soluble) The comparison of thetwo UV scans using the respective buffer as the blankwith the KC4 software on a microquant spectrophotom-eter (Bio-tek instruments inc) was used to select theoptimal desalting buffer as the one corresponding tothe smallest (DO280total ndash DO280soluble) The buff-ers were then exchanged using the Hiprep 2610desalting column (Amersham-pharmacia) on theAKTAuml explorer 10S

Quality control of the samples

Before entering crystallization trials purified recom-binant proteins were first characterized using circulardichroiumlsm spectroscopy to assess the folding status ofthe expression products and its consistency withsecondary structure predictions [11] Monodispersitybeing a key factor in the crystallization process theaggregation state of the protein solution was system-atically monitored by dynamics light scattering (Dy-napro MS800-TC) These measurements were alsoused to define the most suitable buffer in which toperform concentration phase diagrams and crystalli-zation trials Each sample was also controlled oniso-electro-focusing native and SDS gels both in theabsence and presence of reducing agent and 2 M ureaThe results were compared with the theoreticalmolecular weight of the target as well as with itspredicted isoelectric point and cysteine content Thestability of the protein over time was assessed bystoring purified samples at 3 temperatures 20 degC4 degC and ndash80 degC Every two weeks they were con-trolled by electrophoresis on SDS gels to reveal aneventual degradation process

Finally the purified proteins were all characterizedby mass spectroscopy (MALDI-TOF VoyagerDE-RP Perceptive Biosystem) and by 30 cycles ofamino-acid N-terminal Edman sequencing (AppliedBiosystem 473A)

Crystallization

The screening for crystallization conditions was per-formed on 3x96-well crystallization plates (Greiner)loaded by an 8-needle dispensing robot (Tecan WS1008 workstation modified for our needs) using one1-microl sitting drop per condition Each protein wasinitially tested against 480 different conditions at aconcentration determined by the phase diagramanalysis The tested crystallization conditions includein-house designed [14] and commercially availablesolution sets (CrystalScreen-Hampton researchWizards-Emerald BioStructures) After analyzing theresults of this initial screen conditions were refinedusing the incomplete factorial design approach [14and ref herein]

Data collection and structure determination

Diffraction data for proteins with structural homo-logues in the PDB [16] are interpreted using themolecular replacement procedure using the AMoResoftware [17] Alternatively the MultiwavelengthAnomalous Diffraction (MAD) technique [18 and refherein] was used for interpreting the diffraction dataof protein with no obvious structural homologuesusing crystals of the seleno-methionine substitutedproteins [19] Detailed protocols can be found in [20]

Results and discussion

Target selection

Because of the detrimental effect of their mutationsgenes belonging to essential pathways exhibit lowevolution rates Moreover the reliable identificationndash using sequence similarity ndash of homologous genesin very distant organisms is only possible for slowlyevolving genes Thus one expect the subsets of ubiq-uitous or lsquowide-spectrumrsquo genes to be enriched in es-sential genes for the same panel of microorganisms

The first step of the target selection process thusconsisted into a comprehensive cross-comparison ofa large panel of complete genomes from Gram-negative and Gram-positive bacteria (separated by 2billions years of divergent evolution) to identify themost conserved genes among those with clear homo-logues both in E coli and at least one Gram-positivebacteria This first step using a BlastP score thresh-old of 150 (with Blosum62) resulted into a set of

145

1300 different homologous ORF sets each with a dif-ferent representative in E coli For each of theseE coli ORFs a conservation index was computed as

CI all

BestScore

EcoliSelfScore

Where EcoliSelfScore is the BlastP [9] score of eachEcoli ORF aligned with itself and BestScore corre-sponds to the score of the best matching ORFs withineach genome (counting 0 for matches scoring belowthe threshold) the sum being performed over alltested genomes The CI value is thus quantifying thesequence conservation of the gene (and its presence)across a large number of evolutionary diverse micro-bial genomes including important pathogens For theinitial 1300 selected genes CI values ranged from225 (for the elongation factor subunit tufB) to 14 forthe putative sugar hydrolase ybgG gene (Figure 1)

For all 1300 selected genes (as well as for all 4222E coli genes) a number of properties were precom-puted and stored in a master database (lsquoMaster DBrsquo)In this database each gene is represented by a spe-cific interactive identity card (Figure 2) The ID cardcontains the gene nucleotide and amino-acid se-quence its precise position in the E coli genome itssignificant protein matches within the Reference Ge-nome set some theoretical properties of its proteinproduct (molecular weight pI extinction coefficient

number of cysteines methionines and chargedresidues) Each gene ID card also gives access to ad-ditional features such as a restriction map of the genethe amino-acid translations derived from alternativereading frames (for error checking) a map of the rarecodons as well as the protein hydropathy profile pre-dicted signal peptide membrane spanning segmentsand secondary structure The lsquoBlastrsquo button (Figure 2)gives access to interactive similarity searches againstadditional genomes listed in Table 1 (lsquoOther Ge-nomesrsquo) This set includes the complete genomes ofmicroorganisms of lesser biomedical relevance (eghyperthermophilic archebacteria) or interesting bacte-ria the genomic sequence of which is nearing com-pletion This list is submitted to periodical changesThe lsquoextended alignmentrsquo button generate a multiplealignment (using the T-coffee [21] program) of the Ecoli ORF with the homologous ORFs identified in theother bacterial genomes The lsquoPdb alignmentrsquo buttondoes the same with homologues of known 3-D struc-tures Master DB and the results of these varioussequence analyses are frequently consulted to guidethe subsequent experimental steps of the project suchas cloning gene expression protein purification andcrystallization trials

The set of 1300 well-conserved gene was furtherreduced to 317 by only retaining the genes of un-known or hypothetical function The distribution ofconservation index (CI values) in this subset rangesfrom 172 to 14 Figure (3A) presents a graph ofknown vs unknown genes ranked according to theirevolutionary conservation (CI value) This graph in-dicates that the most conserved genes (likely to cor-respond to essential pathways) have been historicallycharacterized in priority However it is comforting tosee that a fraction of well-characterized genes arefound at lower CI values where genes of yet unknownfunctions begin to dominate the distribution Thisshows that our protocol of ORF selection based onthe evolutionary conservation across Gram-positiveGram-negative genomes does actually converge ontoactual genes

The set of 317 highly conserved anonymous geneswas further reduced to enhance our probability of suc-cess in the experimental phase of the project that isthe over expression and determination of the 3-Dstructure of the largest possible number of these Ecoli gene products Although the structural determi-nation of membrane-bound proteins will eventuallybe attempted later on (eg by expressing the ORF asseparate domains) we chose to first focus on the

Figure 1 Mandelbrot plot (rank vs score) of the initial 1300 pu-tative target genes The genes (X-axis) are ranked according to theirassociated CI value (Y-axis) The first 200 genes with CI value gt10are ubiquitous genes involved in the most essential cellular path-ways The rest of the distribution is smoothly decreasing suggest-ing a fairly homogeneous selection pressure among all these can-didate target genes Genes of unknown functions are in red geneof known functions are in green

146

easier soluble targets (the so-called lsquolow hangingfruitsrsquo) A prediction of membrane spanning segments[11] was then performed to determine which of theabove 317 genes were most likely to be expressed assoluble globular protein This final bioinformatic fil-ter resulted into a final set of 221 candidate targetgenes The distribution of CI values within this geneset does not exhibit any significant bias ranging from172 to 14 (Figure 3B) Out of the 221 correspond-ing proteins only 17 exhibit a strong sequence simi-larity (defined as gt40 identity over a segment of atleast 130 residues) to entries in the PDB Our 221-candidate gene set is thus likely to correspond to asizable number of interesting new structures and pos-sibly some original folds 110 of these target geneswere attributed to the AFMB laboratory and are notfurther described

Additional bioinformatic analyses

Additional sequence analyses were performed at thevarious stage of the experimental process in particu-lar for genes and proteins for which difficulties wereencountered For instance putative cofactor-bindingmotifs were searched to help expressing purifying orcrystallizing recalcitrant proteins This was performedby combining the recognition of usual sequence sig-natures (eg Prosite motif [22]) with an analysis oftheir conservation within multiple alignments

Multiple alignment is also useful to identify theproper N-terminus of the ORF (for cloning purpose)or delineating detrimental flexible N- or C-terminalsegments of the candidate proteins An analysis of theconfidence measure at the heart of the T-coffee mul-tiple alignment program [21] indicated a correlationbetween the least reliably aligned protein segmentsthe regions of higher sequence variability and thoseof greater flexibility [23] As we work with well-con-servedubiquitous genes the situation is also optimalto take advantage of multiple alignments to identifykey-residues in these lsquoanonymousrsquo proteins and usethe conserved patterns to link them with protein fam-ilies of known functions despite a lack of overallsignificant sequence similarity (see 24 as an exam-ple)

Phylogenomics analyses aiming at identifyingsubset of genes co-segregating or co-evolving acrossmultiple bacterial genomes are also extensively usedto predict protein-protein interaction [25 26] eventu-

ally leading to the design of co-expression co-purifi-cation or co-crystallization experiments Phyloge-nomics is used to suggest links between ORFs ofunknown functions and identified pathways [25]These hints might help the functional interpretation ofthe 3-D structures of our anonymous targets Phylo-genomics analyses are performed with our in-housesoftware PhyDBac [26 httpigs-servercnrs-mrsfrphydbac]

On the negative side the priority of candidategenes was sometimes downgraded due to their ab-sence in important pathogens with reduced genomes(such as H influenzae or Mycoplasma) or for beingtoo similar to their human homologues Ideally onlybacteria-specific biochemical pathways should betargeted for the development of antibiotics (eg peni-cillin blocking peptidoglycan biosynthesis) Howeverthis constraint is probably not acceptable on the longrun given the constant need for new antibiotics andthe limited number of strictly bacterial specific tar-gets A good counter example is the success of thequinolones that are targeting DNA gyrase one of themost conserved proteins (26 identical amino acidsbetween human and E coli)

Finally the amino acid conservation patterns de-rived from multiple alignments become even moreinformative when analyzed in the context of a repre-sentative 3-D structure (from the E coli gene prod-uct) For instance the structural (hydrophobic core orstabilizing residues) vs functional (eg catalyticresidues) nature of the conserved positions can be as-sessed (as illustrated in [20]) Paradoxical features ndashsuch as well-exposed surface loops exhibiting strongresidue conservation ndash are highly informative for thefunctional interpretation of anonymous 3-D struc-tures Protein cavities lined with conserved residuesare also good candidates for lsquoactive sitesrsquo Previouslyunnoticed (ie misaligned) conserved isolated resi-dues (dispersed in the sequence) might be found closetogether in the 3-D structure and suggest the presenceof a catalytic center The spatial distribution of con-served residues can be exhaustively searched withinproteins of known function eventually resulting inthe identification of catalytic sites [27] Finally thedrugability of the regions of the protein pointed outby these analyses can be assessed and large librariesof putative ligands can be computationally screened[28 29]

147

Gene expression and protein purification pipelineFirst pass

The general principle governing our large-scale pro-tein expression approach is to quickly identify thelsquoeasyrsquo ORFs by passing our whole set of 111 througha common first pass standard protocol Further ex-pression trials are then performed on the lsquorecalcitrantrsquoproteins These trials involve lowering the tempera-ture trying other E coli strains adding predictedco-factors etc After about 17 months the first passhas been completed on the 111 targeted ORFs Theresults are as follows 108 ORFs have been success-fully cloned 104 have been tested 89 have exhibiteda detectable expression (either in vivo or using theRoche cell free system) 54 in a soluble form in a firstpass screening In parallel our industrial partner

(Infectious Disease Division Aventis Pharma) is as-sessing the essentiality of these putative target genesin major pathogens using proprietary approaches

Second pass addressing the bottleneck of solubleexpression

The above results identified the expression of the tar-gets in a soluble form as a major bottleneck A sec-ond pass protocol was thus designed using a statisticalscreening of various variables in order to determinethe best conditions for soluble expression We firstdetermined that the three main variables influencingprotein solubility andor expression yield were 1) thebacterial strain 2) the culture temperature and 3) thegrowth medium A set of 12 conditions correspond-ing to the combination of these 3 variables was thus

Figure 2 The yqhE gene identity card in Master DB The various buttons (top left column) provide interactive access to pre-computedadditional information as well as to lsquoon the flyrsquo sequence analysis

148

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 5: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

10 mM pH (62minus78) and Sodium Acetate 10 mM pH(36minus56) The various buffers were adjusted to atleast one pH unit above or below the pI of each pro-tein To estimate the amount of precipitated versussoluble protein 125 microl of each sample were loadedonto a 96 microtitre plate for subsequent UV scanning(DO280total) The remaining 125 microl were centrifugedat 12000 rpm for 20 minutes and the supernatantloaded onto a 96 microtitre plate for soluble proteinevaluation (DO280soluble) The comparison of thetwo UV scans using the respective buffer as the blankwith the KC4 software on a microquant spectrophotom-eter (Bio-tek instruments inc) was used to select theoptimal desalting buffer as the one corresponding tothe smallest (DO280total ndash DO280soluble) The buff-ers were then exchanged using the Hiprep 2610desalting column (Amersham-pharmacia) on theAKTAuml explorer 10S

Quality control of the samples

Before entering crystallization trials purified recom-binant proteins were first characterized using circulardichroiumlsm spectroscopy to assess the folding status ofthe expression products and its consistency withsecondary structure predictions [11] Monodispersitybeing a key factor in the crystallization process theaggregation state of the protein solution was system-atically monitored by dynamics light scattering (Dy-napro MS800-TC) These measurements were alsoused to define the most suitable buffer in which toperform concentration phase diagrams and crystalli-zation trials Each sample was also controlled oniso-electro-focusing native and SDS gels both in theabsence and presence of reducing agent and 2 M ureaThe results were compared with the theoreticalmolecular weight of the target as well as with itspredicted isoelectric point and cysteine content Thestability of the protein over time was assessed bystoring purified samples at 3 temperatures 20 degC4 degC and ndash80 degC Every two weeks they were con-trolled by electrophoresis on SDS gels to reveal aneventual degradation process

Finally the purified proteins were all characterizedby mass spectroscopy (MALDI-TOF VoyagerDE-RP Perceptive Biosystem) and by 30 cycles ofamino-acid N-terminal Edman sequencing (AppliedBiosystem 473A)

Crystallization

The screening for crystallization conditions was per-formed on 3x96-well crystallization plates (Greiner)loaded by an 8-needle dispensing robot (Tecan WS1008 workstation modified for our needs) using one1-microl sitting drop per condition Each protein wasinitially tested against 480 different conditions at aconcentration determined by the phase diagramanalysis The tested crystallization conditions includein-house designed [14] and commercially availablesolution sets (CrystalScreen-Hampton researchWizards-Emerald BioStructures) After analyzing theresults of this initial screen conditions were refinedusing the incomplete factorial design approach [14and ref herein]

Data collection and structure determination

Diffraction data for proteins with structural homo-logues in the PDB [16] are interpreted using themolecular replacement procedure using the AMoResoftware [17] Alternatively the MultiwavelengthAnomalous Diffraction (MAD) technique [18 and refherein] was used for interpreting the diffraction dataof protein with no obvious structural homologuesusing crystals of the seleno-methionine substitutedproteins [19] Detailed protocols can be found in [20]

Results and discussion

Target selection

Because of the detrimental effect of their mutationsgenes belonging to essential pathways exhibit lowevolution rates Moreover the reliable identificationndash using sequence similarity ndash of homologous genesin very distant organisms is only possible for slowlyevolving genes Thus one expect the subsets of ubiq-uitous or lsquowide-spectrumrsquo genes to be enriched in es-sential genes for the same panel of microorganisms

The first step of the target selection process thusconsisted into a comprehensive cross-comparison ofa large panel of complete genomes from Gram-negative and Gram-positive bacteria (separated by 2billions years of divergent evolution) to identify themost conserved genes among those with clear homo-logues both in E coli and at least one Gram-positivebacteria This first step using a BlastP score thresh-old of 150 (with Blosum62) resulted into a set of

145

1300 different homologous ORF sets each with a dif-ferent representative in E coli For each of theseE coli ORFs a conservation index was computed as

CI all

BestScore

EcoliSelfScore

Where EcoliSelfScore is the BlastP [9] score of eachEcoli ORF aligned with itself and BestScore corre-sponds to the score of the best matching ORFs withineach genome (counting 0 for matches scoring belowthe threshold) the sum being performed over alltested genomes The CI value is thus quantifying thesequence conservation of the gene (and its presence)across a large number of evolutionary diverse micro-bial genomes including important pathogens For theinitial 1300 selected genes CI values ranged from225 (for the elongation factor subunit tufB) to 14 forthe putative sugar hydrolase ybgG gene (Figure 1)

For all 1300 selected genes (as well as for all 4222E coli genes) a number of properties were precom-puted and stored in a master database (lsquoMaster DBrsquo)In this database each gene is represented by a spe-cific interactive identity card (Figure 2) The ID cardcontains the gene nucleotide and amino-acid se-quence its precise position in the E coli genome itssignificant protein matches within the Reference Ge-nome set some theoretical properties of its proteinproduct (molecular weight pI extinction coefficient

number of cysteines methionines and chargedresidues) Each gene ID card also gives access to ad-ditional features such as a restriction map of the genethe amino-acid translations derived from alternativereading frames (for error checking) a map of the rarecodons as well as the protein hydropathy profile pre-dicted signal peptide membrane spanning segmentsand secondary structure The lsquoBlastrsquo button (Figure 2)gives access to interactive similarity searches againstadditional genomes listed in Table 1 (lsquoOther Ge-nomesrsquo) This set includes the complete genomes ofmicroorganisms of lesser biomedical relevance (eghyperthermophilic archebacteria) or interesting bacte-ria the genomic sequence of which is nearing com-pletion This list is submitted to periodical changesThe lsquoextended alignmentrsquo button generate a multiplealignment (using the T-coffee [21] program) of the Ecoli ORF with the homologous ORFs identified in theother bacterial genomes The lsquoPdb alignmentrsquo buttondoes the same with homologues of known 3-D struc-tures Master DB and the results of these varioussequence analyses are frequently consulted to guidethe subsequent experimental steps of the project suchas cloning gene expression protein purification andcrystallization trials

The set of 1300 well-conserved gene was furtherreduced to 317 by only retaining the genes of un-known or hypothetical function The distribution ofconservation index (CI values) in this subset rangesfrom 172 to 14 Figure (3A) presents a graph ofknown vs unknown genes ranked according to theirevolutionary conservation (CI value) This graph in-dicates that the most conserved genes (likely to cor-respond to essential pathways) have been historicallycharacterized in priority However it is comforting tosee that a fraction of well-characterized genes arefound at lower CI values where genes of yet unknownfunctions begin to dominate the distribution Thisshows that our protocol of ORF selection based onthe evolutionary conservation across Gram-positiveGram-negative genomes does actually converge ontoactual genes

The set of 317 highly conserved anonymous geneswas further reduced to enhance our probability of suc-cess in the experimental phase of the project that isthe over expression and determination of the 3-Dstructure of the largest possible number of these Ecoli gene products Although the structural determi-nation of membrane-bound proteins will eventuallybe attempted later on (eg by expressing the ORF asseparate domains) we chose to first focus on the

Figure 1 Mandelbrot plot (rank vs score) of the initial 1300 pu-tative target genes The genes (X-axis) are ranked according to theirassociated CI value (Y-axis) The first 200 genes with CI value gt10are ubiquitous genes involved in the most essential cellular path-ways The rest of the distribution is smoothly decreasing suggest-ing a fairly homogeneous selection pressure among all these can-didate target genes Genes of unknown functions are in red geneof known functions are in green

146

easier soluble targets (the so-called lsquolow hangingfruitsrsquo) A prediction of membrane spanning segments[11] was then performed to determine which of theabove 317 genes were most likely to be expressed assoluble globular protein This final bioinformatic fil-ter resulted into a final set of 221 candidate targetgenes The distribution of CI values within this geneset does not exhibit any significant bias ranging from172 to 14 (Figure 3B) Out of the 221 correspond-ing proteins only 17 exhibit a strong sequence simi-larity (defined as gt40 identity over a segment of atleast 130 residues) to entries in the PDB Our 221-candidate gene set is thus likely to correspond to asizable number of interesting new structures and pos-sibly some original folds 110 of these target geneswere attributed to the AFMB laboratory and are notfurther described

Additional bioinformatic analyses

Additional sequence analyses were performed at thevarious stage of the experimental process in particu-lar for genes and proteins for which difficulties wereencountered For instance putative cofactor-bindingmotifs were searched to help expressing purifying orcrystallizing recalcitrant proteins This was performedby combining the recognition of usual sequence sig-natures (eg Prosite motif [22]) with an analysis oftheir conservation within multiple alignments

Multiple alignment is also useful to identify theproper N-terminus of the ORF (for cloning purpose)or delineating detrimental flexible N- or C-terminalsegments of the candidate proteins An analysis of theconfidence measure at the heart of the T-coffee mul-tiple alignment program [21] indicated a correlationbetween the least reliably aligned protein segmentsthe regions of higher sequence variability and thoseof greater flexibility [23] As we work with well-con-servedubiquitous genes the situation is also optimalto take advantage of multiple alignments to identifykey-residues in these lsquoanonymousrsquo proteins and usethe conserved patterns to link them with protein fam-ilies of known functions despite a lack of overallsignificant sequence similarity (see 24 as an exam-ple)

Phylogenomics analyses aiming at identifyingsubset of genes co-segregating or co-evolving acrossmultiple bacterial genomes are also extensively usedto predict protein-protein interaction [25 26] eventu-

ally leading to the design of co-expression co-purifi-cation or co-crystallization experiments Phyloge-nomics is used to suggest links between ORFs ofunknown functions and identified pathways [25]These hints might help the functional interpretation ofthe 3-D structures of our anonymous targets Phylo-genomics analyses are performed with our in-housesoftware PhyDBac [26 httpigs-servercnrs-mrsfrphydbac]

On the negative side the priority of candidategenes was sometimes downgraded due to their ab-sence in important pathogens with reduced genomes(such as H influenzae or Mycoplasma) or for beingtoo similar to their human homologues Ideally onlybacteria-specific biochemical pathways should betargeted for the development of antibiotics (eg peni-cillin blocking peptidoglycan biosynthesis) Howeverthis constraint is probably not acceptable on the longrun given the constant need for new antibiotics andthe limited number of strictly bacterial specific tar-gets A good counter example is the success of thequinolones that are targeting DNA gyrase one of themost conserved proteins (26 identical amino acidsbetween human and E coli)

Finally the amino acid conservation patterns de-rived from multiple alignments become even moreinformative when analyzed in the context of a repre-sentative 3-D structure (from the E coli gene prod-uct) For instance the structural (hydrophobic core orstabilizing residues) vs functional (eg catalyticresidues) nature of the conserved positions can be as-sessed (as illustrated in [20]) Paradoxical features ndashsuch as well-exposed surface loops exhibiting strongresidue conservation ndash are highly informative for thefunctional interpretation of anonymous 3-D struc-tures Protein cavities lined with conserved residuesare also good candidates for lsquoactive sitesrsquo Previouslyunnoticed (ie misaligned) conserved isolated resi-dues (dispersed in the sequence) might be found closetogether in the 3-D structure and suggest the presenceof a catalytic center The spatial distribution of con-served residues can be exhaustively searched withinproteins of known function eventually resulting inthe identification of catalytic sites [27] Finally thedrugability of the regions of the protein pointed outby these analyses can be assessed and large librariesof putative ligands can be computationally screened[28 29]

147

Gene expression and protein purification pipelineFirst pass

The general principle governing our large-scale pro-tein expression approach is to quickly identify thelsquoeasyrsquo ORFs by passing our whole set of 111 througha common first pass standard protocol Further ex-pression trials are then performed on the lsquorecalcitrantrsquoproteins These trials involve lowering the tempera-ture trying other E coli strains adding predictedco-factors etc After about 17 months the first passhas been completed on the 111 targeted ORFs Theresults are as follows 108 ORFs have been success-fully cloned 104 have been tested 89 have exhibiteda detectable expression (either in vivo or using theRoche cell free system) 54 in a soluble form in a firstpass screening In parallel our industrial partner

(Infectious Disease Division Aventis Pharma) is as-sessing the essentiality of these putative target genesin major pathogens using proprietary approaches

Second pass addressing the bottleneck of solubleexpression

The above results identified the expression of the tar-gets in a soluble form as a major bottleneck A sec-ond pass protocol was thus designed using a statisticalscreening of various variables in order to determinethe best conditions for soluble expression We firstdetermined that the three main variables influencingprotein solubility andor expression yield were 1) thebacterial strain 2) the culture temperature and 3) thegrowth medium A set of 12 conditions correspond-ing to the combination of these 3 variables was thus

Figure 2 The yqhE gene identity card in Master DB The various buttons (top left column) provide interactive access to pre-computedadditional information as well as to lsquoon the flyrsquo sequence analysis

148

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 6: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

1300 different homologous ORF sets each with a dif-ferent representative in E coli For each of theseE coli ORFs a conservation index was computed as

CI all

BestScore

EcoliSelfScore

Where EcoliSelfScore is the BlastP [9] score of eachEcoli ORF aligned with itself and BestScore corre-sponds to the score of the best matching ORFs withineach genome (counting 0 for matches scoring belowthe threshold) the sum being performed over alltested genomes The CI value is thus quantifying thesequence conservation of the gene (and its presence)across a large number of evolutionary diverse micro-bial genomes including important pathogens For theinitial 1300 selected genes CI values ranged from225 (for the elongation factor subunit tufB) to 14 forthe putative sugar hydrolase ybgG gene (Figure 1)

For all 1300 selected genes (as well as for all 4222E coli genes) a number of properties were precom-puted and stored in a master database (lsquoMaster DBrsquo)In this database each gene is represented by a spe-cific interactive identity card (Figure 2) The ID cardcontains the gene nucleotide and amino-acid se-quence its precise position in the E coli genome itssignificant protein matches within the Reference Ge-nome set some theoretical properties of its proteinproduct (molecular weight pI extinction coefficient

number of cysteines methionines and chargedresidues) Each gene ID card also gives access to ad-ditional features such as a restriction map of the genethe amino-acid translations derived from alternativereading frames (for error checking) a map of the rarecodons as well as the protein hydropathy profile pre-dicted signal peptide membrane spanning segmentsand secondary structure The lsquoBlastrsquo button (Figure 2)gives access to interactive similarity searches againstadditional genomes listed in Table 1 (lsquoOther Ge-nomesrsquo) This set includes the complete genomes ofmicroorganisms of lesser biomedical relevance (eghyperthermophilic archebacteria) or interesting bacte-ria the genomic sequence of which is nearing com-pletion This list is submitted to periodical changesThe lsquoextended alignmentrsquo button generate a multiplealignment (using the T-coffee [21] program) of the Ecoli ORF with the homologous ORFs identified in theother bacterial genomes The lsquoPdb alignmentrsquo buttondoes the same with homologues of known 3-D struc-tures Master DB and the results of these varioussequence analyses are frequently consulted to guidethe subsequent experimental steps of the project suchas cloning gene expression protein purification andcrystallization trials

The set of 1300 well-conserved gene was furtherreduced to 317 by only retaining the genes of un-known or hypothetical function The distribution ofconservation index (CI values) in this subset rangesfrom 172 to 14 Figure (3A) presents a graph ofknown vs unknown genes ranked according to theirevolutionary conservation (CI value) This graph in-dicates that the most conserved genes (likely to cor-respond to essential pathways) have been historicallycharacterized in priority However it is comforting tosee that a fraction of well-characterized genes arefound at lower CI values where genes of yet unknownfunctions begin to dominate the distribution Thisshows that our protocol of ORF selection based onthe evolutionary conservation across Gram-positiveGram-negative genomes does actually converge ontoactual genes

The set of 317 highly conserved anonymous geneswas further reduced to enhance our probability of suc-cess in the experimental phase of the project that isthe over expression and determination of the 3-Dstructure of the largest possible number of these Ecoli gene products Although the structural determi-nation of membrane-bound proteins will eventuallybe attempted later on (eg by expressing the ORF asseparate domains) we chose to first focus on the

Figure 1 Mandelbrot plot (rank vs score) of the initial 1300 pu-tative target genes The genes (X-axis) are ranked according to theirassociated CI value (Y-axis) The first 200 genes with CI value gt10are ubiquitous genes involved in the most essential cellular path-ways The rest of the distribution is smoothly decreasing suggest-ing a fairly homogeneous selection pressure among all these can-didate target genes Genes of unknown functions are in red geneof known functions are in green

146

easier soluble targets (the so-called lsquolow hangingfruitsrsquo) A prediction of membrane spanning segments[11] was then performed to determine which of theabove 317 genes were most likely to be expressed assoluble globular protein This final bioinformatic fil-ter resulted into a final set of 221 candidate targetgenes The distribution of CI values within this geneset does not exhibit any significant bias ranging from172 to 14 (Figure 3B) Out of the 221 correspond-ing proteins only 17 exhibit a strong sequence simi-larity (defined as gt40 identity over a segment of atleast 130 residues) to entries in the PDB Our 221-candidate gene set is thus likely to correspond to asizable number of interesting new structures and pos-sibly some original folds 110 of these target geneswere attributed to the AFMB laboratory and are notfurther described

Additional bioinformatic analyses

Additional sequence analyses were performed at thevarious stage of the experimental process in particu-lar for genes and proteins for which difficulties wereencountered For instance putative cofactor-bindingmotifs were searched to help expressing purifying orcrystallizing recalcitrant proteins This was performedby combining the recognition of usual sequence sig-natures (eg Prosite motif [22]) with an analysis oftheir conservation within multiple alignments

Multiple alignment is also useful to identify theproper N-terminus of the ORF (for cloning purpose)or delineating detrimental flexible N- or C-terminalsegments of the candidate proteins An analysis of theconfidence measure at the heart of the T-coffee mul-tiple alignment program [21] indicated a correlationbetween the least reliably aligned protein segmentsthe regions of higher sequence variability and thoseof greater flexibility [23] As we work with well-con-servedubiquitous genes the situation is also optimalto take advantage of multiple alignments to identifykey-residues in these lsquoanonymousrsquo proteins and usethe conserved patterns to link them with protein fam-ilies of known functions despite a lack of overallsignificant sequence similarity (see 24 as an exam-ple)

Phylogenomics analyses aiming at identifyingsubset of genes co-segregating or co-evolving acrossmultiple bacterial genomes are also extensively usedto predict protein-protein interaction [25 26] eventu-

ally leading to the design of co-expression co-purifi-cation or co-crystallization experiments Phyloge-nomics is used to suggest links between ORFs ofunknown functions and identified pathways [25]These hints might help the functional interpretation ofthe 3-D structures of our anonymous targets Phylo-genomics analyses are performed with our in-housesoftware PhyDBac [26 httpigs-servercnrs-mrsfrphydbac]

On the negative side the priority of candidategenes was sometimes downgraded due to their ab-sence in important pathogens with reduced genomes(such as H influenzae or Mycoplasma) or for beingtoo similar to their human homologues Ideally onlybacteria-specific biochemical pathways should betargeted for the development of antibiotics (eg peni-cillin blocking peptidoglycan biosynthesis) Howeverthis constraint is probably not acceptable on the longrun given the constant need for new antibiotics andthe limited number of strictly bacterial specific tar-gets A good counter example is the success of thequinolones that are targeting DNA gyrase one of themost conserved proteins (26 identical amino acidsbetween human and E coli)

Finally the amino acid conservation patterns de-rived from multiple alignments become even moreinformative when analyzed in the context of a repre-sentative 3-D structure (from the E coli gene prod-uct) For instance the structural (hydrophobic core orstabilizing residues) vs functional (eg catalyticresidues) nature of the conserved positions can be as-sessed (as illustrated in [20]) Paradoxical features ndashsuch as well-exposed surface loops exhibiting strongresidue conservation ndash are highly informative for thefunctional interpretation of anonymous 3-D struc-tures Protein cavities lined with conserved residuesare also good candidates for lsquoactive sitesrsquo Previouslyunnoticed (ie misaligned) conserved isolated resi-dues (dispersed in the sequence) might be found closetogether in the 3-D structure and suggest the presenceof a catalytic center The spatial distribution of con-served residues can be exhaustively searched withinproteins of known function eventually resulting inthe identification of catalytic sites [27] Finally thedrugability of the regions of the protein pointed outby these analyses can be assessed and large librariesof putative ligands can be computationally screened[28 29]

147

Gene expression and protein purification pipelineFirst pass

The general principle governing our large-scale pro-tein expression approach is to quickly identify thelsquoeasyrsquo ORFs by passing our whole set of 111 througha common first pass standard protocol Further ex-pression trials are then performed on the lsquorecalcitrantrsquoproteins These trials involve lowering the tempera-ture trying other E coli strains adding predictedco-factors etc After about 17 months the first passhas been completed on the 111 targeted ORFs Theresults are as follows 108 ORFs have been success-fully cloned 104 have been tested 89 have exhibiteda detectable expression (either in vivo or using theRoche cell free system) 54 in a soluble form in a firstpass screening In parallel our industrial partner

(Infectious Disease Division Aventis Pharma) is as-sessing the essentiality of these putative target genesin major pathogens using proprietary approaches

Second pass addressing the bottleneck of solubleexpression

The above results identified the expression of the tar-gets in a soluble form as a major bottleneck A sec-ond pass protocol was thus designed using a statisticalscreening of various variables in order to determinethe best conditions for soluble expression We firstdetermined that the three main variables influencingprotein solubility andor expression yield were 1) thebacterial strain 2) the culture temperature and 3) thegrowth medium A set of 12 conditions correspond-ing to the combination of these 3 variables was thus

Figure 2 The yqhE gene identity card in Master DB The various buttons (top left column) provide interactive access to pre-computedadditional information as well as to lsquoon the flyrsquo sequence analysis

148

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 7: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

easier soluble targets (the so-called lsquolow hangingfruitsrsquo) A prediction of membrane spanning segments[11] was then performed to determine which of theabove 317 genes were most likely to be expressed assoluble globular protein This final bioinformatic fil-ter resulted into a final set of 221 candidate targetgenes The distribution of CI values within this geneset does not exhibit any significant bias ranging from172 to 14 (Figure 3B) Out of the 221 correspond-ing proteins only 17 exhibit a strong sequence simi-larity (defined as gt40 identity over a segment of atleast 130 residues) to entries in the PDB Our 221-candidate gene set is thus likely to correspond to asizable number of interesting new structures and pos-sibly some original folds 110 of these target geneswere attributed to the AFMB laboratory and are notfurther described

Additional bioinformatic analyses

Additional sequence analyses were performed at thevarious stage of the experimental process in particu-lar for genes and proteins for which difficulties wereencountered For instance putative cofactor-bindingmotifs were searched to help expressing purifying orcrystallizing recalcitrant proteins This was performedby combining the recognition of usual sequence sig-natures (eg Prosite motif [22]) with an analysis oftheir conservation within multiple alignments

Multiple alignment is also useful to identify theproper N-terminus of the ORF (for cloning purpose)or delineating detrimental flexible N- or C-terminalsegments of the candidate proteins An analysis of theconfidence measure at the heart of the T-coffee mul-tiple alignment program [21] indicated a correlationbetween the least reliably aligned protein segmentsthe regions of higher sequence variability and thoseof greater flexibility [23] As we work with well-con-servedubiquitous genes the situation is also optimalto take advantage of multiple alignments to identifykey-residues in these lsquoanonymousrsquo proteins and usethe conserved patterns to link them with protein fam-ilies of known functions despite a lack of overallsignificant sequence similarity (see 24 as an exam-ple)

Phylogenomics analyses aiming at identifyingsubset of genes co-segregating or co-evolving acrossmultiple bacterial genomes are also extensively usedto predict protein-protein interaction [25 26] eventu-

ally leading to the design of co-expression co-purifi-cation or co-crystallization experiments Phyloge-nomics is used to suggest links between ORFs ofunknown functions and identified pathways [25]These hints might help the functional interpretation ofthe 3-D structures of our anonymous targets Phylo-genomics analyses are performed with our in-housesoftware PhyDBac [26 httpigs-servercnrs-mrsfrphydbac]

On the negative side the priority of candidategenes was sometimes downgraded due to their ab-sence in important pathogens with reduced genomes(such as H influenzae or Mycoplasma) or for beingtoo similar to their human homologues Ideally onlybacteria-specific biochemical pathways should betargeted for the development of antibiotics (eg peni-cillin blocking peptidoglycan biosynthesis) Howeverthis constraint is probably not acceptable on the longrun given the constant need for new antibiotics andthe limited number of strictly bacterial specific tar-gets A good counter example is the success of thequinolones that are targeting DNA gyrase one of themost conserved proteins (26 identical amino acidsbetween human and E coli)

Finally the amino acid conservation patterns de-rived from multiple alignments become even moreinformative when analyzed in the context of a repre-sentative 3-D structure (from the E coli gene prod-uct) For instance the structural (hydrophobic core orstabilizing residues) vs functional (eg catalyticresidues) nature of the conserved positions can be as-sessed (as illustrated in [20]) Paradoxical features ndashsuch as well-exposed surface loops exhibiting strongresidue conservation ndash are highly informative for thefunctional interpretation of anonymous 3-D struc-tures Protein cavities lined with conserved residuesare also good candidates for lsquoactive sitesrsquo Previouslyunnoticed (ie misaligned) conserved isolated resi-dues (dispersed in the sequence) might be found closetogether in the 3-D structure and suggest the presenceof a catalytic center The spatial distribution of con-served residues can be exhaustively searched withinproteins of known function eventually resulting inthe identification of catalytic sites [27] Finally thedrugability of the regions of the protein pointed outby these analyses can be assessed and large librariesof putative ligands can be computationally screened[28 29]

147

Gene expression and protein purification pipelineFirst pass

The general principle governing our large-scale pro-tein expression approach is to quickly identify thelsquoeasyrsquo ORFs by passing our whole set of 111 througha common first pass standard protocol Further ex-pression trials are then performed on the lsquorecalcitrantrsquoproteins These trials involve lowering the tempera-ture trying other E coli strains adding predictedco-factors etc After about 17 months the first passhas been completed on the 111 targeted ORFs Theresults are as follows 108 ORFs have been success-fully cloned 104 have been tested 89 have exhibiteda detectable expression (either in vivo or using theRoche cell free system) 54 in a soluble form in a firstpass screening In parallel our industrial partner

(Infectious Disease Division Aventis Pharma) is as-sessing the essentiality of these putative target genesin major pathogens using proprietary approaches

Second pass addressing the bottleneck of solubleexpression

The above results identified the expression of the tar-gets in a soluble form as a major bottleneck A sec-ond pass protocol was thus designed using a statisticalscreening of various variables in order to determinethe best conditions for soluble expression We firstdetermined that the three main variables influencingprotein solubility andor expression yield were 1) thebacterial strain 2) the culture temperature and 3) thegrowth medium A set of 12 conditions correspond-ing to the combination of these 3 variables was thus

Figure 2 The yqhE gene identity card in Master DB The various buttons (top left column) provide interactive access to pre-computedadditional information as well as to lsquoon the flyrsquo sequence analysis

148

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 8: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

Gene expression and protein purification pipelineFirst pass

The general principle governing our large-scale pro-tein expression approach is to quickly identify thelsquoeasyrsquo ORFs by passing our whole set of 111 througha common first pass standard protocol Further ex-pression trials are then performed on the lsquorecalcitrantrsquoproteins These trials involve lowering the tempera-ture trying other E coli strains adding predictedco-factors etc After about 17 months the first passhas been completed on the 111 targeted ORFs Theresults are as follows 108 ORFs have been success-fully cloned 104 have been tested 89 have exhibiteda detectable expression (either in vivo or using theRoche cell free system) 54 in a soluble form in a firstpass screening In parallel our industrial partner

(Infectious Disease Division Aventis Pharma) is as-sessing the essentiality of these putative target genesin major pathogens using proprietary approaches

Second pass addressing the bottleneck of solubleexpression

The above results identified the expression of the tar-gets in a soluble form as a major bottleneck A sec-ond pass protocol was thus designed using a statisticalscreening of various variables in order to determinethe best conditions for soluble expression We firstdetermined that the three main variables influencingprotein solubility andor expression yield were 1) thebacterial strain 2) the culture temperature and 3) thegrowth medium A set of 12 conditions correspond-ing to the combination of these 3 variables was thus

Figure 2 The yqhE gene identity card in Master DB The various buttons (top left column) provide interactive access to pre-computedadditional information as well as to lsquoon the flyrsquo sequence analysis

148

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 9: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

defined using an incomplete factorial experimentaldesign (see Materials and Methods) A typical exam-ple of such a screen applied to 6 different targets isgiven in Figure 4

At this point out of the 108 ORFs cloned 94 havebeen passed through this screening 93 have been suc-cessfully expressed including 68 in a soluble formThe incomplete factorial approach demonstrated itspower since out of the 94 ORFs screened only 1 wasnot expressed and 25 failed the soluble expressionassay This systematic screening allowed 723 of thetargets to be recovered in a soluble form This is tobe compared to the 52 yield obtained through thefirst pass standard protocol

The scale up bottleneck

For a certain number of proteins we observed thatdirectly scaling up the culture from 4 ml to 1 litercould lead to insoluble expression This problem wasusually solved by simply replacing the 1-liter cultureby several 200-ml cultures

Protein purification

Protein purifications are not yet performed through anautomatic procedure The buffer most suitable forprotein concentration is determined through dialysisexperiments of the purified material monitored byUV spectra and taking into account the theoreticalisoelectric point of each protein The systematiccheck for N-terminal sequence and mass of each pu-rified protein proved extremely useful for debunkingvarious problems (including the mishandling ofsamples) We noticed frequent cases (1531) of spon-taneous removal of the extended His-tagged N-termi-nus inherent to the use of the GATEWAY system (17to 24 first residues removed) The final statistics onthis observation will only be available once all solu-ble proteins will have been purified and sequenced

A prerequisite to successful crystallization wasalso the determination of appropriate storage condi-tions for each purified targets using gel electrophore-sis over time (see Materials and Methods) Forunstable proteins crystallization trials were per-

Figure 2 (Continued)

149

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 10: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

Figure 3 (A) Distribution of known vs unknown genes according to their evolutionary conservation The 1300 most-conserved E coli genesare ranked according to their decreasing CI value (x-axis in parenthesis) and grouped in successive boxes of 50 (Gene rank) For each boxthe number of genes of unknown vs known function are shown in red or green respectively Functionally characterized genes dominate thedistribution for the high CI values but many anonymous genes also exhibit comparable evolutionary conservations(B) Distribution of predicted membrane vs soluble proteins according to their evolutionary conservation (decreasing CI value x-axis inparenthesis) The 317 highly conserved ORFs of unknown function have been divided into membrane-bound (green boxes) vs soluble (redboxes) after sequence analysis There is no overall bias although proteins predicted soluble dominate the distribution of the most conservedcandidates

150

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 11: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

formed on freshly purified samples or using materialstored at minus80 degC

Screening of alternative cloning contexts usinglinear PCR products

To quickly explore alternative cloning contexts (Tagtype and position) for recalcitrant targets we devel-oped an in vitro expression screening based on lineartemplates involving a two steps PCR and universalprimers Tested on 24 different E coli ORFs the lin-ear templates mimicking pDEST17 vector context(N-terminal His-tag) performed as efficiently as thecorresponding circular plasmid This approach canthus be applied to modify the standart pDEST17 ORFcontext into a tagtarget combination more suitablefor soluble expression Possible alternatives includeC-terminal His-tag GST MBP Thioredoxin Streptag and GFP

Overview of the structures obtained to date

YecD molecular replacement using homologymodelingThe interpretation of yecD diffraction data (collectedon the ID29 beam line at the ESRF synchrotron inGrenoble-France) was first attempted by molecularreplacement using AmoRe [17] with the 1NBA and

1IM5 PDB entries These proteins share less than25 identical residues with the yecD sequence andfailed to produce a solution We then build a yecDstructure model with MODELER [30] using theknown structures as templates The yecD model wasused for molecular replacement and produced a goodsolution with 4 molecules per asymmetric unit TheyecD structure was then refined to 13 Aring (PDB 1J2R)YecD and 1NBA are annotated in Swiss-Prot asbelonging to the isochorismatase protein familyHowever the superposition of active site region of theyecD 1NBA and 1IM5 structures reveals that acysteine residue essential to the isochorismatase en-zymatic activity is replaced by a glycine in yecD Inaddition yecD lacks a cis-peptide conserved in 1NBAand 1IM5 It is thus likely that yecD has a differentenzymatic activity Detailed sequence analyses arebeing performed to predict its function

YggV a nucleoside triphosphate phosphohydrolaseThe yggV structure was solved using the MADtechnique with one seleno-methionine Data were col-lected on the BM30 beam line at ESRF The structurewas refined to 18 Aring and was found to be stronglysimilar to PDB entries 2MJP (Nucleoside triphos-phate phosphohydrolase from Methanocaldococcusjannaschii) and 1EX2 (MAF protein from Bacillussubtilis) YggV crystals were thus grown in presenceof dGTP They diffract to 24 Aring resolution The struc-ture of yggV was also solved in the absence of dGTPby the Midwest Center for Structural Genomics (PDB1K7K) A multiple alignment of the homologous se-quences is currently analyzed in the context of thevarious available structures (free forms and com-plexed with nucleotides) Our phylogenomic analysespointed out putative partners to yggV suggesting thatits function in E coli might not be entirely under-stood

YhbO an intracellular proteaseThe yhbO structure was solved by molecular replace-ment using AmoRe with the PDB structure 1G2I (anintracellular protease from Pyrococcus horikoshii)The yhbO structure was subsequently refined to 20Aring resolution A nucleophile elbow motif as well as acysteine residue essential for the protease activity ofthe P horikoshii intracellular protease is conserved inyhbO The multimeric state of the 1G2I structure wasalso described as essential for the catalytic activitythe active site involving residues contributed by twomonomers Interestingly the interface between the

Table 2 Experimental matrix used for expression screening Thefirst column is the experiment number The second column corre-sponds to the 4-state strain variable BL21(DE3) pLysS RosettaOrigami and C41 The third column corresponds to the 3-stategrowth medium variable 2YT SB and TB media The fourth col-umn corresponds to the 3-state temperature variable 25 degC 37 degCand 42 degC

Condition

number

Strain Medium Temperature

1 Bl21 (DE3) pLysS SB 25

2 Rosetta (DE3) TB 25

3 Origami (DE3) 2YT 25

4 C41 (DE3) SB 25

5 Bl21 (DE3) pLysS TB 37

6 Rosetta (DE3) SB 37

7 Origami (DE3) TB 37

8 C41 (DE3) 2YT 37

9 Bl21 (DE3) pLysS 2YT 42

10 Rosetta (DE3) 2YT 42

11 Origami (DE3) SB 42

12 C41 (DE3) TB 42

151

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 12: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

two yhbO monomers as seen in the crystal structureis entirely different from the one found in the 1G2Istructure The functional assignment of the E coliyhbO protein is thus not yet certain The putativeprotease activity of yhbO is currently studied

YqhE significant difference in the active sitesThe yqhE structure was solved by molecular replace-ment using AmoRe with the PDB entry 1A80 ( a 25-diketo-D-gluconic acid reductase A structure fromCorynebacterium) The yqhE structure was then

Figure 4 Dot Blot results of the expression screening on 6 different targets The top 6 Dot blots correspond to the total protein expressionresults The 6 lower Dot blots correspond to the soluble protein expression measurements The numbers above each Dot Blot refer to thetargets Dot Blot results are ranked by decreasing temperature (left to right) and the strains indicated on the right The growth medium isindicated on the left of each Dot Blot

152

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 13: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

Table 3 Project status List and status of our targets as deposited in targetdb (httptargetdbpdborg) under the project name SGI-SG (BIGS)The first column indicates the target name (according to EcoGene [10]) the second column indicates the relevant genome The status (fromlsquoSelectedrsquo to lsquoCrystal Structure) of each target protein is shown in pink For each target arrows are used to indicates the next step to becompleted stars correspond to targets currently stopped and squares correspond to lsquoon holdrsquo targets (failure during the scale-up or crystal-lization processes) pending further testing

153

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 14: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

Table 3 (Continued)

154

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 15: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

refined to 193 Aring resolution (PDB 1MZR) YqhEbelongs to the vast NADPH-dependent aldo-ketoreductase family Multiple alignment was used tocompare the various sequences of the family Thisrevealed two sub-families including all eukaryoticsequences on one hand and all the prokaryoticsequences on the other hand Structurally the eukary-otic proteins differ from the prokaryotic ones by thepresence of much longer loops in the co-factor andligand binding regions These marked structural dif-ferences should make possible the design of smallinhibitory molecule with a good specificity for theprokaryotic targets

YdhF an NADPH-dependent oxidoreductase ofunknown specificityThe ydhF structure was solved using the MAD tech-nique with 8 seleno-methionines The structure wasthen refined to 25 Aring resolution using data collectedon the BM30 beam line at ESRF YdhF belongs to theNADPH-dependent oxidoreductase family Its struc-ture was compared to the other members of the familyusing multiple alignment YqhE shares 267 identi-cal residues over a 217-aa segment with its closeststructural homologue ydhF YdhF exhibits a se-quence insertion in a flexible loop not well resolvedin the structure This difference is not located near theco-factors and ligand binding sites but may still af-fects the enzyme specificity The structure of theNADP containing ydhF is currently under refinementusing data to 28 Aring resolution

YliB a putative n-peptide binding proteinThe yliB structure was solved using the MAD tech-nique with 20 seleno-methionines The structure wasthen refined to 23 Aring resolution using data collectedon the ID14-EH2 beam line at ESRF This membrane-anchored periplasmic protein shares 29 identicalresidues with 1DPE (a dipeptide binding protein fromE coli) and 1DPP (the same protein in complex withglycyl-L-leucine) The loops surrounding the ligand-binding pocket are shorter in the yliB structure com-pared to the homologous structures suggesting thatlarger peptides might be accommodated by yliB Wealso noticed the presence of zinc ion and of some re-sidual electron density in the region homologous tothe 1DPP dipeptide binding site We are currently pro-ducing crystals of yliB in the presence of variousdipeptide in order to verify its affinity for such sub-strates

YdiB a new structural foldThe ydiB structure was solved using the MAD tech-nique with 20 seleno-methionines The structurebuilding is still in progress using data collected at266 Aring resolution on the BM30 beam line at ESRFThe preliminary tracing suggests that the ydiB struc-ture might be an original fold which is confirmed bythe deposition in the PDB of the ydiB structure by theNortheast Structural Genomics Research Consortium(PDB 1NPD) and the Montreal-Kingston BacterialStructural Genomics Initiative (PDB 1O9B)

Conclusion

This structural genomics project is an opportunity todevelop new approaches for the various stages ofgene expression The use of the lsquoone tube reactionrsquoGATEWAY technology allowed the successful clon-ing of 972 of the selected ORFs The developmentof both in vitro and in vivo expression screening withthe efficient sampling of the various parameters influ-encing soluble protein expression allowed 80 of thetested ORFs to be expressed as soluble proteins (anadditional 18 being expressed as inclusion bodies)The automation of the screening using dot-blot rev-elation now allows us to process 12 targets weekly

In our hands the purification step is not a limitingstep since 100 of the 46 proteins that have reachedthat stage were successfully purified However it wasmore challenging to keep them intact and soluble dur-ing the following steps of concentration and bufferexchange required for crystallization The introduc-tion of buffer screening through dialysis controlled byUV scanning has proven very effective allowing theidentification of a suitable buffer for 95 of the pu-rified samples

Finally the quality control we imposed through theentire project albeit time consuming has proven use-ful to trace mistakes such as target mixing andorcross-contamination as well as protein degradationOn a more positive side it also provided informationon unexpected proteinprotein interactions andorligand binding eventually leading to improved crys-tallization protocols To this days 40 of the 46purified recombinant proteins have been throughbiophysical characterization (circular dichroiumlsm spec-troscopy and monodispersity assays using dynamiclight scattering) 32 have entered crystallization trialsand 31 have been sequenced Twelve proteins haveproduced crystals out of which 9 have resulted in us-

155

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 16: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

Figure 5 Protein structures solved in this study The structures are drawn using MOLSCRIPT [31] The YdiB structure was also solved bythe consortium

156

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157

Page 17: Structural genomics of highly conserved microbial genes of ...metabolomics.helmholtz-muenchen.de/suhre/... · By focusing a structural genomics project on such anonymous proteins,

able diffraction data allowing 7 structures to besolved (Figure 5) This number is now expected togrow rapidly as more expressed gene products arereaching the purificationcrystallization stages

Acknowledgements

We thank our collaborators from Aventis Pharma DrsV Blanc E Remy M Black and J Hogdson for theirscientific input We also thank Drs J-L Ferrer FBorel P Carpentier A Thompson G Leonard and JMcCarthy for their help on the FIP BM14 ID29ID14 ESRF beamlines We thank P Pham-Trong(Greiner) for his help and gift of materials Thisproject is funded by a grant (014906055) from theFrench Ministry of Industry lsquopost genomics researchprogramrsquo

References

1 Stover CK Pham XQ Erwin AL Mizoguchi SDWarrener P Hickey MJ Brinkman FS Hufnagle WOKowalik DJ Lagrou M et al (2000) Nature 406959minus964

2 Cassell GH and Mekalanos J JAMA 285 (2001)601minus605

3 Projan S (2002) Curr Opin Pharmacol 2 513minus5224 Clements JM Beckett RP Brown A Catlin G Lobell

M Palan S Thomas W Whittaker M Wood S SalamaS Baker PJ Rodgers HF Barynin V Rice DW andHunter MG (2001) Antimicrob Agents Chemother 45563minus570

5 Barbachyn MR Brickner SJ Gadwood RC GarmonSA Grega KC Hutchinson DK Munesada K Reis-cher RJ Taniguchi M Thomasco LM Toops DS Ya-mada H Ford CW and Zurenko GE (1998) Adv ExpMed Biol 456 219minus238

6 Johnson AP (2001) Curr Opin Investig Drugs121691minus1701

7 Auckland C Teare L Cooke F Kaufmann ME WarnerM Jones G Bamford K Ayles H Johnson AP (2002)J Antimicrob Chemother 50 743minus746

8 Claverie JM Nature 2000 403 12minus12

9 Altschul SF Madden TL Schaffer AA Zhang JZhang Z Miller W and Lipman DJ (1997) Nucleic AcidsRes 25 3389minus3402

10 Rudd KE (2000) Nucleic Acids Res 28 60minus6411 Persson B and Argos P (1997) J Protein Chem 16

453minus45712 Hartley JL Temple GF and Brasch MA (2000) Genome

Res 10 1788minus179513 Monchois V Vincentelli R Deregnaucourt C Abergel C

and Claverie J-M (2002) In Cell-Free Translation Systems(Ed Spirin AS) Springer Verlag New York pp 197minus202

14 Audic S Lopez F Claverie J-M Poirot O and AbergelC (1997) Proteins 29 252minus257

15 Monchois V Abergel C Sturgis J Jeudy S and ClaverieJ-M (2001) J Biol Chem 276 18437minus18441

16 Bhat TN Bourne P Feng Z Gilliland G Jain S Rav-ichandran V Schneider BSchneider K Thanki N Weis-sig H Westbrook J and Berman HM (2001) Nucleic Ac-ids Res 29 214minus218

17 Navaza J (2001) Acta Crystallogr D57 1367minus137218 Liu Y Ogata CM and HendricksonWA (2001) Proc Natl

Acad Sci U S A 98 10648minus1065319 Hendrickson WA Horton JR and LeMaster DM (1990)

EMBO J 9 1665minus167220 Abergel C Bouveret E Claverie JM Brown K Riga

A Lazdunski C and Benedetti H (1999) Structure FoldDes 71291minus1300

21 Notredame C Higgins DG Heringa J (2000) J MolBiol 302 205minus217

22 Falquet L Pagni M Bucher P Hulo N Sigrist CJHofmann K and Bairoch A (2002) Nucleic Acids Res 30235minus238

23 Notredame C and Abergel C (2003) In Bioinformatics andGenomes (Andrade MA ed) Horizon Scientific Press Wy-mondhamUK pp 27minus49

24 Abergel C Robertson DL Claverie J-M (1999) J Virol73 751minus753

25 Marcotte EM Pellegrini M Thompson MJ Yeates TOand Eisenberg D (1999) Nature 402 83minus86

26 Enault F Suhre K Poirot O Abergel C and ClaverieJ-M (2003) Nucleic Acids Res In press

27 Madabushi S Yao H Marsh M Kristensen DM Phil-ippi A Sowa ME and Lichtarge O (2002) J Mol Biol316 139minus154

28 Klebe G (2000) J Mol Med 78 269minus28129 Apostolakis J and Caflisch A (1999) Comb Chem High

Throughput Screen 2 91minus10430 Sali A and Blundell T L (1993) J Mol Biol 234

779minus81531 Kraulis PJ (1991) J Appl Crystallogr 24 946minus950

157