a community-driven annotation platform for structural genomics workshop on the biological annotation...

of 26 /26
A community-driven annotation platform for structural genomics Workshop on the Biological Annotation of Novel Proteins, March 7-8, 2008 Biomedical theme: Central Machinery of Life -proteins conserved in all kingdoms of life Biological theme: Complete coverage of Thermotoga maritima Adam Godzik and the JCSG Bioinformatics Team

Author: emerald-logan

Post on 31-Dec-2015




2 download

Embed Size (px)


  • A community-driven annotation platform for structural genomicsWorkshop on the Biological Annotation of Novel Proteins, March 7-8, 2008Biomedical theme: Central Machinery of Life -proteins conserved in all kingdoms of lifeBiological theme: Complete coverage of Thermotoga maritima Adam Godzikand the JCSG Bioinformatics Team

  • Science is all about communicationSince late XIX century, a dominant way of communicating scientific results is through peer-reviewed manuscriptsProPeer review ensures qualityEnforces a publishable unit decreases noise in the communication spaceAuthorship rules ensure proper distribution of credit in a system that is well integrated with system of promotions and evaluationsConSignificant time lag and additional costsEnforces a publishable unit below the threshold results are lost Not scalable with high throughput data production

  • Increasingly, its not the only game in town

    Databases and automated annotation protocolspro: fast, machine searchable, scalablecon: difficult to ensure quality and assign credit, put the burden of expertise on the user

    Wikipediapro: harnesses power of community, scalablecon: unreliable, difficult to ensure quality and assign credit

  • Can we have the best of all worlds?Peer-reviewed manuscriptAutomated database annotationWikipediaentryFast, accurate,scalable

  • struc. refinementstruc. validationannotationpublicationPDBStructure determination in PSI centers is done on a semi-automated assembly lineJoint Center for Structural GenomicsOne of four large scale (production) centers of PSI2~ 600 structures deposited in the PDBSustained pace of ~15 PDB depositions per month

  • and the pace of structure determination far outstrips the pace of our publicationsStructure collage: PSIPublication statistics : http://olenka.med.virginia.edu/psi/

  • Why?Speed: 2-3 months from target selection to structure, 2-3 structures per week in each centerAssembly line process, no time to develop special relationship with each proteinStructures are not associated with ongoing biochemical and biological research.Targets selected based on novelty, no expertise available anywhere, difficult to reach publishable unit

  • We are not alone Bacterial genomes1995-2000 - every new sequenced genome led to a Nature/Science publicationwith 500+ genomes, an increasing percentage of them never become a single focus of a specific publicationCommunity based annotation efforts become the best source of information (SEED)

  • WEB 2.0 is reshaping how we share information:Communities of globally distributed peers (Networks) built around rich, collaborative environments.WikipediaCitizendiumScholarpediaGoogle Knols


  • How can we tap into an ultimate research tool?

    Search engines are becoming serious research tools

    Google indexes research papers, books, wikipedia pages

    Semi-natural language searches

  • Structural coverage of many genomes (here T.maritima) approaches completeness~73% of feasible targets

  • http://research.calit2.net/metagenomics/thermotogaWhich brings attention from broader communities

  • We can utilize other informationMetabolic reconstruction of T.maritima was done in collaboration with UCSD Systems Biology Lab (Bernard Palsson)Model is consistent with all the published experimental data on TM (see Ines Thiele poster)First generation model covers 479 genes (1398 are not in the model), 492 metabolitesProteins coded by 113 of these genes have been solved (71 at JCSG, 28 at other PSI centers)320 have be modeledWe know at least approximate structure of ALL the proteins in the reconstruction

  • And bring it together to help make sense of the structures and see them in the full context

  • All available information about a protein on one page

  • We try to combine automated, database driven annotations with expert curated input.Annotation:Feeds from public databasesExpert-curated informationContent management:Wiki-style editing (WYSIWYG editor)Page-level access controlStructured fields + free textInstant publicationAlways open for comments and editsQuality control & authorship:Encourage community collaborationJCSG scientists & invited peersMany authors - no contribution too smallLead authors (editors) in charge of releases

  • We tried it before

    Internal structure annotation system developed at JCSG in 2003

    Led to several interesting discussions, but mostly didnt catch on

  • TOPSAN content:

  • Protein Groups

  • Browsing Options

  • JCSG: Structures / Structure Notes / TOPSAN278 of 593 structures have an annotation on TOPSAN

  • Members of the biological community can utilize PSI structures only when they are aware of themFunctionally well characterized enzyme and is also a new fold.PDB ID: 3C8WTargetDB: 376561

  • TOPSAN access statistics- Jan to Mar 081143 visits from rest of the world.

  • Google sent us the largest number of visitors.

  • UCSD & BurnhamBioinformatics CoreJohn WooleyAdam GodzikLukasz JaroszewskiSlawomir GrzechnikSri Krishna SubramanianAndrew MorseTamara AstakhovaLian DuanPiotr KozbialDana WeekesNatasha SefcovicJosie Alaoen

    GNF & TSRICrystallomics CoreScott LesleyMark KnuthHeath KlockMarc DellerDennis CarltonPolat Abdubek Sanjay AgarwallaConnie ChenThomas ClaytonDustin ErnstJulie FeuerhelmRegina GorskiAnna GrzechnikJoanna C. HaleThamara JanaratneHope JohnsonSachin KaleDaniel McMullanEdward NigoghossianAmanda NopakunLinda OkachJessica PaulsenChristina PuckettSebastian SudekJessica Canseco

    Scientific Advisory Board

    Sir Tom BlundellUniv. CambridgeHomme HellingaDuke University Medical CenterJames NaismithThe Scottish Structural Proteomics facility Univ. St. AndrewsJames Paulson Consortium for Functional Glycomics,The Scripps Research InstituteRobert StroudCenter for Structure of Membrane Proteins,Membrane Protein Expression Center UC San FranciscoSoichi WakatsukiPhoton Factory, KEK, JapanJames Wells UC San FranciscoTodd Yeates UCLA-DOE, Inst. for Genomics and ProteomicsTSRINMR CoreKurt Wthrich Reto Horst Margaret JohnsonAmaranth ChatterjeeMichael GeraltWojtek AugustyniakJin-Kyu RheeBiswaranjan MohantyBill PedriniPedro SerranoTSRI Administrative CoreIan WilsonMarc ElsligerGye Won HanDavid MarcianoHenry TienXiaoping DaiLisa van Veen

    Stanford /SSRLStructure Determination CoreKeith HodgsonAshley DeaconMitchell Miller Hsiu-Ju (Jessica) ChiuDebanu Das Kevin JinAbhinav KumarWinnie LamSilvya OommachenChristopher RifeScott TalafuseChristine TrameQingping XuHenry van den BedemRonald ReyesThe JCSG is supported by the NIH Protein Structure Initiative grant U54 GM074898 from the National Institute of General Medical Sciences (www.nigms.nih.gov).

  • Thermotoga browser acknowledgmentsCo-PI of the project - Andrei Osterman (the biochemistry side, specific examples)The JCSG team - for all the structures, focus on Thermotoga and CMLBernard Palsson group and Ines Thiele for work with Thermotoga reconstruction and model simulationsThe JCMM team for structure modelingKrzysztof Ginalski and bioinfor server team for assistance with borderline predictions

    Ying Zhang (JCMM) - finalizing the metabolic reconstruction, network and fold distribution analysisDana Weekes (JCSG) - first pass on the Thermotoga metabolic reconstruction, TM TOPSAN pagesCraig Shepherd (JCMM) - network visualizationZhanwen Li (JCMM) - modeling and fold assignments

    *****The JCSG PipelineProgress has been made in automating nearly all of the steps in this pipeline. Some of the steps are well defined and strictly procedural, such as oligo design, amplification, expression, first pass purification, and crystallization. The excellent robotic systems developed at GNF work well for these steps in the pipeline. Other stages are more complex, and require judgements based on accumulated human experience, such as evaluating gels, mounting and freezing crystals, assessing diffraction quality, refining and validating protein structures. Progression of samples through these steps is currently increased by web-based interfaces and scripts created to assist the user and through the conversion of human tasks into code where possible (see III.A.2.c). Underlying our pipeline strategy is the assumption that, in year 3, our pipeline throughput will continue to accelerate from the increased connectivity between process steps. Figure 18 Bottleneck analysis of the JCSG pipeline using data mining and systems integration. Leaks are illustrated by blue drips, scheduling and resource access bottlenecks by yellow connections and major process bottlenecks by red connections. As year 3 progresses, we expect to see a secondary gain from improvements in the overall process that are driven by systematic examination of successes and failures to date. This data mining approach has already been applied in the development of the next tier of the T. maritima crystallome analysis as described earlier. Figure 18 illustrates an extensive analysis of the JCSG pipeline at the end of the initial two year period. Extensive data mining was conducted to identify leaks (blue drops) where gene targets are lost due to processing problems such as the challenges of the C. elegans proteome or loss due to insolubility, bottlenecks due to a lack of resources (green bottlenecks), lack of dedicated instrumentation such as crystallization robotics and beamlines (yellow bottlenecks) and process flow challenges (red bottlenecks). A number of initial feedback loops have already been established throughout the pipeline and are in routine production. *************Of 1728 visits, 585 are from San Diego. The rest are from around the world. ***