worldwide protein data bank september 7, 2007

47
Worldwide Protein Data Bank www.wwpdb.org www.wwpdb.org September 7, 2007

Upload: shakira-markes

Post on 29-Mar-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

www.wwpdb.org

September 7, 2007

Page 2: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Agenda Welcome and introductions Accomplishments Remediation rollout summary Toward the futureBreak

Matters arising– Incorrect structures

Executive session Feedback to wwPDB Set next meeting date

Page 3: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

wwPDB AchievementsOctober 2006 - September 2007 Continued growth of archive Website updates Publications and presentations Time-stamped archive Remediation rollout Annotation document One stop shop: NMR, cryoEM

Page 4: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Depositions since wwPDB establishment

Page 5: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

PDB entry processing 1-1-2000 10,997 entries in PDB Today 10-Jul-2007 44,578 entries in PDB

Size now is 4 times larger than when the 3 sites started

In 1999, 2361 entries were deposited In 2006, 7282 entries were deposited

We handle more than 3 times as many entries per year with less staff – and all wwPDB sites produce high quality annotated PDB entries

No current backlog of unprocessed entries

Page 6: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Time-stamped copies of the archive

57 Gbytes of data for 2006, released January 2, 2007

68 Gbytes of data for July 2007 snapshot Both include

– PDB format entries– mmCIF format entries– PDBML format entries – Experimental data– Dictionary, schema, and format documentation

Page 7: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Outreach

wwPDB website Discussion forums NMR Task Force Publications Professional society meetings

Page 8: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Page 9: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Joint publications

Nucleic Acids Research, 35: D301 (2007)– The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform

archive of PDB data Nature Structure Molecular Biology, 14:354 (2007)– Reply to: Building meaningful models of glycoproteins

Nature Biotechnology, 25: 854 (2007)– Response to “Overhauling the PDB”

Methods in Molecular Biology, in press– Data deposition and annotation at the wwPDB

Structural Bioinformatics 2nd Edition, in press– The wwPDB

Page 10: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Interactions since October 2006 Exchange visits

– MSD/RCSB PDB (4) – PDBj/RCSB PDB (1)– PDBj/BMRB (2)– BMRB/RCSB PDB (1)

Phone conference with site directors-twice a year VTC’s among staff

– BMRB/RCSB PDB twice a month (ADIT-NMR)– MSD/RCSB PDB twice a week (annotation procedures, remediation)– RCSB PDB/PDBj and BMRB/PDBj on necessary occasions

Email among staff– MSD/RCSB PDB ~2 per day– PDBj/RCSB PDB ~2 per day

Page 11: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

New initiatives

One stop shop for NMR data and models One stop shop for electron microscopy maps

and models (NIH-funded)

Page 12: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Recommendations from 2006 wwPDBAC report Implement the recommendations from November 19-

20 2005 modeling workshop (Berman et al. Structure 14, 1211-1217) – Models phased out October 16, 2006

Rollout remediated data to superusers by December 31, 2006; to all users by July 1st 2007; Provide access to PDB formatted files following the most current format.– Superusers had access to data November 2006, all users in April

2007

Page 13: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Recommendations from 2006 wwPDBAC report Work with SAXS community to create appropriate

representation of these data, and circulate progress reports to the Committee as appropriate– Not done

Expand the four character PDB ID codes before the number of depositions reaches 400,000– Number of available PDB ID codes has been increased by allowing

IDs to start with a character

Develop and present a formal recommendation to the wwPDBAC regarding the purview of the PDB at our September 2007 meeting in Princeton, NJ – In process

Page 14: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Recommendations from 2006 wwPDBAC report Coordinate with the wwPDBAC to obtain formal

letters of support when seeking funding; establish a coordinated plan to both educate and lobby funding agency representatives; establish a charitable organization to serve as a conduit for receipt of both grant funding and gifts from pharmaceutical and biotechnology companies, involving individual Committee members as needed.– Funding Representatives Round Table Discussion

Page 15: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Remediation

Page 16: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Key drivers

Chemistry and nomenclature Sequence and taxonomy Citations Viruses

Page 17: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

IUPAC, NMR, and the PDB

Atom nomenclature and NMR restraints

John L. Markley

Page 18: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

History of the NMR-led requested remediation of hydrogen atom nomenclature When BMRB was established in the late 1980’s, it adopted the IUPAC atom

nomenclature recommendations from Biochemistry 9, 3471-3479, 1970

At that time, we noted that NMR structures being deposited in the PDB did not adhere to these recommendations (particularly for H-atoms; e.g. HB1/HB2 instead of HB2/HB3), and I brought this to the attention of the director of the PDB at Brookhaven with the request that it be remedied

A group of NMR spectroscopists led by Kurt Wüthrich worked with the NMR community to develop recommendations for the deposition of NMR structures; all agreed that the prior IUPAC recommendations be maintained (Pure & Appl. Chem., 70, 117-142, 1998)

Over the years, wwPDB Task Force on NMR has pushed strongly for remediation of atom nomenclature

Page 19: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Accomplished: atom nomenclature remediation Nomenclature in PDB now matches that in BMRB The single format will avoid confusion and errors All discrepancies have been resolved in the remediated

files, with the minor exception of atoms at the C-terminus

IUPAC-IUBMB-IUPAB wwPDB

H'' HXT

O' O

O'' OXT

– Since these atoms are not observed by NMR spectroscopists, we do not consider this to be a problem

– We plan to write an addendum to the IUPAC-IUBMB-IUPAB “Recommendations” for submission to Pure & Appl. Chem. to formalize these as “accepted atom designators”

Page 20: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Remediation of NMR structure files

Required the linking of structure files and restraint files

Atom names, residue numbers and chain identifiers needed to be updated

Remediation of restraint files required the unpacking, parsing, and regularization of legacy information contained in PDB “MR” files into the “NMR Restraints Grid”

Page 21: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

NMR Restraints Grid development

BMRB, University of Wisconsin-Madison, USA

MSD, European Bioinformatics Institute, Hinxton, UK

Department of Computer Sciences/Condor Project, University of Wisconsin, USA

Department of NMR Spectroscopy, Utrecht University, The Netherlands

Centre for Molecular and Biomolecular Informatics, Radboud University, The Netherlands

Page 22: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

NMR Restraints Grid development

PDB MR files are converted into NMR-STAR

NMR-STAR file and the corresponding PDB coordinate file are parsed;

the information is connected inside the CCPN framework; and the results

are written out as NMR-STAR files; converted restraint files are filtered to

remove redundant restraints

Files made available in the NMR Restraints Grid with access from links in

each corresponding PDB entry

NMR restraint data files with atom nomenclature corresponding to

remediated PDB data files will be available by the end of 2007

Page 23: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Current state of the NMR Restraints Grid

Grid contains 3583 entries with a total of 3,882,595 parsed restraints

3583 entries out of 6508 in PDB have restraints Database is updated continuously as new PDB entries

are released that have associated NMR restraints

Page 24: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Recent agenda items considered by the wwPDB NMR Task Force Strongly recommend that restraints be mandatory for

all NMR depositions to the PDB Commissioned the development of procedures for

representing uncertainty in NMR structures and for specifying the single model meant to be most representative of the structure

Task Force should write an article for J. Biomol. NMR on its recommendations for data representation and submission of experimental data

It was suggested that the Task Force begin to discuss validation issues

Page 25: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Most X-ray structures are supported by structure factorsDeposited Crystal Structures and Structures Factor Files

0

1000

2000

3000

4000

5000

6000

7000

1999 2000 2001 2002 2003 2004 2005 2006Year

Count

Crystal Structures Structure Factors

Page 26: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Less than half of NMR structures are supported by restraint dataDeposited NMR Structures and Restraint Files

0

200

400

600

800

1000

1200

1999 2000 2001 2002 2003 2004 2005 2006 2007Year

Count

NMR Structures Restraint Files

Page 27: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Most structural genomics centers regularly provide restraints, but the overall average is low

0%

20%

40%

60%

80%

100%

RIKEN OTHER SG TOTAL

Number of NMR structures deposited

1127

880

247

Per

cen

t o

f d

epo

site

d

stru

ctu

res

wit

h r

estr

ain

ts

Structural genomics center

Page 28: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Remediation rollout

Helen M. Berman

Page 29: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Remediation: scope and statistics

All primary citations verified (45K)

Sequences & taxonomy updated for 61K sequences

Ligand stereochemistry and nomenclature for 13M monomers and 170K non-polymer molecules

Symmetry and coordinate transformations for 280 virus entries

10814 diffraction source & beamline updates

~1000 miscellaneous uniformity issues

Page 30: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Remediation process Corrections contributed and reviewed by all wwPDB

members Corrections on the archival mmCIF data files tracked

in a version tracking system (CVS) New PDBx/mmCIF, PDBML-XML, and PDB format

data files produced Validated by each wwPDB group Staged public testing began January 2007 Iterative corrections based on external comments

made through July 2007 Remediated archive released August 1, 2007

Page 31: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Remediation-supporting infrastructure Internal (wwPDB) CVS archive remediation data files Internal (wwPDB) rsync distribution site for

remediated data files Early tests of web, rsync, & ftp distribution sites for

dictionaries, PDB, mmCIF, and XML data files Complete wwPDB ftp site for remediated data and

dictionaries updated with remediation corrections and weekly PDB updates

200K CVS remediated data file updates 1M+ remediated file updates to support testing and

distribute from January 2007 - present

Page 32: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Checking the remediated files

Haruki Nakamura

Page 33: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Different checks

References to external databases Data processing consistency checks PDBML/XML validation Database loads User-contributed diagnostics

Page 34: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

References to external databases

Sequence and taxonomy (UniProt) Primary Citations (PubMed)

Page 35: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Data processing consistency checks

Covalent geometry and stereochemistry Compliance with wwPDB Chemical

Component Dictionary– Molecular and stereochemical assignment– Atom and residue nomenclature

Compliance with PDB Exchange Dictionary– Data types, controlled vocabularies, parent-child relations

External tools such as WhatIF

Page 36: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

PDBML/XML schema validation

Version control Data type consistency Data ranges Controlled vocabularies Referential integrity XPath traversal of PDBML data hierarchy

Page 37: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Database loads

Diagnostics obtained from loading remediated data into existing database systems– Relational databases used by MSD-EBI and RCSB PDB– XML database used by PDBj

Page 38: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

User-contributed diagnostics Batch checking of remediated files by Phenix revealed

consistency issues with alternate conformations - Ralf Grosse-Kunstleve

Batch checking for inconsistent linkages and missing residues by docking software - Tommy Carstensen

Nomenclature - Tom Goddard & Chimera Group Sequence and assembly diagnostics - Roland Dunbrack Relational data integrity diagnostics - Dan Bosler Nomenclature and experimental details - Clemens

Vonrhein Many specific issues related to chemical assignments,

disorder, and nomenclature

Page 39: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Looking toward the future

Kim Henrick

Page 40: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Annotation project

Standardize annotation rules and policies among wwPDB sites

Document annotation rules and policies Create venue to update annotation rules and

policies as necessary

Page 41: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Annotation projectHow did we get there? Review and discussion of each PDB field by

email and VTC Document written and reviewed by all staff Final review by site directors Software compliant to new annotation

procedures implemented Tested software and trained annotators Published document on web (January 2007)

Page 42: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Annotation document

Specification of ALL fields in PDB file Clarification of policies

– Assignment of PDB IDs– Release of files and information– Changes to entries

Clarification of data representation– Chain ID for all atoms in the file– Multi-model representation for alternate conformation or

disorder– Chimeras – Microheterogenity

Page 43: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

PDB IDs and DOIs

Credit for a PDB entry in CVs

Used as a reference in publications– http://dx.doi.org/10.2210/

pdb4hhb/pdb

See also DOIs for Biological Databases Philip E. Bourne, CrossRef 7th Annual Meeting, 1 November 2006 Cambridge, MA

Page 44: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Outstanding issues

Microheterogeniety Disorder Large structures

Page 45: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

wwPDB and software developers

ACA 24th July 2007 meeting in Salt Lake City

“Future Challenges for the PDB: What should the PDB be doing in 2015?”

Attended by software developers and wwPDB staff

Page 46: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

July 24 meeting Technical discussions

TLS Multiple models Large structure

demand for one file per structure Microheterogeneity Twinning

George Sheldrick, Paul Adams and Garib Murshudov produce a draft of the PDB format to describe twinning and to represent the data in HKLF

Procedural outcomes Yearly developer meeting Editorial board to assist in difficult annotation problems Ongoing electronic forum

Page 47: Worldwide Protein Data Bank  September 7, 2007

Worldwide Protein Data Bank

www.wwpdb.org

Toward a single processing tool

This weekend – wwPDB retreat with contributors from RCSB PDB Rutgers and UCSD, BMRB, PDBj, and EBI-EMBL

Task – come to agreement to pool resources to produce a single deposition tool and design of new processing pipeline