enabling rapid interaction with the protein data bank

20
Enabling Rapid Interaction with the Protein Data Bank Alexy Khrabrov Rutgers University John D. Westbrook Rutgers University

Upload: haracha

Post on 02-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Enabling Rapid Interaction with the Protein Data Bank. Alexy Khrabrov Rutgers University John D. Westbrook Rutgers University. Goals. Provide application and database access to macromolecular structure data Follow standards-based approach (OMG MMS finalized 2001) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Enabling Rapid Interaction   with the Protein Data Bank

Enabling Rapid Interaction with the Protein Data Bank

Alexy KhrabrovRutgers University

John D. WestbrookRutgers University

Page 2: Enabling Rapid Interaction   with the Protein Data Bank

Goals• Provide application and database access to

macromolecular structure data

• Follow standards-based approach (OMG MMS finalized 2001)

• Build on informatics structure of PDB data ontology

• Provides high performance access

• Direct access to compact binary data structures (e.g.

coordinates)

• Provide broad granularity of access (individual atoms to

biological assemblies)

Page 3: Enabling Rapid Interaction   with the Protein Data Bank

Program Level Access to the Details of Molecular Structure

Ligand – Which ligands are contained within the entry?

Chain/Entity – Extract the sequence and coordinates for each molecular entity.

Secondary Structure – Extract helices and sheets for the entry.

Residues/Atoms - What is the environment of this residue? Extract the coordinates for a selection of atoms or residues.

Page 4: Enabling Rapid Interaction   with the Protein Data Bank

API Architecture Features• API organization based on PDB Exchange Data Dictionary - access methods are provided at the level of data categories/classes

• PDB Exchange Dictionary provides the content to automatically generate:

• OMG Interface Definition Language (IDL) and access classes

• SQL queries required to support Corba server

• Software to load PDB datafiles in memory or into a supporting relational database engine

Page 5: Enabling Rapid Interaction   with the Protein Data Bank

Current Data Dictionarieshttp://deposit.pdb.org/mmcif/

• PDB data exchange (XML Schema/CIF)• Including structural genomics and data harvesting extensions

• mmCIF• NMR • 3D-EM• Modeling • Crystallization• Symmetry• Image data • BIOSYNC

Page 6: Enabling Rapid Interaction   with the Protein Data Bank

Extending Data Dictionaries for Deposition

• X-ray

– macromolecular naming, source organism, crystallization

and cell parameters, data collection, structure solution

and phasing, model building, refinement, model quality

• NMR

– explicit details on sample preparation, contents and

conditions, constraints, force constants, related statistics• Protein Production

– source information, target gene production, bacterial

cloning, bacterial expression, purification

Page 7: Enabling Rapid Interaction   with the Protein Data Bank

Elements of Dictionary Metadata

• Data Attributes– Definition– Examples– Data type (primitive type/regular expression patterns)– Range or allowed values

• Classes– Categories– Subcategories– Category groups

• Associations– Parent-child relationships– Interdependencies/exclusivity– Methods

Page 8: Enabling Rapid Interaction   with the Protein Data Bank

Automatic Production of Macromolecular Structure

API Components

PDB Exchange Dictionary + API

Specific Data Dictionaries

CORBA IDL, SQL Schema,XML DTD/Schemas,

Data LoadersDatabase Access Classes

MetamodelFramework

Page 9: Enabling Rapid Interaction   with the Protein Data Bank

Macromolecular Structure API Data Flow

Applications

mmCIF Data Files(Data Reference

Standard)

CORBAServer

Relational

Database

mmCIFParsers

XML Files

Page 10: Enabling Rapid Interaction   with the Protein Data Bank

Metadata Framework

• PDB Exchange Dictionary• Defines content model

• Grouping Dictionary• Maps dictionary content to API organization• Assigns attributes to API aggregate data

types and indices

• Schema Mapping Dictionary• Maps content to physical storage layer

Page 11: Enabling Rapid Interaction   with the Protein Data Bank

Automatic Generation of IDL

• Metadata framework is input data for automated generation of Corba IDL

• IDL is a platform independent definition of API

• IDL is used to produce client stubs and server skeleton classes on any platform

Page 12: Enabling Rapid Interaction   with the Protein Data Bank

Automatic Generation of API Server

• Metadata framework is input data for automated generation of server access classes - • SQL access methods• Implementation of abstract skeleton methods

using DB2 CLI

• Integrate with any custom server methods

Page 13: Enabling Rapid Interaction   with the Protein Data Bank

API Server Extension

• Extend content model through PDB exchange data dictionary

• Extend supporting dictionaries in metadata framework

• Autogenerate IDL• Autogenerate skeleton implementations• Integrate custom code

Page 14: Enabling Rapid Interaction   with the Protein Data Bank

Supporting Alternative APIs

• Adapt IDL autogenerator• Revise MDF->IDL to MDF->new API spec

• Adapt autogenerator of server skeleton implementations

• Integrate custom methods

Page 15: Enabling Rapid Interaction   with the Protein Data Bank

Server Availability

• OpenMSS toolkit provides Java interface to Oracle/MySQL using JDBC (core mmCIF classes)

• C++ server using native interface to DB2 (EEE) implemented on 4-node Linux cluster (NDB beta test in Sept.)

• Installation of DB2 (EEE) at SDSC underway to support high-performance access

Page 16: Enabling Rapid Interaction   with the Protein Data Bank

Client Program Examples

struct AtomSite { string id; IndexId type_symbol; AtomIndex label; IndexId label_entity; VectorXYZ cartn; float occupancy; float b_iso_or_equiv; };

DsMmsMacromolecularStructure.idl excerpt:

Page 17: Enabling Rapid Interaction   with the Protein Data Bank

Client Program Examples

Example 1. Retrieving the AtomSite list for hemoglobin (4HHB) and printing the atomic coordinates.

A primary requirement of the design was that it present an interface that was clearly defined and easy to use from the point of view of developing new applications. The code examples in this section illustrate how client programs can use the API to quickly access macromolecular structure data. As a simple example the following Python code fragment will print out the atom identifier and the Cartesian (x, y, z) position for atoms in the macromolecule 4hhb.

try: sid = ”4HHB" e = ef.get_entry_from_id(sid);except: print "cannot get entry %s, exiting!" % sid sys.exit(1)

print "got entry!"

# Get the atom site list

atoms = e.get_atom_site_list()

print "got %d atoms total" % (len(atoms))

print "A few atoms:"for a in atoms[:10]: print "%s\t%.3f %.3f %.3f" % (a.id, a.cartn.x, a.cartn.y, a.cartn.z)

Page 18: Enabling Rapid Interaction   with the Protein Data Bank

Example 2. Listing symmetry information and the residues ranges for the helices of the hemoglobin (4HHB).  

# Get the symmetry informations = e.get_sym_info()

print "space group: %s" % s.space_groupprint "cell constants: "c = s.acell.unit_cell

print "a=%.3f, b=%.3f, c=%.3f" % \ (c.length_a, c.length_b, c.length_c)

print "alpha=%.3f, beta=%.3f, gamma=%.3f" % \ (c.angle_alpha, c.angle_beta, c.angle_gamma)

# Get the secondary structuressconfs = e.get_struct_conf_list()

print "Secondary structures:"for a in sconfs: print a.id, '\t', \ a.beg_auth.asym.id, a.beg_auth.comp.id, a.beg_auth.seq.id, \ '\t-->', \ a.end_auth.asym.id, a.end_auth.comp.id, a.end_auth.seq.id

Page 19: Enabling Rapid Interaction   with the Protein Data Bank

Client Availability

• Example clients provide category-level access in Java OpenMMS and C++ native servers

• Clients available in Java, C++ and Python• C++ API extended to support efficient detailed

molecular selections (e.g. coordinates of secondary structure elements, symmetry related molecular elements, biological assemblies)

Page 20: Enabling Rapid Interaction   with the Protein Data Bank

Access

• Protein Data Bank Site• http://www.pdb.org/

• OpenMMS site (Java implementation)• http://openmms.sdsc.edu

• PDB Software Download Site (C++ and Python implementation)• http://deposit.pdb.org /mmcif/FILM/

• PDB Dictionary Resource Site • http://deposit.pdb.org /mmcif/

• PDB Beta Data Site• ftp://beta.rcsb.org/pub/pdb/uniformity/data/