the bioassay research database
TRANSCRIPT
The BioAssay Research Database A Pla4orm to Support the Collec:on, Management and
Analysis of Chemical Biology Data
ACS Na'onal Mee'ng New Orleans April 7, 2013
@AskTheBARD
hCp://bard.nih.gov
Direct Contributors NIH Molecular Libraries – Glenn McFadden, Ajay Pillai NIH Chemical Genomics Center – Chris Austin (PI), John Braisted, Marc Ferrer, Rajarshi Guha, Ajit Jadhav, Dac-Trung Nguyen, Tyler Peryea, Noel Southall, Henrike Veith Broad Institute – Benjamin Alexander, Jacob Asiedu, Kay Aubrey, Joshua Bittker, Steve Brudz, Simon Chatwin, Paul Clemons, Vlado Dancik, Siva Dandapani, Andrea DeSouza, Dan Durkin, David Lahr, Jeri Levine, Judy McGloughlin, Phil Montgomery, Jose Perez, Stuart Schreiber (PI), Gil Walzer, Xiaorong Xiang University of New Mexico – Cristian Bologa, Steve Mathias, Tudor Oprea, Larry Sklar, Oleg Ursu, Anna Waller, Jeremy Yang
University of Miami – Saminda Abeyruwan, Hande Küküc, Vance Lemmon, Ahsan Mir, Magdalena Przydzial, Kunie Sakurai, Stephan Schürer, Uma Vempati, Ubbo Visser Vanderbilt University – Eric Dawson, Bill Graham, Craig Lindsley, Shaun Stauffer Sanford-Burnham Medical Research Institute – “T.C.” Chung, Jena Diwan, Michael Hedrick, Gavin Magnuson, Siobhan Malany, Ian Pass, Anthony Pinkerton, Derek Stonich Scripps Research Institute – Yasel Cruz, Mark Southern
BARD: BioAssay Research Database BARD’s mission is to enable novice and expert scientists to effectively utilize MLP data to generate new hypotheses • Unique collaboration amongst NIH and academic centers
with expertise in screening and software development • Developed as an open-source, industrial-strength platform
to support public translational research. • Provides opportunity to address existing cheminformatics barriers
o Deploy predictive models o Foster new methods to interpret chemical biology data o Enable private data sharing o Develop and adopt a Assay Data Standard with tools to:
o Annotate assays to a minimum standards and definitions o Integrate and extend existing ontologies for meaningful experiment
descriptions o Enable assay creation, registration and modification
o Provide an easy-to-use portal and an advanced desktop client
Engagement & Milestones Summer 2011 MLP issues administrative supplement and call for proposals to
create the Molecular Libraries Biological Database January 2012 Inaugural mee'ng of MLPCN Stakeholders & NIH MLP PT
February 2012 Update on progress-‐ data extrac'on & annota'on, test plaKorm selec'on, GUI design & test, Outreach
March 2012 BARD Program Kick-‐off
April 2012 Outreach strategy & tac'c session at UNM w/ subteam
May – July 2012 Discussions with and reviews of Amgen, Vertex, Novar's, Sanofi assay registra'on and chem-‐bio informa'on query systems
November 2012 Conducted mul'-‐level usability interviews on BARD GUI & func'on w/ Dir. Computa'on, Informa'cs/Lab Mgr, TA Lead, Dir. Chem, Med chem, Db developer, Cmpd curator
January 2013 BARD Review by Ext. Sci Panel & Public alpha release (CAP, REST API, Web & Desktop clients)
March 2013 BARD limited beta-‐release – then transi'on to enabling science
BARD Technology Components
Define & Register Assays
Data Dictionary – std terms Catalog of Assay Protocols
High Quality Data & Result Deposition
Calculations & Results Project-experiment association
Query & Interpret Information
Intuitive Guided Queries Cross Assay & SAR centric views
Advance applications
Ena
ble
Hyp
othe
sis
Gen
erat
ion
Novice Expert
Where Are We today? CAP, Data Dictionary, and Results Deposition Data model created & populated
CAP UI with View and basic editing
Dictionary defined as OWL using Protégé
Annotations for 85% of MLPCN experiments & projects loaded via spreadsheet
~95% of PubChem result types mapped to BARD dictionary
~70% of PubChem columns mapped to BARD result types
Warehouse loaded with all PubChem AIDs and results
Warehouse loaded with GO terms, KEGG terms, and DrugBank annotations
Manual annotation of AIDs ~70% completed by centers
The BARD Data Warehouse • Running on MySQL with replication • 0.85 TB of data…
– 151M result rows – 46M compound rows
• Locally deployed at UNM • Planning to build better packaging
– VM based deployment
Open Source As Far as Possible
ETL Database Text Search Engine Structure Search Engine
Caching Layer
http://bard.nih.gov/api
Jersey Webapps deployed on HA
Application Server Cluster
The BARD Public API • Java, REST-like, read-only, deployed on
Glassfish cluster • Different functionality
hosted in different containers – Maintenance, security – Stability – Performance
• Versioned • Fully documented
API
Text Search
Struct Search
Data Warehouse
Plugins
API Resources • Extensive list of
resources covering many data types
• Each resource supports a variety of sub-resources – Usually linked to
other resources
API Level of Detail • Supports different
levels of detail • Allows clients to trade-
off detail for speed • Good for mobile apps
API Caching & Storage • Caching is enabled at resource level • The API supports ETags
– Every request returns an ETag in the header – With If-None-Match, supports web caching
• We also abuse ETags to support persistent references to collections
• An ETag can refer to other ETags recursively – Allows clients to create and store arbitrarily
complex collections • Not permanent, not infinite!
Annota:ng Data
Entrez
Uniprot
Gene Ontology Gene Ontology
Disease Ontology
BioAssay Ontology BioAssay Ontology BioAssay Ontology BioAssay Ontology
Unit Ontology
Uniprot Uniprot
Unit Ontology
BARD Dictionary & Term Hierarchy
Chemical Ontology
BARD Assay Definition Hierarchy
• To best exploit the current data set, and encourage discoverability, we need to better structure the data – Annotate all assays to a minimum standard – Integrate and extend existing ontologies to
support meaningful experiment descriptions – Develop processes
and tools to enable assay registration
(Pseudo) Linked Data • Full text search enabled by Solr
– Enables filtering, faceting, auto-suggest – Key entry point for users – Type ahead suggestions provide guidance
• By virtue of manual associations of data types, we enable “linked data” – Allows searches to indicate what matched the
query and how – Solr supports sophisticated scoring schemes
• Doesn’t yet take advantage of ontologies
Desktop Client • Support large datasets • Merge private &
public data • Examine SAR
Web Client
Filter on annota'ons, such as detec'on method type
Google-‐like searching of: 4,000+ assays, 35M+ compounds, 300+ projects
Save items of interest for further analysis
Amazon-‐like Query Cart
Community Engagement • Sustained outreach efforts
– 7 MLPCN sites participating • Facilitate access, driven by compelling use-
cases and stakeholder feedback – Assay definition standard is collaboration with
industrial partners in addition to MLPCN • Publish APIs for data access, first-adopters • A ‘BARD App Store’: Enabling new
approaches to data integration, mining – Promiscuity calculations – CYP450 prediction
Extending BARD with Plugins • BARD supports deployment of external code
as part of core API • Plugins can access the data warehouse via
direct calls – No need to go via REST API
• Plugin resources can accept anything – Text, JSON, files, links, …
• Plugin responses can be anything – Plain text, JSON, HTML, SVG, …
BARD Plugin Development
Plugins have to be deployable on the JVM
BARD -‐ SMARTCyp • Predicts site of metabolism by CYP450
isoforms using 2D structures • Developed by Patrik Rydberg and co-
workers • Released under LGPL • BARD plugin exposes two resources
– Summary HTML view – Data view (JSON)
BARD -‐ SMARTCyp
P. Rydberg et al, hgp://www.farma.ku.dk/smartcyp/
BARD - BADAPPLE
• BioActivity Data Associative Promiscuity Pattern Learning Engine
• Associations via scaffolds for chemical space navigation.
Example URI* descrip'on
<base>/badapple/prom/cid/752424
For compound with specified ID, return scaffold IDs and scores.
<base>/badapple/prom/cid/752424?expand=true
Addi'onal sta's'cs, scaffold smiles, and inDrug flag.
<base>/badapple/prom/scafid/233
For scaffold with specified ID, return sta's'cs and smiles.
On the Horizon
23
• Reproducibility – Be honest with me …
• Private data in the context of public data – Local installs, molecule hashes
• Mobile – Compounds as funny looking QR tags
Long-Term Path Forward
• BARD is not just a data store – it’s a platform – Seamlessly interact with users’ preferred tools – Allows the community to tailor it to their needs – Serve as a meeting ground for experimental and
computational methods – Enhance collaboration opportunities – Consider cloud deployment
• Enhance the ability to translate data from individual experiments to systems level insight