the bioassay research database

The BioAssay Research Database A Pla4orm to Support the Collec:on, Management and

Analysis of Chemical Biology Data

ACS Na'onal Mee'ng New Orleans April 7, 2013

@AskTheBARD

hCp://bard.nih.gov

Direct Contributors NIH Molecular Libraries – Glenn McFadden, Ajay Pillai NIH Chemical Genomics Center – Chris Austin (PI), John Braisted, Marc Ferrer, Rajarshi Guha, Ajit Jadhav, Dac-Trung Nguyen, Tyler Peryea, Noel Southall, Henrike Veith Broad Institute – Benjamin Alexander, Jacob Asiedu, Kay Aubrey, Joshua Bittker, Steve Brudz, Simon Chatwin, Paul Clemons, Vlado Dancik, Siva Dandapani, Andrea DeSouza, Dan Durkin, David Lahr, Jeri Levine, Judy McGloughlin, Phil Montgomery, Jose Perez, Stuart Schreiber (PI), Gil Walzer, Xiaorong Xiang University of New Mexico – Cristian Bologa, Steve Mathias, Tudor Oprea, Larry Sklar, Oleg Ursu, Anna Waller, Jeremy Yang

University of Miami – Saminda Abeyruwan, Hande Küküc, Vance Lemmon, Ahsan Mir, Magdalena Przydzial, Kunie Sakurai, Stephan Schürer, Uma Vempati, Ubbo Visser Vanderbilt University – Eric Dawson, Bill Graham, Craig Lindsley, Shaun Stauffer Sanford-Burnham Medical Research Institute – “T.C.” Chung, Jena Diwan, Michael Hedrick, Gavin Magnuson, Siobhan Malany, Ian Pass, Anthony Pinkerton, Derek Stonich Scripps Research Institute – Yasel Cruz, Mark Southern

BARD: BioAssay Research Database BARD’s mission is to enable novice and expert scientists to effectively utilize MLP data to generate new hypotheses •  Unique collaboration amongst NIH and academic centers

with expertise in screening and software development •  Developed as an open-source, industrial-strength platform

to support public translational research. •  Provides opportunity to address existing cheminformatics barriers

o  Deploy predictive models o  Foster new methods to interpret chemical biology data o  Enable private data sharing o  Develop and adopt a Assay Data Standard with tools to:

o  Annotate assays to a minimum standards and definitions o  Integrate and extend existing ontologies for meaningful experiment

descriptions o  Enable assay creation, registration and modification

o  Provide an easy-to-use portal and an advanced desktop client

Engagement & Milestones Summer 2011 MLP issues administrative supplement and call for proposals to

create the Molecular Libraries Biological Database January 2012 Inaugural mee'ng of MLPCN Stakeholders & NIH MLP PT

February 2012 Update on progress-‐ data extrac'on & annota'on, test plaKorm selec'on, GUI design & test, Outreach

March 2012 BARD Program Kick-‐off

April 2012 Outreach strategy & tac'c session at UNM w/ subteam

May – July 2012 Discussions with and reviews of Amgen, Vertex, Novar's, Sanofi assay registra'on and chem-‐bio informa'on query systems

November 2012 Conducted mul'-‐level usability interviews on BARD GUI & func'on w/ Dir. Computa'on, Informa'cs/Lab Mgr, TA Lead, Dir. Chem, Med chem, Db developer, Cmpd curator

January 2013 BARD Review by Ext. Sci Panel & Public alpha release (CAP, REST API, Web & Desktop clients)

March 2013 BARD limited beta-‐release – then transi'on to enabling science

BARD Technology Components

Define & Register Assays

Data Dictionary – std terms Catalog of Assay Protocols

High Quality Data & Result Deposition

Calculations & Results Project-experiment association

Query & Interpret Information

Intuitive Guided Queries Cross Assay & SAR centric views

Advance applications

Ena

ble

Hyp

othe

sis

Gen

erat

ion

Novice Expert

Where Are We today? CAP, Data Dictionary, and Results Deposition Data model created & populated

CAP UI with View and basic editing

Dictionary defined as OWL using Protégé

Annotations for 85% of MLPCN experiments & projects loaded via spreadsheet

~95% of PubChem result types mapped to BARD dictionary

~70% of PubChem columns mapped to BARD result types

Warehouse loaded with all PubChem AIDs and results

Warehouse loaded with GO terms, KEGG terms, and DrugBank annotations

Manual annotation of AIDs ~70% completed by centers

The BARD Data Warehouse •  Running on MySQL with replication •  0.85 TB of data…

– 151M result rows – 46M compound rows

•  Locally deployed at UNM •  Planning to build better packaging

– VM based deployment

Open Source As Far as Possible

ETL Database Text Search Engine Structure Search Engine

Caching Layer

http://bard.nih.gov/api

Jersey Webapps deployed on HA

Application Server Cluster

The BARD Public API •  Java, REST-like, read-only, deployed on

Glassfish cluster •  Different functionality

hosted in different containers – Maintenance, security – Stability – Performance

•  Versioned •  Fully documented

API

Text Search

Struct Search

Data Warehouse

Plugins

API Resources •  Extensive list of

resources covering many data types

•  Each resource supports a variety of sub-resources – Usually linked to

other resources

API Level of Detail •  Supports different

levels of detail •  Allows clients to trade-

off detail for speed •  Good for mobile apps

API Caching & Storage •  Caching is enabled at resource level •  The API supports ETags

– Every request returns an ETag in the header – With If-None-Match, supports web caching

•  We also abuse ETags to support persistent references to collections

•  An ETag can refer to other ETags recursively – Allows clients to create and store arbitrarily

complex collections •  Not permanent, not infinite!

Annota:ng Data

Entrez

Uniprot

Gene Ontology Gene Ontology

Disease Ontology

BioAssay Ontology BioAssay Ontology BioAssay Ontology BioAssay Ontology

Unit Ontology

Uniprot Uniprot

Unit Ontology

BARD Dictionary & Term Hierarchy

Chemical Ontology

BARD Assay Definition Hierarchy

•  To best exploit the current data set, and encourage discoverability, we need to better structure the data – Annotate all assays to a minimum standard –  Integrate and extend existing ontologies to

support meaningful experiment descriptions – Develop processes

and tools to enable assay registration

(Pseudo) Linked Data •  Full text search enabled by Solr

– Enables filtering, faceting, auto-suggest – Key entry point for users – Type ahead suggestions provide guidance

•  By virtue of manual associations of data types, we enable “linked data” – Allows searches to indicate what matched the

query and how – Solr supports sophisticated scoring schemes

•  Doesn’t yet take advantage of ontologies

Desktop Client •  Support large datasets •  Merge private &

public data •  Examine SAR

Web Client

Filter on annota'ons, such as detec'on method type

Google-‐like searching of: 4,000+ assays, 35M+ compounds, 300+ projects

Save items of interest for further analysis

Amazon-‐like Query Cart

Community Engagement •  Sustained outreach efforts

–  7 MLPCN sites participating •  Facilitate access, driven by compelling use-

cases and stakeholder feedback – Assay definition standard is collaboration with

industrial partners in addition to MLPCN •  Publish APIs for data access, first-adopters •  A ‘BARD App Store’: Enabling new

approaches to data integration, mining – Promiscuity calculations – CYP450 prediction

Extending BARD with Plugins •  BARD supports deployment of external code

as part of core API •  Plugins can access the data warehouse via

direct calls – No need to go via REST API

•  Plugin resources can accept anything – Text, JSON, files, links, …

•  Plugin responses can be anything – Plain text, JSON, HTML, SVG, …

BARD Plugin Development

Plugins have to be deployable on the JVM

BARD -‐ SMARTCyp •  Predicts site of metabolism by CYP450

isoforms using 2D structures •  Developed by Patrik Rydberg and co-

workers •  Released under LGPL •  BARD plugin exposes two resources

– Summary HTML view – Data view (JSON)

BARD -‐ SMARTCyp

P. Rydberg et al, hgp://www.farma.ku.dk/smartcyp/

BARD - BADAPPLE

•  BioActivity Data Associative Promiscuity Pattern Learning Engine

•  Associations via scaffolds for chemical space navigation.

Example URI* descrip'on

<base>/badapple/prom/cid/752424

For compound with specified ID, return scaffold IDs and scores.

<base>/badapple/prom/cid/752424?expand=true

Addi'onal sta's'cs, scaffold smiles, and inDrug flag.

<base>/badapple/prom/scafid/233

For scaffold with specified ID, return sta's'cs and smiles.

On the Horizon

23

•  Reproducibility – Be honest with me …

•  Private data in the context of public data – Local installs, molecule hashes

•  Mobile – Compounds as funny looking QR tags

Long-Term Path Forward

•  BARD is not just a data store – it’s a platform –  Seamlessly interact with users’ preferred tools –  Allows the community to tailor it to their needs –  Serve as a meeting ground for experimental and

computational methods –  Enhance collaboration opportunities –  Consider cloud deployment

•  Enhance the ability to translate data from individual experiments to systems level insight

the bioassay research database

Technology

bard data warehouse

mlp data

data types

tb of data

chemical biology data

assay data standard

bard result types

bard review