nih data summit - the nih data commons
TRANSCRIPT
NIH Data Commons
NIH Data Storage Summit
October 20, 2017
Vivien Bonazzi Ph.D.
Senior Advisor for Data Science (NIH/OD)Project Leader for the NIH Data Commons
What’s driving the need for a
Data Commons?
Challenges with the current state of data
Generating large volumes of biomedical data
Cheap to generate, costly to store on local servers
Multiple copies of the same data in different locations
Building data resources that cannot be easily found by others
Data resources are not connected to each other and cannot
share data or tools
No standards and guidelines on how to share and access data
Convergence of factors
Increasing recognition of the need to support data sharing
Availability of digital technologies and infrastructures that
support Data at scale
Cloud: data storage, compute and sharing
FAIR – Findable Accessible Interoperable Reproducible
Understanding that data is a valuable resource that needs to be
sustained
https://gds.nih.gov/
Went into effect January 25, 2015
NCI guidance:
http://www.cancer.gov/grants-training/grants-management/nci-
policies/genomic-data
Requires public sharing of genomic data sets
Findable
Accessible
Interoperable
Reusable
DATA has VALUE
DATA is CENTRAL to the Digital Economy
a signal of the coming Digital Economy
Scientific digital assets
Data
Software
Workflows
Documentation
Journal Articles
Organizations will be defined by their digital assets
The most successful organizations of the
future will be those that can
leverage their digital assets and transform
them into a digital enterprise
Data Commons
Enabling data driven science
Enable investigators to leverage all possible data and
tools in the effort to accelerate biomedical discoveries,
therapies and cures
by
driving the development of data infrastructure and data
science capabilities through collaborative research and
robust engineering
Developing a Data Commons
Treats products of research – data, methods, tools, papers etc. as digital objects
For this presentation: Data = Digital Objects
These digital objects exist in a shared virtual space
Find, Deposit, Manage, Share, and Reuse data, software, metadata and workflows
Digital object compliance through FAIR principles:
Findable
Accessible (and usable)
Interoperable
Reusable
The Data Commons
is a platform
that allows transactions to occur
on FAIR data at scale
The Data Commons Platform
Compute Platform: Cloud
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
FA
IR
App store/User Interface/Portal
PaaS
SaaS
IaaS
Other Data Commons’
Data Commons Engagement
US Government Agencies & EU groups
Interoperability with other Commons’
Common goals – democratizing, collaborating & sharing data
Reuse of currently available open source tools which support
interoperability GA4GH, UCSC, GDC, NYGC
May 2017 BioIT Commons Session
Shared open standard APIs for data access and computing
Ability to deploy and compute across multiple cloud environments
Docker containers – Dockerstore/Docker registry
Workflows management, sharing and deployment
Discoverability (indexing) objects across cloud commons
Global Unique identifiers
Common user authentication system
The Good News
Considerable agreement about the general approaches to
be taken
Many people are already addressing many of the problems:
Data architectures/platforms
Automated/semi-automated data access/authentication protocols
Common metadata standards and templates
Open tools and software
Instantiation and initial metrics of Findability, Accessibility,
Interoperability, and Reusability
Relationships/agreements with Cloud Service Providers that leverage
their interest in hosting NIH data
Moving data to the cloud and operating in a cloud environment
The Challenges
A need to “Bring it all Together” – Community endorsement of:
Metadata standards/tools/approaches
Crosswalks between equivalent terms/ontologies
Robust, shared approaches to data access/authentication
Best practices that will enable existing data to become FAIR and will
guide generation of future datasets
Rapidly evolving field makes approaches/tools/etc subject to
change – approaches need to be adaptable
Effort is required to adapt data to community standards and move
data to the cloud
How much does that cost and how long does it take?
Lack of interoperability between cloud providers
The Challenges
Making data FAIR comes with a cost
How much does it actually cost?
How can we minimize the cost?
How do we determine whether any one set of data warrants the
expense?
What is the value added to the data by making it FAIR?
What new science can be achieved?
How can new derived data or new computational approaches be
added to the dataset to enrich it?
What are the limitations of FAIRness from dataset to dataset?
Development of a
NIH Data Commons Pilot
NIH Data Commons Pilot
allows access, use and sharing
of large, high value NIH data
in the cloud
NIH Data Commons Pilot
NIH Data Commons Structure
26
Cloud
Services: APIs, Containers, GUIDs, Indexing, Search,
Auth
ACCESS
Scientific analysis tools/workflows
Data
“Reference” Data Sets
TOPMed, GTEx, MODs
FA
IR
App store/User Interface/Portal/Workspace
PaaS
SaaS
IaaS
Operationalizing
the NIH Data Commons Pilot
NIH Data Commons Pilot : Implementation
Storage, NIH Marketplace, Metrics and Costs
Leveraging and extending relationships established as part of BD2K
to provide access cloud to storage and compute
Supplements: TOPMed, GTEx, MODs groups
Prepare (and move) data sets to the cloud for storage, access and
scientific use
Work collaboratively with the OT awardees to build towards data access
Data Commons OT Solicitation: Other Transaction
ROA: Research Opportunity Announcement
Developing the fundamental FAIR computational components to
support access, use and sharing of the 3 data sets above
NIH Data Commons Pilot Consortium
Establishing a new NIH Marketplace
access to a sustainable cloud infrastructure for data science at NIH
Over the next 18 months, NIH will establish its own NIH Cloud Marketplace
Data Commons Pilot Consortium awardees ability to acquire cloud storage and compute
services
Enable ICs to easily acquire cloud storage and storage services from commercial
cloud providers, resellers, and integrators
Building on existing relationship with CSPs
Led by CIT with input from Multi-IC working group
Storage, NIH Marketplace, Metrics and Costs
Assessment and Evaluation
What are the costs associated with cloud storage and usage?
What are the business best practices?
How should costs be paid?
Who should pay them?
How should highly used data be managed vs less used data?
Are data producers supportive of this model?
Are users (of all experience levels) able to access and use data effectively?
How will we know if the Data Commons Pilot is successful?
How to adjust to changing needs?
Storage, NIH Marketplace, Metrics and Costs
Supplements to 3 Test Data Set Groups
Administrative Supplements to TOPMed, GTEx and MODs
PIs for each data set were requested to review the OT (ROA) and
determine appropriate ways to interact
Prepare (and move) data sets to the cloud for storage, access
and scientific use
Make community workflows and cloud based tools of popular
analysis pipelines from the 3 datasets accessible
Facilitate discovery and interpretation of the association of
human and model organism genotypes and phenotypes
NIH Data Commons: OT ROA
Key Capabilities – modular components
Development of Community Supported FAIR Guidelines and Metrics
Global Unique Identifiers (GUID) for FAIR biomedical data
Open Standard APIs (interoperability & connectivity)
Cloud Agnostic Architecture and Frameworks
Cloud User Workspaces
Research Ethics, Privacy, and Security (AUTH)
Indexing and Search
Scientific Use cases
Training, Outreach, Coordination
Stage 1: 180 day window
Develop MVPs (Minimum Viable Products)
Demonstrations of the Data Commons and its components
Have one copy of each test data set in each cloud provider
Understanding of the process required to achieve this
Draft version of a single standard access control system
be able to access and use the data through the access control system
Able to use a variety of analysis tools and pipelines on the 3 data sets in the cloud – (driven by scientific use cases)
Have a rudimentary ability to query across test data sets
Display phenotype, expression and variant data aligned with a specific gene or genomic location
Display model organism orthologs for a given set of human genes
Draft FAIR guidelines and metrics
Understand how each of the computational components that support the ability to access data fit together and what standards are needed
Written plans of how and why these demonstrations should be extended into a full Pilot
NIH Data Commons Pilot: Outcomes
Stage 2: 4 year period
To extend and fully implement the Data Commons Pilot based on the
design strategies and capabilities developed as part of Stage 1
Review of MVP/demonstrations and written plans from Stage 1
Goals and Milestones with clear and specific outcomes
Evaluate, negotiate, and revise terms of existing awards
Award additional OTs
NIH Data Commons Pilot: Outcomes
Acknowledgments
DPCPSI: Jim Anderson, Betsy Wilder, Vivien Bonazzi, Marie Nierras, Rachel Britt,
Sonyka Ngosso, Lora Kutkat, Kristi Faulk, Jen Lewis, Kate Nicholson,
Chris Darby, Tonya Scott
NHLBI: Gary Gibbons, Alastair Thomson, Teresa Marquette, Jeff Snyder,
Melissa Garcia, Maarten Lerkes, Ann Gawalt, Cashell Jaquish,
George, Papanicolaou
NHGRI: Eric Green, Valentina di Francesco, Ajay Pillai, Simona Volpi, Ken Wiley
NIAID: Nick Weber
CIT: Andrea Norris
NLM: Patti Brennan
NCBI: Steve Sherry
Stay in
Touch
QR Business Card
@Vivien.Bonazzi
Slideshare
Blog (Coming soon!)