My comments are an informal communication and represent my own best judgment. These comments do not bind or obligate FDA.
• open FDA: FDA Launches Big Data 'openFDA' initiative, giving public easier Access to Safety Information
• MAQC consortium: FDA-led community-wide consortium aimed at assessing the technical performance of next-generation sequencing platforms
• HIVE: FDA partners with the academia in order to develop Next Generation Sequencing platform
• CFSAN: FDA partners with CDC, NIH/NCBI in developing research infrastructure for a risk-based food safety system
FDA leverages big data: current efforts
Currently focus: • Adverse events. FDA’s publically available drug adverse event and
medication error reports, and medical device adverse event reports. • Recalls. Enforcement report data, containing information gathered
from public notices about certain recalls of FDA-regulated products. • Labeling. Structured Product Labeling (SPL) data for FDA-regulated
human prescription drug, OTC drug and biological product labeling.
Romania Hungary Greece
Cyprus
Ukraine Lithuania
Croatia Moldova Serb. Mont.
Germany Switzerland
Gibraltar
Austria Lux. Slovakia
Sweden Finland
Norway
France Spain
Poland Italy
Belarus United Kingdom Latvia Ireland
Bulgaria
Estonia
Portugal
Denmark
Andorra
Netherlands Belgium Czech Rep.
Albania Macedonia
Iceland
Syria Kuwait
Qatar Guam
Palau Micronesia
Christmas Is.
Wake I. Marshall Is.
Maldives
Russia
China
Australia
India Iran
Kazakhstan Mongolia
Saudi Arabia
Turkey Iraq
Pakistan Myanmar
Afghanistan Uzbekistan
Turkmenistan
Thailand Yemen
Japan
Oman Laos Vietnam
Nepal
Kyrgyzstan Azerbaijan Tajikistan
New Zealand
Papua New Guinea
Jordan North Korea
Indonesia Malaysia
Bangladesh South Korea
Bhutan
Sri Lanka
Taiwan
New Caledonia
Philippines
Solomon Is. Vanuatu
Israel U.A.E. Bahrain
Lebanon Georgia Armenia
East Timor
Cambodia Benin
Congo Liberia
Canary Is.
Gabon Togo
Rwanda
Cape Verde
Seychelles
Algeria North Sudan
Libya Mali
Chad Niger Egypt
Angola Dem. Rep. Congo
Ethiopia
South Africa
Nigeria
Namibia
Mauritania
Zambia Tanzania
Kenya Somalia
Botswana Mozambique
Morocco
Madagascar
Cameroon
Zimbabwe
Ghana Guinea
Tunisia
Uganda Cote d'Ivoire
Senegal Burkina Faso
Western Sahara Eritrea
Malawi
Swaziland Lesotho
Cen. Afr. Rep. Sierra Leone Guinea-Bissau The Gambia
Equat. Guinea
Burundi
Djibouti
Comoros
Falkland Is.
Suriname
Brazil
Argentina
Peru
Chile
Bolivia
Colombia Venezuela
Paraguay Uruguay
Ecuador Guyana
French Guiana Trinidad & Tobago
Samoa French Polynesia
Cook Is.
Galapagos Is.
Honduras Nicaragua
The Bahamas Hawaii
Bermuda Midway Is.
Canada
United States
Mexico Cuba
Panama Haiti Puerto Rico Dominican Rep.
Guatemala Belize El Salvador
Costa Rica
Greenland
South Sudan
>10M API calls >½ of API calls were issued from outside US >12 new software (mobile or web) apps
>20,000 connected IP addresses ~6,000 registered API users
openFDA: Usage
5 5
MicroArray Quality Control (MAQC) An FDA-led community wide consortium effort to assess technical performance and application of genomics technologies (microarrays, GWAS and next-gen sequencing)
MAQC-I 2005.2 – 2006.9
6 papers, 2006 13 papers, 2010
Assess reliability of microarrays • Repeatability • Reproducibility
Assess microarray based biomarkers • Clinical use • safety evaluation
MAQC-II 2006.9 – 2010.10
MAQC-III/SEQC 2008.8 – 2014
Assess reliability of next-gen sequencing (RNA-seq) and compare it with microarrays
11 manuscripts
137 participants 51 organizations
202 participants 97 organizations
>180 participants 73 organizations
6
The 3rd Phase of MAQC - SEquencing Quality Control 180 participants from 73 organizations Generated > 10Tb data and >100 billion reads Represented ~6% data in GEO (Jun, 2014) 11 Manuscripts: 3 by Nat Biotechnol, 2 by Nat Commun, 3 by Sci Data, 2 by Genome Biology (revision) and 1 in Nat Method (revision)
Objectives
Study designs
Datasets
FDA, USDA, CDC State, Local and Foreign Public Health Agencies
Academia/Industry ADDITIONAL DATA ANALYSIS
DATA ASSEMBLY, STORAGE and ANALYSIS
DATA ACQUISITION
NCBI, EMBL DDBJ (INSDC) (Public Access Database)
Food Safety Research Infrastructure – Publicly available data
NaFonal Network of Sequencers IntrenaFonal Network of Sequencers
GenomeTrakr Strategy
• Develop and test the performance of a distributed sequencing based network, rather than centralized model
• Provide sequence and minimal metadata in a publicly accessible database – Partner with NCBI for storage and serving data – Cost prohibitive for FDA to establish its own high capacity
data site – Cost savings by using NCBI allowed more labs to participate – Industry, academia, and other government agencies have
access to data for individual needs
GenomeTrakr: distributed sequencing network
~10,000 pre-‐registered strains ~6000 genomes
• Robust data loading: multiple sources, large data blobs, through complex handshaking procedures
• Distributed storage: compressed, organized data and metadata
• Hierarchical Security: permission based files, meta-data, processes, algorithms in a collaborative environment.
• Distributed Computations: private cloud based platform running virtual services
• Interface: customizable, web-driven unified interface with graphical visualizations
• Expertise: in-house research and development team capable of responding to the needs
HIVE: High-performance Integrated Virtual Environment
maxi-HIVE location: White Oak /CDRH HPC
storage: ~2 Petabytes cpu:1500 cores, extensible to 3000-5000 wan : 10Gb ⇒ Internet2
lan: 40Gb ⇒ Infiniband platform: metal + SunGrid goal: regulatory next-gen support platform for long term storage and large scale computations; to support regulatory submissions for NGS and standardization portal for NGS evidence submissions
mini-HIVE
location: White Oak/CBER server room
storage: ~380 Terabytes cpu: ~350 cores wan: 1Gb
lan: 10-20 GB platform: metal goal: research and scientific NGS portal with cutting edge production quality tools
HIVE: deployments Public-HIVE
location: GWU Dr Mazumder’s lab
storage: 200 Terabytes cpu: ~350 cores wan: 1Gb
lan: 10 GB platform: metal goal: support and integrate wider community of researchers into HIVE process, allow access to cutting edge regulatory complaint tools and standards, perform pilot free projects with academic, industry and government entities to promote and ease the access to novel NGS techniques. To incorporate HIVE into education.
Public-elastic HIVE location: ColonialOne/Ashburn datacenter storage: extensible to Petabytes
cpu:+1000 cores wan : 10Gb ⇒ Internet2 lan: 10Gb ⇒ Infiniband
platform: Lustre open source goal: to become extensibility platform for public HIVE users for their large scale computational needs for large clinical research projects.
Amazon
• Standardization: FDA gets ready for standardization of big data submissions
• Bioinformatics harmonization: ongoing efforts to build bioinformatics validation platforms
• MAQC-IV: looking into personalized genome quality metrics projects
• Cloud: prepares for more utilization of cloud services to store, manipulate and communicate big data
FDA leverages big data: moving forward
www.ncbi.nlm.nih.gov
Expectations
Thanks to all of those who’s hard work and bright ideas made all of the above possible. Slide contributors: Taha Kass-Hout, Tong Weida, Errol Strain and Marc Allard
Acknowledgments