fda data innovation lab visualization gallery dr. brand niemann director and senior data...

30
FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Work ing_Group_Meetup October 1, 2014 1

Upload: gabriella-andra-robertson

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

1

FDA Data Innovation Lab Visualization Gallery

Dr. Brand NiemannDirector and Senior Data Scientist/Data Journalist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

October 1, 2014

Page 2: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

2

Overview• Data Science White House Big Data Review & Brooke Aker: Big Data

Lens on OpenFDA– July 7th Meetup

• Meeting with Dr. Taha Kass-Hout, FDA’s First Chief Health Informatics Officer (CHIO)– September 23rd

• HHS Idea Lab Demo Day with Bryan Sivak and Dr. Taha Kass-Hout– September 30th

• FDA Data Innovation Lab and Predictive Analytics Meetup– October 6th Meetup

• Data Science, Data Infrastructure, & Data Publications for the HHS IDEA Lab– December 1st Meetup

Page 3: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

3

HHS Ignite Application

• Proposal Questions:– Executive Summary (Your Elevator Pitch) [ 500

characters ]– What’s the problem you’ve identified? [ 2000

characters ]– What’s your proposed solution? What do want to

accomplish within the 3 months? [ 1000 characters ]– Who is your target end-user / customer? [ 75

characters ]– Is there any other information you’d like us to know

about? (optional) [ 500 characters ]Source: http://www.hhs.gov/idealab/what-we-do/hhs-ignite/

Page 4: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

4

HHS Ignite Evaluation Process• The Scoring Criteria and Selection Process:

– The project’s importance to the Office, Agency and/or Department [20 points]

– The potential impact of the proposed solution. [40 points]– The proposal’s understanding and explanation of the problem that

needs to be solved. [20 points]– The proposal’s understanding of the customers that the project serves.

[20 points]• Teams submitting the top proposals will be asked to present and

discuss their project with members of the HHS Innovation Council.

• The Council will make recommendations to the Secretary who will make the final selection.

Page 5: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

5

OpenFDA• OpenFDA, a new initiative to provide unprecedented access to FDA data

and highlight projects in the public and private sector that use these data to further scientific research, educate the public, and save lives.

• OpenFDA is an initiative of FDA’s Office of Informatics and Technology Innovation to provide a new level of access to a number of public high-value FDA datasets via RESTful APIs and structured raw file download. Currently, the project is in an early-development stage, with an alpha release of two datasets planned for spring 2014 and a larger public release later in the year. Additionally, openFDA will provide a platform for the community to interact with each other and FDA domain experts with the goal of spurring innovation around FDA data and creating new partnerships and opportunities between the public and private sector (BOLDING BY ME).– Presidential Innovation Fellow: Sean Herron is a Presidential Innovation Fellow

serving at FDA [email protected] | @seanherronhttp://www.hhs.gov/idealab/innovate/openfda/

Page 6: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

6

FDA's Path Forward for Open Data and Next Generation Sequencing

• Utility NGS (Next Generation Sequencing) in the Internet cloud: FDA is facing growing NGS needs for processing internal genome sequencing data as well as the NGS data from industry submissions. The NGS initiative is planning and developing a cloud-base Big Data platform and analytics for robust, secure and controlled data storage, analysis, and collaboration and potentially sharing public-access genome sequencing information.

• NGS is a Big Data Initiative.https://open.fda.gov/update/fda-path-forward-for-open-data-and-next-generation-sequencing/

Page 7: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

7

Data Science Data Publications forBig Data Analytics

• New Government Data Science Best Practices:– Digital Government Strategy– Open Research Data Policy– Agency: HHS IdeaLab, NIH Data Commons, FDA Innovation Lab– White House NITRD Big Data Initiative and NSF Agency Strategic

Plan: Data Science, Data Infrastructure, and Data Publications• New Government Data Science Publication Examples:

– Federal Data Center Consolidation 2014– Performance.gov– FDA Data and FDA Data Innovation Lab– National Science Board Science & Engineering Indicators

Page 8: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

8

Data Science Data Publications for FDA:Data Science Data Mining Process

• Recall OpenFDA Knowledge Base for previous visualization and analytics:– Brooke Aker, Biplab Pal, and Brand Niemann.

• Mined HealthData.gov for FDA data and built linked data spreadsheets (17) for Spotfire:– See next slides.

• Mined FDA Site Map for data:– Found Two: Data Standards and FDA Drug Approvals & Databases.– Downloaded and inventoried files (41) (ZIP, CSV & XLS) for Spotfire.– Used for FDA Data Innovation Lab Visualization Gallery.

Page 9: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

9

Data Science for FDA DataExcel Spreadsheet Data Ecosystem

• FDA @ HealthData.gov• Summary FDA• FDA Site Map• FDA-TRACK• FDA Glossary• FDA-TRACK Research

Glossary• FDA Drug Approvals &

Databases

• Summary All• Holdren Memo Agencies• HealthData.gov Subject 09172014• HealthData.gov Agency 09172014• HealthData.gov Date 09172014• HealthData.gov Year 09172014• HealthData.gov Period 09172014• HealthData.gov Spatial 09172014• HealthData.gov Start 09172014• HealthData.gov Media 09172014

http://semanticommunity.info/@api/deki/files/30746/HHSFDA.gov.xlsx?origin=mt-web

Page 10: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

10

Data Science Data Publication:FDA Data in Spotfire

• Cover Page-Performance Analytics: FDA TRACK– Most programs do not have a Strategic Plan!

• Content Analytics: Summary Statistics– Of the 5 HHS agencies that come under the Holdren Memo, CDC and FDA

have by far the most and almost equal number of data sets!• Content Analytics: HealthData.gov Statistics 09172014

– See how few of these data sets are in readily useable media!• Content Analytics: FDA @ HealthData.gov

– A Dashboard to the FDA Dashboards!• Network Analytics: FDA Glossary & Site Map

– The FDA Site Map and Glossary as a Linked Data Network!• Data Analytics: FDA Drug Approvals & Databases

– The FDA Site Map and Glossary as a Linked Data Network!

Page 12: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

12

FDA Data Innovation Lab Visualization Gallery:Spreadsheet Inventory

http://semanticommunity.info/@api/deki/files/30746/HHSFDA.gov.xlsx?origin=mt-web

My Note: My Note: Inventory to prioritize further data science data publication work!This inventory is updated as one drills down into the data sets!

Page 13: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

13

FDA Data Innovation Lab Visualization Gallery:File Folder

My Note: Some folders contain multiple files!

Page 14: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

14

Suggestions

• Help the FDA Data Innovation Lab with data publication gallery and wall posters.

• Help the FDA Data Innovation Lab with their Open Data Lab Day.

• Organize Joint Meetups and promote use of the FDA Data Innovation Lab.

• Help form Data Science Teams to work on FDA big data problems.

Page 15: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

15

SEMOSS FDA Analytics• This video shows how SEMOSS can be used to analyze adverse drug reaction

data from the Food and Drug Administration’s Adverse Event Reporting System. This database includes information on demographics, drugs, reactions, roles, outcomes and more. This is very useful data but even the FDA admits that “users of these files need to be familiar with creation of relational databases…A simple search of AERS/FAERS data cannot be performed with these files by persons who are not familiar with creation of relational databases.” In other words, this data is freely available but people can’t use it or analyze it very easily. In this video we show how we can easily ask questions of this data and arrive quickly at insights. These data and visualizations will be useful for patients, doctors, health administrators and other medical professionals. – http://blog.semoss.org/2013/11/fda-analytics.html

• My Note: I would like to try to reproduce these results with Spotfire, but as you will see the FAERS_ASCII_2013Q4 data set is not good.

Page 16: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

16

Edward Tufte:How to Create Trust in the Agency Pitch Process

• "The Visual Display of Quantitative Information" is one of the most successful self-published books in history.

• Tufte is generally considered to the biggest thinker in the uber-trendy field of data visualization.

• He thinks The Guardian is the best-designed newspaper and that The New York Times does the best visualizations by far.

• He says: – Good content reasoners and presenters are rare, designers are not.– PowerPoint should be used solely as a projector operating system to show 100% content, without

the "chartjunk" and "chartoons“.– At NASA, where PowerPoint trumped rocket science -- and the Columbia Accident Investigation

Board agreed with and published my analysis in their final Report.– Presenters need (1) to tell a coherent story and (2) to convince their audience of their credibility.

Not a cherry-picker, but a master of detail.– Graphics are at their best for really large data sets, as in sparklines for time series and NASA's

photographs of the Earth.– Sensibly-designed tables usually outperform graphics for data sets under 100 numbers.– I think designers and marketers greatly underestimate their audiences.

http://adage.com/article/adagestat/edward-tufte-adagestat-q-a/230884/

Page 17: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

17

Cover Page: Spotfire

Explanations: I inventoried, downloaded, unzipped, and imported the files in the table below into Spotfire and tried to understand the data and its usefulness.

Web Player

Page 18: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

18

bmis: Spotfire

No Column Names!

Page 19: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

19

CLIL: Spotfire

Some Have No Column Names!

Page 20: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

20

drls_reg: Spotfire

Three firms clearly stand out here!

Page 21: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

21

Drugsatfda: Spotfire

Lots of data here, but not sure of its information value.

Page 22: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

22

EOBZIP_2013_09_16

This data set is unintelligible to me!

Page 23: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

23

EOBZIP_2013_09_16 (2): Spotfire

These data sets have single columns or missing column definitions!

Page 24: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

24

FAERS_ASCII_2013Q4: Spotfire

These data sets also have single columns or missing column definitions!

Page 25: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

25

pmc: Spotfire

Most of the pmc commitments are pending (849 or 2,113) and have been received in 2013-2014.

Page 26: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

26

Data Exchange Standards Catalog: Spotfire

The standard terminology code sets are listed in a separate tab of this worksheet. Please look at the "Terminology" tab to find standard terminology information. This is what GENIS is about.

Page 27: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

27

FDA Data Spreadsheets• Fraudulent H1N1 Products List 2009

– 185 rows and 6 columns• Hydrolyzed Vegetable Protein Products List 2010

– 177 rows and 9 columns• Infant Formula Recall List 2010

– 2173 rows and 8 columns• Milk Recall Products 2009

– 286 rows and 9 columns• Peanut Butter Products 2009

– 3918 rows and 9 columns• Pistachio 2009

– 662 rows and 9 columns• Shell Eggs Recall List 2010

– 94 rows and 8 columnsComment: These are 2009-2010. More recent?

Page 28: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

28

Fraudulent H1N1 Products List 2009:Spotfire

Most of the products were approved, etc. content on the Web and Supplements

Page 29: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

29

FDA ndc: SpotfireThe ndc package.txt and product.txt data sets were visualized in separate Spotfire file and showed that HUMAN PRESCRIPTION DRUG (43,458 out of 83,167 rows, and most of the STARTMARKETINGDATE were after the Year 2000.

Web Player

Page 30: FDA Data Innovation Lab Visualization Gallery Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

30

Conclusions

• We have participated in Meetups & Demos to understand the OpenFDA Data & the HHS Ignite Application & Evaluation Criteria.

• We have created an FDA Data Innovation Lab Visualization Gallery. There are some problems with the FDA Data sets.

• We are creating Data Science Data Publications for FDA using the Data Science Data Mining Process.

• Semantic Community has a platform for the community to interact with each other and FDA domain experts with the goal of spurring innovation around FDA data and creating new partnerships and opportunities between the public and private sector.