bigdatapdi2015 150112111012-conversion-gate02

32
BIG DATA BIG OPPORTUNITIES OR BIG TROUBLE? Kathy Partin, Office of the VP for Research, Dept. of Biomedical Sciences Shea Swauger, Libraries

Upload: soniamra

Post on 08-Aug-2015

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bigdatapdi2015 150112111012-conversion-gate02

BIG DATA BIG OPPORTUNITIES OR BIG TROUBLE?

Kathy Partin, Office of the VP for Research, Dept. of Biomedical Sciences

Shea Swauger, Libraries

Page 2: Bigdatapdi2015 150112111012-conversion-gate02

What is Big Data?

• Volume• Variety• Velocity

• Too Big to Email

• Veracity• Variability• Visualization• Value

Page 3: Bigdatapdi2015 150112111012-conversion-gate02

The Data Lifecycle

• Proposal• Infrastructure• Acquisition/Generation• Management• Dissemination• Preservation

Page 4: Bigdatapdi2015 150112111012-conversion-gate02

Proposal

• Grant Funding Requirements

• Data Management Plan

http://lib.colostate.edu/repository/nsf

https://dmptool.org

Page 5: Bigdatapdi2015 150112111012-conversion-gate02

Infrastructure

• Where do you store it?

• How do you move it?

• How do you analyze it? (HPC?)

+ Ultra High Speed Research LAN

+ College or Department Servers

+ Bioinformatics & other Clusters

http://istec.colostate.edu/activities/hpc/

Page 6: Bigdatapdi2015 150112111012-conversion-gate02

Data Acquisition/Generation

Reuse Existing• Where to find it?

• How to understand/use it?

• Do you trust it?

• Create your own data

Metadata + README files

Data Provenance

Privacy, Security, Proprietary, Dual Use Research of Concern

Page 7: Bigdatapdi2015 150112111012-conversion-gate02

Data Management

• Access/Permissions• File Naming• Metadata• Organization• Collaboration• Version Control• Fixity/Integrity

http://lib.colostate.edu/services/data-management

Page 8: Bigdatapdi2015 150112111012-conversion-gate02

DisseminationWhere to share your data?

• Institutional Repository

• Discipline Specific Repository

How to cite your data?

• Permanent identifier (doi, handle, PURL, etc.)• Citation standards

http://lib.colostate.edu/services/data-management/citing-data

Page 9: Bigdatapdi2015 150112111012-conversion-gate02

Data Preservation

• Media Obsolescence• Software Obsolescence• Bit Rot• Back-ups• Checksums

Page 10: Bigdatapdi2015 150112111012-conversion-gate02

Public Outcry Regarding Data Integrity

• “Why Most Published Research Findings are False”, Ioannidis, 2005• “Update of the Stroke Therapy Academic Industry Roundtable Preclinical

Recommendations,” Fisher et al., 2009• “Science Publishing: The Trouble with Retractions,” Van Noorden, 2011• “Believe it or not: how much can we rely on published data on potential drug

targets?” Prinz et al., 2011• “Misconduct Accounts for the Majority of Retracted Scientific Publications,”

Fang et al., 2012 • “Drug Development: Raise standards for Preclinical Cancer Research,”

Begley & Ellis, 2012

http://i97.photobucket.com/albums/l217/Shockwave_73/angry-mob-at-frankenstein-castle_zps364a2714.jpg

Page 11: Bigdatapdi2015 150112111012-conversion-gate02

Integrity - Reliability - Translation• “Power Failure: why small sample size undermines the

reliability of neuroscience”, Button et al., 2013• “Challenges in Translating Academic Research into

Therapeutic Advancement,” Matos et al., 2013 (epilepsy)• “Reproducibility,” McNutt, 2014• “NIH plans to enhance reproducibility,” Collins & Tabak,

2014 • “Reproducibility: Fraud is not the big problem,” – Gunn,

2014• Taxpayers are wasting their investment because the

integrity of basic research is flawed, not due to intentional misconduct but to unintentional mismanagement.

Page 12: Bigdatapdi2015 150112111012-conversion-gate02

Research Misconduct

1. Fabrication, falsification, plagiarism, or other practices that seriously deviate from those that are commonly accepted within the relevant scientific/academic community for proposing, conducting, reviewing or reporting research; that

2. Has been committed intentionally, knowingly or recklessly; and, that

3. Has been proven by a preponderance of the evidence (more likely than not)

Misconduct does not include honest error or honest differences in interpretations or judgments of data.

Page 13: Bigdatapdi2015 150112111012-conversion-gate02

Reporting Concerns • All employees and individuals associated with CSU should report observed,

suspected or apparent Research Misconduct to their Department Head, Dean, the RIO and/or the Vice President for Research.

• If an individual is unsure whether a suspected incident falls within the definition of scientific misconduct, a call may be placed to one of these individuals to discuss the suspected misconduct informally.

http://reportinghotline.colostate.edu/

Page 14: Bigdatapdi2015 150112111012-conversion-gate02

Research Integrity Officer› Primary contact for departments and deans with

questions about potential misconduct issues› Represents CSU with the PHS Office of Research

Integrity (ORI), NSF, USDA, etc› Manages the CSU MIS process to meet

institutional, state and federal standards› [email protected]

Page 15: Bigdatapdi2015 150112111012-conversion-gate02

External Pressure to Fix or Be Fixed• Issues with data reliability have brought external pressure

on the scientific community• From Congress

• Presidential Council of Advisors on Science and Technology (PCAST) – “Improving Scientific Reproducibility in an Age of International Competition and Big Data” , 2014 http://www.tvworldwide.com/events/pcast/140131/

• From the popular press and “watch dog” websites/blogs• The Economist - “Unreliable research: Trouble at the Lab”, 2013• NYT– “New truths that only one can see”, 2014• RetractionWatch.com

Page 16: Bigdatapdi2015 150112111012-conversion-gate02

The Gap Between Applied & Basic Research

Innovation

Reliability

The two opposite and contrary forces of data

Dynamic, agile, discovery, exploration, optimization, creative, outside-the-box, anti-dogmatic(pre pre-clinical study)

Reproducible, robust, translatable to bedside, rigid, immutable, non-optimized, boring(preclinical or clinical study)

Page 17: Bigdatapdi2015 150112111012-conversion-gate02

What needs to change?• Funding agencies need to raise the bar for data

acquisition• Publishers need to raise the bar for data quality• Academic institutions need to reassess how success is

defined• Academic institutions need to provide their faculty with the

right tools and training to do it right• Faculty need to pass this down to their trainees

Page 18: Bigdatapdi2015 150112111012-conversion-gate02

External Changes• NIH appears to be

• Developing a new training module on good experimental design to disseminate

• Developing a data checklist for grant proposals• DDI- Data Discovery Index• New biosketch format to reduce the focus on numbers of publications

and increase the focus on impact of publications• Considering blinded review of grant proposals

• Science Exchange Reproducibility Initiative

Page 19: Bigdatapdi2015 150112111012-conversion-gate02

DDI

“In summary, a Data Discovery Index (DDI) emphasizes development of an adaptable, scalable system through active community engagement that would serve as an index to large biomedical datasets.”

Rather than in a traditional “catalog” the DDI concept stresses discoverability, access, and citability.

This is a dataset of raw data, which rarely saw the light of day in academic research before.

Page 20: Bigdatapdi2015 150112111012-conversion-gate02

Publishers• Preventing plagiarism with iThenticate• Preventing Fabrication/Falsification with new data checklists• Abolishing word limits on methods sections

Page 21: Bigdatapdi2015 150112111012-conversion-gate02

Six Common Experimental Failings1. Poor experimental design

2. Poor reagents

3. Poor analysis

4. Failure to reject hypothesis after observing discordant, valid experimental results

5. Deliberate bias in selecting positive rather than negative results to report, publish, cite, and fund

6. Failure to follow through when wondering “Why is this result NOT what I expected?”

Page 22: Bigdatapdi2015 150112111012-conversion-gate02

Statistics & General Methods

1. How was the sample size chosen to ensure adequate power to detect a pre-specified effect size?

2. Describe inclusion/exclusion criteria if samples, subjects or animals were excluded from the analysis. Were the criteria pre-established?

3. If a method of randomization was used to determine how samples/subjects/animals were allocated to experimental groups and processed, describe it.

Page 23: Bigdatapdi2015 150112111012-conversion-gate02

Statistics & General Methods

4. If the investigator was blinded to the group allocation during the experiment and/or when assessing the outcome, state the extent of blinding.

5. For every figure, are statistical tests justified as appropriate? Do the data meet the assumptions of the tests (e.g., normal distribution)?

a) Is there an estimate of variation within each group of data?

b) Is the variance similar between the groups that are being statistically compared?

Page 24: Bigdatapdi2015 150112111012-conversion-gate02
Page 26: Bigdatapdi2015 150112111012-conversion-gate02

Data Notebooks – Another Vulnerability

• Binders• Electronic Notebooks• Software documentation• Field notes• Images• Algorithms

Page 27: Bigdatapdi2015 150112111012-conversion-gate02

Data Corrections & Amendments• Errors, additions, and modifications should be identified

by crossing out the original data with a single line (do not obscure the initial data) and initialing, dating and providing a reason for the change.

• Missing or obscured data/pages are often interpreted as intentional obfuscation of data

•Absence is interpreted as guilt

Page 28: Bigdatapdi2015 150112111012-conversion-gate02

Data Forensics• Numbers• Images• Hardware/Software

Page 29: Bigdatapdi2015 150112111012-conversion-gate02

Numbers

Page 30: Bigdatapdi2015 150112111012-conversion-gate02

Images

Page 31: Bigdatapdi2015 150112111012-conversion-gate02

Hardware

Software

Page 32: Bigdatapdi2015 150112111012-conversion-gate02

Questions?