resolving reproducibility in computational science: tools ...vcs/talks/uofa2015-stodden.pdf · 1....

51
Resolving Reproducibility in Computational Science: Tools, Policy, and Culture Victoria Stodden School of Information Sciences University of Illinois at Urbana-Champaign Department of Computer Science and College of Law Joint Colloquium University of Arizona October 8, 2014

Upload: others

Post on 16-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Resolving Reproducibility in Computational Science:

Tools, Policy, and Culture

Victoria StoddenSchool of Information Sciences

University of Illinois at Urbana-Champaign

Department of Computer Science and College of Law Joint ColloquiumUniversity of Arizona

October 8, 2014

Page 2: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Agenda

1. Reproducibility: Who Cares?

2. Unpacking Reproducibility: Empirical, Computational, Statistical

3. Empirical Evidence: Journal and Investigator Responses

4. Intellectual Property Barriers

5. (if time) Policy Responses

Page 3: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Credibility Crisis

Page 4: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Parsing Reproducibility“Empirical Reproducibility”

“Computational Reproducibility”

“Statistical Reproducibility”V. Stodden, IMS Bulletin (2013)

Page 6: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

II. Computational Reproducibility

Traditionally two branches to the scientific method:• Branch 1 (deductive): mathematics, formal logic,• Branch 2 (empirical): statistical analysis of controlled experiments.

Now, new branches due to technological changes?• Branch 3,4? (computational): large scale simulations / data driven

computational science.

Page 7: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Commonly believed...

“This book is about a new, fourth paradigm for science based on data-intensive computing.”

“It is common now to consider computation as a third branch of science, besides theory and experiment.”

Page 8: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

The Ubiquity of Error

The central motivation for the scientific method is to root out error:

• Deductive branch: the well-defined concept of the proof,

• Empirical branch: the machinery of hypothesis testing, appropriate statistical methods, structured communication of methods and protocols.

Claim: Computation presents only a potential third/fourth branch of the scientific method (Donoho, Stodden, et al. 2009), until the development of comparable standards.

Page 9: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

The Impact of Technology

1. Big Data / Data Driven Discovery: high dimensional data, p >> n,

2. Computational Power: simulation of the complete evolution of a physical system, systematically varying parameters,

3. Deep intellectual contributions now encoded only in software.

The software contains “ideas that enable biology...”Stories from the Supplement, 2013.

Page 10: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Scoping the Issue

JASA June Computational Articles Code Publicly Available1996 9 of 20 0%2006 33 of 35 9%2009 32 of 32 16%2011 29 of 29 21%

Ioannidis (2011): of 500 papers studied, 9% had full primary raw data deposited.

Stodden (to come): estimates that the computations in 27% of scientific articles published in Science today are reproducible.

Page 11: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Really Reproducible Research

• “Really Reproducible Research” inspired by Stanford Professor Jon Claerbout:

“The idea is: An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete ... set of instructions [and data] which generated the figures.” David Donoho, 1998.

Page 12: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Journal Policy

• Journal Policy setting study design:

• Select all journals from ISI classifications “Statistics & Probability,” “Mathematical & Computational Biology,” and “Multidisciplinary Sciences” (this includes Science and Nature).

• N = 170, after deleting journals that have ceased publication.

• Create dataset with ISI information (impact factor, citations, publisher) and supplement with publication policies as listed on journal websites, in June 2011 and June 2012.

Page 13: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Journal Requirements

In January 2014 Science enacted new policies. The will check for:

1. a “data-handling plan” i.e. how outliers will be dealt with,

2. sample size estimation for effect size,

3. whether samples are treated randomly,

4. whether experimenter blind to the conduct of the experiment.

Statisticians added to the Board of Reviewing Editors.

Page 14: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Journal Data Sharing Policy2011 2012

Required as condition of publication, barring exceptions 10.6% 11.2%

Required but may not affect editorial decisions 1.7% 5.9%

Encouraged/addressed, may be reviewed and/or hosted 20.6% 17.6%

Implied 0% 2.9%

No mention 67.1% 62.4%

Source: Stodden, Guo, Ma (2013) PLoS ONE, 8(6)

Page 15: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Journal Code Sharing Policy2011 2012

Required as condition of publication, barring exceptions 3.5% 3.5%

Required but may not affect editorial decisions 3.5% 3.5%

Encouraged/addressed, may be reviewed and/or hosted 10% 12.4%

Implied 0% 1.8%

No mention 82.9% 78.8%

Source: Stodden, Guo, Ma (2013) PLoS ONE, 8(6)

Page 16: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Findings

• Changemakers are journals with high impact factors.

• Progressive policies are not widespread, but being adopted rapidly.

• Close relationship between the existence of a supplemental materials policy and a data policy.

• No statistically significant relationship between data and code policies and open access policy.

• Data and supplemental material policies appear to lead software policy.

Page 17: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Data / Code Sharing PracticesSurvey of the NIPS community:

• 1,758 NIPS registrants up to and including 2008,

• 1,008 registrants when restricted to .edu registration emails,

• After piloting, the final survey was sent to 638 registrants,

• 37 bounces, 5 away, and 3 in industry, gave a final response rate was 134 of 593 or 23%.

• Queried about reasons for sharing or not sharing data/code associated with their NIPS paper.

Page 18: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Sharing IncentivesCode Data91% Encourage scientific advancement

c advancementcument and clean up81%

90% Encourage sharing in others 79%86% Be a good community member 79%82% Set a standard for the field 76%85% Improve the calibre of research 74%81% Get others to work on the problem 79%85% Increase in publicity 73%78% Opportunity for feedback 71%71% Finding collaborators 71%

Survey of the Machine Learning Community, NIPS (Stodden 2010)

Page 19: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Barriers to SharingCode Data77% Time to document and clean up 54%52% Dealing with questions from users 34%44% Not receiving attribution 42%40% Possibility of patents -34% Legal Barriers (ie. copyright) 41%

- Time to verify release with admin 38%30% Potential loss of future publications 35%30% Competitors may get an advantage 33%20% Web/disk space limitations 29%

Survey of the Machine Learning Community, NIPS (Stodden 2010)

Page 20: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

ICERM Workshop

Page 21: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

ICERM Workshop Report

Page 22: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around
Page 24: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Research CompendiaGoal: improve understanding of reproducible computational science, trace sources of error.

• link data/code to published claims,

• enable re-use,

• sharing guide for researchers,

• certification of results,

• large scale validation of findings,

• stability, sensitivity checks.

Page 25: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around
Page 26: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around
Page 27: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around
Page 28: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around
Page 29: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

3. Reproducibility as a Statistical Issue• False discovery, chasing significance, p-hacking (Simonsohn 2012), file

drawer problem, overuse and mis-use of p-values,

• Multiple testing, sensitivity analysis, poor reporting/tracking practices,

• Data preparation, treatment of outliers,

• Poor statistical methods (nonrandom sampling, inappropriate tests or models,..)

• Model robustness to parameter changes and data perturbations,

• Investigator bias toward previous findings; conflicts of interest.

Page 30: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

We need:

Standards for reproducibility of big data findings:

1. data access, software access, persistent linking to publications.

2. innovation around data and code access for privacy protection and scale.

3. robust methods, producing stable results, emphasis on reliability and reproducibility.

Example: Google Flu Trends results: worked at first, but what happened? (Lazer et al. “The Parable of Google Flu: Traps in Big Data Analysis” Science, 2014)

Page 31: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Legal Barriers: Copyright

• Original expression of ideas falls under copyright by default (papers, code, figures, tables..)

• Copyright secures exclusive rights vested in the author to:

- reproduce the work

- prepare derivative works based upon the original

“To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” (U.S. Const. art. I, §8, cl. 8)

Exceptions and Limitations: Fair Use.

Page 32: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Responses Outside the Sciences 1: Open Source Software

• Software with licenses that communicate alternative terms of use to code developers, rather than the copyright default.

• Hundreds of open source software licenses:

- GNU Public License (GPL)

- (Modified) BSD License

- MIT License

- Apache 2.0 License

- ... see http://www.opensource.org/licenses/alphabetical

Page 33: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Responses Outside the Sciences 2: Creative Commons

• Founded in 2001, by Stanford Law Professor Larry Lessig, MIT EECS Professor Hal Abelson, and advocate Eric Eldred.

• Adapts the Open Source Software approach to artistic and creative digital works.

Page 34: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Response from Within the Sciences

• A suite of license recommendations for computational science:

• Release media components (text, figures) under CC BY,

• Release code components under Modified BSD or similar,

• Release data to public domain or attach attribution license.

➡ Remove copyright’s barrier to reproducible research and,

➡ Realign the IP framework with longstanding scientific norms.

The Reproducible Research Standard (RRS) (Stodden, 2009)

Winner of the Access to Knowledge Kaltura Award 2008

Page 35: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Copyright and Data

• Copyright adheres to raw facts in Europe.

• In the US raw facts are not copyrightable, but the original “selection and arrangement” of these facts is copyrightable. (Feist Publns Inc. v. Rural Tel. Serv. Co., 499 U.S. 340 (1991)).

• the possibility of a residual copyright in data (attribution licensing or public domain certification).

• Law doesn’t match reality on the ground: What constitutes a “raw” fact anyway?

Page 36: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Bayh-Dole Act (1980)

Promote the transfer of academic discoveries for commercial development, via licensing of patents (ie. Technology Transfer Offices),

Bayh-Dole Act gave federal agency grantees and contractors title to government-funded inventions and charged them with using the patent system to aid disclosure and commercialization of the inventions.

Greatest impact in biomedical research collaborations and drug discovery. Now, software patents also impact science.

Page 37: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Ownership of Research Codes

Page 38: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Disclosure of Research Codes

Claim: Codes would (eventually) be fully open in the absence of Bayh-Dole:

•Grassroots “Reproducible Research” movement in computational science (policy development, best practices, tool development),

•Changes in funding agency requirements

•Changes in journal publication requirements

Page 39: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Share Your Code..

Page 40: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

References“Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals,” PLoS ONE, June 2013

“Reproducible Research,” guest editor for Computing in Science and Engineering, July/August 2012.

“Reproducible Research: Tools and Strategies for Scientific Computing,” July 2011.

“Enabling Reproducible Research: Open Licensing for Scientific Innovation,” 2009.

available at http://www.stodden.net

Page 41: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around
Page 42: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Policy Responses

Page 43: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Open Science from the Whitehouse

• Feb 22, 2013: Executive Memorandum directing federal funding agencies to develop plans for public access to data and publications.

• May 9, 2013: Executive Order directing federal agencies to make their data publicly available.

• July 29, 2014: Notice of Request for Information “Strategy for American Innovation”

Page 44: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Executive Memorandum:“Expanding Public Access to the

Results of Federally Funded Research”• “Access to digital data sets resulting from federally funded research allows companies

to focus resources and efforts on understanding and exploiting discoveries.”

• “digitally formatted scientific data resulting from unclassified research supported wholly or in part by Federal funding should be stored and publicly accessible to search, retrieve, and analyze.”

• “digital recorded factual material commonly accepted in the scientific community as necessary to validate research findings”

• “Each agency shall submit its draft plan to OSTP within six months of publication of this memorandum.”

Page 45: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Executive Order: “Making Open and Machine Readable the New Default for

Government Information"

• “The Director … shall issue an Open Data Policy to advance the management of Government information as an asset”

• “Agencies shall implement the requirements of the Open Data Policy”

• “Within 30 days of the issuance of the Open Data Policy, the CIO and CTO shall publish an open online repository of tools and best practices”

Page 46: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Request for Input:“Strategy for American Innovation”• “to guide the Administration's efforts to promote lasting economic

growth and competitiveness through policies that support transformative American innovation in products, processes, and services and spur new fundamental discoveries that in the long run lead to growing economic prosperity and rising living standards.”

• “(11) Given recent evidence of the irreproducibility of a surprising number of published scientific findings, how can the Federal Government leverage its role as a significant funder of scientific research to most effectively address the problem?”

Page 47: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

National Science Board Report

“Digital Research Data Sharing and Management,” December 2011.

http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf

Page 48: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

Sharing: Funding Agency Policy

• NSF grant guidelines: “NSF ... expects investigators to share with other researchers, at no more than incremental cost and within a reasonable time, the data, samples, physical collections and other supporting materials created or gathered in the course of the work. It also encourages grantees to share software and inventions or otherwise act to make the innovations they embody widely useful and usable.” (2005 and earlier)

• NSF peer-reviewed Data Management Plan (DMP), January 2011.

• NIH (2003): “The NIH expects and supports the timely release and sharing of final research data from NIH-supported studies for use by other researchers.” (>$500,000, include data sharing plan)

Page 49: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

NSF Data Management Plan

“Proposals submitted or due on or after January 18, 2011, must include a supplementary document of no more than two pages labeled ‘Data Management Plan.’ This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results.” (http://www.nsf.gov/bfa/dias/policy/dmp.jsp)

Software management plans appearing.. (BigData joint NSF/NIH solicitation)

Page 50: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

DOE Data Management Plan

“The Department is taking a phased approach to the implementation of requirements set forth by the OSTP memo. In particular, the Office of Science, which supports roughly two-thirds of the total R&D for the Department, plans to pilot a data management policy with the requirements described below by July 28, 2014. Other DOE Offices and elements with over $100 million in annual conduct of research and development expenditures will implement data management plan requirements that satisfy the requirements of the OSTP memo no later than October 1, 2015 in such a way that there is a single DOE policy for data management planning.” (DOE Public Access Plan 2014)

Page 51: Resolving Reproducibility in Computational Science: Tools ...vcs/talks/UofA2015-STODDEN.pdf · 1. data access, software access, persistent linking to publications. 2. innovation around

NAS Data Sharing Report

• Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences, (2003)

• “Principle 1. Authors should include in their publications the data, algorithms, or other information that is central or integral to the publication—that is, whatever is necessary to support the major claims of the paper and would enable one skilled in the art to verify or replicate the claims.”