you got your engineering in my data science - addressing the reproducibility crisis with software...

49
YOU GOT YOUR ENGINEERING IN MY DATA SCIENCE ADDRESSING THE REPRODUCIBILITY CRISIS WITH SOFTWARE ENGINEERING 1

Upload: jonbodner

Post on 15-Jan-2017

118 views

Category:

Software


0 download

TRANSCRIPT

Page 1: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

YOU GOT YOUR ENGINEERING IN MY DATA SCIENCE

ADDRESSING THE REPRODUCIBILITY CRISIS WITH SOFTWARE ENGINEERING

1

Page 2: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

WE SEE PATTERNS2

Page 3: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

SCIENCE USED TO BE A SOLO OPERATION…

3

Page 4: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

THE OVERALL HIGGS ANALYSIS WAS PERFORMED BY A TEAM OF MORE THAN 600 PHYSICISTS.

“Who Really Found the Higgs Boson” -Neal Hartman, Nautilus Issue 18

…BUT NOW IT’S NOT

4

Page 5: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

DATA SCIENCE IMPROVES

EVERYTHING 5-1

Page 6: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

5-2

Page 7: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

5-3

Page 8: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

Clinical recommendations discouraging the use of CYP2D6 gene testing to guide tamoxifen therapy in breast cancer patients are based on studies with flawed methodology and should be reconsidered, according to the results of a Mayo Clinic study published in the Journal of the National Cancer Institute.

Joe Dangor, Mayo Clinic News Network December 9, 2014

5-4

Page 9: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

SEARCHING FOR PATTERNS

6

Page 10: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

7

Page 11: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

8

Page 12: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH ANALYSIS TOOLS

FALSE POSITIVES IN FMRI RESEARCH

9-1

Page 13: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH ANALYSIS TOOLS

FALSE POSITIVES IN FMRI RESEARCH

▸ After crunching the numbers, “we think that around 3,000 studies could be affected,” says Dr Eklund. But without revisiting each and every study, it is impossible to know which those 3,000 are.

9-2

Page 14: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH PROCESS

PSYCHOLOGICAL RESEARCH

10-1

Page 15: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH PROCESS

▸ “Estimating the reproducibility of psychological science”

PSYCHOLOGICAL RESEARCH

10-2

Page 16: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH PROCESS

▸ “Estimating the reproducibility of psychological science”

▸ Brian Nosek, Science, August 2015

PSYCHOLOGICAL RESEARCH

10-3

Page 17: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH PROCESS

▸ “Estimating the reproducibility of psychological science”

▸ Brian Nosek, Science, August 2015

▸ 270 co-authors tried to reproduce 100 studies

PSYCHOLOGICAL RESEARCH

10-4

Page 18: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH PROCESS

▸ “Estimating the reproducibility of psychological science”

▸ Brian Nosek, Science, August 2015

▸ 270 co-authors tried to reproduce 100 studies

▸ 36% could be reproduced

PSYCHOLOGICAL RESEARCH

10-5

Page 19: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH PROCESS

PSYCHOLOGICAL RESEARCH

“Nosek said there were three possible reasons for his results: that the original effect could have been false positive, that the replication was a false negative, or that both the original and replication results are accurate but that each experiment’s methodology differed in significant ways.”- Colleen Flaherty Inside Higher EdAugust 2015

11

Page 20: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH DATA

12-1

Page 21: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

11% OF STUDIES REPRODUCIBLE

PROBLEMS WITH DATA

12-2

Page 22: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PROBLEMS WITH DATA

“For results that could not be reproduced, however, data were not routinely analyzed by investigators blinded to the experimental versus control groups. Investigators frequently presented the results of one experiment, such as a single Western-blot analysis. They sometimes said they presented specific experiments that supported their underlying hypothesis, but that were not reflective of the entire data set. There are no guidelines that require all data sets to be reported in a paper; often, original data are removed during the peer review and publication process.”

- C. Glenn Begley

13

Page 23: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

IT CAN BE PROVEN THAT MOST CLAIMED RESEARCH FINDINGS ARE FALSE.John Ioannidis

14

Page 24: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

THE REPRODUCIBILITY CRISIS

15

Page 25: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

16

Page 26: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

IT WORKS ON MY MACHINE

Every Single Software Developer Ever

REPRODUCIBILITY IN SOFTWARE ENGINEERING

17

Page 27: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

VERSION YOUR CODE AND DATA

VERSION CONTROL

18

Page 28: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

USE A BUILD SCRIPT

19

Page 29: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

REVIEW YOUR CODE20

Page 30: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

21

Page 31: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

DEFINE STANDARD FORMATS

22

Page 32: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

FUZZING23

Page 33: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

USE IT RELEASE IT

OPEN SOURCE

24

Page 34: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

TAKE ADVANTAGE OF MODERN TECHNOLOGY

25

Page 35: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

CREATING INTERACTIVE PUBLICATIONS

“Truly Interactive Science Publishing was shown to have enough educational value that readers were willing to invest in the needed set–up and learning phases. Problems encountered in network and computer speed can now be minimized by running the ISP software in a cloud computing environment which will minimize the dependence on local computer and network speeds. The social aspects of data sharing and the enlarged review process may be the hardest obstacles to overcome.”

-Dr. Michael Ackerman

26

Page 36: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

27-1

Page 37: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

27-2

Page 38: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

27-3

Page 39: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

27-4

Page 40: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

27-5

Page 41: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

▸ Stick to standards

27-6

Page 42: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

▸ Stick to standards

▸ Use open source when you can

27-7

Page 43: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

▸ Stick to standards

▸ Use open source when you can

▸ Open source when you can

27-8

Page 44: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

▸ Stick to standards

▸ Use open source when you can

▸ Open source when you can

▸ Take advantage of technology

27-9

Page 45: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

THERE IS NO SILVER BULLET

28

Page 46: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

THANKS TO

▸ Andrew Schechtman-Rook

▸ Jacqueline Kazil

▸ Jeanie Drury

29

Page 47: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

WHO AM I

JONATHAN BODNER

▸ Tech Fellow, Capital One

[email protected]

▸ @jonbodner

30

Page 48: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

Image and Content Credits:

2. http://www.telescope.com/assets/images/starcharts/2016-10-starchart_col.png

3. https://xkcd.com/1584/

4. http://nautil.us/issue/18/genius/who-really-found-the-higgs-boson

5. https://news.virginia.edu/content/capital-one-cio-talks-big-data-innovation-ahead-tonight-s-information-session, http://newsnetwork.mayoclinic.org/discussion/mayo-clinic-genotyping-errors-plague-cyp2d6-testing-for-tamoxifen-therapy/, https://www.google.com/patents/US8615473, https://www.bloomberg.com/news/articles/2016-09-20/microsoft-develops-ai-to-help-cancer-doctors-find-the-right-treatments

6. By Lokilech - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1804667

7. http://news.stanford.edu/news/2012/september/austen-reading-fmri-090712.html

8. http://www.popsci.com/science/article/2010-05/hollywood-science-how-your-brain-reacts-horror-movies

9. http://www.economist.com/news/science-and-technology/21702166-two-studies-one-neuroscience-and-one-palaeoclimatology-cast-doubt

11. https://www.insidehighered.com/news/2015/08/28/landmark-study-suggests-most-psychology-studies-dont-yield-reproducible-results

12. http://www.nature.com/nature/journal/v483/n7391/full/483531a.html

14. http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124

31

Page 49: You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software Engineering

Image and Content Credits:

15. http://xkcd.com/1574/

16. https://www.flickr.com/photos/vannispen/4608436679

18. https://xkcd.com/1597/

20. https://xkcd.com/1695/

21. http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

22. https://xkcd.com/927/

23. https://www.flickr.com/photos/lamenta3/4349576638

24. https://www.flickr.com/photos/jalbertbowdenii/5682524083

25. http://quod.lib.umich.edu/j/jep/3336451.0018.201?view=text;rgn=main

28. https://www.flickr.com/photos/eschipul/4160817135

32