reproducible computational research€¦ · reproducible computational research geir kjetil sandve,...

38
Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014

Upload: others

Post on 22-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Reproducible computational research

Geir Kjetil Sandve,INF-STK 5010,

19.feb 2014

Page 2: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

The goals for today

Show you why reproducibility is important,and why it is so frustratingly difficult to get it right

Page 3: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

The form of this session

Instead of the usual 4x45 minute sessions (2x lecture and 2x exercise),

we take a single 105 minute sessionthat is interspersed with 4 group discussions

Page 4: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Outline

• What is replication and reproducibility

• Ensuring that you can reproduce your own analyses

• Improving transparency and reproducibility through rich results

• Ensuring that published results are reproducible by peers

Page 5: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Replication and reproducibility

• “Replication is a cornerstone of a cumulative science” Science 334(6060):1182

• “Reproducibility has the potential to serve as a minimum standard for judging scientific claims when full independent replication of a study is not possible.” Science 334(6060):1226-7

Page 6: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

In silico analysis and reproducibility

• Bioinformaticians gets surprised every time they need to redo/modify previous analyses

• But you bench biologists already know the importance of reproducibility!

• You also know that even with a detailed lab journal, reproduction is a challenge

• The question is then how this manifests itself when doing analysis on a computer

Page 7: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

What is in silico reproducibility?

• Basically the same issues as at the bench:

• Materials -> Data sources

• Experiment conditions -> Analysis parameters

• Equipment (and models) -> Programs (and versions)

• And the same challenges:

• Are all relevant conditions described accurately?

• Will the same materials and equipment be available?

Page 8: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

What is the current status of reproducibility?

• Less than half of selected microarray experiments published in Nature Genetics could be reproducedNat Genet 41(2):149-55

• More than half [of surveyed papers] do not provide primary data and list neither the version nor the parameters used [for read mapping]Nat Rev Genet 13(9):667-72

Page 9: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

But really, is reproducibility that critical?

If so, why?

What major purposes do reproducibility address?

When would you start thinking about it - from the start or at time of publication? (and why?)

Discuss in groups of 2-4 people and take notes!

Page 10: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Why should we care? (that published findings are reproducible)

• “Credibility crisis of scientific research” (V. Stodden)

• Possibly related to a sharply increasing rate of retractions. (J Med Ethics. 37(4):249-53)

• Possibly related to a sharply increasing rate and failed clinical trials. (Nat Rev Drug Discov. 10(9):712)

Page 11: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Why should you care?(about making your analyses reproducible)

• Because it’s the right thing to do!

• .. and the one that’s struggling with its reproduction is often the future you

• Journals are becoming aware of the issues

• Reviewers may value it

• “Discussing issues related to data replication should become part of student training” Science 334(6060):1182

Page 12: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Ensuring that you can reproduce your own

analyses

Page 13: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Just about to submit manuscript

• Did a quick check of parameter robustness

• Much worse results than in manuscript

• Tried dozens of settings, but never as good

• What had I used before? Had I just been extremely lucky?

• Conclusion after much sweat:

• A slight rounding inconsistency had at some point been introduced in the program and weakened performance

A confession: “why can’t I reproduce my good results”

Page 14: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• A figure/table/claim is typically the result of a long sequence of interrelated steps

• Note every step, and every detail that may influence the execution of the step

• Results and specification can easily get out of sync

• By specifying the full analysis workflow,in a form that allows for direct execution, one can ensure that the specification matches the analysis that was (subsequently) performed

Rule 1: Track analyses behind results

Page 15: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Just write your analysis as an R script and you’re good!

• Why isn’t it that easy? Why (and how) does it typically still go wrong?

• Discuss in groups of 2-4 people and take notes!

How to ensure that you can reproduce your analyses?

Page 16: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Things change!

• Code: When you develop/update your method

• Data: When you get new versions or curate it

• Results: When you run new code or use new data

• You always miss something!

• A typical analysis involves tens of steps: if you forget a tiny detail at a tiny step, that might be enough to derail

Challenges with reproducing your own results

Page 17: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Every time you change a single line of your code, you have a new program

• Every time you generate a result, it is based on a unique version of your program

• You need to store every version that were used for any (!) purpose

• myProg1.R, myProg2.R … myProg174.R, or use version control system (e.g. Git)

Rule 4: Version control scripts

Page 18: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• You will typically need to rely also on external R packages or command line tools

• These programs also tend to be organic - the exact version used might not be available in the future

• A slight change of behavior may derail your whole analysis pipeline, and is usually hard to track down

• Be sure to note version and archive program files

Rule 3: Archive exact program versions

Page 19: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Slight modifications/corrections of a data file may be needed for correct loading in R or a tool

• Example: A single row with a missing entry (column) is enough to make trouble for “read.table” in R

• Very tempting to open in Excel and change a few entries

• But how would you describe such manual adjustments?

• Hint: a video camera may be your easiest option..

Rule 2: Avoid manual data manipulation

Page 20: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Improving transparency and reproducibility through rich results

Page 21: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• But results have nothing to do with reproducibility!?

• If you just supply initial data and the full code, everything can be reproduced exactly! (that’s the whole point, right)

• Why isn’t it as clearcut? How are intermediate and underlying data/results related to reproducibility?

• Discuss in groups of 2-4 people and take notes!

Reproducibility, transparency and rich results

Page 22: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Initial data and code allows any result to be reproduced

• But what if a step takes weeks to run - do you want to re-run due to a small change downstream?

• What if there is a problem with one slight step - do you want the full chain to flop?

• If you just want a quick verification of intermediate data - do you want to install and run the whole script?

• Store all intermediate data as stable files from the start

• Provides safety net and allows quick inspection

Rule 5: Record intermediate results

Page 23: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Don’t just trust your eyeballing

• And don’t restrict your peers (and reviewers) to take your word for it - they have the right (and duty) to double check

• Allows to easily adjust the way (the same) data is plotted

• It doesn’t feel good to be Photoshopping your plots..

Rule 7: Store raw data behind plots

Page 24: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Figures and tables typically tell a high level tale

• Each value in a plot/table may be the average of dozens of concrete values

• Sanity checks can often only be made through inspection of the individual values behind such summaries

• Hypertext shows the way

• Each single value in a table (or plot) can be a link to a new page/table giving all detailed underlying results

Rule 8: Generate hierarchical output

Page 25: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Email received Feb 2013:

• In the paper [from 2007], you and your colleagues presented that ‘To make the data sets more coherent, we removed fragments that…’

• I am very interested if you could kindly provide a version of the real datasets that retain such corrupted sequences …

• My answer:

• I have been trying find something on it in my old archives, but unfortunately without success. I have changed institution and computers since I worked on that project, and although I made sure to store the main data sets, I unfortunately did not store all intermediate files and details…

Another confession

Page 26: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Ensuring that published results are reproducible

by peers

Page 27: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• If someone have questions, my email is in the paper!

• Take a wild guess: do you think people typically answer (promptly)?

• “To authors and developers of the .. method. We are worried about the soundness of your method. We think it is flawed and often produces false results.”(25 days and counting)• Who wants to reproduce if it involves indefinite waiting?

• “the most important function of prepublication peer review and editing is to make effective postpublication peer review possible” (Ann Intern Med. 146(6):450-3)

Reproducibility and peers

Page 28: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• As long as I store every file, all information is there!

• But who wants to browse through hundreds of files to find the exact one that is behind a result of interest? (you don’t want either)

Putting it all together (connecting the dots)

Page 29: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• How can you in a practical way provide your peers with all the details needed to reproduce your results?

• How can you ensure that you and your peers can trace the exact underlying analyses and data behind a particular (published) claim/figure/table?

• Discuss in groups of 2-4 people and take notes!

Making published results reproducible by peers

Page 30: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• As the Chinese say, 1001 words is worth more than a picture..

• The majority of our contributions are brought through words

• We like to think that are major claims are founded on empirics

• When claims are based on concrete underlying analyses and results, they should be connected from the start

• How to connect claims to results/analyses

• Sweave (R with Latex), GenePattern add-in (Word), Galaxy Pages (web pages)

Rule 9: Connect statements to results

Page 31: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Online supplementary material is a start

• Guaranteed to be stable for many years

• Direct links from figures/tables/claim to page with underlying data and code is sure to come

• Ideally, you could get to a web tool where settings could be changed and the figure/table re-recreated

Rule 10: Provide public access to

analyses and results

Page 32: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Summary

Page 33: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• “In a pragmatic setting, with publication pressure and deadlines, one may face the need to make a trade-off between the ideals of reproducibility and the need to get the research out while it is still relevant” (self-citation)

Do I live by my rules? (today’s last confession)

Page 34: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• “... However, frequently one will, with the wisdom of hindsight, contemplate the missed opportunity to ensure reproducibility, as it may already be too late to take the necessary notes from memory (or at least much more difficult than to do it while underway).”

• Yes, I just did!

Do I live by my rules? (today’s last confession)

Page 35: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• Referee comment (to my defense: paper initiated in 2011)

• “For section 4-5 case studies, it would be worth describing more details for the testing setting, like number of monte carlo samplings, so others can replicate the results”

• Three lines for him, several work days for me!

• For each figure/table, trying to find which data set and code were behind it, in an unstructured collection of analyses in a dozen versions, from different revision rounds, among a myriad blind alleys

• Weren’t even able to reproduce exactly - had to update figures/tables in accordance with new (reproducible) analyses

Do I live by my rules? (today’s last confession)

Page 36: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Do I live by my rules? (today’s last confession)

• A happy ending!

• (well, these journals use half a year for review, so I don’t yet know if the paper will be accepted..)

• But at least: the results are highly reproducible

Page 37: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

• No - it’s just because we wanted to publish in the “Ten Simple Rules” collection of PLoS Computational Biology..

• Go out and find new ones!

• (and don’t hesitate to contact me if you have any comments or find something new)

So, are there really exactly 10 rules for reproducibility?

Page 38: Reproducible computational research€¦ · Reproducible computational research Geir Kjetil Sandve, INF-STK 5010, 19.feb 2014. The goals for today Show you why reproducibility is

Conclusion

• Reproducibility is important - for you and for the field

• It is frustratingly difficult to get it right

• It is possible to get it right

• Some is better than none