reproducibility in scientific data analysis - bioscience seminar

27
Reproducibility in Scientific Data Analysis Samuel Lampa @smllmp PhD Student Pharmaceutical Bioinformatics at pharmb.io with Assoc. Prof. Ola Spjuth @ola_spjuth @ Dept. of Pharm. Biosci. / Uppsala University Farmbio BioScience Seminar – Dec 16 2016

Upload: samuel-lampa

Post on 15-Apr-2017

74 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Reproducibility in Scientific Data Analysis - BioScience Seminar

Reproducibility in Scientific Data Analysis

Samuel Lampa @smllmp

PhD StudentPharmaceutical Bioinformatics at pharmb.io

with Assoc. Prof. Ola Spjuth @ola_spjuth@ Dept. of Pharm. Biosci. / Uppsala University

Farmbio BioScience Seminar – Dec 16 2016

Page 2: Reproducibility in Scientific Data Analysis - BioScience Seminar
Page 3: Reproducibility in Scientific Data Analysis - BioScience Seminar

Structure of this talk

Reproducibility in Scientific Data Analysis …

● What is it?● Why is it important?● Why is it a problem?● What can we do about it?● What does pharmb.io do about it?

Page 4: Reproducibility in Scientific Data Analysis - BioScience Seminar

What is it?

“it” = reproducibility in scientific data analysis

Page 5: Reproducibility in Scientific Data Analysis - BioScience Seminar

reproducible ≠ replicable

Page 6: Reproducibility in Scientific Data Analysis - BioScience Seminar

reproducible ≠ correct

Page 7: Reproducibility in Scientific Data Analysis - BioScience Seminar

Why is it important?

“it” = reproducibility in scientific data analysis

Page 8: Reproducibility in Scientific Data Analysis - BioScience Seminar

Why is it important?

● More and more data generation automated→ More and more focus on data analysis

● Culture of replicability not (yet) as established in computational as in classical disciplines

● “it is the only thing that an investigator can guarantee about a study”simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important

Page 9: Reproducibility in Scientific Data Analysis - BioScience Seminar

Why is it a problem?

“it” = reproducibility in scientific data analysis

Page 10: Reproducibility in Scientific Data Analysis - BioScience Seminar

wet lab data analysis?

Page 11: Reproducibility in Scientific Data Analysis - BioScience Seminar

Why is it a problem?

● Complexity of computing environment– Software versions, Data versions ...

● More black box components● Assumptions on computing

environment often left out● Manual steps often left out

Page 12: Reproducibility in Scientific Data Analysis - BioScience Seminar

What can we do about it?

“it” = reproducibility in scientific data analysis

Page 13: Reproducibility in Scientific Data Analysis - BioScience Seminar

What can we do about it?

Utopia: Infrastructure for all data and computations to be inspected and re-run with other data and parameters by anyone

But: We can’t wait for that

In the meanwhile: Even small steps towards reproducibility will help. Start today!

Page 14: Reproducibility in Scientific Data Analysis - BioScience Seminar

General themes

Know exactly what data and results mean

Know exactly how results were obtained

Be able to get same result independently

Page 15: Reproducibility in Scientific Data Analysis - BioScience Seminar

More concretely ...

Know exactly what data and results mean– Open standards, Ontologies, Data formats

Know exactly how results were obtained– Keeping track of manual steps, parameters, versions of

software and data ...

– Version control

– Automation (scripts)

Be able to get same result independently– code, data, and scripts … make it all available!

Page 16: Reproducibility in Scientific Data Analysis - BioScience Seminar

Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol. 2013;9(10):1-4. dx.doi.org/10.1371/journal.pcbi.1003285

Page 17: Reproducibility in Scientific Data Analysis - BioScience Seminar

FAIR Principlesfor data and meta data

F - Findable

A - Accessible

I - Interoperable

R – Reusable

Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18.

Page 18: Reproducibility in Scientific Data Analysis - BioScience Seminar

What does pharmb.io do about it?

“it” = reproducibility in scientific data analysis

Page 19: Reproducibility in Scientific Data Analysis - BioScience Seminar

What does pharmb.io do about it?

● Open data, open source, open standardsPromoting and using as much as possible

● BioImg.org Store Virtual Machines & Containers

● Semantic Data Technologies Machine readability - Avoiding ambiguity

● Re-runnable computational experimentsVia workflows, containers, infrastructure as code

Page 20: Reproducibility in Scientific Data Analysis - BioScience Seminar

O’Boyle NM, Guha R, Willighagen EL, et al. Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. J Cheminform. 2011;3(10):1-16. doi:10.1186/1758-2946-3-37

Page 21: Reproducibility in Scientific Data Analysis - BioScience Seminar

BioImg.org

Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O. BioImg.org: A catalog of virtual machine images for the life sciences. Bioinform Biol Insights. 2015;9(Vmi):125-128. doi:10.4137/BBI.S28636.

Martin Dahlö

Page 22: Reproducibility in Scientific Data Analysis - BioScience Seminar

Semantic Data Technologies

Lampa S, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R, Spjuth O. RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. J Biomed Sem. Submitted.

Page 23: Reproducibility in Scientific Data Analysis - BioScience Seminar

Re-runnable experimentsvia containers

(and infrastructure as code)

Marco Capuccini

github.com/kubenow/KubeNowgithub.com/mcapuccini/SparkNow

Page 24: Reproducibility in Scientific Data Analysis - BioScience Seminar

Re-runnable experimentsvia workflows

Page 25: Reproducibility in Scientific Data Analysis - BioScience Seminar

Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.

Page 26: Reproducibility in Scientific Data Analysis - BioScience Seminar

Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.

Page 27: Reproducibility in Scientific Data Analysis - BioScience Seminar

Thank you

pharmb.io