reproducibility in scientific data analysis - bioscience seminar

Post on 15-Apr-2017

74 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Reproducibility in Scientific Data Analysis

Samuel Lampa @smllmp

PhD StudentPharmaceutical Bioinformatics at pharmb.io

with Assoc. Prof. Ola Spjuth @ola_spjuth@ Dept. of Pharm. Biosci. / Uppsala University

Farmbio BioScience Seminar – Dec 16 2016

Structure of this talk

Reproducibility in Scientific Data Analysis …

● What is it?● Why is it important?● Why is it a problem?● What can we do about it?● What does pharmb.io do about it?

What is it?

“it” = reproducibility in scientific data analysis

reproducible ≠ replicable

reproducible ≠ correct

Why is it important?

“it” = reproducibility in scientific data analysis

Why is it important?

● More and more data generation automated→ More and more focus on data analysis

● Culture of replicability not (yet) as established in computational as in classical disciplines

● “it is the only thing that an investigator can guarantee about a study”simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important

Why is it a problem?

“it” = reproducibility in scientific data analysis

wet lab data analysis?

Why is it a problem?

● Complexity of computing environment– Software versions, Data versions ...

● More black box components● Assumptions on computing

environment often left out● Manual steps often left out

What can we do about it?

“it” = reproducibility in scientific data analysis

What can we do about it?

Utopia: Infrastructure for all data and computations to be inspected and re-run with other data and parameters by anyone

But: We can’t wait for that

In the meanwhile: Even small steps towards reproducibility will help. Start today!

General themes

Know exactly what data and results mean

Know exactly how results were obtained

Be able to get same result independently

More concretely ...

Know exactly what data and results mean– Open standards, Ontologies, Data formats

Know exactly how results were obtained– Keeping track of manual steps, parameters, versions of

software and data ...

– Version control

– Automation (scripts)

Be able to get same result independently– code, data, and scripts … make it all available!

Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol. 2013;9(10):1-4. dx.doi.org/10.1371/journal.pcbi.1003285

FAIR Principlesfor data and meta data

F - Findable

A - Accessible

I - Interoperable

R – Reusable

Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18.

What does pharmb.io do about it?

“it” = reproducibility in scientific data analysis

What does pharmb.io do about it?

● Open data, open source, open standardsPromoting and using as much as possible

● BioImg.org Store Virtual Machines & Containers

● Semantic Data Technologies Machine readability - Avoiding ambiguity

● Re-runnable computational experimentsVia workflows, containers, infrastructure as code

O’Boyle NM, Guha R, Willighagen EL, et al. Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. J Cheminform. 2011;3(10):1-16. doi:10.1186/1758-2946-3-37

BioImg.org

Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O. BioImg.org: A catalog of virtual machine images for the life sciences. Bioinform Biol Insights. 2015;9(Vmi):125-128. doi:10.4137/BBI.S28636.

Martin Dahlö

Semantic Data Technologies

Lampa S, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R, Spjuth O. RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. J Biomed Sem. Submitted.

Re-runnable experimentsvia containers

(and infrastructure as code)

Marco Capuccini

github.com/kubenow/KubeNowgithub.com/mcapuccini/SparkNow

Re-runnable experimentsvia workflows

Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.

Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.

Thank you

pharmb.io

top related