reproducible analyses and data management · 2014-01-06 · reproducible analyses and data...

38
SMARTSkills Workshop Reproducible Analyses and Data Management oil´ ın Minto Marine and Freshwater Research Centre Galway-Mayo Institute of Technology October 24 th , 2013

Upload: others

Post on 19-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

SMARTSkills Workshop

Reproducible Analyses and Data Management

Coilın MintoMarine and Freshwater Research Centre

Galway-Mayo Institute of Technology

October 24th, 2013

Page 2: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Outline

1 Reproducible Analyses

2 Data management

3 Example database project

4 Summary

Page 3: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Outline

1 Reproducible Analyses

2 Data management

3 Example database project

4 Summary

Page 4: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Research is reproducible if it can be reproduced by others

Baggerly and Berry (2011)

Page 5: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Image source: http://www.therooms.ca

Page 6: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Emigration to Newfoundland

Page 7: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Emigration to Newfoundland

20

40

60

80

-0.0

5

0.0

0.0

5

0.1

0

1720

1723

1724

1725

1726

1727

1730

1732

17331

734

1735

1736

1739

1741

1742

1748

1749

1750

1751

1752

1753

1754

1755

1757

1758

1759

1760 1

762

1763

1764

1768

1776

1778

1779

1781

1782

1786

1787

1788

1791

1793

1794

1797

1798

1802

1803

1804

1805

1807

1808

1809

1810

1811

1812

1813

1814

1815

1816

1817

1818

1819

1820

1821 1

822

1823

1824

18261827

Ca

tch

pe

r M

an

(q

uin

tals

/ma

n)

Trinity B

ay

Population Growth Rate (1/year)

100

200

300

400

500

600

-0.0

5

0.0

0.0

5

0.1

0

1720

1723

1724

1725

1726

1730

1732

1733

1734

1735

1736

1739

1741

1742

1748

1749

1750

1751

1752

1753

1754

1755

17571758

1759

1760

1762

17631764

1768

1776

1778

1779

1781 1

782

1786

1787

1788

1791

1793

1794

1797

1798

1802

1803

18041

805

1807

18081

809

1810

1811

18121813

1814

1815

1816

1817

1818 1819 1820

1821

1822

1823

1824

1826

1827

Ca

tch

pe

r B

oa

t (q

uin

tals

/bo

at)

Myers (2001)

Page 8: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Emigration to Newfoundland

Analysis was:

• Conducted in 2000

• Run on a Sun server with documented (READMEs) folderscontaining:

• Data• Text• Analysis code (S-Plus)

• Archived

Page 9: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Emigration to Newfoundland

Year 2009:Contacted by a Norwegian researcher wishing to re-run theanalysis but the sole author (RAM) had very unfortunatelypassed away in 2007

In many cases, this would signal the end of the line and we goback to collating the data over-again or forget about it.But in this case, three steps:

$ ssh server

$ cd relevant_folder

$ make

recovered the complete analysis, figure and table preparationand dynamicaly linked to a fresh write-up

Page 10: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Getting the structure right

analyses

project name

doc

figures ms tables

data R

functions scripts

Fastidious data management is paramount for reproducibility

Page 11: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Image source: moods of norway

Page 12: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Outline

1 Reproducible Analyses

2 Data management

3 Example database project

4 Summary

Page 13: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

What’s data?

Ultimately, a stored array of electrical charges but I like to thinkof data as the map and mode of transport that gets you fromthe start of a research project or program to the final product

Image source: http://www.deviantart.com

Page 14: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

What’s data?

It’s not just a spreadsheet!

Page 15: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Data encompasses

• Metadata on what the work was about (who, what,where, when and why?)

• Records Measurements, dates, treatments, etc.

• Code Data extraction and analysis

• Results (value-added collections of records) Figures,tables, calculations

• Reporting Documents, mark-up

Page 16: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Losing our way

In science we often lose our map and mode of transport via:

• Damage to files or storage deviceError: cannot open ...

• Purported storage device ageing or becoming redundant“That was three laptops ago”

• Software changeHouse of punch-cards

• Personnel change“They left with the laptop”

• Bounce to the next project

Page 17: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Why do some scientists treat data poorly?

Among other reasons:

• Incentive potentially lacking in highly competitivepublishing arena

• Focus on the publication as self-contained product of thebusiness

• Data husbandry viewed as diminishing returns

• Shoulders of giants mis-interpreted

• Illusion of ownership

Page 18: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Why these reasons don’t cut the mustard now

Among other reasons:

• Large collaborative initiatives consisting of manysub-projects necessitate data management

• Journal publishing ethics changing and valuing datahusbandry, e.g.,

• Debes PV, Fraser DJ, McBride MC, Hutchings JA (2013) Multigenerational

hybridisation and its consequences for maternal effects in Atlantic salmon.

Heredity 111: 238-247. doi:10.1038/hdy.2013.43

• Debes PV, McBride MC, Fraser DJ, Hutchings JA (2013) Data from:

Multigenerational hybridisation and its consequences for maternal effects in

Atlantic salmon. Dryad Digital Repository. doi:10.5061/dryad.9cs2v

• Granting bodies requesting data management planning

Page 19: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Data management

ONCE COLLECTED AND ELECTRONICALLY ENTEREDDON’T TOUCH THE DATA!

Tempting as it might be to fire up a spreadsheet and startcreating worksheets and pasting specially, this will only lead todata woesTo avoid wondering whetherdata new.xls

ordata updated.xls

is the relevant copy, leave the data in the data folder orrepository alone

Page 20: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Data management: Spreadsheet Tales

“In the process of copying, the scribes made (deliberately orotherwise) changes, which were themselves copied.”

Barbrook et al. (1998). Nature (394) p.839.

Page 21: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Data management: solution

All data manipulations should be done programmatically

• Read raw data in analytical software

• Subset, remove, adjust via code

• Leaves a reproducible trail and

• Leaves the original (hard-won) data intact

• Pipe results dynamically into your document (e.g., Sweave,knitr)

Page 22: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

A contention

My over-arching contention with the status quo is that anindividual’s laptop or PC is not an acceptable researchenvironment, as it:

• Risks complete data loss

• Fosters the “Chaucer” effect (more later)

• Is anti-collaborative

• Is license hungry and therefore costly

• Is less powerful, slower

Page 23: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

A back-to-the-future solution

Need to return to a common research environment - the server

Image source: http://my.opera.com

Page 24: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

A back-to-the-future solution

In as much as we have the focal point of the wet-lab to processspecimen samples, we should have a central place for datastorage and processing, as it:

• Keeps single copies of data centrally

• Has a longer life than the project

• Has a longer life than the researchers (??)

• Gives everyone equal access to high-performancearchitecture (no need a new laptop, just use laptop for )

• Managed centrally

Page 25: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

A back-to-the-future solution

Many institutes have servers but rarely used as a commonresearch environment outside of the physical sciences

But the coming of age of high-performance computing nownecessitates that we make the move back

Page 26: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Example: data-poor stock status

• FAO and Conservation International project to globallyassess status of “data-poor” stocks

• Two research teams from 8 different countries

• Had to work in a central environment -hexagon.bccs.uib.no in Bergen, Norway.

Page 27: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Example: data-poor stock status

• 576 (scenarios) x 10 (iterations) x 4 (methods) = 23,040stock assessments

• For agreed convergence level (MCMC,SIR) requires 19.5CPU years on single processor

• Completed work in walltime of 7.5 days on Hexagon cluster

Page 28: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Outline

1 Reproducible Analyses

2 Data management

3 Example database project

4 Summary

Page 29: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Original database

Ransom Myers’ Stock Recruitment Database

• Approximately 640 stocks.

• Used in many publications on fish population dynamics,e.g.

• Relationship between recruitment and spawning stock size• Density dependence• Depensation (Allee effects)• Productivity rates across taxa• Patterns of depletion and recovery

• Housed in flat text files

• Archived (not updated anymore) version available from:http://www.mscs.dal.ca/∼myers/welcome.html

Page 30: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Why an updated database?

• Many stocks 15 years outof date

• New data often at:• Low population levels• Reduced fishing

intensities

● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1985 1990 1995 2000 2005

050

010

0015

0020

0025

00

Year

Bio

mas

s (1

000

tonn

es)

● COD2J3KL

• Interest in:• Effects of exploitation on trends in abundance across taxa

from many ecosystems• Efficacy of harvest policies• Recovery trajectories post fishing mortality reductions

• Relational database to support reproducible analyses

Page 31: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Geographic coverage

1−45−910−1920−2930+

Page 32: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Temporal coverage: orca plots

1850 1900 1950 2000

050

100

150

200

250

300

data$span

Fre

quen

cy

010

2030

40

10 30 50 70 90 110Span (years)

Fre

quen

cy

A

1850 1900 1950 2000

data$span

Fre

quen

cy

010

2030

40

10 30 50 70 90 110

B

1850 1900 1950 2000

data$span

Fre

quen

cy

010

2030

40

10 30 50 70 90 110

C

Year

Ass

essm

ent c

ount

Page 33: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Taxonomic coverage

Pseudocyttus maculatusAmmodytes marinusArripis truttaCentropristis striataEpinephelus morioEpinephelus niveatus

Mycteroperca microlepisChrysophrys auratusPagrus pagrusStenotomus chrysops

Cynoscion regalis

Micropogonias undulatus

Dissostichus eleginoides

Dissostichus mawsoni

Kajikia audax

Katsuwonus pelamis

Scomber japonicus

Scomber scombrus

Scomberomorus cavalla

Scomberomorus maculatus

Thunnus alalunga

Thunnus albaca

res

Thunnus macc

oyii

Thunnus obesu

s

Thunn

us th

ynnu

s

Loph

olatilu

s cha

mae

leont

iceps

Lutja

nus a

nalis

Lutja

nus c

ampe

chan

us

Ocy

urus

chr

ysur

us

Rhom

bopl

ites

auro

rube

ns

Mor

one

saxa

tilis

Nem

adac

tylu

s m

acro

pter

us

Pepr

ilus

triac

anth

us

Pom

atom

us s

alta

trix

Pseu

doca

ranx

den

tex

Serio

la d

umer

ili

Trac

huru

s ca

pens

is

Trac

huru

s m

urph

yi

Rex

ea s

olan

dri

Ser

iole

lla b

ram

a

Ser

iole

lla p

unct

ata

Sill

ago

flind

ersi

Taut

oga

oniti

s

Xip

hias

gla

dius

Ano

plop

oma

fimbr

ia

Hex

agra

mm

os d

ecag

ram

mus

Oph

iodo

n el

onga

tus

Ple

urog

ram

mus

mon

opte

rygi

us

Neo

plat

ycep

halu

s ric

hard

soni

Pla

tyce

phal

us c

onat

us

Red

fish

spec

ies

Sco

rpae

na g

utta

taS

ebas

tes

aleu

tianu

s

Sebastes alutus

Sebastes borealis

Sebastes carnatusS

ebastes crameri

Sebastes entom

elas

Sebastes fasciatus

Sebastes flavidus

Sebastes goodei

Sebastes jordani

Sebastes levis

Sebastes m

elanops

Sebastes m

elanostomus

Sebastes m

ystinus

Sebastes norvegicus

Sebastes paucispinis

Sebastes pinniger

Sebastes polyspinis

Sebastes ruberrim

us

Sebastes variabilis

Sebastolobus alascanus

Sebastolobus altivelis

Scorpaenichthys m

armoratus

Balistes capriscus

Brevoortia patronus

Brevoortia tyrannus

Clupea harengus

Clupea pallasii

Clupeonella engrauliformis

Sardina pilchardus

Sardinops sagax

Sprattus sprattus

Engraulis anchoita

Engraulis encrasicolus

Engraulis ringens

Brosme brosme

Gadus macrocephalusGadus morhua

Melanogrammus aeglefinus

Merlangius merlangus

Micromesistius australis

Micromesistius poutassou

Pollachius virens

Theragra chalcogramma

Triso

pterus esm

arkii

Urophycis

tenuis

Macruro

nus m

agellanicu

s

Macru

ronu

s nov

aeze

landia

e

Mer

lucciu

s aus

tralis

Mer

lucciu

s bilin

earis

Mer

lucc

ius

cape

nsis

Mer

lucc

ius

hubb

si

Mer

lucc

ius

mer

lucc

ius

Mer

lucc

ius

para

doxu

s

Mer

lucc

ius

prod

uctu

s

Cen

trobe

ryx

gerra

rdi

Hop

lost

ethu

s at

lant

icus

Eop

setta

jord

ani

Gly

ptoc

epha

lus

cyno

glos

sus

Gly

ptoc

epha

lus

zach

irus

Hip

pogl

osso

ides

ela

ssod

on

Hip

pogl

osso

ides

pla

tess

oide

s

Hip

pogl

ossu

s hi

ppog

loss

us

Hip

pogl

ossu

s st

enol

epis

Lepi

dops

etta

bili

neat

a

Lepi

dops

etta

pol

yxys

traLi

man

da a

sper

a

Lim

anda

ferr

ugin

eaM

icro

stom

us p

acifi

cus

Par

ophr

ys v

etul

usP

latic

hthy

s st

ella

tus

Ple

uron

ecte

s pl

ates

sa

Pleuronectes quadrituberculatus

Pseudopleuronectes am

ericanusR

einhardtius hippoglossoidesR

einhardtius stomias

Lepidorhombus boscii

Lepidorhombus w

hiffiagonis

Scophthalm

us aquosus

Paralichthys dentatus

Solea vulgaris

Genypterus blacodes

Genypterus capensis

Lophius americanus

Mallotus villosus

Carcharhinus acronotus

Carcharhinus isodon

Carcharhinus limbatus

Carcharhinus plumbeus

Rhizoprionodon terraenovae

Sphyrna tiburo

Isurus oxyrinchus

Raja rhina

Squalus acanthias

Arctica islandica

Spisula solidissima

Placopecten magellanicus

Haliotis iris

Haliotis midae

Illex illecebrosus

Chionoecetes opilio

Homarus americanus

Jasus edwardsii

Jasus lalandii

Palinurus gilchristi

Lithodes aequispinus

Paralithodes camtschaticus

Pandalus borealisPenaeus esculentusPseudocarcinus gigas●

Arthropoda

MolluscaChondrichtyes

Perciformes

Scorpaeniformes

Clupeiformes

Pleuronectiformes

Gadiformes

Animalia

356 assessments

Order N %

Gadiformes 71 20Perciformes 66 19Pleuronectiformes 57 16Scorpaeniformes 45 13Clupeiformes 36 10Invertebrates 42 12

Page 34: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Used in 27 publications since inception in 2009http://depts.washington.edu/ramlegac/

Page 35: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Outline

1 Reproducible Analyses

2 Data management

3 Example database project

4 Summary

Page 36: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Summary

• Reproducibility is a central component of science

• To-date our general approach to data has been poorbordering on careless

• Scale of the problems and collaborations now necessitatechange for the better

• A laptop/desktop is not a research environment

• Data management increasingly recognized

• Putting in the spade work of data management can reapgood rewards

Page 37: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Aknowledgements

Paulha McGrane and John Boyd and organizing committee

Julia BaumDeirdre BrophyOlaf JensenRay HilbornRick OfficerDaniel RicardConservation InternationalUniversity of WashingtonFAOMarine and FreshwaterResearch Centre, GMIT,GalwayDalhousie University, NovaScotia

Page 38: Reproducible Analyses and Data Management · 2014-01-06 · Reproducible Analyses and Data Management ... Galway-Mayo Institute of Technology October 24th, 2013. Outline 1 Reproducible

Baggerly, KA and Berry, DA (2011). Reproducible Research. AmstatNews January 2011.

Myers, RA (2001). Testing ecological models: the influence of catchrates on settlement of fishermen in Newfoundland, 1710-1833.Research in Maritime History, 21, 13-29.