data matters for agu early career conference

81
Data Matters Carly Strasser, California Digital Library [email protected] AGU Student & Early Career Scientist Conference 14 Dec 2014 Tips & Tools for Better Research From Flickr by Lachlan Donald

Upload: carly-strasser

Post on 13-Jul-2015

396 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data Matters for AGU Early Career Conference

Data Matters

Carly Strasser, California Digital Library [email protected]

AGU Student & Early Career Scientist Conference 14 Dec 2014

Tips & Tools for Better Research

From Flickr by Lachlan Donald

Page 2: Data Matters for AGU Early Career Conference

Why are you here?

Science: you’re (probably) doing it wrong

Page 3: Data Matters for AGU Early Career Conference

From Wikimedia Commons

Back in the day…

From ahswhg.wikispaces.com

Page 4: Data Matters for AGU Early Career Conference

Back in the day…

Da Vinci

Curie Newton

classicalschool.blogspot.com

Darwin

Page 5: Data Matters for AGU Early Career Conference

Research has changed

Better

Page 6: Data Matters for AGU Early Career Conference

From wikimedia

Such Internet!

So many tools!

From Flickr by John Jobby

So much data!

Page 7: Data Matters for AGU Early Career Conference

Research has changed Worse

Page 8: Data Matters for AGU Early Career Conference

Digital data Fr

om F

lickr

by

Flick

mor

From

Flic

kr b

y US

Arm

y En

viron

men

tal C

omm

and

From

Flic

kr b

y D

W08

25

C. Strasser

Cour

tese

y of

WHO

I

From

Flic

kr b

y d

eltaM

ike

Page 9: Data Matters for AGU Early Career Conference

Digital data +

Complex workflows

Page 10: Data Matters for AGU Early Career Conference

i.telegraph.co.uk

Page 11: Data Matters for AGU Early Career Conference

Scientists are bad at data management.

Page 12: Data Matters for AGU Early Career Conference

An embarrassing example…

From Flickr by lincolnblues

Page 13: Data Matters for AGU Early Career Conference
Page 14: Data Matters for AGU Early Career Conference
Page 15: Data Matters for AGU Early Career Conference

?

Page 16: Data Matters for AGU Early Career Conference

From Flickr by ransomtech

Didn’t share the data Didn’t document the data (metadata) Didn’t document provenance/workflow

Page 17: Data Matters for AGU Early Career Conference

From Flickr by johntrainor

Why should I care?

Page 18: Data Matters for AGU Early Career Conference

Because reproducibility is one of the fundamental tenets of science.

Because we need to be credible.

Page 19: Data Matters for AGU Early Career Conference
Page 20: Data Matters for AGU Early Career Conference

Because reproducibility is one of the fundamental tenets of science.

Because we need to be credible.

Because Fox News, creationism, and the war on science.

Page 21: Data Matters for AGU Early Career Conference

“Help us identify grants that are wasteful or that you don’t think are a good use of taxpayer dollars.” Rep. Adrian Smith (R-Nebraska), a member of the House Committee on Science and Technology

Page 22: Data Matters for AGU Early Career Conference

Because reproducibility is one of the fundamental tenets of science.

Because we need to be credible.

Because Fox News, creationism, and the war on science

Because it means faster progress.

Page 23: Data Matters for AGU Early Career Conference
Page 24: Data Matters for AGU Early Career Conference

Because you are a good person.

Page 25: Data Matters for AGU Early Career Conference

From Flickr by Redden-McAllister

From Flickr by Ken Cowell

From Flickr Brandi Jordan

Page 26: Data Matters for AGU Early Career Conference

flowingdata.com

Map of Scientific Collaborations

Page 27: Data Matters for AGU Early Career Conference

Because you have to.

Page 28: Data Matters for AGU Early Career Conference

Journals Institutions Funders From Flickr by Eva Rinaldi Celebrity and Live Music

Photographer

Page 29: Data Matters for AGU Early Career Conference
Page 30: Data Matters for AGU Early Career Conference

… “Federal agencies investing in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products.”

Feb 2013

Page 31: Data Matters for AGU Early Career Conference

From  Flickr  by  Michael  Tinkler  

Page 32: Data Matters for AGU Early Career Conference

data management

From

Flic

kr b

y Bi

g Sw

ede

Guy

Best Practices

Page 33: Data Matters for AGU Early Career Conference

From Flickr by Mark Sardella

Plan before data collection

Page 34: Data Matters for AGU Early Career Conference

•  Create a key (data dictionary) •  Make sure names are unique •  Define codes

From

Flic

kr b

y ze

bbie

Planning Design sample naming scheme

Page 35: Data Matters for AGU Early Career Conference

Use descriptive file names •  Unique •  Reflect contents

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Bad: Mydata.xls 2001_data.csv best version.txt

Better: Eaffinis_nanaimo_2010_counts.xls

Site name

Year What was measured

Study organism

*Not for everyone

*

Planning Design file naming scheme

Page 36: Data Matters for AGU Early Career Conference

Biodiversity

Lake

Experiments

Field work

Grassland

Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv …

From S. Hampton

Planning Design file organization

Consider… •  Dependencies? •  File formats? •  Time of collection? •  Order of analysis?

Page 37: Data Matters for AGU Early Career Conference

Planning

Constrain entries Atomize Break down spreadsheets

Design your spreadsheet

From Flickr by Ulleskelf

Page 38: Data Matters for AGU Early Career Conference

A relational database is A set of tables Relationships among the tables A language to specify & query the tables

A RDB provides

Scalability: millions+ records Features for sub-setting, querying, sorting Reduced redundancy & entry errors

From Mark Schildhauer

Planning Consider a database

Page 39: Data Matters for AGU Early Career Conference

Store your data in a repository Institutional archive

Discipline/specialty archive

Pick a data repository

From Flickr by torkildr

Planning

Page 40: Data Matters for AGU Early Career Conference

Store your data in a repository Institutional archive

Discipline/specialty archive

Pick a data repository

From Flickr by torkildr

Ask a librarian

Planning

Page 41: Data Matters for AGU Early Career Conference

Store your data in a repository Institutional archive

Discipline/specialty archive

Pick a data repository

From Flickr by torkildr

Ask a librarian

Repos of repos: databib.org re3data.org

Planning

Page 42: Data Matters for AGU Early Career Conference

From

Flic

kr b

y se

pa s

ynod

From Flickr by taberandrew

From Flickr by withassociates

Decide on preservation/backup Planning

Page 43: Data Matters for AGU Early Career Conference

From

Flic

kr b

y se

pa s

ynod

From Flickr by taberandrew

From Flickr by withassociates

What software? What hardware? What personnel?

How often? Set up reminders!

Test system

Decide on preservation/backup Planning

Page 44: Data Matters for AGU Early Career Conference

…document that describes what you will

do with your data throughout

the research project

From Flickr by Barbies Land

Write a data management plan!

Planning

Page 45: Data Matters for AGU Early Career Conference

DMP components

But they all have different requirements and express them in

different ways

•  What will be collected •  Methods •  Standards •  Metadata •  Sharing/access •  Long-term storage

Planning

From Flickr by Barbies Land

Page 46: Data Matters for AGU Early Career Conference

Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community

dmptool.org Planning

Page 47: Data Matters for AGU Early Career Conference

During Data Collection & Entry

From Flickr by Julia Manzerova

Page 48: Data Matters for AGU Early Career Conference

Realistically: •  Archive .csv version of raw data •  Make a “raw” tab in working data file •  Do all work on other tabs

During collection Keep raw data raw

Page 49: Data Matters for AGU Early Career Conference

Raw data as .csv

R script for processing & analysis

During collection

Ideally: •  Use scripts to process data •  Save them with data

Keep raw data raw

Page 50: Data Matters for AGU Early Career Conference

During collection Document your workflow

Temperature data

Salinity data

Data import into Excel

Analysis: mean, SD

Graph production

Quality control & data cleaning “Clean” T

& S data

Summary statistics

Data in spread-sheet

Workflow: how you get from the raw data to the final products of your research

Simple workflow: flow chart

Page 51: Data Matters for AGU Early Career Conference

During collection

Workflow: how you get from the raw data to the final products of your research

Commented script

•  R, SAS, MATLAB… •  Well-documented code is

Easier to review Easier to share Easier to use for repeat analysis

# % $

&

Document your workflow

Page 52: Data Matters for AGU Early Career Conference

Constrain data entries •  Excel lists •  Data validation •  Google docs forms

Modified from K. Vanderbilt

During collection

Page 53: Data Matters for AGU Early Career Conference

Atomize During collection

One piece of information per cell

Page 54: Data Matters for AGU Early Career Conference

Create parameter table

From doi:10.3334/ORNLDAAC/777

From doi:10.3334/ORNLDAAC/777

From R Cook, ESA Best Practices Workshop 2010

During collection Break down spreadsheets

Fake a relational database

Create a site table

Page 55: Data Matters for AGU Early Career Conference

Metadata: data reporting

WHO created the data? WHAT is the content

of the data set? WHEN was it created? WHERE was it collected? HOW was it developed? WHY was it developed?

From

Flic

kr b

y /\

/\ich

ael P

atric

|{

During collection Create metadata

Page 56: Data Matters for AGU Early Career Conference

Digital context •  Name of the data set •  The name(s) of the data file(s) in the

data set •  Date the data set was last modified •  Example data file records for each data

type file •  Pertinent companion files •  List of related or ancillary data sets •  Software (including version number)

used to prepare/read the data set •  Data processing that was performed Personnel & stakeholders •  Who collected •  Who to contact with questions •  Funders

Scientific context •  Scientific reason why the data were

collected •  What data were collected •  What instruments (including model & serial

number) were used •  Environmental conditions during collection •  Temporal & spatial resolution •  Standards or calibrations used

Information about parameters •  How each was measured or produced •  Units of measure •  Format used in the data set •  Precision & accuracy if known

Information about data •  Definitions of codes used •  Quality assurance & control measures •  Known problems that limit data use (e.g.

uncertainty, sampling problems)

During collection Create metadata

Page 57: Data Matters for AGU Early Career Conference

•  Provide structure to describe data Common terms | definitions | language | structure

•  Come in many flavors EML , FGDC, ISO19115, DarwinCore,…

•  Can be met using software tools Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)

What is metadata?

Metadata standards…

During collection

Standard < Create metadata

Page 58: Data Matters for AGU Early Career Conference

Back up daily During collection

From Flickr by lippo

From Flickr by see phar

Original Near

Far

Page 59: Data Matters for AGU Early Career Conference

During collection

From Flickr by Barbies Land

Remember that data management plan?

Revisit Review Revise

Page 60: Data Matters for AGU Early Career Conference

During collection

Schedule a time each week or month

Revisit Review Revise

From Flickr by purplemattfish

Page 61: Data Matters for AGU Early Career Conference

From

 Flickr  by  celikins  

Where to start?

Page 62: Data Matters for AGU Early Career Conference

From Flickr by Andy Graulund

Make a resolution • Triage on current

projects • Get advisor, lab mates,

collaborators on board • Do better next time

Page 63: Data Matters for AGU Early Career Conference

Start working online

From  Flickr  by  karindalziel  

Page 64: Data Matters for AGU Early Career Conference

http://datapub.cdlib.org

Open notebooks

Page 65: Data Matters for AGU Early Career Conference

Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community

dmptool.org Write a DMP

Page 66: Data Matters for AGU Early Career Conference

databib.org

Where should I put my data?

Find a repository

Page 67: Data Matters for AGU Early Career Conference

Learn new skills software carpentry www.software-carpentry.org

Page 68: Data Matters for AGU Early Career Conference

From Flickr by Micah Taylor

Other Fun Stuff

Page 69: Data Matters for AGU Early Career Conference

Altmetrics?

Impact Factors

+ Citation Counts

Credit in academia…

Page 70: Data Matters for AGU Early Career Conference

Altmetrics Article-level metrics Altmetrics for alt-products

Data Code Slides Blogs

Downloads Tweets

Mentions Views

From Flickr by Skakerman

Page 71: Data Matters for AGU Early Career Conference

Altmetrics Article-level metrics Altmetrics for alt-products

Page 72: Data Matters for AGU Early Career Conference

Researcher  Identification  

Page 73: Data Matters for AGU Early Career Conference
Page 74: Data Matters for AGU Early Career Conference

BIG initiatives…

Page 75: Data Matters for AGU Early Career Conference

NSF funded DataNet Project Office of Cyberinfrastructure

www.dataone.org

Page 76: Data Matters for AGU Early Career Conference
Page 77: Data Matters for AGU Early Career Conference

New partners…

Page 78: Data Matters for AGU Early Career Conference

Better methods…

Page 79: Data Matters for AGU Early Career Conference

Better methods…

Page 80: Data Matters for AGU Early Career Conference

From Flickr by dotpolka

Manage & share your data!

Page 81: Data Matters for AGU Early Career Conference

Website Email

Twitter Slides

carlystrasser.net [email protected] @carlystrasser slideshare.net/carlystrasser