data matters for agu early career conference

Post on 13-Jul-2015

396 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Matters

Carly Strasser, California Digital Library carlystrasser@gmail.com

AGU Student & Early Career Scientist Conference 14 Dec 2014

Tips & Tools for Better Research

From Flickr by Lachlan Donald

Why are you here?

Science: you’re (probably) doing it wrong

From Wikimedia Commons

Back in the day…

From ahswhg.wikispaces.com

Back in the day…

Da Vinci

Curie Newton

classicalschool.blogspot.com

Darwin

Research has changed

Better

From wikimedia

Such Internet!

So many tools!

From Flickr by John Jobby

So much data!

Research has changed Worse

Digital data Fr

om F

lickr

by

Flick

mor

From

Flic

kr b

y US

Arm

y En

viron

men

tal C

omm

and

From

Flic

kr b

y D

W08

25

C. Strasser

Cour

tese

y of

WHO

I

From

Flic

kr b

y d

eltaM

ike

Digital data +

Complex workflows

i.telegraph.co.uk

Scientists are bad at data management.

An embarrassing example…

From Flickr by lincolnblues

?

From Flickr by ransomtech

Didn’t share the data Didn’t document the data (metadata) Didn’t document provenance/workflow

From Flickr by johntrainor

Why should I care?

Because reproducibility is one of the fundamental tenets of science.

Because we need to be credible.

Because reproducibility is one of the fundamental tenets of science.

Because we need to be credible.

Because Fox News, creationism, and the war on science.

“Help us identify grants that are wasteful or that you don’t think are a good use of taxpayer dollars.” Rep. Adrian Smith (R-Nebraska), a member of the House Committee on Science and Technology

Because reproducibility is one of the fundamental tenets of science.

Because we need to be credible.

Because Fox News, creationism, and the war on science

Because it means faster progress.

Because you are a good person.

From Flickr by Redden-McAllister

From Flickr by Ken Cowell

From Flickr Brandi Jordan

flowingdata.com

Map of Scientific Collaborations

Because you have to.

Journals Institutions Funders From Flickr by Eva Rinaldi Celebrity and Live Music

Photographer

… “Federal agencies investing in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products.”

Feb 2013

From  Flickr  by  Michael  Tinkler  

data management

From

Flic

kr b

y Bi

g Sw

ede

Guy

Best Practices

From Flickr by Mark Sardella

Plan before data collection

•  Create a key (data dictionary) •  Make sure names are unique •  Define codes

From

Flic

kr b

y ze

bbie

Planning Design sample naming scheme

Use descriptive file names •  Unique •  Reflect contents

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Bad: Mydata.xls 2001_data.csv best version.txt

Better: Eaffinis_nanaimo_2010_counts.xls

Site name

Year What was measured

Study organism

*Not for everyone

*

Planning Design file naming scheme

Biodiversity

Lake

Experiments

Field work

Grassland

Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv …

From S. Hampton

Planning Design file organization

Consider… •  Dependencies? •  File formats? •  Time of collection? •  Order of analysis?

Planning

Constrain entries Atomize Break down spreadsheets

Design your spreadsheet

From Flickr by Ulleskelf

A relational database is A set of tables Relationships among the tables A language to specify & query the tables

A RDB provides

Scalability: millions+ records Features for sub-setting, querying, sorting Reduced redundancy & entry errors

From Mark Schildhauer

Planning Consider a database

Store your data in a repository Institutional archive

Discipline/specialty archive

Pick a data repository

From Flickr by torkildr

Planning

Store your data in a repository Institutional archive

Discipline/specialty archive

Pick a data repository

From Flickr by torkildr

Ask a librarian

Planning

Store your data in a repository Institutional archive

Discipline/specialty archive

Pick a data repository

From Flickr by torkildr

Ask a librarian

Repos of repos: databib.org re3data.org

Planning

From

Flic

kr b

y se

pa s

ynod

From Flickr by taberandrew

From Flickr by withassociates

Decide on preservation/backup Planning

From

Flic

kr b

y se

pa s

ynod

From Flickr by taberandrew

From Flickr by withassociates

What software? What hardware? What personnel?

How often? Set up reminders!

Test system

Decide on preservation/backup Planning

…document that describes what you will

do with your data throughout

the research project

From Flickr by Barbies Land

Write a data management plan!

Planning

DMP components

But they all have different requirements and express them in

different ways

•  What will be collected •  Methods •  Standards •  Metadata •  Sharing/access •  Long-term storage

Planning

From Flickr by Barbies Land

Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community

dmptool.org Planning

During Data Collection & Entry

From Flickr by Julia Manzerova

Realistically: •  Archive .csv version of raw data •  Make a “raw” tab in working data file •  Do all work on other tabs

During collection Keep raw data raw

Raw data as .csv

R script for processing & analysis

During collection

Ideally: •  Use scripts to process data •  Save them with data

Keep raw data raw

During collection Document your workflow

Temperature data

Salinity data

Data import into Excel

Analysis: mean, SD

Graph production

Quality control & data cleaning “Clean” T

& S data

Summary statistics

Data in spread-sheet

Workflow: how you get from the raw data to the final products of your research

Simple workflow: flow chart

During collection

Workflow: how you get from the raw data to the final products of your research

Commented script

•  R, SAS, MATLAB… •  Well-documented code is

Easier to review Easier to share Easier to use for repeat analysis

# % $

&

Document your workflow

Constrain data entries •  Excel lists •  Data validation •  Google docs forms

Modified from K. Vanderbilt

During collection

Atomize During collection

One piece of information per cell

Create parameter table

From doi:10.3334/ORNLDAAC/777

From doi:10.3334/ORNLDAAC/777

From R Cook, ESA Best Practices Workshop 2010

During collection Break down spreadsheets

Fake a relational database

Create a site table

Metadata: data reporting

WHO created the data? WHAT is the content

of the data set? WHEN was it created? WHERE was it collected? HOW was it developed? WHY was it developed?

From

Flic

kr b

y /\

/\ich

ael P

atric

|{

During collection Create metadata

Digital context •  Name of the data set •  The name(s) of the data file(s) in the

data set •  Date the data set was last modified •  Example data file records for each data

type file •  Pertinent companion files •  List of related or ancillary data sets •  Software (including version number)

used to prepare/read the data set •  Data processing that was performed Personnel & stakeholders •  Who collected •  Who to contact with questions •  Funders

Scientific context •  Scientific reason why the data were

collected •  What data were collected •  What instruments (including model & serial

number) were used •  Environmental conditions during collection •  Temporal & spatial resolution •  Standards or calibrations used

Information about parameters •  How each was measured or produced •  Units of measure •  Format used in the data set •  Precision & accuracy if known

Information about data •  Definitions of codes used •  Quality assurance & control measures •  Known problems that limit data use (e.g.

uncertainty, sampling problems)

During collection Create metadata

•  Provide structure to describe data Common terms | definitions | language | structure

•  Come in many flavors EML , FGDC, ISO19115, DarwinCore,…

•  Can be met using software tools Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)

What is metadata?

Metadata standards…

During collection

Standard < Create metadata

Back up daily During collection

From Flickr by lippo

From Flickr by see phar

Original Near

Far

During collection

From Flickr by Barbies Land

Remember that data management plan?

Revisit Review Revise

During collection

Schedule a time each week or month

Revisit Review Revise

From Flickr by purplemattfish

From

 Flickr  by  celikins  

Where to start?

From Flickr by Andy Graulund

Make a resolution • Triage on current

projects • Get advisor, lab mates,

collaborators on board • Do better next time

Start working online

From  Flickr  by  karindalziel  

http://datapub.cdlib.org

Open notebooks

Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community

dmptool.org Write a DMP

databib.org

Where should I put my data?

Find a repository

Learn new skills software carpentry www.software-carpentry.org

From Flickr by Micah Taylor

Other Fun Stuff

Altmetrics?

Impact Factors

+ Citation Counts

Credit in academia…

Altmetrics Article-level metrics Altmetrics for alt-products

Data Code Slides Blogs

Downloads Tweets

Mentions Views

From Flickr by Skakerman

Altmetrics Article-level metrics Altmetrics for alt-products

Researcher  Identification  

BIG initiatives…

NSF funded DataNet Project Office of Cyberinfrastructure

www.dataone.org

New partners…

Better methods…

Better methods…

From Flickr by dotpolka

Manage & share your data!

Website Email

Twitter Slides

carlystrasser.net carlystrasser@gmail.com @carlystrasser slideshare.net/carlystrasser

top related