public data archiving: who does? who doesn't? what can we do about it?

Post on 01-Nov-2014

760 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation at UBC Biodiversity Internal Seminar Series (BLISS) http://www.zoology.ubc.ca/~biodiv/BLISS/BLISS.htm

TRANSCRIPT

Public data archiving:

Who shares? Who doesn’t?

What can we do about it?Heather Piwowar

Presented at UBC BLISS, Sept 2010

DataONE postdoc with Dryad and NESCent, @UBCPhD in Dept of Biomedical Informatics, U of Pittsburgh

http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm

http://www.flickr.com/photos/jsmjr/62443357/

http://www.flickr.com/photos/camilleharrington/3587294608/

http://www.flickr.com/photos/rkuhnau/3318245976/

http://www.flickr.com/photos/conformpdx/1796399674/

http://www.flickr.com/photos/rkuhnau/3317418699/

http://www.flickr.com/photos/zemlinki/261617721/

http://www.flickr.com/photos/tracenmatt/3020786491/

http://www.flickr.com/photos/the-o/2078239333/

http://www.flickr.com/photos/75166820@N00/5318468/

FindOrganizeDocumentDeidentifyFormatDecideAskSubmit

Answer questionsWorry about mistakes being foundWorry about data being misinterpretedWorry about being scoopedForgo money and IP and prestige???

not very motivating.

As a result, policy makers have spent lots of time and money ....

http://www.flickr.com/photos/tonivc/2283676770/

http://www.flickr.com/photos/johnnyvulkan/381941233/

building databases, developing standards, articulating best practices

to support public archiving of research datasets 

lots of data sharing!

http://www.genome.jp/en/db_growth.html

but how much isn’t shared?

what isn’t shared?

who isn’t sharing it?why not?

what can we do about it?

how much does it matter?

you can not manage what you do not measure

quote: Lord Kelvinhttp://www.flickr.com/photos/archeon/2941655917/

As we seek to embrace and encourage data sharing,

understanding patterns of adoption will allow us to make informed decisions about tools, policies, and best practices.

Measuring adoption over time will allow us to note progress and identify best practices and opportunities for improvement.

1. Is there benefit for those who share?

2. How can we study data sharing behaviour in a scalable, systematic way?

3. What factors are correlated with sharing and withholding data?

research questions

http://www.flickr.com/photos/paulhami/1020538523//

Which data?

http://www.flickr.com/photos/paulhami/1020538523//

Where?

http://www.flickr.com/photos/paulhami/1020538523//

With whom?

http://www.flickr.com/photos/paulhami/1020538523//

When?

http://www.flickr.com/photos/paulhami/1020538523//

Under what terms?

http://www.flickr.com/photos/paulhami/1020538523//

• gene expression microarray data

• raw intensity data

• upon publication

• publicly on the internet

• (centralized databases)

microarray data

http://en.wikipedia.org/wiki/DNA_microarray

http://en.wikipedia.org/wiki/Image:Heatmap.png

http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG

microarray data

http://www.flickr.com/photos/sunrise/35819369/

1.  Is there benefit for those who share?

currency of value?

Citations.

currency of value?

Citations.

$50!

Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215

dataset85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003)

citationsISI Web of Science Citation index, citations from 2004-2005

data sharing locationsPublisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine

statisticsMultivariate linear regression

Note:log scale

~70%

2. Need automated methods to:

a) Identify studies that create datasets

b) Determine which of these have in fact been shared

c) Extract attributes about the environment

a) Identify studies that create datasets

http://www.flickr.com/photos/lofaesofa/248546821/

Combined, these full-text portals reach 85% of the articles available through U of Pittsburgh library subscriptions.

But how to generate an effective query?

Use open access articles.

•text analysis: automatically catalogued single words and word-pairs from full text

•assessed precision and recall

•combined the high performers:

Derived query:

("gene expression" AND microarray AND cell AND rna)

AND (rneasy OR trizol OR "real-time pcr")

NOT (“tissue microarray*” OR “cpg island*”)

Evaluation:

Ochsner et al. Nature Methods (2008) 400 studies across 20 journals

Precision: 90% (conf int: 86% to 93%) Recall: 56% (conf int: 52% to 61%)

a) Identify studies that create datasets

b) Determine which of these have in fact been shared

c) Extract attributes about the environment

b) Determine which datasets have in fact been shared

77 % 

a) Identify studies that create datasets

b) Determine which of these have in fact been shared

c) Extract attributes about the environment

Is research data shared after publication?

Funder Journal Investigator Institution Study

funded by NIH?

size of grant

sharing plan req’d?

funded by non-NIH?

impact factor

strength of policy

open access?

number of microarray studies published

years since first paper

# pubs

# citations

previously shared?

previously reused?

gender

sector

size

impact rank

country

humans?

mice?

plants?

cancer?

clinical trial?

number of authors

year

Funder Journal Investigator Institution Study

journal rank

“An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …”

http://www.nature.com/authors/editorial_policies/availability.html

http://www.nature.com/nature/journal/v453/n7197/index.html

journal data sharing policy

institution rank

Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17

study type

Author publication history:

Citation counts:

Author-ity web serviceTorvik & Smalheiser. (2009). Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11.

Author name disambiguation:

author “experience”

author gender

funding level

PubMed grant lists + NIH grant details

funder mandates

Requires a data sharing planfor studies funded after October 2003

that receive more than $500 000 in direct funding per year

Proxy for NIH data sharing policy applicability:

If in any year since 2004,

• funded by an NIH grant number with a “1” or “2” type code

• received more than $750 000 in total funding from the grant

funder mandates

and so on...

124 variables

Now equipped with automated methods to:

a) Identify studies that create datasets

b) Determine which of these have in fact been shared

c) Extract attributes about the environment

http://www.flickr.com/photos/cogdog/123072/

3.  What factors are correlated with sharing and withholding data?

11,603 datapoints

25% had links from datasets in databases

univariate analysis

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Year article published

Pro

po

rtio

n o

f a

rtic

les w

ith

da

tase

ts f

ou

nd

in

GE

O o

r A

rra

yE

xp

ress

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Proportion of articles with shared datasets, by year

Across time

Ph

ysio

l G

en

om

ics

PL

oS

Ge

ne

t

Ge

no

me

Bio

l

Microbiology

PL

oS

On

e

BM

C G

en

om

ics

Pla

nt

Ce

ll

Ge

no

me

Re

s

Eu

ka

ryo

t C

ell

Ap

pl E

nviro

n M

icro

bio

lB

MC

Me

d G

en

om

ics

Hu

m M

ol G

en

et

Pro

c N

atl A

ca

d S

ci U

S A

Infe

ct

Imm

un

Am

J R

esp

ir C

ell

Mo

l B

iol

De

v B

iol

J B

acte

rio

l

Mo

l E

nd

ocrin

ol

BM

C C

an

ce

r

Pla

nt

Ph

ysio

lB

iol R

ep

rod

Blood

J I

mm

un

ol

FA

SE

B J

To

xic

ol S

ci

J E

xp

Bo

tN

ucle

ic A

cid

s R

es

Diabetes

Mo

l C

ell B

iol

Mo

l C

an

ce

r T

he

r

BM

C B

ioin

form

atics

Ste

m C

ells

FE

BS

Le

tt

J N

eu

rosci

Am

J P

ath

ol

J B

iol C

he

m

J V

iro

l

OTHER

Ca

nce

r R

es

J C

lin

En

do

crin

ol M

eta

b

Pla

nt

Mo

l B

iol

Clin

Ca

nce

r R

es

Genomics

Inve

st

Op

hth

alm

ol V

is S

ci

Mo

l H

um

Re

pro

dCarcinogenesis

Gene

Endocrinology

Oncogene

Ca

nce

r L

ett

Bio

ch

em

Bio

ph

ys R

es C

om

mu

n

Pro

port

ion o

f data

sets

share

d

0.0

0.2

0.4

0.6

0.8

1.0 Journals(Physiological Genomics)

Sta

nfo

rd U

niv

ers

ity

Un

ive

rsity o

f P

en

nsylv

an

ia

Un

ive

rsity o

f Illin

ois

Un

ive

rsity o

f C

alif

orn

ia,

Lo

s A

ng

ele

s

Un

ive

rsity o

f W

isco

nsin

, M

ad

iso

n

Un

ive

rsity o

f W

ash

ing

ton

Un

ive

rsity o

f C

alif

orn

ia,

Da

vis

Th

e U

niv

ers

ity o

f B

ritish

Co

lum

bia

Un

ive

rsity o

f C

alif

orn

ia,

Sa

n F

ran

cis

co

Un

ive

rsity o

f F

lorid

a

Un

ive

rsity o

f C

alif

orn

ia,

Sa

n D

ieg

o

Un

ive

rsity o

f M

inn

eso

ta,

Tw

in C

itie

s

Ba

ylo

r C

olle

ge

of

Me

dic

ine

OTHER

Ma

x P

lan

ck G

ese

llsch

aft

Ha

rva

rd U

niv

ers

ity

Du

ke

Un

ive

rsity M

ed

ica

l C

en

ter

Ya

le U

niv

ers

ity

Jo

hn

s H

op

kin

s U

niv

ers

ity

Un

ive

rsity o

f P

itts

bu

rgh

Wa

sh

ing

ton

Un

ive

rsity in

Sa

int

Lo

uis

Un

ive

rsity o

f T

oro

nto

Un

ive

rsity o

f C

alif

orn

ia,

Be

rke

ley

Un

ive

rsity o

f M

ich

iga

n,

An

n A

rbo

r

Mic

hig

an

Sta

te U

niv

ers

ity

Na

tio

na

l C

an

ce

r In

stitu

te

To

kyo

Da

iga

ku

Pro

po

rtio

n o

f d

ata

se

ts s

ha

red

0.0

0.2

0.4

0.6

0.8

1.0

Institutions(Stanford)

1

101

201

301

401

501

601

701

801

901

1001

1101

1201

1301

1401

1501

1601

1701

1801

1901

Pro

po

rtio

n o

f d

ata

se

ts s

ha

red

0.0

0.2

0.4

0.6

0.8

1.0

Institutionrank

multivariate analysis

factor analysis

multivariate logistic regression over the first-order factors

Odds Ratio

0.25 0.50 1.00 2.00 4.00 8.00

Has journal policy0.95Count of R01 & other NIH grants

Authors prev GEOAE sharing & OA & microarray creation

NO K funding or P funding

Institution high citations & collaboration

Journal impact

Journal policy consequences & long halflife

NOT animals or mice

Instititution is government & NOT higher ed

Last author num prev pubs & first year pub

Large NIH grant

Humans & cancer

NO geo reuse + YES high institution output

First author num prev pubs & first year pub

Multivariate nonlinear regressions with interactions

Odds Ratio

0.25 0.50 1.00 2.00 4.00 8.00

Has journal policy0.95Count of R01 & other NIH grants

Authors prev GEOAE sharing & OA & microarray creation

NO K funding or P funding

Journal impact

Journal policy consequences & long halflife

Institution high citations & collaboration

NOT animals or mice

Instititution is government & NOT higher ed

Last author num prev pubs & first year pub

Large NIH grant

Humans & cancer

NO geo reuse + YES high institution output

First author num prev pubs & first year pub

Multivariate nonlinear regressions with interactions

Odds Ratio

0.25 0.50 1.00 2.00 4.00 8.00

Has journal policy0.95Count of R01 & other NIH grants

Authors prev GEOAE sharing & OA & microarray creation

NO K funding or P funding

Institution high citations & collaboration

Journal impact

Journal policy consequences & long halflife

NOT animals or mice

Instititution is government & NOT higher ed

Last author num prev pubs & first year pub

Large NIH grant

Humans & cancer

NO geo reuse + YES high institution output

First author num prev pubs & first year pub

Multivariate nonlinear regressions with interactions

Odds Ratio

0.25 0.50 1.00 2.00 4.00 8.00

Has journal policy0.95Count of R01 & other NIH grants

Authors prev GEOAE sharing & OA & microarray creation

NO K funding or P funding

Journal impact

Journal policy consequences & long halflife

Institution high citations & collaboration

NOT animals or mice

Instititution is government & NOT higher ed

Last author num prev pubs & first year pub

Large NIH grant

Humans & cancer

NO geo reuse + YES high institution output

First author num prev pubs & first year pub

Multivariate nonlinear regressions with interactions

logistic regressionusing second-order factors

Odds Ratio

0.25 0.50 1.00 2.00 4.00

OA journal & previous GEO-AE sharing

0.95Amount of NIH funding

Journal impact factor and policy

Higher Ed in USA

Cancer & humans

Multivariate nonlinear regression with interactions

Odds Ratio

0.25 0.50 1.00 2.00 4.00

OA journal & previous GEO-AE sharing

0.95Amount of NIH funding

Journal impact factor and policy

Higher Ed in USA

Cancer & humans

Multivariate nonlinear regression with interactions

Conclusions:

• data sharing rates are increasing, but overall levels are low

Preliminary evidence:• levels are particularly low in cancer• levels are highest for those who

• publish in a journal with a policy• publish in an open access journal • have shared data before

• data and filters were imperfect• many assumptions• didn’t capture all types of sharing• don’t know how generalizable across datatypes• should be considered hypothesis-generating

http://www.flickr.com/photos/vlastula/300102949/

http://www.flickr.com/photos/gatewaystreets/3838452287/

NSF-funded distributed framework and cyberinfrastructure for environmental science.

Dryad is a repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields.

The National Evolutionary Synthesis Center, NSF-funded:

• Duke University,• UNC at Chapel Hill• North Carolina State University

1.  new domain

http://www.flickr.com/photos/paulhami/1020538523//

• evolution and ecology datasets

• raw data that support results

• upon publication or short embargo

• publicly on the internet

challenges!

1. No PubMed

2. Diverse data types, norms, repositories

3. Data almost always collected for a specific hypothesis

4. Less public sharing so far

2.  new initiatives

JDAP• The American Naturalist• Evolution• Journal of Evolutionary Biology• Molecular Ecology• Evolutionary Applications• Genetics• Heredity• Molecular Biology and Evolution• Systematic Biology• Paleobiology• BMC Evolutionary Biology

http://www.flickr.com/photos/jima/606588905/

Blumenthal et al. Acad Med. 2006 Campbell et al. JAMA. 2002.

Kyzas et al. J Natl Cancer Inst. 2005.Vogeli et al. Acad Med. 2006.

Reidpath et al. Bioethics 2001.

3.  Reuse.

http://www.flickr.com/photos/boitabulle/3668162701/

who reuses data?when?

why aren’t they?

which datasets are most likely to be reused?

what can we do about it?

how many datasets could be reused but aren’t?

why?

who doesn’t?

does it matter?

http://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Gamma_distribution_pdf.svg/500px-Gamma_distribution_pdf.svg.png

I post my data, code, and statistical scripts on GitHub (links from http://researchremix.org)

Share yours too!

http://www.flickr.com/photos/myklroventine/892446624/

“Does anyone want your data?

That’s hard to predict […] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay.

Your data, too, may simply be awaiting an effective matchmaker.”

Got data? Nature Neuroscience (2007)

thank you

Dept of Biomedical Informatics at U of Pittsburgh

Wendy Chapman for support and feedback

Todd Vision, Mike Whitlock for ongoing discussions

NIH NLM. NSF through DataONE, NESCent, Dryad.

Open science online community and those who release their articles, datasets and photos openly

http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/

variables

Journal mandates

• readers

• reusers

• authors

• editors

• reviewers

• funders

• database designers, maintainers, curators

• patients, subjects, or populations

perspectives,

and also driving towards actionable results for these groups

Blumenthal et al. Acad Med. 2006

industry involvement

perceived competitiveness of field

male

sharing discouraged in training

human participants

academic productivity

0 1 2 3

Correlates with self‐reported data withholding

Campbell et al. JAMA 2002.

sharing is too much effort

want student or jr faculty to publish more

they themselves want to publish more

cost

industrial sponsor

confidentiality

commercial value of results0% 20% 40% 60% 80%

Self‐reported reasons for data withholding

Table 2: Second-order factor loadings, by first-order factors

Amount of NIH funding 0.88 Count of R01 & other NIH grants

0.49 Large NIH grant -0.55 NO K funding or P funding

Cancer & humans 0.83 Humans & cancer

OA journal & previous GEO-AE sharing 0.59 Authors prev GEOAE sharing & OA & microarray creation

0.43 Institution high citations & collaboration 0.31 First author num prev pubs & first year pub -0.36 Last author num prev pubs & first year pub

Journal impact factor and policy 0.57 Journal impact

0.51 Last author num prev pubs & first year pub

Higher Ed in USA 0.40 NO geo reuse + YES high institution output -0.44 Institution is government & NOT higher ed

Table 3: Second-order factor loadings, by original variables

Amount of NIH funding 0.87 nih.cumulative.years.tr 0.85 num.grants.via.nih.tr 0.84 max.grant.duration.tr 0.82 num.grant.numbers.tr 0.80 pubmed.is.funded.nih 0.79 nih.max.max.dollars.tr 0.70 nih.sum.avg.dollars.tr 0.70 nih.sum.sum.dollars.tr 0.59 has.R.funding 0.59 num.post2003.morethan500k.tr 0.58 country.usa 0.58 has.U.funding 0.57 has.R01.funding 0.55 num.post2003.morethan750k.tr 0.53 has.T.funding 0.53 num.post2003.morethan1000k.tr 0.49 num.post2004.morethan500k.tr 0.45 num.post2004.morethan750k.tr 0.44 has.P.funding 0.43 num.post2004.morethan1000k.tr 0.43 num.nih.is.nci.tr 0.35 num.post2005.morethan500k.tr 0.32 num.nih.is.nigms.tr 0.31 num.post2005.morethan750k.tr

Cancer & humans 0.60 pubmed.is.cancer 0.59 pubmed.is.humans 0.52 pubmed.is.cultured.cells 0.43 pubmed.is.core.clinical.journal 0.39 institution.is.medical -0.58 pubmed.is.plants -0.50 pubmed.is.fungi -0.37 pubmed.is.shared.other -0.30 pubmed.is.bacteria

OA journal & previous GEO-AE sharing 0.40 first.author.num.prev.geoae.sharing.tr 0.37 pubmed.is.open.access 0.37 first.author.num.prev.oa.tr 0.35 last.author.num.prev.geoae.sharing.tr 0.32 pubmed.is.effectiveness 0.32 last.author.num.prev.oa.tr 0.31 pubmed.is.geo.reuse -0.38 country.japan

Journal impact factor and policy 0.48 journal.impact.factor.log 0.47 jour.policy.requires.microarray.accession 0.46 jour.policy.mentions.exceptions 0.46 pubmed.num.cites.from.pmc.tr 0.45 journal.5yr.impact.factor.log 0.45 jour.policy.contains.word.miame.mged 0.42 last.author.num.prev.pmc.cites.tr 0.41 jour.policy.requests.accession 0.40 journal.immediacy.index.log 0.40 journal.num.articles.2008.tr 0.39 years.ago.tr 0.36 jour.policy.says.must.deposit 0.35 pubmed.num.cites.from.pmc.per.year 0.33 institution.mean.norm.citation.score 0.32 last.author.year.first.pub.ago.tr 0.31 country.usa 0.31 last.author.num.prev.pubs.tr 0.31 jour.policy.contains.word.microarray -0.31 pubmed.is.open.access

Higher Ed in USA 0.36 institution.stanford 0.36 institution.is.higher.ed 0.35 country.usa 0.35 has.R.funding 0.33 has.R01.funding 0.30 institution.harvard -0.37 institution.is.govnt

top related