thesis proposal, as presented for dissertation proposal defense

Post on 01-Nov-2014

4.740 Views

Category:

Health & Medicine

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

The slides I presented for my PhD proposal defense for my project, "Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data." Dept of Biomedical Informatics, University of Pittsburgh.

TRANSCRIPT

Foundationalstudiesformeasuringtheimpact,

prevalence,andpatternsofpubliclysharingbiomedical

researchdata

HeatherPiwowarDepartmentofBiomedicalInformatics

UniversityofPittsburgh

Sharingresearchdata

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Sharingresearchdata

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Sharingresearchdata

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Sharingresearchdata

PAST MEDICAL HISTORY:

Past medical history showed she had

superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for

four years.

She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:

The patient is a 58-year-old female, …

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Shareddatabenefitsscience

VerifyUnderstandExtendExploreCombineSynergizeTrainReduce

But...costlyforauthorsFindOrganizeDocumentDeidentifyFormatDecideAskSubmit

Answer questionsWorry about mistakes being foundWorry about data being misinterpretedWorry about being scoopedForgo money and IP and prestige???

Asaresult,policymakershavespentlotsoftimeandmoney....

http://www.flickr.com/photos/tonivc/2283676770/

http://www.flickr.com/photos/johnnyvulkan/381941233/

...oninitiatives,requests,requirements,andtools

NIH data sharing plan requirement

Journal requirements

Databases

Data sharing grids like BIRN and caBIG

Standards

Editorials, letters to the editor, discussion....

lotsofdatasharing!

http://www.genome.jp/en/db_growth.html

buthowmuchisn’tshared?

whatisn’tshared?

whoisn’tsharingit?whynot?

whatcanwedoaboutit?

howmuchdoesitmatter?

youcannotmanagewhatyoudonotmeasure

http://www.flickr.com/photos/archeon/2941655917/

Long-term motivation:

I believe that analysis of the impact, prevalence, and patterns with which investigators share and withhold gene expression microarray research data can uncover rewards, best practices, and opportunities for increased adoption of data sharing.

Aim1:Doessharinghavebenefitforthosewhoshare?

Aim2:Cansharingandwithholdingbesystematicallymeasured?

Aim3:Howoftenisdatashared?Whatpredictssharing?Howcanwemodelsharingbehavior?

Relatedresearch

Data usually collected via surveys and/or manual audits

http://www.flickr.com/photos/jima/606588905/

Noor et al. PLoS Biology 2006.Ochsner et al. Nature Methods 2008.

Piwowar et al. PLoS ONE 2007.Editorial. Nature Biotech 2007.

DNA sequences

gene expression microarrays

proteomics spectra

0% 25% 50% 75% 100%

Prevalenceofdatasharingviamanualaudit

self-reported denying a request in last 3 years

trainees self-reported denying a request

been denied access to data, materials, code

authors “not able to retrieve raw data”

not willing to release data

0% 10% 20% 30% 40%

Prevalenceofdatawithholdingviasurveys

Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.

Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.

Campbell et al. JAMA 2002.

sharing is too much effort

want student or jr faculty to publish more

they themselves want to publish more

cost

industrial sponsor

confidentiality

commercial value of results0% 20% 40% 60% 80%

Self‐reportedreasonsfordatawithholding

Blumenthal et al. Acad Med. 2006

industry involvement

perceived competitiveness of field

male

sharing discouraged in training

human participants

academic productivity

0 1 2 3

Correlateswithself‐reporteddatawithholding

Modelsofdataandknowledgesharing

Andriessen. Conditions for the willingness to share knowledge, 2006.

Harder. SMG WP 6/2008 .

Cabrera and Cabrera. Int J of HR Mgmt. 2005.

Kuo. JASIST. 2008.

Limitationsoftherelatedresearch

• manual audits: small sample sizes

• surveys: few variables + self-reporting bias

• not much focus on measuring demonstrated behavior

• not much focus on impact or policy

• not much focus on biomedical data other than DNA sequences

Needed:

a study of data sharing behavior and impact that includes

• a measurement of demonstrated behavior• policy variables • estimate of rewards• a broad and deep selection of data creation instances• a focus on biomedical data other than DNA sequences

Aim1:Doessharinghavebenefitforthosewhoshare?

Aim2:Cansharingandwithholdingbesystematicallymeasured?

Aim3:Howoftenisdatashared?Whatpredictssharing?Howcanwemodelsharingbehavior?

Scopeofcurrentstudy• typeofdata:geneexpressionmicroarrays

• sharingmechanism:centralizeddatabases

• studies:Englishfulltextavailableinacentralizedportal

• covariates:extractedfromMedlineanddatabasesources

http://en.wikipedia.org/wiki/DNA_microarray

http://en.wikipedia.org/wiki/Image:Heatmap.png

Preliminaryresearch

Aim1

Aim1:Doessharinghavebenefitforthosewhoshare?

http://www.flickr.com/photos/sunrise/35819369/

Aim1:Doessharinghavebenefitforthosewhoshare?

Aim1:Doessharinghavebenefitforthosewhoshare?

Aim1:Doessharinghavebenefitforthosewhoshare?

Note the logarithmic scale

Aim1:Doessharinghavebenefitforthosewhoshare?

Aim1:Associatedcitationincrease

http://www.flickr.com/photos/sunrise/35819369/

Next:

Whatfactorspredictsharing?

http://www.flickr.com/photos/ryanr/142455033/

CanIusethesamemethodsofAim1tochoosestudiesanddeterminedatasharingstatus?

CanIusethesamemethodsofAim1tochoosestudiesanddeterminedatasharingstatus?

No,thosemethodsdon’tscaletoidentifyorclassifyenoughdatapoints

Aim2

Needautomatedmethodsto:

Identifystudiesthatgeneratedatasetsthatcouldpotentiallybeshared(Aim2a)

Determinewhichofthesehaveinfactbeenshared(Aim2b)

Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata

http://www.flickr.com/photos/lofaesofa/248546821/

Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata

Easy,viaMeSHindexingterms?

geneexpressionprofilingand/or

microarrayanalysis

Unfortunately,thesehaveneitherhighrecallnorprecision.

BUTthisrequiresdevelopingandmaintainingafull‐textarchive!

WhataboutusingPubMedCentral?

Canreach~85%ofarticleswithfull‐textlinksviaUofPittsburghlibrarysubscriptions,whencombinedwithtwootherfull‐textqueryportals:

Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata

Deriveafull‐textquerywithsuffientlyhighrecall(>1250studies)andprecision(>70%).

Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata

Referencestandard?

Ochsneretal.•2007•20journals•broadqueryformicroarraystudies•identified400studiesthatcreatedgeneexpressionmicroarraydata

Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata

Developmentcorpus?

PubMedCentralOpenAccesssubset+TRECGenomicsIRsubset

=about5000relevantarticleswithabout50%truepositiverate

Aim2a:Identifystudiesthatcreategeneexpressionmicroarraydata

Developmentapproach?

•Patternbuildingviamanualinspection•Classificationdecisiontreeswithn‐grams•Borrowapproachesfrom•Autoslog‐TS•automatedregularexpressionbuilding•semi‐supervisedlearning•retrievalqueryaspects

Aim2b

Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata

http://www.flickr.com/photos/dcassaa/422261773/

Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata

Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata

Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata

Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata

pmc_gds[filter]

Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata

Unfortunately,thesubmissioncitationisoftenleftblankwhendataissubmittedpriortopublication.

Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata

Toacheive70%recall,Imayhavetosupplementwithaqueryofthefulltext,suchas:

(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))

Aim2b:Identifystudiesthatsharetheirexpressionmicroarraydata

Referencestandard:

Aim3

Aim3–Howoftenisdatashared?Whatpredictssharing?Howcanwemodelsharingbehavior?

http://www.flickr.com/photos/ryanr/142455033/

Aim3a:Prevalenceofdatasharing

Aim3a:Prevalenceofdatasharing

PubMedID

PortalCreateddata?

234345456567678789890901

PMC YesHighPr YesScirus YesPMC YesPMC YesHighPr NoPMC No‐ ?

Aim3a:Prevalenceofdatasharing

PubMedID

PortalCreateddata?

234345456567678789890901

PMC YesHighPr YesScirus YesPMC YesPMC YesHighPr NoPMC No‐ ?

Aim3a:Prevalenceofdatasharing

PubMedID

PortalCreateddata?

234345456567678

PMC YesHighPr YesScirus YesPMC YesPMC Yes

Aim3a:Prevalenceofdatasharing

PubMedID

PortalCreateddata?

Shareddata?

234345456567678

PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO

Aim3a:Prevalenceofdatasharing

PubMedID

PortalCreateddata?

Shareddata?

234345456567678

PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO

Prevalence=NumberwithShareddataNumberwithCreateddata

Aim3b:Correlateswithdatasharing

Aim3b:Correlateswithdatasharing

PubMedID

PortalCreateddata?

Shareddata?

234345456567678

PMC Yes YesHighPr Yes YesScirus Yes YesPMC Yes NOPMC Yes NO

Covariates

Aim3b:Correlateswithdatasharing

Features to include:• Does the journal have a data sharing policy?• Is the study funded by the NIH?• Number of authors• Research-orientation of the primary

institution• Journal impact factor• Are the samples from humans?• Disease of study• Year of publication• …

Aim3b:Correlateswithdatasharing

PubMedID

PortalCreateddata?

Shareddata?

Journalpolicy

NIHfunds?

#authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Aim3b:Correlateswithdatasharing

PubMedID

PortalCreateddata?

Shareddata?

Journalpolicy

NIHfunds?

#authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Shareddata?

Journalpolicy? NIHfunded? #authors ...

Aim3c:Modelofdatasharing

Aim3c:Modelofdatasharing

PubMedID

PortalCreateddata?

Shareddata?

Journalpolicy

NIHfunds?

#authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Aim3c:Modelofdatasharing

PubMedID

PortalCreateddata?

Shareddata?

Journalpolicy

NIHfunds?

#authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Shareddata?

Mandates AmountofCollaboration

...

Aim3c:Modelofdatasharing

PubMedID

PortalCreateddata?

Shareddata?

Journalpolicy

NIHfunds?

#authors

...

234345456567678

PMC Yes Yes strong yes 2HighPr Yes Yes weak yes 5Scirus Yes Yes weak no 6PMC Yes NO strong yes 5PMC Yes NO strong no 2

Covariates

Shareddata?

Mandates AmountofCollaboration

...StrongWeak

Assumptions

That the following limitations are randomly distributed:• Ambiguous author names • The method of describing data generation • Studies with data in GEO but no submission links• Studies that don’t mention sharing in the full-text article

The first and last authors are usually primary decision-makers about whether to share data

Citations are a valued, though imperfect, measure of research impact

Limitations

Association does not imply causation

Only one datatype: microarray data.

Only considering sharing in the primary centralized databases.

Many variables are USA-centric.

Results will only be generalizable to research studies made available in full-text portals.

Risksandcontingencyplans

NLP performance may be inadequatesupplement with manual annotating via Mechanical Turk

Author ambiguity may introduce extreme outliers.use Author-ity software on extreme outliers

Unable to derive a robust exploratory factor modeltry other clustering techniques

Several variables may be unexpectedly difficult to extract

if not essential, defer the analysis of that variable to future work

Contributions

• an assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray dataset sharing

• a publicly available dataset associating microarray study publications with data sharing status

• a generalizable approach for developing practical, real-world information retrieval using centralized full-text query portals

• preliminary models of data sharing behavior

Publicationplan

http://www.flickr.com/photos/linkwize/926334421/

Publicationplan:Aim1

Do studies with publicly shared datasets receive

more citations?

Published in PLoS ONE in February 2007

Publicationplan:Aim2a How can we identify studies that generate

certain data, given full-text query access through centralized portals?

Targeted journal:Journal of Medical Internet Research? BMC Bioinformatics?other?

Publicationplan:Aim2b,3a,3b

What factors are associated with demonstrated

data sharing behavior?

Targeted journal:BMC Bioinformatics?BMC Biology?PLoS Biology?a research policy journal?other?

Publicationplan:Aim3c Derive (and validate?) a preliminary a model of

demonstrated research data sharing behavior

Targeted journal:JASIST?

(Journal of the American Society for Information Science and Technology)

Information Research?Journal of Documentation?Science Communication?Data Science Journal?other?

Futurework

1. Identify and model data reuse2. Citation analysis of the large cohort3. Supplement with survey responses4. Generalize the method for creating

queries for full-text portals

http://www.flickr.com/photos/cogdog/123072/

Datasharingplan

I plan to share my code, data, and process openly during the research via blogs and repositories.

http://www.flickr.com/photos/myklroventine/892446624/

Thanks to

the Dept of Biomedical Informatics at the U of Pittsburgh,

the NLM for funding through training grant 5 T15 LM007059-22,

those with photos on Flickr under a Creative Commons license,

Wendy for her support and feedback, and my committee for anticipated feedback....

Questions and Suggestions?

Futurework

• Funders, policy makers and thought leaders.

• Database, software, and data standard developers.

• Biomedical informatics community.

• Information science and digital library community.

• Open Science community.

• Primary Investigators.

Audience

Recent related grants

NIH: Haga, S. Exploring Attitudes About Data Disclosure and Data-Sharing in Genomics Research.

NSF: Hedstrom, M. Incentives for Data Producers to Create Archive-Ready Data Sets.

National Inst of Nursing Research: Pienta, A. Barriers and Opportunities for Sharing Research Data.

+others

top related