Transcript
Page 1: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Identifying data sharingin the biomedical literature

Heather Piwowar and Wendy Chapman

Department of Biomedical Informatics, U of Pittsburgh

Page 2: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Visualized as a “Wordle” (font size ~ word frequency, location and orientation are random)

Our full paper:

Page 3: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Created at IBM’s data sharing and visualization site Many Eyes

Page 4: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Our aim:

Identify research articles for which the authors have shared their datasets

For this research:

sharing = submitted to centralized databases

Page 5: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 6: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 7: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Links between article and dataare important

Page 8: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

The data provides detail for the results of the article

Page 9: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

The article provides detail for the data

Page 10: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Specialized searching methods help us find articles OR data...

but what about when we want articles WITH data?

Page 11: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

How can we find articles that have shared their datasets?

Page 12: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Sometimes the links are easy to discover

Page 13: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

1. Through database citations:

When authors upload data to a database, they have the opportunity to cite the paper that describes the data collection

Page 14: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 15: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 16: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Text

Unfortunately, the citation is often left blankbecause the data is submitted before

the paper is published

Page 17: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

2. Through hyperlink urls in the text

Authors often reference their datasets within their paper with a website url

Page 18: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 19: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

But the meaning of the hyperlinks is ambiguous. Sometimes they point to datasets that have been

accessed, rather than submitted.

Page 20: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

But the meaning of the hyperlinks is ambiguous. Sometimes they point to datasets that have been

accessed, rather than submitted.

Page 21: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

And often the text contains no hyperlinks at all:

Page 22: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

3. Through text mining

Page 23: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

What if we could extract phrases like

“data of the experiment can be accessed at”

Page 24: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

full-text phrases containing “... accessed”

Page 25: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

“can be accessed” suggests data is shared

Page 26: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

BUT “was/were accessed” suggests data reuse!

Page 27: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

full-text phrases containing “... downloaded”

Page 28: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

“was/were downloaded” suggests data reuse

Page 29: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

while “can be downloaded” suggests data sharing

Page 30: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Our aim:

Identify research articles for which the authors have shared their raw datasets.

Proposed approach:

Develop a system to identify statements of shared data from an article’s full text.

Page 31: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Materials:

Full text from a subset of the open access literature

Database submission citations from five databases:

• Genbank

• Protein Data Bank

• Gene Expression Omnibus

• ArrayExpress

• Stanford Microarray Database

Page 32: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Our Gold standard:

An article was considered to have a “shared dataset” if the article was cited within the primary submission field of a database entry

(+ a small amount of manual screening to find additional positives based on full text)

Page 33: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Approach:

For those articles that mention database names,

• Extract a 300-character window around every mention of a database name

• Apply various mining algorithms to decide if there is evidence that the authors deposited data from this study in the database

Page 34: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Results:

• queried 24 000 articles across 27 journals

• 25% of all open access articles mentioned one of the database names (50% Genbank)

• development set of 4434 articlestraining set of 2000test set of 1028

Page 35: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

True positives:

23% of the articles that mentioned a database were cited from within a database submission field

= evidence that article shared its data!

Page 36: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Three simple methods for identifying sharing

Does the excerpt surrounding the database name contain:

1. the word “accession”

2. an accession number

3. a URL

Page 37: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Two complex methods:

4. A manually-derived regular expression to match lexical cues that suggest sharing

5. An automatically-derived bag of words decision tree

Page 38: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Snippet of manually-developed regular expression

wehavehasisarewaswerebebeen

+

accessionedaddedarchivedassigneddepositedenteredimportedincludedinsertedloadedlodgedplacedpostedprovidedregisteredreported tostoredsubmitteduploaded to

Page 39: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

How accurately were these methods able to identify papers with evidence of public database submissions?

Page 40: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall: % of papers cited in database submission fields that were found by our methods

Page 41: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall: % of papers cited in database submission fields that were found by our methods

Best method for

recall depends on

database

Page 42: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall: % of papers cited in database submission fields that were found by our methods

“accession”good for

some, <url> for others

Page 43: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall: % of papers cited in database submission fields that were found by our methods

lexical regular

expressions do well overall

Page 44: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision: % of papers found by our methods that were cited in database submissions fields

Page 45: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision: % of papers found by our methods that were cited in database submissions fields

lexical regular

expressions do well overall,

bag-of-words doeseven better

Page 46: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision: % of papers found by our methods that were cited in database submissions fields

Precision of simple

patterns depends on

database

Page 47: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision: % of papers found by our methods that were cited in database submissions fields

Simple patterns do poorly on the most popular

databases (those with the most

statements of reuse?)

Page 48: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision vs. Recall plot of all methods for each database.

Diverse!

Page 49: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

<url>

bag of words

“accession”

<accession>

<lexical patterns>

Relative strength of methods for this taskacross databases

Page 50: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Limitations:

• bias due to manual screening of negatives

• database-centric classifier

• approach requires computational access to literature full text!

Page 51: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Impact:

• A recent version that runs in PubMed Central:

• could increase GEO article links by 2.6%

• by 5.5% annually when all NIH in PMC

• double the recall (to 80%), double these estimates

• 40 links already added by GEO staff!

Page 52: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Ongoing work:

1. Continue focusing on methods that use existing full-text query interfaces, like PubMed Central

2. Use this tool to evaluate the patterns and prevalence ofbiomedical research data sharing and reuse

Page 53: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Thanks to

the Dept of Biomedical Informatics at the U of Pittsburgh,

the NLM for funding through training grant 5 T15 LM007059-22,

and everyone who publishes “gold” open access, thereby facilitates reuse of article full text for studies like this.

My shared data: www.dbmi.pitt.edu/piwowarShare your research data too!

Page 54: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 55: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Our manual filter for additional positive classifications identified more cases in some databases than others: we

reclassified 19% of [article,database] cases from ArrayExpress as positive despite an omitted literature

link, compared to 11%, 7%, 2%, and 1% for GEO, Genbank, PDB, and SMD respectively (see Table 2 for raw number of cases). The most common situations included: the

database entry listed a citation for another paper by the same authors, the entry listed an erroneous PubMed ID,

the entry included a citation without a PubMed ID, or the entry had a blank citation field.

Page 56: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Usage?

• scientists looking for datasets for reuse

• curators looking for primary citations

• researchers studying data sharing behaviour

Page 57: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Regular expression

• Precise one +

• "(\b(accession.{0,20}(for|at).{0,100}(is|are)))",

• r"(\b(raw|original|our|complete|detailed).{0,20}data)",

• r"(\b(we|have|is|was|were|is|are|be|have|has|been).(exported|gave|given|listed|provided|reported))"

• ]) + ")"

Page 58: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precise Regular expression

• wehavehasisarewaswerebebeen

accessioned|added|archived|assigned|deposited|entered|imported|included|inserted|loaded|lodged|placed|posted|provided|registered|reported.to|stored|submitted|uploaded.to))",

is|are|will.be|made).{0,20}(available|accessible)

(be).(accessed|browsed|downloaded|found|obtained|queried|retrieved|searched|viewed)

(through|under|as).{0,20}accession

(given)|new|received|assigned).{0,20}(accession)

(data.{0,20}availability|for public distribution|for.{0,20}release upon publication|for the.{0,20}data.{0,20}generated|from this study have.{0,20}accession|data.{0,10}from this study|access to.{0,20}data.

Page 59: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Stopwords are important!

Page 60: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall

Page 61: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision

Page 62: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

• queried 24 000 articles across 27 journals

• 25% mentioned one of the database names

• development set of 4434 training set of 2000test set of 1028

Evaluation

Page 63: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 64: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 65: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

PAST MEDICAL HISTORY:Past medical history showed she had superficial phlebitis times two in the past, had non-insulin

dependent diabetes mellitus for four years.She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:The patient is a 58-year-old female, …

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 66: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

PAST MEDICAL HISTORY:Past medical history showed she had superficial phlebitis times two in the past, had non-insulin

dependent diabetes mellitus for four years.She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:The patient is a 58-year-old female, …

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 67: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

PAST MEDICAL HISTORY:Past medical history showed she had superficial phlebitis times two in the past, had non-insulin

dependent diabetes mellitus for four years.She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:The patient is a 58-year-old female, …

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 68: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

PAST MEDICAL HISTORY:Past medical history showed she had superficial phlebitis times two in the past, had non-insulin

dependent diabetes mellitus for four years.She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:The patient is a 58-year-old female, …

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 69: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Top Related