access to open data through open access articles in the life sciences

17
Literature-Data Integration in the Life Sciences Lisbon, Oct 2nd 2012

Upload: conferencia-luso-brasileira-sobre-acesso-livre

Post on 11-May-2015

270 views

Category:

Technology


7 download

DESCRIPTION

Access to open data through open access articles in the life sciences. - Johanna McEntyre (EMBL-European Bioinformatics Institute)

TRANSCRIPT

Page 1: Access to open data through open access articles in the life sciences

Literature-Data Integration in the Life Sciences

Lisbon, Oct 2nd 2012

Page 2: Access to open data through open access articles in the life sciences

Publications and Data Sources

Page 3: Access to open data through open access articles in the life sciences

26 million abstracts

2.3 million full text articles

Citation networksDatabase linksText-mining

20122006 2011 2016?

Europe PubMed Central

Page 4: Access to open data through open access articles in the life sciences

How many open access articles in UKPMC?PubMed (995K)

UKPMC (18%,182K)

OA (9.6%, 96K)

Total: 489,000 OA articles

Publication Date2020200200200200200200200200 200

Page 5: Access to open data through open access articles in the life sciences

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Nuc

leo

tides

(mill

ions

) European Nucleotide Archive

0

50

100

150

200

250

300

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Ensembl and Ensembl Genomes

Year

Gen

om

es

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

UniProt

Year

Ent

ries

InterPro

0

5000

10000

15000

20000

25000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Ent

ries

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

Year

ArrayExpress

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Hyb

ridis

atio

ns

Str

uctu

res

0

10000

20000

30000

40000

50000

60000

70000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

PDBe

Figure 2. Growth of key resources

• Big data• Thematic data• Public data• Archived data

• Two petabytes of data• Scales to 7 pbs raw disk 

•Majority is DNA

Page 6: Access to open data through open access articles in the life sciences

Literature citation from datavs

Data referal from literature

Page 7: Access to open data through open access articles in the life sciences

PMC336623 Extended to several other biological data types

Page 8: Access to open data through open access articles in the life sciences

Literature citation from data

• Proteins• Nucleotides• OMIM• Chemicals• Structure• Clinical reviews• Protein families• Protein-protein interactions• Gene expression experiments

800 K

370 K

110 K

Page 9: Access to open data through open access articles in the life sciences

Semantic Type Unique Terms Articles Annotations

Accession No. 233,017 66,356 387,787

Chemical 76,712 1,694,385 83,923,066

Disease 171,692 1,768,214 57,821,871

Gene/Protein 227,318 1,310,382 77,189,022

GO Terms 32,664 1,832,294 65,061,579

Organism 180,637 1,713,280 70,832,222

Data referral from literature: text mining

2.3 million articles

Page 10: Access to open data through open access articles in the life sciences

gen

pdb

spro

t

genp

ept

geo

omim pir

embla

lign

pubc

hem

pmc

0

10

20

30

40

50

60

70

80

90

100

gen

pdb

spro

t

arra

yexp

ress

pfam

inter

pro

0

10

20

30

40

50

60

70

80

90

100

publisher-annotated text-mined

Annotation of accession numbers (OA)

~10,000 articles >25,000 articles

BMC Genomics: 1,484 TM tagged, 4,337 articles (1135 tagged)PLoS One: 4,226 TM tagged, 42,888 articles

Senay Kafkas and Jee-Hyub Kim

Page 11: Access to open data through open access articles in the life sciences

Scientific:Linking articles that cite the same data

Citation:Data Citation as measure of impact (Thomson: Data citation index)

Context of data citation: submission, reuse, analysis

Operational:Services for publishers to improve Accession number tagging

Editorial policies and adherence

Extension of NLM DTD

Lessons learned for considering unstructured data

Why is this important? Implications

That we can perform this analysis at all highlights a benefit of Open Access

Page 12: Access to open data through open access articles in the life sciences

Case Study of an FP7-funded article (1)

Page 13: Access to open data through open access articles in the life sciences

Case Study of an FP7-funded article (2)

Page 14: Access to open data through open access articles in the life sciences

Europe PubMed Central content map

Abstract Full text

Databases

Extractedterms

UnstructuredDatasets

Citingarticles

Citingarticles

Page 15: Access to open data through open access articles in the life sciences

AY387398: needle in a haystack

Page 16: Access to open data through open access articles in the life sciences

Europe PubMed Central and Institutional Repositories:content matching

Number of article IDs

OpenAIRE plus

**Coming soon: RESTful interface for data linked to articles

Page 17: Access to open data through open access articles in the life sciences

People

• Paula Buttery• Andrew Caines• Norman Cobley• Yuci Gou• Senay Kafkas• Jyothi Katuri• Oliver Kilian• Jee-Hyub Kim• Nikos Marinos• Jo McEntyre• Xingjun Pi• Philip Rossiter

• Rebholz Group• Peter Stoehr

• University of Manchester• British Library

• OpenAIRE/OpenAIRE Plus

• NCBI, NLM