access to open data through open access articles in the life sciences
DESCRIPTION
Access to open data through open access articles in the life sciences. - Johanna McEntyre (EMBL-European Bioinformatics Institute)TRANSCRIPT
Literature-Data Integration in the Life Sciences
Lisbon, Oct 2nd 2012
Publications and Data Sources
26 million abstracts
2.3 million full text articles
Citation networksDatabase linksText-mining
20122006 2011 2016?
Europe PubMed Central
How many open access articles in UKPMC?PubMed (995K)
UKPMC (18%,182K)
OA (9.6%, 96K)
Total: 489,000 OA articles
Publication Date2020200200200200200200200200 200
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Nuc
leo
tides
(mill
ions
) European Nucleotide Archive
0
50
100
150
200
250
300
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Ensembl and Ensembl Genomes
Year
Gen
om
es
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
UniProt
Year
Ent
ries
InterPro
0
5000
10000
15000
20000
25000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Ent
ries
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
Year
ArrayExpress
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Hyb
ridis
atio
ns
Str
uctu
res
0
10000
20000
30000
40000
50000
60000
70000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
PDBe
Figure 2. Growth of key resources
• Big data• Thematic data• Public data• Archived data
• Two petabytes of data• Scales to 7 pbs raw disk
•Majority is DNA
Literature citation from datavs
Data referal from literature
PMC336623 Extended to several other biological data types
Literature citation from data
• Proteins• Nucleotides• OMIM• Chemicals• Structure• Clinical reviews• Protein families• Protein-protein interactions• Gene expression experiments
800 K
370 K
110 K
Semantic Type Unique Terms Articles Annotations
Accession No. 233,017 66,356 387,787
Chemical 76,712 1,694,385 83,923,066
Disease 171,692 1,768,214 57,821,871
Gene/Protein 227,318 1,310,382 77,189,022
GO Terms 32,664 1,832,294 65,061,579
Organism 180,637 1,713,280 70,832,222
Data referral from literature: text mining
2.3 million articles
gen
pdb
spro
t
genp
ept
geo
omim pir
embla
lign
pubc
hem
pmc
0
10
20
30
40
50
60
70
80
90
100
gen
pdb
spro
t
arra
yexp
ress
pfam
inter
pro
0
10
20
30
40
50
60
70
80
90
100
publisher-annotated text-mined
Annotation of accession numbers (OA)
~10,000 articles >25,000 articles
BMC Genomics: 1,484 TM tagged, 4,337 articles (1135 tagged)PLoS One: 4,226 TM tagged, 42,888 articles
Senay Kafkas and Jee-Hyub Kim
Scientific:Linking articles that cite the same data
Citation:Data Citation as measure of impact (Thomson: Data citation index)
Context of data citation: submission, reuse, analysis
Operational:Services for publishers to improve Accession number tagging
Editorial policies and adherence
Extension of NLM DTD
Lessons learned for considering unstructured data
Why is this important? Implications
That we can perform this analysis at all highlights a benefit of Open Access
Case Study of an FP7-funded article (1)
Case Study of an FP7-funded article (2)
Europe PubMed Central content map
Abstract Full text
Databases
Extractedterms
UnstructuredDatasets
Citingarticles
Citingarticles
AY387398: needle in a haystack
Europe PubMed Central and Institutional Repositories:content matching
Number of article IDs
OpenAIRE plus
**Coming soon: RESTful interface for data linked to articles
People
• Paula Buttery• Andrew Caines• Norman Cobley• Yuci Gou• Senay Kafkas• Jyothi Katuri• Oliver Kilian• Jee-Hyub Kim• Nikos Marinos• Jo McEntyre• Xingjun Pi• Philip Rossiter
• Rebholz Group• Peter Stoehr
• University of Manchester• British Library
• OpenAIRE/OpenAIRE Plus
• NCBI, NLM