specimen-level mining: bringing knowledge back 'home' to the natural history museum,...
TRANSCRIPT
Museum ImpactLinking-up specimens with
research published on them
Ross Mounce
@rmounce
formerly at
About Me
Currently a Postdoc at
a Fellow of the('Class of 2016')
a researcher with
plantsci.cam.ac.uk
software.ac.uk/fellows
contentmine.org
About This Talk (A little warning!)
● Don't expect to see much biology in this talk
● I'm going to talk about informatics
● I will focus more on context, background and methods, more than 'results' per se
● There will be more questions than answers :)
Source: http://www.nhm.ac.uk/our-science/collections.html © The Trustees of the Natural History Museum, London
New
Open Data
Easy-to-use
Quick
Images
Audio
Interactive Maps
Citable
API access
Open Source Infrastructure
It’s not KE Emu :)
What I want to do: link specimen records to their mentions in the literature
“Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus
rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)http://dx.doi.org/10.1371/journal.pone.0061998
114,000,000
scholarly papers available online36,000,000 of which are
‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’
Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and no institution in the world has access to everything. Not even close to everything!
In 2016, libraries pay subscriptions, or individuals per article fees to access even out of copyright works
??
http://outofcopyright.eu/rights-after-digitisation/
Some academic societies recognise the value of releasing out-of-copyright content
This is what a PDF looks like
PDF is NOT a good method of exchanging information
HTML is better, but lacks standardisation
+ italics & bold preserved, semantic links to figures & tables - lacks standardisation
The industry standard format for scholarly articles is JATS XML
● Journal Article Tags Archiving Suite
is an application of NISO Z39.96-2015, which defines a set of XML elements and attributes for tagging journal articles
● Standardising the format of digital scholarly publications is HIGHLY desirable
e.g. for this project, knowing if the string 'NHM' occurrs in the Materials section, rather than the Acknowledgements section is hugely helpful. Much harder to do with PDF/HTML.
Section-based search already implemented in EuropePMC!
→ Section level search functionality in Europe PMC. Kafkas et al (2015) J Biomed Semantics
A plea for full text XML
A minority of journals do not provide full text XML
✓PLOS, eLife, PeerJ, Pensoft, Wiley, Elsevier, Springer, NPG, Ubiquity Press, Copernicus, Hindawi, MPDI
✘ Geological Society of London Publications,Magnolia Press, a long tail of smaller publishers
Making fuller use of our expensively provisioned access
Image credit: Ubiquity Press
http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine
UK Copyright Law has changed recently, giving a specific copyright exemption for non-commercial text and data mining work
A complicated, fragmented landscape of relevant journals
Nature + Science + PNAS + Phytotaxa + ZootaxaBioOne Journals (131)Springer Journals (32)Wiley Journals (22)Taylor & Francis Journals (14)Elsevier Journals (12)Oxford University Press Journals (8)SciELO Journals (7) [Open Access but not in PMC] Ecological Society of America Journals (6)Geological Society Journals (4)CSIRO Journals (4)Cambridge University Press Journals (3)Royal Society Journals (2)
Journal-omics!
I discover 'new' journals every week
e.g. last week I 'found' Oryctos (published between 1998-2010), still behind a paywall. Does anyone have access to this journal? Please let me know
http://www.dinosauria.org/oryctos.php
How are we meant to achieve a comprehensive aggregation of research literature (to do rigorous science, inclusive of all the evidence) when it is so unhelpfully scattered and we don't even know where it all is?
https://github.com/rossmounce/NHM-specimens
I don’t just find in-text mentions.
I’m trying to match them up to our NHM Data Portal records too!
Specimens in RED do not appear to be on the Data Portal ...yet
Blue globe represents a PLOS ONE paper
Searching ALL full texts is not enough!!!
A significant number of specimens are probably ‘hiding-out’ in supplementary
data files of all sorts of formats.
Google Scholar does not index SIWeb of Science doesn’t either
Nor does Scopus
At scale, journal-held supplementary data files are the ‘darkest corners’ of
science
“Specimens were deposited in the collections of the California Academy of Sciences' Department of
Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)”
10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
Why write such descriptive papers in natural language? Keep data as data!
The above was published in 2013(!)
Almost nothing in Nature & Science ‘full (short) text’
Context: 15 years worth of full text research in Nature & Science examined
Science: only 11 NHM specimens found in 39,600 full texts.Nature: similar story. <30 specimens in 14,132 full texts.
Clearly there are more, but it’s all buried in supplementary materials :(
Blue globe represents a PLOS ONE paper
Very few specimens occur in more than one paper
Can you guess what BMNH 37001 is?Hint: it’s a very famous specimen! Grey represents an NHMUK specimen
Huge variation in how specimens are cited (not helpful!)
PI AZ 8459 TEXSpruce6067
BM000922891 NYRaz054
BMNH(E)609062 MSB00509
Belize_CW_All_1071 F1629082
BM-BRIT-EURO 3948 OR.5379
“BMNH” is not necessarily British Museum of Natural History (UK). Can also be Beijing Museum of Natural History (CN) or Bell Museum of Natural History (US)
Where possible use standard/permanent identifiers
Want to discuss a particular collection? Use the official GrSciColl identifier The Global Registry of Scientific Collections (GRSciColl)http://grscicoll.org/
Which for the Natural History Museum, London (UK) is: NHMUKhttp://biocol.org/urn:lsid:biocol.org:col:34665
Want to cite the BM Archaeopteryx specimen?
NHMUK PV OR 37001http://data.nhm.ac.uk/object/57ee3bf1-0a74-4ae4-a588-ba9ea8dc5265
Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013) The Evolution of Bat Vestibular Systems in the Face of Potential Antagonistic Selection Pressures for Flight and Echolocation. PLoS ONE 8(4): e61998. doi:10.1371/journal.pone.0061998
Openly-licensed data on specimens, published elsewhere, could be re-incorporated back into the online museum catalogue. A one-stop shop for information.
Beyond-linking:repatriation of knowledge
This is a CT-scan of “BMNH 76.3.15.14”.Without mining, I wouldn’t know this data exists.
Perhaps it could also be made available on the portal?
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Does published info make it back ‘home’ to the collections?
BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi”
I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9
It’s now called: Petrochromis horii n. sp. , according to the paper.
What mechanisms are there to update newer information back into the collection?
Content mining could definitely help keep collections data up-to-date!
Can we create a (better) digital NHM metadata catalogue entirely from the literature, hundreds of years before the NHM
themselves complete their own digitisation programme?
Given funding and time, perhaps…
Acknowledgements
Sincere thanks to:Aime Rankin for help with the projectThe NHM Library staff, particularly Sarah Vincent for actively supporting my content miningNancy Chillingsworth (IPR, NHM London)Mark Wilkinson (Life Sciences, NHM London)Peter Murray-Rust & the ContentMine teamVince Smith (Life Sciences, NHM London)Ben Scott (NHM Data Portal Lead Architect)Rod Page (University of Glasgow)All of the Biodiversity Informatics team
http://contentmine.org/
For a more detailed version of this talk on
YouTube see: bit.ly/nhmlink