specimen-level mining: bringing knowledge back 'home' to the natural history museum,...

Museum ImpactLinking-up specimens with

research published on them

Ross Mounce

@rmounce

formerly at

About Me

Currently a Postdoc at

a Fellow of the('Class of 2016')

a researcher with

plantsci.cam.ac.uk

software.ac.uk/fellows

contentmine.org

http://plantsci.cam.ac.uk/

http://www.software.ac.uk/fellows

http://contentmine.org/

About This Talk (A little warning!)

● Don't expect to see much biology in this talk

● I'm going to talk about informatics

● I will focus more on context, background and methods, more than 'results' per se

● There will be more questions than answers :)

Source: http://www.nhm.ac.uk/our-science/collections.html © The Trustees of the Natural History Museum, London

New

Open Data

Easy-to-use

Quick

Images

Audio

Interactive Maps

Citable

API access

Open Source Infrastructure

It’s not KE Emu :)

What I want to do: link specimen records to their mentions in the literature

“Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus

rodricensis (BMNH.76.3.15.14); …”

NHM Data Portal Link (Stable, Unique Identifier)http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408

Article DOI (Stable, Unique Identifier)http://dx.doi.org/10.1371/journal.pone.0061998

http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408

http://dx.doi.org/10.1371/journal.pone.0061998

114,000,000

scholarly papers available online36,000,000 of which are

‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’

Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE

Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and no institution in the world has access to everything. Not even close to everything!

In 2016, libraries pay subscriptions, or individuals per article fees to access even out of copyright works

??

http://outofcopyright.eu/rights-after-digitisation/

Some academic societies recognise the value of releasing out-of-copyright content

This is what a PDF looks like

PDF is NOT a good method of exchanging information

HTML is better, but lacks standardisation

+ italics & bold preserved, semantic links to figures & tables - lacks standardisation

The industry standard format for scholarly articles is JATS XML

● Journal Article Tags Archiving Suite

is an application of NISO Z39.96-2015, which defines a set of XML elements and attributes for tagging journal articles

● Standardising the format of digital scholarly publications is HIGHLY desirable

e.g. for this project, knowing if the string 'NHM' occurrs in the Materials section, rather than the Acknowledgements section is hugely helpful. Much harder to do with PDF/HTML.

Section-based search already implemented in EuropePMC!

→ Section level search functionality in Europe PMC. Kafkas et al (2015) J Biomed Semantics

A plea for full text XML

A minority of journals do not provide full text XML

✓PLOS, eLife, PeerJ, Pensoft, Wiley, Elsevier, Springer, NPG, Ubiquity Press, Copernicus, Hindawi, MPDI

✘ Geological Society of London Publications,Magnolia Press, a long tail of smaller publishers

Making fuller use of our expensively provisioned access

Image credit: Ubiquity Press

http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine

UK Copyright Law has changed recently, giving a specific copyright exemption for non-commercial text and data mining work

A complicated, fragmented landscape of relevant journals

Nature + Science + PNAS + Phytotaxa + ZootaxaBioOne Journals (131)Springer Journals (32)Wiley Journals (22)Taylor & Francis Journals (14)Elsevier Journals (12)Oxford University Press Journals (8)SciELO Journals (7) [Open Access but not in PMC] Ecological Society of America Journals (6)Geological Society Journals (4)CSIRO Journals (4)Cambridge University Press Journals (3)Royal Society Journals (2)

Journal-omics!

I discover 'new' journals every week

e.g. last week I 'found' Oryctos (published between 1998-2010), still behind a paywall. Does anyone have access to this journal? Please let me know

http://www.dinosauria.org/oryctos.php

How are we meant to achieve a comprehensive aggregation of research literature (to do rigorous science, inclusive of all the evidence) when it is so unhelpfully scattered and we don't even know where it all is?

http://www.dinosauria.org/oryctos.php

https://github.com/rossmounce/NHM-specimens

I don’t just find in-text mentions.

I’m trying to match them up to our NHM Data Portal records too!

Specimens in RED do not appear to be on the Data Portal ...yet

Blue globe represents a PLOS ONE paper

Searching ALL full texts is not enough!!!

A significant number of specimens are probably ‘hiding-out’ in supplementary

data files of all sorts of formats.

Google Scholar does not index SIWeb of Science doesn’t either

Nor does Scopus

At scale, journal-held supplementary data files are the ‘darkest corners’ of

science

“Specimens were deposited in the collections of the California Academy of Sciences' Department of

Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)”

10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/

Why write such descriptive papers in natural language? Keep data as data!

The above was published in 2013(!)

Almost nothing in Nature & Science ‘full (short) text’

Context: 15 years worth of full text research in Nature & Science examined

Science: only 11 NHM specimens found in 39,600 full texts.Nature: similar story. <30 specimens in 14,132 full texts.

Clearly there are more, but it’s all buried in supplementary materials :(

Blue globe represents a PLOS ONE paper

Very few specimens occur in more than one paper

Can you guess what BMNH 37001 is?Hint: it’s a very famous specimen! Grey represents an NHMUK specimen

Huge variation in how specimens are cited (not helpful!)

PI AZ 8459 TEXSpruce6067

BM000922891 NYRaz054

BMNH(E)609062 MSB00509

Belize_CW_All_1071 F1629082

BM-BRIT-EURO 3948 OR.5379

“BMNH” is not necessarily British Museum of Natural History (UK). Can also be Beijing Museum of Natural History (CN) or Bell Museum of Natural History (US)

Where possible use standard/permanent identifiers

Want to discuss a particular collection? Use the official GrSciColl identifier The Global Registry of Scientific Collections (GRSciColl)http://grscicoll.org/

Which for the Natural History Museum, London (UK) is: NHMUKhttp://biocol.org/urn:lsid:biocol.org:col:34665

Want to cite the BM Archaeopteryx specimen?

NHMUK PV OR 37001http://data.nhm.ac.uk/object/57ee3bf1-0a74-4ae4-a588-ba9ea8dc5265

http://grscicoll.org/

http://data.nhm.ac.uk/object/57ee3bf1-0a74-4ae4-a588-ba9ea8dc5265

Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013) The Evolution of Bat Vestibular Systems in the Face of Potential Antagonistic Selection Pressures for Flight and Echolocation. PLoS ONE 8(4): e61998. doi:10.1371/journal.pone.0061998

Openly-licensed data on specimens, published elsewhere, could be re-incorporated back into the online museum catalogue. A one-stop shop for information.

Beyond-linking:repatriation of knowledge

This is a CT-scan of “BMNH 76.3.15.14”.Without mining, I wouldn’t know this data exists.

Perhaps it could also be made available on the portal?

http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408

Does published info make it back ‘home’ to the collections?

BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi”

I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9

It’s now called: Petrochromis horii n. sp. , according to the paper.

What mechanisms are there to update newer information back into the collection?

Content mining could definitely help keep collections data up-to-date!

Can we create a (better) digital NHM metadata catalogue entirely from the literature, hundreds of years before the NHM

themselves complete their own digitisation programme?

Given funding and time, perhaps…

Acknowledgements

Sincere thanks to:Aime Rankin for help with the projectThe NHM Library staff, particularly Sarah Vincent for actively supporting my content miningNancy Chillingsworth (IPR, NHM London)Mark Wilkinson (Life Sciences, NHM London)Peter Murray-Rust & the ContentMine teamVince Smith (Life Sciences, NHM London)Ben Scott (NHM Data Portal Lead Architect)Rod Page (University of Glasgow)All of the Biodiversity Informatics team

http://contentmine.org/

For a more detailed version of this talk on

YouTube see: bit.ly/nhmlink

specimen-level mining: bringing knowledge back 'home' to the natural history museum,...

Education