no specimen left behind: collections digitisation at the nhm, london*

Click here to load reader

Post on 16-Apr-2017




9 download

Embed Size (px)


PowerPoint Presentation

Vince SmithCollections for the 21st Century, Florida5-6 May 2014No specimen left behind:

Collections digitisation at theNHM, London*


Some historythe rate of progress by the UK taxonomic institutions in digitising and making collections information available is disappointingly low there is a significant risk of damage to the international reputation of major institutions such as The Natural History MuseumHouse of Lords Science and Technology CommitteeReport on Taxonomy and Systematics, 2009


Digitisation rates at the NHM (circa 2009)

900 years to digitise the collection!


The prevailing attitude collections digitisationBiodiversity Informatics2010, 7: 120 1292010 GBIF Task Group:Global Strategy and Action Plan for the Digitisation of Natural History Collections

However desirable, digitization of all specimens across the globe is a noble but impracticable goal.

Digitizing all specimens is not an achievable aim at present

Our collections are


More technology, more automation, more speed

Whole drawer scanningHerbarium sheet scanning

Microscope slide scanning


European collections rising to the challenge

Large-scale data capture & digitisation in France, Netherlands & Finland


NHM London Science Strategy 2013-17A New Voyage of Discovery

Three Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagement

Five Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills

Resources & funding

Measuring success

7 target20M specimens available by 2017


A long way to go, practically, technically & culturally

NHM collections comprise c.80m objects Physical register: c.5m Digital data: 2.8m Images: 350k


NHM Digital Collections Programme

A 2, 5 and 10 year plan...To collate, organise and make available one of the worlds most important natural history collections as digital resource, delivering:

an online specimen / lot-level database to manage all holdings

core meta-data and / or images for key parts of the collection

flexible informatics tools750,000 for first 2 years


OutlineWhyInternal objectives & benefitsResearch opportunity - the iCollections exampleWhatHow much data to digitiseLinking digitisation effort to project benefitsHowDigi-street pilots, quick wins (herbarium, drawer & slide scanning)Crowdsourcing pilots & optionsWhereNHM Data PortalExternal Portals (E.g. GBIF, Europeana)LinksCrowdfundingH2020 projects (COST, SYNTHESYS, LOD, VRE, Dig. Inf.)Other museums, herbaria & partners (e.g. CETAF & publishers)When




1. Why: Research opportunity & the iCollections pilot

Using the NHM collections to track long-term seasonal response of butterflies to climate changeDigitisation of British and Irish Lepidoptera collectionSpecies poor, specimen rich~500,000 specimens, 5,000 drawersRe-curation, imaging, label data, georeferenced~25% complete (started Jan.13)About 50% specimens useableMany specimens in most years (late - 19th century to 1970)Provide longer time perspective than most observational records (BMS post-1976)


1. Why: Research opportunity & the iCollections pilotCooler springs

Warmer springs

Earlier collectionLater collectionRelationship between 10th percentile collection date of Anthocharis cardamines (Orange tip) and mean Mar. May temp.(N.B. temp. axis reversed)1900-2000, strong correlation between initial collection dates & temperatureCritical marker on phenological response prior to recent rapid climate changeLonger time perspective than most observational records (BMS post-1976)Museum data available for rare or hard to record speciesAn example of unique biological and ecological data from collectionsBrooks, Self, Toloni & Sparks, 2014, Int. J. Biometeorol. DOI 10.1007/s00484-013-0780-6


2. What: Linking data capture effort to research benefits12345AdministrativeBM Number of objectUnknown/ Not assignedIn label imageTranscribedTranscribed & linked to digitised registerTranscribed & linked to digitised registerObject-level identifier if differentNot assignedIn label image; not machine-readableTranscribedMachine readable barcodeMachine readable barcodeLocation within the NHMDefined at project levelCollectionCollection + Cabinet Collection + Cabinet + DrawerCollection + Cabinet + Drawer + Location within drawerImagesObject imageNoneMulti-object (e.g. drawer level); not separableMulti-object (e.g. drawer level); separableSingle object (low-res)Single object (high-res)Label imageNoneMulti-object (e.g. drawer level); human-readableSingle object; human-readableMulti-object (e.g. drawer level); transcribedSingle object; transcribedObject(s) metadataTaxonomic nameUnidentifiedIn label imageTranscribed but above species-levelTranscribed to species-levelName linked to current taxonomyType status (if single specimen)UnknownIn label imageYes/NoTranscribed & specifiedTranscribed & specifiedGeographical locationUnknown or in label imageContinent (TDWG level 1)Country or region (TDWG levels 2-4)Georeferenced named locality (e.g., town)Coordinates based on narrative or GPSDate of collectionUnknownIn label imageTranscribed yearTranscribed month and year or date rangeTranscribed dateCollectorUnknownIn label imageTranscribedTranscribed & linked to controlled listTranscribed; linked to controlled list & collector's notesStratigraphyUnknownIn label imageTranscribed verbatimChronostratigraphic interpretationChronostratigraphy, biostratigraphy and lithostratigraphyAdditional data (e.g. host)UnknownImaged but not transcribedPartially transcribedTranscribed in fullTranscribed in full and integrated

LowHighLevel of Data Capture

First Sweep

Benefit JustifiedCoarse biogeographyCoarse macroecologyComparative analysis of traitsMacroecological modelling of phenotypeCommunity phylogeneticsList-based macroecologyTemporal diversity curvesEvolution of disparity

OriginsResearch Areas We have mapped out much data we need to address these questionsFuturesCoarse analysis of spp. distribution changeCoarse Species Distribution ModelsPhenological changeHazardsEarliest records of invadersEffects of decadal climate oscillationsModelling biotic consequences of weatherEvolution of invasive speciesResourcesSemi-automated capture of trait dataModelling within-spp variation across rangeObject ImageStratig-raphyCollectorTax.nameGeographical locationDate of collection12345Radar Plots


3. How: Digi-street pilots (Herbarium Sheets)



3. How: Digi-street pilots (Herbarium Sheets)33k Specimens per day, 3 shifts (6am-10pm), Netherlands collection complete in 1.5 years1.29 Euros per specimen image (if outsourced), transcription at similar costVideo of Herbarium Sheet Digitisation(Not available on SlideShare Version of this presentation)


3. How: Digi-street pilots (Drawer scanning & segmentation)

SatScan whole drawer scanning30 Million specimens, 130k drawersFast, high res. multi-specimen drawer images (5 mins. each)No specimen handlingLimited drawer / unit tray metadata, plus identifiersSpecimen segmentation problemDigital and physical collection gets out of syncNeed to automate specimen segmentation


3. How: Digi-street pilots (Drawer scanning & segmentation)

Starting image


Mark errors

CorrectWork with Pieter Holtzhausen and Stfan van der Walt (Stellenbosch University)Software: InselectMain language: Python


3. How: Digi-street pilots (Slide scanning)Originally for histological sectionsCan be adapted for NH specimensMany issues:Speed (Max. 500 per day)File size (2-5GB per slideNetwork ingestion (100MBps)Reading labels at both endsNHM testing 6 systemsNERC capital grant awardedFully operation early 2015

1. Slides cleaned & barcoded2. Loaded into hopper(50-100)3. High resolution scan4. Images stored & databased


3. How: Crowdsourcing pilot

1 user with 32,629 transcriptions!92 users with 100+ transcriptions363 users with 1 transcriptionRanked usersLog no. of records transcribedNHM Bird registersNo advertisingHard to transcribeChallenging starting project


3. How: Crowdsourcing optionsZooniverse ProjectsSmithsonian Digital VolunteersWikisource transcription (WiR)[email protected] steps: Survey and review of natural history transcription projects cf. paying transcribers


4. Where: NHM Data PortalA focus for deposition and discovery of NHM research & collections dataStable, citable identifiers on datasets & specimen / lot recordsTransparent data quality (un-reviewed, reviewed, reviewed & updated)Download (DwCA), web-services & Linked Open DataBuild using CKAN, with enhanced mapping functionality

SearchDatasets matching criteriaIndividual datasetResultsBrowse & searchcriteriaMapping, table & statistical views


4. Where: External PortalsFlickr


e.g. NHM ColeopteraNHM almost getting data to GBIF!Submitting to Europeana portal (via Open-Up)Niche collections on FlickrRobust API servicesGateway to image analysis projects (e.g. species recognition & trait extraction tools)


5. LinksCrowdfundingPersonalizes don