digitizing california arthropod collections
DESCRIPTION
Digitizing California Arthropod Collections. Peter Oboyski, Phuc Nguyen, Serge Belongie , Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA. Who is CalBug ?. Essig Museum of Entomology California Academy of Sciences - PowerPoint PPT PresentationTRANSCRIPT
Digitizing California Arthropod Collections
Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary GillespieEssig Museum of Entomology
University of CaliforniaBerkeley, California, USA
Who is CalBug?
Essig Museum of Entomology
California Academy of Sciences
California State Collection of Arthropods
Bohart Museum, UC Davis
Entomology Research Museum, UC Riverside
San Diego Natural History Museum
LA County Museum
Santa Barbara Museum of Natural History
(Optional) Sort by locality, date, sex, etc.
Remove labels, add unique identifier
Replace labels, return to collection
Manually enter data into MySQL database
Online crowd-sourcing of manual data entry
Optical Character Recognition (OCR) &
Automated data parsing
Error checking
Geographic referencing
Aggregate data in online cache
Temporospatial analyses
Take digital image, name and save file
Digitization workflow
Handling & Imaging Data Capture Data Manipulation
Why Image Specimens/Labels?• Data capture can be done remotely• Magnify difficult to read labels• Potential for OCR• Verbatim digital archive of label data
1st generation - DinoLite digital microscope
2nd generation – Digital Camera (Canon G9)
Higher resolution
Labels flat & unobstructed
Scale bar, controlled light
Important to add species name to image or file name
EMEC218958 Paracotalpa ursina.jpg~150,000 images waiting to database
Manually enter data into MySQL database
Online crowd-sourcing of manual data entry
Optical Character Recognition (OCR) &
Automated data parsing
Data capture
Using our own MySQL database (EssigDB)Built-in error checkingData carry-over one record to nextTaxonomy automatically added
“Notes from Nature”Collaboration with ZooniverseCitizen Scientist transcription of labels
Collaboration with UC San DiegoImproved word spotting & OCR
Notes from NatureCitizen Science data transcription
Integrating OCR with crowd sourcing
o Spotting words within imageso Copy-paste, highlight-drag fieldso Auto-detecting repeated “words”
o eg. species, states, countieso Providing an additional “vote” for
transcription consensus
The OCR challenge for specimen labels
DETECTION:Finding text in a complex matrixMachine-typed vs. hand-written labelsSliding window classifier creating text bounding boxes>95% detection and localization using pixel-overlap measures
RECOGNITION:
Using Tesseract OCR engine
Machine Type
74% accuracy for word-level
82% accuracy for character-level
Hand Writing
5.4% accuracy for word-level
9.2% accuracy for character-level
Current Progress in OCR recognition
Where do we go from here?
• Improved recognition of hand-writing• Incorporate OCR into crowd sourcing• Develop (semi-) automated data parsing
Thank you
http://calbug.berkeley.edu