oboyski ecn2013
TRANSCRIPT
Notes from NatureCitizen Science data transcription
Peter Oboyski, Jun Ying Lim, Joyce Gross, Chris Snyder*, Arfon Smith*, Joanie Ball,
Kip Will, Rosemary Gillespie Essig Museum of Entomology
* Zooniverse Citizen Science Alliance
How does it work?
• Introduction to CalBug• What is Zooniverse?• What do we provide?• What happens online?• What do we get back?• Technical issues• Maintaining interest• How can you get involved?
What is CalBug?
NSF - ADBC grantCollaboration among the eight major entomology
collections in CaliforniaDigitize 1.2 million specimens
Essig Museum of Entomology
California Academy of Sciences
California State Collection of Arthropods
Bohart Museum, UC Davis
Entomology Research Museum, UC Riverside
San Diego Natural History Museum
Santa Barbara Museum of Natural History
LA County Museum
Stephen Dowlan
CalPhotosMySQL database
Berkeley Mapper
http://calbug.berkeley.edu
• In development– Integrating point data (specimen records) with
Habitat, Range maps, Elevation, Climate, etc.– Historical recreation of the environment– Predict potential impacts of environmental change– Facilitate land use/management decisions
Berkeley Natural History Museums
(Optional) Sort by locality, date, sex, etc.
Remove labels, add unique identifier
Replace labels, return to collection
Manually enter data into MySQL database
Online crowd-sourcing of manual data entry
Optical Character Recognition (OCR) &
Automated data parsing
Error checking
Geographic referencing
Aggregate data in online cache
Temporospatial analyses
Take digital image, name and save file
Digitization workflow
Handling & Imaging Data Capture Data Manipulation
Why Image Labels?• Magnify difficult to read labels• Verbatim archive of label data– Essential for proofing data– Useful for taxonomists interested in label data
• Data capture can be done remotely
Digital camera tethered to computerAverage 50-55 images per hour
Including imaging, file renaming, and upload
Filename = EMEC218958 Paracotalpa ursina.jpg
Slide Scanningaverage 150 slides per hour
including scan, file renaming, and upload
400 DPISeems to
provide high enough
resolution for difficult to read
labels while keeping file
size relatively small
But not high resolution enough for taxonomic work
Using Citizen Scientist to transcribe label data
http://www.notesfromnature.org/ Launched April 22, 2013
Images in Transcriptions out
• We supply jpeg images– 400 DPI (300 DPI good)– Deposited as zip file– Stored in Amazon Cloud
• In development– Automated service to
upload images to A.C.– Be able to prioritize
image set
• Zooniverse provides– MondoDB data dump– 1 record = 1 transcription– 4 transcriptions / image
• In development– Automated daily dump
Reconciling transcriptions
• Drop down lists (Country, State, County, Date) are compared for exact match– Occasionally missing, sometimes wrong– Majority rule
• Free-form text fields (Locality, Collectors) are much more problematic– Transcribers asked to record label data verbatim– Puctuation, capitalization, spacing between words– Misspelling, expanding abbreviations, interpretations
• Developing scripts in R to reconcile free-form text
• Text matching for maximum correspondence among multiple transcriptions (cf. DNA alignment methods)
• Final result = 1 transcription in our database with links to the 4 original transcriptions marked as Citizen Science transcribed record
• Vetting by CalBug personnel still necessary, but we can prioritize based on record-matching confidence scores
Reconciling transcriptions
Generating & Maintaining Interest
Number of Notes from Nature transcriptions for CalBug
Generating & Maintaining Interest
• Popular media, social media, and press releases– Only so many occasions for a press release
• Campaigns– Highlight particular taxa, habitats, geographic regions
• Education– High quality, high resolution photo of species transcribed– Create links to other services to learn more about species
• Competitions– Prizes are worth more than badges– However, need to watch for bad data in pursuit of prize
Generating & Maintaining Interest
• Right now you cannot• iDigBio is interested in getting involved• iDigBio hosting a hackathon in December
• Begin building up collections of images
How can you get involved?
Thank youAnd a HUGE thank you to the
CalBug Armywho image our specimens
Chris Amy, Maritess Aristorenas, Jazmin Calderon, Alex Carolina, Sonia Castillo, Matthew Chan, Sabina Cook, Alex Darwish, John Davie, Jesson Go, Nick Grady-Grote, Ginger Haight, Laura Hayes, Dennis Ho, Aubrey Huey, Leah Humphreys, Veronica Hurd, Hanna Huynh, Eseosa Igbinedion, Ilona Istenes, Emma Kohlsmith, Asia Kwan, Tiffany Kyo, Jerry Lee, Ken Lee, Christina Lew, Maggie Lewis, Alex Lim, Derick Matano, Christian Munevar, Frank Ngo, Kent Nguyen,
Minh Nguyen, Riley O'Brien, Marielle Pinheiro, Rammonhan Reddy, Jessica Rothery, Stacey Rutherford, Anna Szendrenyi, Anni Sheh, Hannah Shin, Erika So, Mee Thao, Cindy Truong, Darleen Tu, Skyler Valle, Daug Vaughn, Hayden Wong, Yiu Kei Wong, Keane Yang, Kevin Yao, Frances Zhang