Download - BHL / EOL technology sit down
BHL / BIG
2 Mar 2010Woods Hole
BHL: CONTENT & USAGEZea mays http://www.eol.org/pages/1115259
Literature: http://biodiversitylibrary.org/name/Zea_mays
Size of BHL
• 24TB & growing!
Workflow
Selection Preparation
Post Production(Re)publication
Digitization
Conservation
Scanning = human work
Scan & Store: Internet Archive
Scanning on Scribes
Storage in Petaboxes
Scanning Derivatives
• XML• JP2
• PDF• JPG• TXT• DJVu
Master Derivatives
OCR
XML
JP2
MOBOT
Petabox cluster
Internet Archive
Image Server
MBL
Cluster
BHL Data Flow – Sep 2009
CiteBank
Usage: 1 Jan 08 – 31 Jan 10
• Daily average– 1,026 visitors– 1,680 visits / day– 8,200 pageviews / day
Referrers: 1 Jan 08 – 31 Jan 10
Jan 1, 2008 – Jan 31, 2010
BHL: APP / UI / SERVICESZea mays http://www.eol.org/pages/1115259
Literature: http://biodiversitylibrary.org/name/Zea_mays
BHL Development Team
Phil ->
<- Mike
Name Finding via TaxonFinder
Image from ScannerConverted to text via OCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response
Name Finding in action
with Taxonomic Intelligence…
Name Finding Evaluation
• Structured and performed by Qin Wei– Ph.D. student at UIUC, working with Bryan Heidorn
• Methodology– Scholarly volunteers manually identified scientific names
on random sample of 392 pages in BHL corpus– Compared those against OCR,then two name finding
algorithms (TaxonFinder & FAT)• Goals
– Spark discussion, set baseline for future work
Characteristics of sample
Number of Pages 392
Average Number of Words per Page 446.8
Average Number of Names per Page 7.7
Total Number of Names 3003
Total Number of Unique Names 2610= 86.91%
OCR error rate for names only
1 Insert Space 8 n->v
2 Omit Space 9 l->i
3 e->c 10 r->i
4 u->I 11 u->ii
5 u->n 12 h->l
6 i->l 13 h->ii
7 c->e 14 e->o
Top OCR errors
35.16%
Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.
Considerations
• Improving OCR software is out of scope– Google’s Tesseract is only viable open source
option– Flurry of activity in 2006-2007, quiet since
• Rekeying is expensive given size of corpus– Will not scale
Name finding statistics
• 27.7 million pages scanned• 70.4 million name strings found• 56.2 million names verified with a
NameBankID• 1.4 million unique names with a NameBankID• 3.3 million unique names *without* a
NameBankID– This is where the interesting data live!!!
http://www.biodiversitylibrary.org/name/Physeter_catodon
But where are the articles??
PDF Generation Stats
Mandate for new development
• display / manage articles
• meet community demands for bibliography / citation management
• build from more open source tools
Development goals re: citations
• Create a repository for community-vetted taxonomic bibliographies.
• Ability to ingest, display, download, and index articles so that the BHL can operate as an article repository.
• Build from existing community of work around Drupal / Biblio.– In use by collaborators
http://www.citebank.org
http://citebank.org/node/47423
Services• OpenURL
– Facilitate links to citations: protologues, articles, references• Documentation:
http://www.biodiversitylibrary.org/openurlhelp.aspx– Useful to Nomenclators, Reference Systems
• IPNI• Tropicos
• Names Service– Return all occurrences of a name throughout BHL digitized corpus
• Documentation: http://bit.ly/2e6sg9– Access to 51million name strings using TaxonFinder
– 1.4million unique names– Working out a strategy for obscure species– Algorithm improvements to detect nomenclatural & taxonomic acts
• New API
Services: OpenURL
http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879
http://www.tropicos.org/Name/1200408
Services: OpenURL Disambiguation
• Looking for:
• BHL returns:
Services: OpenURL Results
How?
• Tropicos maintains internal authority list of publications:
• Each protologue/reference tied to authority:
• Matched Tropicos TitleIDs to BHL TitleIDs:
• Throw citations at resolver at regular intervals & cache data in Tropicos
http://www.tropicos.org/Publication/775
http://www.biodiversitylibrary.org/title/3934http://www.tropicos.org/Publication/775 =
http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879
BHL Name Serviceshttp://www.biodiversitylibrary.org/services/name/NameService.asmx
http://www.biodiversitylibrary.org/services/name/NameService.asmx
Other consumers
• EarthCape Lab• BioGuid• BioSTOR
Research projects• BREC - NSF• Conjecturator - NSF• Darwin’s Library – NEH/JISC• Hong Cui @ University of AZ - NSF
http://bioguid.info/bhl/compare.php?name1=Physeter+catodon&name2=Physeter+macrocephalus
HARDWARE / INFRASTRUCTUREZea mays http://www.eol.org/pages/1115259
Literature: http://biodiversitylibrary.org/name/Zea_mays
<insert Phil here>
GLOBAL BHLZea mays http://www.eol.org/pages/1115259
Literature: http://biodiversitylibrary.org/name/Zea_mays
Global BHL
Global BHL Nodes
• BHL-Australia– http://ec2-75-101-224-221.compute-1.amazonaws.com/
• BHL-China– http://bhl-china.org
• BHL-Europe– http://biodiversitylibrary.eu
BHL: COLLABORATION W/ BIGZea mays http://www.eol.org/pages/1115259
Literature: http://biodiversitylibrary.org/name/Zea_mays
<insert discussion here>
• Existing issues in Jira• Taxonomic name finding enhancements
– Nomenclatural acts in web services– Other algorithms / verification
• WoRMS data• Improvement
– Ranking results– Visualization
• LifeDesks– Bibliography sharing– Resolve to articles