bhl / eol technology sit down

55
BHL / BIG 2 Mar 2010 Woods Hole

Upload: chris-freeland

Post on 18-May-2015

1.051 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: BHL / EOL technology sit down

BHL / BIG

2 Mar 2010Woods Hole

Page 2: BHL / EOL technology sit down

BHL: CONTENT & USAGEZea mays http://www.eol.org/pages/1115259

Literature: http://biodiversitylibrary.org/name/Zea_mays

Page 3: BHL / EOL technology sit down

Size of BHL

• 24TB & growing!

Page 4: BHL / EOL technology sit down

Workflow

Selection Preparation

Post Production(Re)publication

Digitization

Conservation

Page 5: BHL / EOL technology sit down
Page 6: BHL / EOL technology sit down
Page 7: BHL / EOL technology sit down

Scanning = human work

Page 8: BHL / EOL technology sit down

Scan & Store: Internet Archive

Scanning on Scribes

Storage in Petaboxes

Page 9: BHL / EOL technology sit down
Page 10: BHL / EOL technology sit down

Scanning Derivatives

• XML• JP2

• PDF• JPG• TXT• DJVu

Master Derivatives

PDF

OCR

XML

JP2

Page 11: BHL / EOL technology sit down

MOBOT

Petabox cluster

Internet Archive

Image Server

MBL

Cluster

Page 12: BHL / EOL technology sit down
Page 13: BHL / EOL technology sit down

BHL Data Flow – Sep 2009

CiteBank

Page 14: BHL / EOL technology sit down

Usage: 1 Jan 08 – 31 Jan 10

• Daily average– 1,026 visitors– 1,680 visits / day– 8,200 pageviews / day

Page 15: BHL / EOL technology sit down

Referrers: 1 Jan 08 – 31 Jan 10

Jan 1, 2008 – Jan 31, 2010

Page 16: BHL / EOL technology sit down
Page 17: BHL / EOL technology sit down

BHL: APP / UI / SERVICESZea mays http://www.eol.org/pages/1115259

Literature: http://biodiversitylibrary.org/name/Zea_mays

Page 18: BHL / EOL technology sit down

BHL Development Team

Phil ->

<- Mike

Page 20: BHL / EOL technology sit down

Image from ScannerConverted to text via OCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response

Name Finding in action

with Taxonomic Intelligence…

Page 21: BHL / EOL technology sit down

Name Finding Evaluation

• Structured and performed by Qin Wei– Ph.D. student at UIUC, working with Bryan Heidorn

• Methodology– Scholarly volunteers manually identified scientific names

on random sample of 392 pages in BHL corpus– Compared those against OCR,then two name finding

algorithms (TaxonFinder & FAT)• Goals

– Spark discussion, set baseline for future work

Page 22: BHL / EOL technology sit down

Characteristics of sample

Number of Pages 392

Average Number of Words per Page 446.8

Average Number of Names per Page 7.7

Total Number of Names 3003

Total Number of Unique Names 2610= 86.91%

Page 23: BHL / EOL technology sit down

OCR error rate for names only

1 Insert Space 8 n->v

2 Omit Space 9 l->i

3 e->c 10 r->i

4 u->I 11 u->ii

5 u->n 12 h->l

6 i->l 13 h->ii

7 c->e 14 e->o

Top OCR errors

35.16%

Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.

Page 24: BHL / EOL technology sit down

Considerations

• Improving OCR software is out of scope– Google’s Tesseract is only viable open source

option– Flurry of activity in 2006-2007, quiet since

• Rekeying is expensive given size of corpus– Will not scale

Page 25: BHL / EOL technology sit down

Name finding statistics

• 27.7 million pages scanned• 70.4 million name strings found• 56.2 million names verified with a

NameBankID• 1.4 million unique names with a NameBankID• 3.3 million unique names *without* a

NameBankID– This is where the interesting data live!!!

Page 26: BHL / EOL technology sit down

http://www.biodiversitylibrary.org/name/Physeter_catodon

Page 27: BHL / EOL technology sit down

But where are the articles??

Page 28: BHL / EOL technology sit down
Page 29: BHL / EOL technology sit down
Page 30: BHL / EOL technology sit down
Page 31: BHL / EOL technology sit down
Page 32: BHL / EOL technology sit down

PDF Generation Stats

Page 33: BHL / EOL technology sit down

Mandate for new development

• display / manage articles

• meet community demands for bibliography / citation management

• build from more open source tools

Page 34: BHL / EOL technology sit down

Development goals re: citations

• Create a repository for community-vetted taxonomic bibliographies.

• Ability to ingest, display, download, and index articles so that the BHL can operate as an article repository.

• Build from existing community of work around Drupal / Biblio.– In use by collaborators

Page 35: BHL / EOL technology sit down

http://www.citebank.org

Page 36: BHL / EOL technology sit down

http://citebank.org/search

Page 38: BHL / EOL technology sit down
Page 39: BHL / EOL technology sit down

Services• OpenURL

– Facilitate links to citations: protologues, articles, references• Documentation:

http://www.biodiversitylibrary.org/openurlhelp.aspx– Useful to Nomenclators, Reference Systems

• IPNI• Tropicos

• Names Service– Return all occurrences of a name throughout BHL digitized corpus

• Documentation: http://bit.ly/2e6sg9– Access to 51million name strings using TaxonFinder

– 1.4million unique names– Working out a strategy for obscure species– Algorithm improvements to detect nomenclatural & taxonomic acts

• New API

Page 40: BHL / EOL technology sit down

Services: OpenURL

http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879

http://www.tropicos.org/Name/1200408

Page 41: BHL / EOL technology sit down

Services: OpenURL Disambiguation

• Looking for:

• BHL returns:

Page 42: BHL / EOL technology sit down

Services: OpenURL Results

Page 43: BHL / EOL technology sit down

How?

• Tropicos maintains internal authority list of publications:

• Each protologue/reference tied to authority:

• Matched Tropicos TitleIDs to BHL TitleIDs:

• Throw citations at resolver at regular intervals & cache data in Tropicos

http://www.tropicos.org/Publication/775

http://www.biodiversitylibrary.org/title/3934http://www.tropicos.org/Publication/775 =

http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879

Page 44: BHL / EOL technology sit down

BHL Name Serviceshttp://www.biodiversitylibrary.org/services/name/NameService.asmx

http://www.biodiversitylibrary.org/services/name/NameService.asmx

Page 45: BHL / EOL technology sit down

Other consumers

• EarthCape Lab• BioGuid• BioSTOR

Research projects• BREC - NSF• Conjecturator - NSF• Darwin’s Library – NEH/JISC• Hong Cui @ University of AZ - NSF

Page 47: BHL / EOL technology sit down

HARDWARE / INFRASTRUCTUREZea mays http://www.eol.org/pages/1115259

Literature: http://biodiversitylibrary.org/name/Zea_mays

Page 48: BHL / EOL technology sit down

<insert Phil here>

Page 49: BHL / EOL technology sit down

GLOBAL BHLZea mays http://www.eol.org/pages/1115259

Literature: http://biodiversitylibrary.org/name/Zea_mays

Page 50: BHL / EOL technology sit down

Global BHL

Page 51: BHL / EOL technology sit down
Page 52: BHL / EOL technology sit down

Global BHL Nodes

• BHL-Australia– http://ec2-75-101-224-221.compute-1.amazonaws.com/

• BHL-China– http://bhl-china.org

• BHL-Europe– http://biodiversitylibrary.eu

Page 53: BHL / EOL technology sit down

BHL: COLLABORATION W/ BIGZea mays http://www.eol.org/pages/1115259

Literature: http://biodiversitylibrary.org/name/Zea_mays

Page 54: BHL / EOL technology sit down

<insert discussion here>

• Existing issues in Jira• Taxonomic name finding enhancements

– Nomenclatural acts in web services– Other algorithms / verification

• WoRMS data• Improvement

– Ranking results– Visualization

• LifeDesks– Bibliography sharing– Resolve to articles