next steps for bhl and linked data john mignault technical advisory group biodiversity heritage...

22
Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library Twitter: @jmignault

Upload: christian-mccarthy

Post on 29-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Next steps for BHL and Linked Data

John MignaultTechnical Advisory Group

Biodiversity Heritage LibraryTwitter: @jmignault

The Biodiversity Heritage Library

• BHL is a consortium of natural history, botanical libraries and research institutions

• An open access digital library for legacy biodiversity literature

• An open data repository of taxonomic names and bibliographic information

• An increasingly global effort– US/UK, Europe, Egypt, China, Africa

How much text are we talking?

• Just hit 40 million page mark• Tens of thousands of titles• 110, 000 volumes• Internet Archive is BHL scanning partner• In conjunction with local scanning efforts

Issues we’ve faced

• OCR is a *BIG* deal• A lot of literature is pre-1923• Expanding the range of material in BHL

OCR is a *BIG* deal

• All book / literature digitization projects affected, not just BHL

• Especially problematic in BHL– More than 50 languages represented in BHL– Dates of publication from 1400’s to 2000’s– Irregular typeface / typesetting– Multiple languages on one page

• Botanical descriptions in Latin

2007 Name Finding Study

>35% OCR error rate for names only

1 Insert Space 8 n->v

2 Omit Space 9 l->i

3 e->c 10 r->i

4 u->I 11 u->ii

5 u->n 12 h->l

6 i->l 13 h->ii

7 c->e 14 e->o

Top OCR errors

35.16%

Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.

Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008.http://www.tdwg.org/proceedings/article/view/380

Abbild ungen und Beschreibungen der

Fische Syriens, nebst

einer neuen Classification und Characteristik sämmtlicher Gattungen

der i

JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in

Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.

STUTTGART. E. Schweizerbart' sehe Verlagshandlung,

1843.

Older material

• Great deal of material is pre-1923 • Irregular fonts – blackletter• Multiple languages on same page – English

text with Latin scientific names• Changes in geographic names• Changes in scientific names

*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �

', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �

r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

Expanding scope

• Manuscripts, field notebooks –mostly handwritten, often with drawings

• Global expansion means dealing with non-Western script systems and a whole new set of OCR problems – Arabic materials from Bibliotheca Alexandria in Egypt

Images

Some current initiatives

• Scientific name extraction• “Parts”• PDF Generator

Scientific Name Extraction

• TaxonFinder algorithm in production since 2008– More than 100 million candidate name strings– More than 1.5 million unique, verified names– Available through UI, APIs, Data Exports &

Internet Archive

• New collaboration with Global Names– Improved algorithm, better precision & recall– More data!

Finding parts

• Disambiguating and locating structural boundaries in the corpus

• Done mainly by crowdsourced means– Citebank

• Greatly increases usability and semantic value of the dataset

• Addressing important – makes data addressable and thus linkable

Articles in the BHL UI

Images

PDF Generator

What we’d like to dohttp://biodivlib.wikispaces.com/BHL+and+Gaming

•Correcting OCR•Rekeying Tables of Contents•Researching candidate Scientific Names•Image identification & extraction

– http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH

^Challenges framed as games

We need your help

• “When in doubt, use humans.” – @dpatil: ttp://radar.oreilly.com/2012/07/data-

jujitsu.html

• Increase value of biodiversity domain through improved data integration

• Many similarities between specimen labels and literature

Need deep intertwingling

• Wider integration of biodiversity data• Normalization through controlled

vocabularies and authorities• Linkages between

– Specimens– Descriptions– Articles– Manuscripts

To sum up

• BHL is a massive dataset useful for multidisciplinary research– Systematics– Natural Language Processing– Humanities

• BHL is open– Free to use at http://biodiversitylibrary.org– Open access data for scholarly use & reuse

• BHL has APIs and data exports to enable reuse– BHL data can be incorporated into other virtual

research environments

Get involved

• http://biodiversitylibrary.org• http://biodivlib.wikispaces.com/Developer+Tools+and+API • http://biodivlib.wikispaces.com/BHL+and+Gaming

• Thanks!