next steps for bhl and linked data

22
Next steps for BHL and Linked Data John Mignault Technical Advisory Group Biodiversity Heritage Library Twitter: @jmignault

Upload: fala

Post on 30-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Next steps for BHL and Linked Data. John Mignault Technical Advisory Group Biodiversity Heritage Library Twitter: @ jmignault. The Biodiversity Heritage Library. BHL is a consortium of natural history, botanical libraries and research institutions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Next steps for BHL and Linked Data

Next steps for BHL and Linked Data

John MignaultTechnical Advisory Group

Biodiversity Heritage LibraryTwitter: @jmignault

Page 2: Next steps for BHL and Linked Data

The Biodiversity Heritage Library

• BHL is a consortium of natural history, botanical libraries and research institutions

• An open access digital library for legacy biodiversity literature

• An open data repository of taxonomic names and bibliographic information

• An increasingly global effort– US/UK, Europe, Egypt, China, Africa

Page 3: Next steps for BHL and Linked Data

How much text are we talking?

• Just hit 40 million page mark• Tens of thousands of titles• 110, 000 volumes• Internet Archive is BHL scanning partner• In conjunction with local scanning efforts

Page 4: Next steps for BHL and Linked Data

Issues we’ve faced

• OCR is a *BIG* deal• A lot of literature is pre-1923• Expanding the range of material in BHL

Page 5: Next steps for BHL and Linked Data

OCR is a *BIG* deal

• All book / literature digitization projects affected, not just BHL

• Especially problematic in BHL– More than 50 languages represented in BHL– Dates of publication from 1400’s to 2000’s– Irregular typeface / typesetting– Multiple languages on one page

• Botanical descriptions in Latin

Page 6: Next steps for BHL and Linked Data

2007 Name Finding Study

>35% OCR error rate for names only

1 Insert Space 8 n->v

2 Omit Space 9 l->i

3 e->c 10 r->i

4 u->I 11 u->ii

5 u->n 12 h->l

6 i->l 13 h->ii

7 c->e 14 e->o

Top OCR errors

35.16%

Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.

Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008.http://www.tdwg.org/proceedings/article/view/380

Page 7: Next steps for BHL and Linked Data

Abbild ungen und Beschreibungen der

Fische Syriens, nebst

einer neuen Classification und Characteristik sämmtlicher Gattungen

der i

JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in

Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.

STUTTGART. E. Schweizerbart' sehe Verlagshandlung,

1843.

Page 8: Next steps for BHL and Linked Data

Older material

• Great deal of material is pre-1923 • Irregular fonts – blackletter• Multiple languages on same page – English

text with Latin scientific names• Changes in geographic names• Changes in scientific names

Page 9: Next steps for BHL and Linked Data

*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �

', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �

r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

Page 10: Next steps for BHL and Linked Data

Expanding scope

• Manuscripts, field notebooks –mostly handwritten, often with drawings

• Global expansion means dealing with non-Western script systems and a whole new set of OCR problems – Arabic materials from Bibliotheca Alexandria in Egypt

Page 11: Next steps for BHL and Linked Data

Images

Page 12: Next steps for BHL and Linked Data

Some current initiatives

• Scientific name extraction• “Parts”• PDF Generator

Page 13: Next steps for BHL and Linked Data

Scientific Name Extraction

• TaxonFinder algorithm in production since 2008– More than 100 million candidate name strings– More than 1.5 million unique, verified names– Available through UI, APIs, Data Exports &

Internet Archive

• New collaboration with Global Names– Improved algorithm, better precision & recall– More data!

Page 14: Next steps for BHL and Linked Data

Finding parts

• Disambiguating and locating structural boundaries in the corpus

• Done mainly by crowdsourced means– Citebank

• Greatly increases usability and semantic value of the dataset

• Addressing important – makes data addressable and thus linkable

Page 15: Next steps for BHL and Linked Data

Articles in the BHL UI

Page 16: Next steps for BHL and Linked Data

Images

Page 17: Next steps for BHL and Linked Data

PDF Generator

Page 18: Next steps for BHL and Linked Data

What we’d like to dohttp://biodivlib.wikispaces.com/BHL+and+Gaming

•Correcting OCR•Rekeying Tables of Contents•Researching candidate Scientific Names•Image identification & extraction

– http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH

^Challenges framed as games

Page 19: Next steps for BHL and Linked Data

We need your help

• “When in doubt, use humans.” – @dpatil: ttp://radar.oreilly.com/2012/07/data-

jujitsu.html

• Increase value of biodiversity domain through improved data integration

• Many similarities between specimen labels and literature

Page 20: Next steps for BHL and Linked Data

Need deep intertwingling

• Wider integration of biodiversity data• Normalization through controlled

vocabularies and authorities• Linkages between

– Specimens– Descriptions– Articles– Manuscripts

Page 21: Next steps for BHL and Linked Data

To sum up

• BHL is a massive dataset useful for multidisciplinary research– Systematics– Natural Language Processing– Humanities

• BHL is open– Free to use at http://biodiversitylibrary.org– Open access data for scholarly use & reuse

• BHL has APIs and data exports to enable reuse– BHL data can be incorporated into other virtual

research environments

Page 22: Next steps for BHL and Linked Data

Get involved

• http://biodiversitylibrary.org• http://biodivlib.wikispaces.com/Developer+Tools+and+API • http://biodivlib.wikispaces.com/BHL+and+Gaming

• Thanks!