bhl: big data, big challenges
DESCRIPTION
Presented at EOL Semantic Reasoning Workshop, 6-7 Sep 2012, Washington DC.TRANSCRIPT
BHL: Big Data, Big Challenges
Chris Freeland
Founding Technical Director, BHL
Sr. Director, University Academic Computing, Washington University
@chrisfreeland
>100,000 books, > 39 million pages, > 70TB data
Firewall
Images (JP2)PDFCoordinate-based OCRXML metadata
BHL Architecture: Window Seat Ed.
BHL DB
Internet Archive
Storage
Logic
APIs UI DataExports
Access
Data TransformUtilities
Geocoding
Name Finding
BHL Content: Structured Data
• Metadata from Library catalogues at title/volume level
<titleInfo><title>Caroli Linnaei ... Species plantarum :exhibentes plantas rite cognitas, ad genera relatas, cum differentiis specificis, nominibus trivialibus, synonymis selectis, locis natalibus, secundum systema sexuale digestas...</title></titleInfo><titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo><name type="personal"><namePart>Linné, Carl von,</namePart><namePart type="date">1707-1778</namePart></name><typeOfResource>text</typeOfResource><genre authority="marcgt">book</genre><originInfo><place><placeTerm type="text">Holmiae :</placeTerm></place><publisher>Impensis Laurentii Salvii,</publisher><dateIssued>1753.</dateIssued>
BHL Content: Unstructured Data
More than 39 million pages!
….of uncorrected OCR
Abbild ungen und Beschreibungen der
Fische Syriens, nebst
einer neuen Classification und Characteristik sämmtlicher Gattungen
der i
JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in
Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.
STUTTGART. E. Schweizerbart' sehe Verlagshandlung,
1843.
*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �
', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �
r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
How to connect/consumehttp://biodivlib.wikispaces.com/Developer+Tools+and+API
BHL DB
APIs UI DataExports
Internet ArchiveImages (JP2)PDFCoordinate-based OCRXML metadata
BHL Data Challenge: Name Findingpre-2007
BHL Data Challenge: Name Finding
• TaxonFinder algorithm in production since 2008– More than 100 million candidate name strings– More than 1.5 million unique, verified names– Available through UI, APIs, Data Exports & Internet
Archive• New collaboration with Global Names– Improved algorithm, better precision & recall– More data!
New Data Challengeshttp://biodivlib.wikispaces.com/BHL+and+Gaming
• Correcting OCR• Rekeying Tables of Contents• Researching candidate Scientific Names• Image identification & extraction
– http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH
^Challenges framed as games