bhl markup efforts and plans

52
Efforts and plans towards Markup of the BHL Content William Ulate R. BHL Technical Director Missouri Botanical Garden Berlin, Feb. 10, 2014 pro-iBiosphere Markup Workshop

Upload: william-ulate

Post on 10-May-2015

444 views

Category:

Education


0 download

DESCRIPTION

Presentation about past BHL Markup Efforts and present Plans for the pro-iBiosphere Markup Workshop.

TRANSCRIPT

Page 1: BHL Markup Efforts and Plans

Efforts and plans towards Markup of the BHL Content

William Ulate R.BHL Technical Director

Missouri Botanical Garden

Berlin, Feb. 10, 2014

pro-iBiosphere Markup Workshop

Page 2: BHL Markup Efforts and Plans

BHL Mission and Vision

Page 3: BHL Markup Efforts and Plans

22.00

40.00

84.86 94.6

105.85

120.09 130.68

9.2 16.4

31.8 35.4 38.9 41.9 42.6

-

20

40

60

80

100

120

140

Oct-08 Oct-09 Oct-10 Oct-11 Oct-12 Oct-13

Pages (Millions) and Volumes (in Thousands) included in BHL

Volumes (K)

Pages (M)

More Online Content

Page 4: BHL Markup Efforts and Plans

Subjects

Page 5: BHL Markup Efforts and Plans

New Types of Content

Page 6: BHL Markup Efforts and Plans

New Types of Content

Page 7: BHL Markup Efforts and Plans

Scientific Name Extraction• TaxonFinder algorithm in production since

2008– More than 100 million candidate name strings– More than 1.5 million unique, verified names– Available through UI, APIs, Data Exports & Internet

Archive• New collaboration with Global Names project

– Improved algorithm, better precision & recall– More data with TaxonFinder and Neti Neti!– http://gnrd.globalnames.org/

Page 8: BHL Markup Efforts and Plans

Taxon NamesBEFORE Name Instances 101,591,803 101,288,804Unique Names 7,498,554 7,464,924Verified Names 1,905,507 1,902,803EOL Names 63,130,350 62,963,582EOL Pages 13,579,868 13,532,684 AFTER Name Instances 151,222,182 150,066,425Unique Names 29,246,382 29,091,767Verified Names 10,153,165 10,109,540EOL Names 87,791,695 87,135,089EOL Pages 15,466,713 15,342,867

Page 9: BHL Markup Efforts and Plans
Page 10: BHL Markup Efforts and Plans
Page 11: BHL Markup Efforts and Plans

Article-level metadata

Chapter-level metadata

Treatment-level metadata

Part-level metadata

Page 12: BHL Markup Efforts and Plans

Articles in the BHL UI

Page 13: BHL Markup Efforts and Plans
Page 14: BHL Markup Efforts and Plans

See also:

Page 15: BHL Markup Efforts and Plans

Related Titles

Page 16: BHL Markup Efforts and Plans

Smithsonian

San Francisco

Woods Hole

London

Alexandria

Beijing

Global Replication & ServingReplicated Data Center Portal Application

Page 17: BHL Markup Efforts and Plans

BHL-Europe Term Expansion

Page 18: BHL Markup Efforts and Plans

Taxonomic Literature II (TL-2)

Page 19: BHL Markup Efforts and Plans

BioStor articles marked up with JATS

Page 20: BHL Markup Efforts and Plans

Art of Life

Page 21: BHL Markup Efforts and Plans

Art of Life

Page 22: BHL Markup Efforts and Plans

Art of Life

Page 23: BHL Markup Efforts and Plans
Page 24: BHL Markup Efforts and Plans

Art of Life

Page 25: BHL Markup Efforts and Plans

Macaw

https://github.com/cajunjoel/macaw-book-metadata-tool

Page 26: BHL Markup Efforts and Plans

Reviewing Metadata

Page 27: BHL Markup Efforts and Plans

Reviewing Metadata

Page 28: BHL Markup Efforts and Plans
Page 29: BHL Markup Efforts and Plans

Manually built:

1,693 sets

87,879 images

Page 30: BHL Markup Efforts and Plans
Page 31: BHL Markup Efforts and Plans

The Art of Life schema: describing and providing access to natural history illustrations from the Biodiversity Heritage Library (BHL)

by William Ulate, Trish Rose-Sandler, Gaurav Vaidya, Robert Guralnick

Title Stictospiza formosa

Type Illustrations

Date Publication: 1898

Agent Author: Arthur G. Butler (1844-1925)Illustrator: F.W. Frohawk (1861-1946)

Description A pair of finches with green and yellow bodies resting on reeds

Subjects Scientific name: Amandava formosa (Latham, 1790) Vernacular Name: Green Avadavat or Green MuniaAccepted Name: Amandava formosa (Latham, 1790) Birds, finches

Inscriptions bottom center: Green Amaduvade Waxbill (Stictospiza formosa)

SourceButler, Arthur Gardiner. Foreign finches in captivity. Hull and London: Brumby and Clarke, limited,1889 (2nd edition). This image comes from the Biodiversity Heritage Library, and is available online at biodiversitylibrary.org/page/17195895

Rights Public domain

Element Definition Examples Repeat

Agents person or corporate entity involved in the creation, design, production, or publication of a visual resource.

<vra:agent> <vra:name type="personal" vocab="LCNAF" refid="89015596> Curtis,John</vra:name> <vra:dates type="life"> <vra:earliestDate>1791</vra:earliestDate> <vra:latestDate>1862</vra:latestDate> </vra:dates> <vra:role vocab="AAT" refid="300025574">publisher</vra:role></vra:agent>

Y

Copyright The copyright status of the visual resource. <vra:rights refid=”http://creativecommons.org/licenses/by-nc/2.0/

deed.en”>Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)</vra:rights>

N

Date Date or range of dates associated with the creation or publication of the visual resource.

<vra:date type="creation"> <vra:earliestDate>1945</vra:earliestDate> <vra:latestDate>1955</vra:latestDate></vra:date>

Y

Description A free-text note about content of the image, including comments, description, or interpretation, that gives additional information not recorded in other categories.

<vra:description>This illustration shows a scale, coloured illustration of Sepsis annulipes (now known as Encita annulipes) beside the Trifolium ochroleucum plant. Several dissections from Sepsis cylindrica Fab. (all these details are provided on the next page of this book and the subsequent page).</vra:description>

Y

Inscriptions All marks, caption, or written words added to the object at the time of production or in its subsequent history, including signatures, dates, dedications, texts, and colophons, as well as marks, such as the stamps of silversmiths, publishers, or printers.

<vra:inscription> <vra:position>bottom</vra:position> <vra:text>Radula of L. souleyetianum on a more reduced scale</vra:text></vra:inscription>

Y

Source A citation for the book, journal or resource that hosts the visual resource

<vra:source><vra:name type=”book”>Butler, Arthur Gardiner. Foreign finches in captivity. HullBrumby and Clarke, limited,1889 (2nd edition). </vra:name> <vra:refid type=”URI”>http://biodiversitylibrary.org/page/17195895</vra:refid> </vra:source>

N

Subject Terms or phrases that describe, identify, or interpret the visual resource.

<vra:subject><vra:term type=”personalName”>Carl Linnaeus</vra:term></vra:subject>

<dwc:scientificName>Plant: Picea abies</dwc:scientificName> <dwc:acceptedName>Plant: Picea abies</dwc:acceptedName> <dwc:vernacularName>Plant: Norway spruce<dwc:vernacularName>

Y

Title The title or identifying phrase given to an Image <vra:title xml:lang=”la”>Sepsis annulipes</vra:title>

<vra:title type=“alternate”>Orangutan</vra:title>Y

Type Identifies a general category for the visual resource

<vra:type>maps</vra:type><vra:type>forestry maps</vra:type>

Y

Example of illustration described using Art of Life schema

Art of Life schema elements required in Red

We welcome your feedback on the schema! http://tinyurl.com/9hm7nsb

Page 32: BHL Markup Efforts and Plans

*E.xvi c piteI von c. cXx.WptdvonfnrWmn � �bu fbe;bcn.5 am cix bIa S &3rn~ 41X � �a m cv(f b1air 'o et ert oiensr ; � � � �

', : hlrfc c wa ff 4am.diug bist a� � � �6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn ciblatGteaM �w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl Oiff ;Bruet wacfttc n qmcx b1a bl: �bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t B Rn " trv W1Rt' ?Cm c blas � �waIwutr Ober ci ti 1V Ces ' wt �gbtiemwwajfu tpctt, afferain 9 c: b titbfof �

r f eran m rs bra wlg auig4;f aer m *mc vrt � �blatcabtfm wfru an'deg~m rt blas Iaum bwWt run f ncmai b14ianf tJobrrfan �ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W e &mcyfbq4 Mabtt mmw � �rc a iiu bc Jcn ncI.end.*, blat s. a\ u: rprd3 �rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

Page 33: BHL Markup Efforts and Plans

OCR Improvements

• Gaming• Transcription

Page 34: BHL Markup Efforts and Plans

OCR Improvements• Transcription• Purposeful Gaming

• Looking at…– Crowdsource Markup

Page 35: BHL Markup Efforts and Plans

Purposeful GamingDIGITALKOOT

• Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage.

.

Page 36: BHL Markup Efforts and Plans

Purposeful GamingDIGITALKOOT

• Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012

• DigiTalkoot enabled volunteers to participate in this fixing work by playing games.

• .

Page 37: BHL Markup Efforts and Plans

Purposeful gaming and BHL: engaging the public in improving and

enhancing access to digital texts

• IMLS Grant Program: National Leadership Grants for Libraries

• Partners:– Missouri Botanical Garden– Harvard University– Cornell University– New York Botanical Garden

• P.I.: Trish Rose-Sandler, Missouri Botanical Garden• Dates: Dec 2013 – Nov. 2015

Page 38: BHL Markup Efforts and Plans

Project objectives and benefits

• Test new means of crowdsourcing to support the enhancement of content in BHL

• Demonstrate if digital games are an effective tool for analyzing and improving digital outputs from OCR and transcription

• Benefits of gaming include:– improved access to content by providing richer and more accurate

data; – an extension of limited staff resources; and – exposure of library content to communities who may not know

about the collections otherwise.

Page 39: BHL Markup Efforts and Plans

OCR Improvements

German text interpreted by the OCR process as: “unb auf ben ©elnrgen be6 fublic{)en”

Page 40: BHL Markup Efforts and Plans

OCR Improvements

Different resulting texts from parsing the phrase:“und auf den Gebirgen des südlichen Deutschlands”

(“and on the mountains of southern Germany”)

IA OCR OCR 2 Transcription 1 Transcription 2

1 unb und und und Ok

2 den ben den den Ok

3 ©elnrgen ©ebirgen Bebirgen Gebirgen X

4 be6 des de5 des Chk

5 fublic{)en fublichen Füdlichen Südlichen X

6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschlands X

Page 41: BHL Markup Efforts and Plans

Purposeful Gaming

Page 42: BHL Markup Efforts and Plans

iDigBio’s aOCR Hackathon

• Improve OCR parsing of labels with clear metrics (datasets, output formats, scoring algorithm)

• Libraries of regular expr. to clean up each field (different error correction for latitude/longitude coordinates than personal names or herbarium catalog numbers)

• Tool for classifying segments of the image before submitting to OCR

• Do a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR

Page 43: BHL Markup Efforts and Plans

iDigBio’s CITScribe Hackathon

1. Interoperability betweenpublic participation tools and biodiversity data systems,

2. Transcription quality assessment/quality control (QA/QC) and the reconciliation of replicatetranscriptions,

3. Integration of optical character recognition (OCR) into thetranscription workflow

4. User engagement

Page 44: BHL Markup Efforts and Plans

NfN & iDigBio’s CITScribe Hackathon

• Jason Best’s DarwinScore • Ben Brumfield’s Handwriting Gibberish Detector• Dictionaries to improve crowdsourcing consensus

(e.g., names of collectors, scientific names)• Word Clouds created using n-gram scoring, faceting,

and Solr for indexing + Carrot2 for specimen selection (visualize and explore of the use with a word of interest from the word cloud) and a data cleaning step (highlight infrequent words by the system).

Page 45: BHL Markup Efforts and Plans

NESCent EOL-BHL Research SprintThere is no place like home: Defining “habitat” for biodiversity science

Robert D. StevensonUMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393

Carl Nordman (Natureserve) and

Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece

Page 46: BHL Markup Efforts and Plans

NESCent EOL-BHL Research Sprint

Assessing Risk Status of Mexican Amphibians Through Data Mining.

Esther Quintero and Bárbara AyalaNational Commission for Knowledge and Use of Biodiversity (CONABIO)

and

Anne ThessenMarine Biological Laboratory and Arizona State University

Page 47: BHL Markup Efforts and Plans

NESCent EOL-BHL Research Sprint

Evolution in the usage of anatomical concepts in the biodiversity literature

Todd Vision ([email protected]),

Prashanti Manda ([email protected]), and

Dongye MengUniversity of North Carolina at Chapel Hill

Page 48: BHL Markup Efforts and Plans

MiBIO: Mining Biodiversity

• Mining Biodiversity: Enriching Biodiversity Heritage with Text Mining and Social Media

• One of the international projects that won in the third round of the 2013 Digging Into Data Challenge

• Promote the development of innovative computational techniques to apply into big data in the humanities and social sciences– The National Centre for Text Mining (UK)– Missouri Botanical Garden (US) – Dalhousie University's Big Data Analytics Institute (Canada) – Social Media Lab (Canada)

Page 49: BHL Markup Efforts and Plans

MiBIO: Mining Biodiversity

1. Automatic error correction of OCR text errors.

2. Crowdsource annotation of legacy texts with semantic metadata.

3. Adapt text mining techniques to extract terminology, entities and significant events automatically and to track terminology evolution over time.

4. Use Interactive visualization techniques to help users manage search results through next generation browsing capabilities, assisted by a semantic similarity network of important terms and entities.

5. Design of a social media layer, serving as an environment for diverse users to interact and collaborate on science, public education, awareness and outreach.

Page 50: BHL Markup Efforts and Plans

MiBIO: Mining Biodiversity

Page 51: BHL Markup Efforts and Plans

Crowdsource Markup

Display text Species Profile Model category

General/summary TaxonBiology

Geographic range Distribution

Habitat Habitat

Food sources and feeding behavior TrophicStrategy

Physical description (general) Description

Physical description (detailed morphology) DiagnosticDescription

Page 52: BHL Markup Efforts and Plans

Thank youWilliam UlateGlobal BHL Project Manager / Technical DirectorMissouri Botanical [email protected]: william_ulate_r