bhl tech overview for bhl-europe

Post on 26-May-2015

1.113 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at BHL-Europe Kickoff Meeting.Museum für Naturkunde, Berlin12 May 2009

TRANSCRIPT

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

BHL Technology Overview

Chris Freeland

Technical Director, BHL

Director of Bioinformatics, Missouri Botanical Garden

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

About BHL: Usage, History

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Goals of BHL

• Scan public domain biodiversity literature.

• Negotiate rights to digitize copyrighted materials.

• Ingest content digitized by others.

• Provide interfaces & APIs for repository.– GUIs– Services for data mining & citation resolution

http://www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

• More than:33,000 volumes

13.3 million pages

• Avg. monthly growth rate1,500 volumes

600,000 pages

Now Online

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Monthly Usage Stats

• 45,000 unique users

• 250,000 pageviews

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

History

• Preliminary work: MOBOT’s Botanicus– http://www.botanicus.org

• Funded by Keck Foundation & IMLS

• Working demonstration of how nomenclators/databases can link into digitized scientific literature

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Architecture

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Distributed

• Digitized content on Internet Archive servers in California

• Metadata index on MOBOT servers in Missouri

• Image server on MBL servers in Massachusetts

• Nice, but not global

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.orgMOBOT

Petabox cluster

Internet Archive

Image Server

MBL

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Scanning Workflow

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Scanning OperationsBHL uses scanning centers established by

Internet Archive for mass scanning.

Some partner libraries also scan in-house.

Want to expand international footprint:

•mirrored content•ingest from global data providers

Locations of BHL/IA Scanning Centers

Workflow

Selection Preparation

Post Production(Re)publication

Digitization

Conservation

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Open Access DataFlora medica, oder, Abbildung der wichtigsten officinellen Pflanzen…[Heft 1-18]

Publisher: Jena,August Schmid,1831 [i.e. 1829-1831].

PDF

OCR

XML

JP2

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Complexities of distributed, mass scanningfrom NYBG

from Smithsonian

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Post Processing & Derivatives

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Derivatives

• JPEG2000 (JP2) images

• OCR: ABBY FineReader

• PDF: LuraTech PDF Compressor

• XML metadata

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Name Finding via TaxonFinder

Raw Image Converted to text via OCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response

Name Finding in action

with Taxonomic Intelligence…

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Name Finding Stats to date*

• Have mined more than 42 million name string occurrences

• More than 30 million name strings verified by NameBank– 1.5 million unique

*12 May 2009

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Content Delivery

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

OCR error rate for names only

1 Insert Space 8 n->v

2 Omit Space 9 l->i

3 e->c 10 r->i

4 u->I 11 u->ii

5 u->n 12 h->l

6 i->l 13 h->ii

7 c->e 14 e->o

Top OCR errors

35.16%

Study in 2008 found that for sample population of 3,003 names, 1,056 were incorrectly transcribed by OCR.

http://biodiversitylibrary.blogspot.com/2008/10/evaluation-of-taxonomic-name-finding.html

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Current image delivery: djatoka

• Images stored as JPEG2000 (.jp2)

• Decoded & delivered to browser via djatoka– Open source JP2 image server– Developed by digital librarians– Scalable– Rapid development cycle (v1.1)– Growing community of users

djatoka

Browser IIPViewer

www.biodiversitylibrary.org

.jp2

.jpg

IA

/page/1274907

pageid: 1274907

BHLdb

http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2

images.biodivlibrary.org

A user requests Mushrooms of America, edible and poisonous, Plate X:http://www.biodiversitylibrary.org/page/1274907

locate:

BHL/IA architecture

St. Louis

San Francisco

Woods Hole

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

New delivery option: IA Bookreader

• Open source

• Example: Flora medicahttp://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229

IA Book Viewer

http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

APIs & Data Sharing

• Name Service (Documentation)

– REST: XML or JSON

• Data Export (Documentation)

– Monthly export of BHL titles, volumes, pages, names, other metadata in delimited files

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

*Soon: Citation resolver via OpenURL

Beetle, A. A. 1977. Noteworthy grasses from Mexico V. Phytologia 37(4): 317–407.

http://example.edu/cgi?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:article&rft.jtitle=Phytologia&rft.atitle=Noteworthy+grasses+from+Mexico&rft.aulast=Beetle&rft.aufirst=A&rft.date=1977&rft.volume=37&rft.issue=4&rft.spage=317&rft.epage=407

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Articles

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Article repository

• Needed a way to display these PDFs

• Wanted to extend contribution functionality to users

• “Safe harbor” model– BHL provides platform– Community provides content

• Scientists, students, libraries

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

http://cite.biodiversitylibrary.org

• Drupal with Biblio module

• Multi-lingual interface

• Customizable display, layout

• Solr search/faceting

• OAI & other services for discovery/sharing

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Outreach

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

BHL Blog

• Updates

• Announcements

• 1,500 users / month

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Twitter

• twitter.com/BioDivLibrary

• Communication tool– Connecting with LinkedData community, other

users– Receiving assistance, guidance– FAST turnaround

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

If BHL-E is not a Research Project…

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Technologies in hand:

• TaxonFinder

• djatoka

• IA Bookreader

• Drupal/Biblio

• OAI-PMH

• OpenURL

• Fedora Commons

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Needed:

• Deduplication Tools

• Storage

• OCR

• Markup/rekeying

• UI/UX

• Interface translation

• Data synchronization

Freeland. BHL Tech Overview. 12 May 2009 www.biodiversitylibrary.org

Thank youChris Freeland

4344 Shaw Blvd.

St. Louis, MO 63110

chris.freeland@mobot.org

http://www.biodiversitylibrary.org

top related