from the printed page to discoverable content library camp perth 2010

From the

Printed Page to

Discoverable Contentthe open source way

Steven Miles

@stevermiles stevenmiles.com.auTuesday, 18 January 2011

About Me

Tuesday, 18 January 2011

About MeWeb Application Developer

State Library of Western Australia

@


About MeWeb Application Developer

State Library of Western Australia

@

S.L.U.R.P. Digital Content Ingestion &

Integration with LMS

PC Reservation PC Reservations and Booking

System

PLOPublic Libraries Online

Venues BookingsVenues Booking & Reservation

System

P.URL Permanent URL


WARNING !!!!

Lots of technical stuff!


How can I make scanned content more discoverable?

presentation

DigitisationIndexing

Capture DIY Scanner

Existing Documents

Dual Camera Setup

Single Camera Setup

Commercial ScannersImage Processing

OCR

Document Scanners

MFD’s

Rotation

Cropping

Normalisation Levels Correction

Multi page

TaggingOpen source

Commercial

Cuneiform

Tesseract

OcropusGOCR

PageLayout Analysis

Abby Fine Reader

Acrobat

leptonica

Metadata

ManualAutomatic

PersonsLocations

Dates

OrganisationsLocations

Formats

hOCRText

XML

Manual

Import

Z39.50

SRU/SRW

Engine

Zebra

XML

Z39.50

RBMS

Postgres

MySQL

Search

Pull from LMS

Search Multiple Databases Results

Expose Web API’s

Other Library Systems

Z39.50

SRU/SRW

Facets Page Previews

Ranked

Sortable

Filters

Web Accessible

SimpleKeywordSearching

Encourage Exploration

Tagging

AdvancedSearch

SavedSearches

Social Sharing,Intergration

Web Browser Accessible

Auto Updating Downloadable PDF’s

User Correctable Text

In DocumentSearching

Highlight Search Results

Potential Conversion to Other Formats


Most common process of digitisation for public consumption

Scan /Capture Generate PDF OCR

Indexed by ContentManagement

System

Link toDownloadable

PDF(Uncorrected OCR)

(Links only to Document)

How can we do this better?


Inspirational Resources

National Libraries Australia - Australian Newspapershttp://newspapers.nla.gov.au/

Google Docshttp://docs.google.com

Informit -Text Searchable Content


http://newspapers.nla.gov.au

http://newspapers.nla.gov.au

http://docs.google.com

http://docs.google.com

Scan /Capture

Semi Auto Cropping

and Rotation Correction

Optimise Each Page for OCR

OCR Pages

Retain Positional Information (hocr)

Post OCR Processing

Spell checking & correction of common

OCR errors

Natural Language

ProcessingAuto Extract Names,

Organisations, Locations & Dates

from Text and Use for tagging

Store as XML

Generate Page Level XML Index

Files

Add/Update XML

Indexing Server

Fully Automated Process

Generate Searchable PDF

Generate Web Friendly Versions

of each page

Full Text Search

Web Services & Z39.50

Downloadable PDF

Google Docs Style Interface

Individual Line Highlighting to Show

search results

Proposed Digitisation Process


Available Open Source Projects

Ocropus - Page Layout Analysishttp://code.google.com/p/ocropus/

Tesseract OCR - OCRhttp://code.google.com/p/ocropus/

Image Magick - Image Processinghttp://www.imagemagick.org/

Index Data Zebra -XML Indexinghttp://www.indexdata.com/zebra

Index Data Pazpar2 -Federated Searchhttp://www.indexdata.com/pazpar2

Existing Web Technologies - PHP, HTML, CSS etc


http://code.google.com/p/ocropus/




http://www.imagemagick.org

http://www.imagemagick.org

http://www.indexdata.com/zebra


http://www.indexdata.com/pazpar2


DIY Book Scanner Project

www.diybookscanner.org


http://www.diybookscanner.org

http://www.diybookscanner.org

Discovery Layer(PHP, HTML,CSS)

Federated SearchUsing PazPar2 - Z39.50, SRU, SRW

Full Text SearchZebra - XML Indexer

via Z39.50

LMS & External DatabasesExisting via Z39.50

XML Data FilesMARC, Dublin Core, OAI-PM

Document Viewer / Editor(PHP, HTML,CSS)

Ingest / Digitisation(PHP,HTML,CSS)

OCR & NLP(Document Processing, OCR & Natural Language Processing)

Downloadable VersionAutomatic Generation of Searchable

PDF, Text Files etc(Updated from User Alterations)

External Resources

Basic Architecture

Crowdsourcing OCR Corrections & Possible

translation on handwritten documents


Converting Images for OCR

Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCR Version

OCRopus Page Layout Analysis

Image Magick Image Manipulation

Combined


Images to Text

Image for OCR Processing Tesseract OCR to HOCR File

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title></title><meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta name='ocr-system' content='tesseract'></head><body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'><div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233">R r </div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883">By LIAM CROY</div><div class='ocr_carea' id='block_1_3' title="bbox 43 1909 533 2404">IN a <metadata><title>Eastern Reporter Tuesday, October 5, 2010</title><id>eastern_reporter/2010/10/5</id></metadata><pages><page id="0" origWidth="3648" origHeight="2736" rotate="-90.5" crop="2199x3321+147+147"/><page id="1" origWidth="3648" origHeight="2736" rotate="91" path="odd/IMG_0946.JPG" crop="2161x3247+374+274" width="2161" height="3247"><paragraph><line id="line_1_1" top="50" left="201" width="1657" height="180">R r</line></paragraph><paragraph><line id="line_1_2" top="1855" left="47" width="194" height="27">By LIAM CROY</line></paragraph><paragraph><line id="line_1_3" top="1910" left="46" width="485" height="24">IN a display of unity, Muslims and Chris-</line><line id="line_1_4" top="1937" left="45" width="486" height="26">tians gathered at Dianella Uniting Church</line><line id="line_1_5" top="1965" left="45" width="485" height="26">last Thursday to share thei.r experiences</line><line id="line_1_6" top="1993" left="45" width="212" height="24">and pray for peace.</line></paragraph><paragraph><line id="line_1_7" top="2020" left="79" width="451" height="25">Sheikh Muhammad Agherdien of the</line></paragraph><paragraph><line id="line_1_8" top="2048" left="46" width="484" height="25">Mirrabooka mosque opened the service</line><line id="line_1_9" top="2076" left="46" width="484" height="26">with a verse of the Islamic religious text,</line><line id="line_1_10" top="2103" left="45" width="117" height="20">the Koran:</line></paragraph><paragraph><line id="line_1_11" top="2131" left="79" width="451" height="27">“Oh People! Behold, we have created you</line></paragraph><paragraph><line id="line_1_12" top="2158" left="46" width="331" height="22">all out ofa male and a female.</line></paragraph><paragraph><line id="line_1_13" top="2187" left="79" width="451" height="25">“And we have made you into nations</line></paragraph><paragraph><line id="line_1_14" top="2214" left="46"

Convert HOCR to XML for Storage Sample Auto Generate Tags

IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians gathered at [ORG Dianella Uniting Church ] , last Thursday to share thei.r experiences , and pray for peace.


http://www.w3.org/TR/html4/loose.dtd

http://www.w3.org/TR/html4/loose.dtd

Demo


Prototype Interface for Ingesting Pages from Book Scanner


Perform Basic Image Rotation and Cropping

Rotation and Cropping can replicated to other pages


Prototype Search PagesResults on the left are the Auto Generated facets based on the natural language processing tags


Viewing Document Pages


Viewing Document Pages with Highlighted Results


Editing Document with Auto Updating of Indexer


Pazar2 can be used to alternative interfaces for search multiple existing catalogs


Questions?


More Info & Credits

Tesseract-OCRhttp://code.google.com/p/tesseract-ocr/

OCRopushttp://code.google.com/p/ocropus/

Do-It-Yourself Book Scanninghttp://www.diybookscanner.org/

CHDK - Canon Hack Development Kithttp://chdk.wikia.com/wiki/CHDK

Zebra - XML Indexinghttp://www.indexdata.com/zebra

PazPar2 -Federated Searchhttp://www.indexdata.com/pazpar2

Cuneiformhttp://en.wikipedia.org/wiki/HOCR

EyeFi Python Serverhttp://returnbooleantrue.blogspot.com/2009/01/eye-fi-standalone-server.html/

hOCR - HTML OCRhttp://en.wikipedia.org/wiki/HOCR

OpenNLPhttp://www.indexdata.com/pazpar2

Illinois Named Entity Taggerhttp://cogcomp.cs.illinois.edu/page/software_view/4


http://code.google.com/p/tesseract-ocr/












http://en.wikipedia.org/wiki/HOCR












from the printed page to discoverable content library camp perth 2010

Technology