from the printed page to discoverable content library camp perth 2010
DESCRIPTION
TRANSCRIPT
From the
Printed Page to
Discoverable Contentthe open source way
Steven Miles
@stevermiles stevenmiles.com.auTuesday, 18 January 2011
About Me
Tuesday, 18 January 2011
About MeWeb Application Developer
State Library of Western Australia
@
Tuesday, 18 January 2011
About MeWeb Application Developer
State Library of Western Australia
@
S.L.U.R.P. Digital Content Ingestion &
Integration with LMS
PC Reservation PC Reservations and Booking
System
PLOPublic Libraries Online
Venues BookingsVenues Booking & Reservation
System
P.URL Permanent URL
Tuesday, 18 January 2011
WARNING !!!!
Lots of technical stuff!
Tuesday, 18 January 2011
How can I make scanned content more discoverable?
presentation
DigitisationIndexing
Capture DIY Scanner
Existing Documents
Dual Camera Setup
Single Camera Setup
Commercial ScannersImage Processing
OCR
Document Scanners
MFD’s
Rotation
Cropping
Normalisation Levels Correction
Multi page
TaggingOpen source
Commercial
Cuneiform
Tesseract
OcropusGOCR
PageLayout Analysis
Abby Fine Reader
Acrobat
leptonica
Metadata
ManualAutomatic
PersonsLocations
Dates
OrganisationsLocations
Formats
hOCRText
XML
Manual
Import
Z39.50
SRU/SRW
Engine
Zebra
XML
Z39.50
RBMS
Postgres
MySQL
Search
Pull from LMS
Search Multiple Databases Results
Expose Web API’s
Other Library Systems
Z39.50
SRU/SRW
Facets Page Previews
Ranked
Sortable
Filters
Web Accessible
SimpleKeywordSearching
Encourage Exploration
Tagging
AdvancedSearch
SavedSearches
Social Sharing,Intergration
Web Browser Accessible
Auto Updating Downloadable PDF’s
User Correctable Text
In DocumentSearching
Highlight Search Results
Potential Conversion to Other Formats
Tuesday, 18 January 2011
Most common process of digitisation for public consumption
Scan /Capture Generate PDF OCR
Indexed by ContentManagement
System
Link toDownloadable
PDF(Uncorrected OCR)
(Links only to Document)
How can we do this better?
Tuesday, 18 January 2011
Inspirational Resources
National Libraries Australia - Australian Newspapershttp://newspapers.nla.gov.au/
Google Docshttp://docs.google.com
Informit -Text Searchable Content
Tuesday, 18 January 2011
Scan /Capture
Semi Auto Cropping
and Rotation Correction
Optimise Each Page for OCR
OCR Pages
Retain Positional Information (hocr)
Post OCR Processing
Spell checking & correction of common
OCR errors
Natural Language
ProcessingAuto Extract Names,
Organisations, Locations & Dates
from Text and Use for tagging
Store as XML
Generate Page Level XML Index
Files
Add/Update XML
Indexing Server
Fully Automated Process
Generate Searchable PDF
Generate Web Friendly Versions
of each page
Full Text Search
Web Services & Z39.50
Downloadable PDF
Google Docs Style Interface
Individual Line Highlighting to Show
search results
Proposed Digitisation Process
Tuesday, 18 January 2011
Available Open Source Projects
Ocropus - Page Layout Analysishttp://code.google.com/p/ocropus/
Tesseract OCR - OCRhttp://code.google.com/p/ocropus/
Image Magick - Image Processinghttp://www.imagemagick.org/
Index Data Zebra -XML Indexinghttp://www.indexdata.com/zebra
Index Data Pazpar2 -Federated Searchhttp://www.indexdata.com/pazpar2
Existing Web Technologies - PHP, HTML, CSS etc
Tuesday, 18 January 2011
DIY Book Scanner Project
www.diybookscanner.org
Tuesday, 18 January 2011
Discovery Layer(PHP, HTML,CSS)
Federated SearchUsing PazPar2 - Z39.50, SRU, SRW
Full Text SearchZebra - XML Indexer
via Z39.50
LMS & External DatabasesExisting via Z39.50
XML Data FilesMARC, Dublin Core, OAI-PM
Document Viewer / Editor(PHP, HTML,CSS)
Ingest / Digitisation(PHP,HTML,CSS)
OCR & NLP(Document Processing, OCR & Natural Language Processing)
Downloadable VersionAutomatic Generation of Searchable
PDF, Text Files etc(Updated from User Alterations)
External Resources
Basic Architecture
Crowdsourcing OCR Corrections & Possible
translation on handwritten documents
Tuesday, 18 January 2011
Converting Images for OCR
Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCR Version
OCRopus Page Layout Analysis
Image Magick Image Manipulation
Combined
Tuesday, 18 January 2011
Images to Text
Image for OCR Processing Tesseract OCR to HOCR File
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title></title><meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta name='ocr-system' content='tesseract'></head><body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'><div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233"><p class='ocr_par'><span class='ocr_line' id='line_1_1' title="bbox 201 50 1858 230"><span class='ocr_word' id='word_1_1' title="bbox 1058 50 1211 196"><span class='xocr_word' id='xword_1_1' title="x_wconf -6">R</span></span> <span class='ocr_word' id='word_1_2' title="bbox 1319 88 1858 230"><span class='xocr_word' id='xword_1_2' title="x_wconf -4"> r </span></span></span></p></div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883"><p class='ocr_par'><span class='ocr_line' id='line_1_2' title="bbox 47 1855 241 1882"><span class='ocr_word' id='word_1_3' title="bbox 47 1855 77 1882"><span class='xocr_word' id='xword_1_3' title="x_wconf -2">By</span></span> <span class='ocr_word' id='word_1_4' title="bbox 87 1855 153 1877"><span class='xocr_word' id='xword_1_4' title="x_wconf -3">LIAM</span></span> <span class='ocr_word' id='word_1_5' title="bbox 163 1856 241 1878"><span class='xocr_word' id='xword_1_5' title="x_wconf -2">CROY</span></span></span></p></div><div class='ocr_carea' id='block_1_3' title="bbox 43 1909 533 2404"><p class='ocr_par'><span class='ocr_line' id='line_1_3' title="bbox 46 1910 531 1934"><span class='ocr_word' id='word_1_6' title="bbox 46 1910 72 1928"><span class='xocr_word' id='xword_1_6' title="x_wconf -3">IN</span></span> <span class='ocr_word' id='word_1_7' title="bbox 83 1914 94 1928"><span class='xocr_word' id='xword_1_7' title="x_wconf -2">a</span></span> <span class='ocr_word' id='word_1_8' title="bbox 105 1910 185 1933"><span
<document><metadata><title>Eastern Reporter Tuesday, October 5, 2010</title><id>eastern_reporter/2010/10/5</id></metadata><pages><page id="0" origWidth="3648" origHeight="2736" rotate="-90.5" crop="2199x3321+147+147"/><page id="1" origWidth="3648" origHeight="2736" rotate="91" path="odd/IMG_0946.JPG" crop="2161x3247+374+274" width="2161" height="3247"><paragraph><line id="line_1_1" top="50" left="201" width="1657" height="180">R r</line></paragraph><paragraph><line id="line_1_2" top="1855" left="47" width="194" height="27">By LIAM CROY</line></paragraph><paragraph><line id="line_1_3" top="1910" left="46" width="485" height="24">IN a display of unity, Muslims and Chris-</line><line id="line_1_4" top="1937" left="45" width="486" height="26">tians gathered at Dianella Uniting Church</line><line id="line_1_5" top="1965" left="45" width="485" height="26">last Thursday to share thei.r experiences</line><line id="line_1_6" top="1993" left="45" width="212" height="24">and pray for peace.</line></paragraph><paragraph><line id="line_1_7" top="2020" left="79" width="451" height="25">Sheikh Muhammad Agherdien of the</line></paragraph><paragraph><line id="line_1_8" top="2048" left="46" width="484" height="25">Mirrabooka mosque opened the service</line><line id="line_1_9" top="2076" left="46" width="484" height="26">with a verse of the Islamic religious text,</line><line id="line_1_10" top="2103" left="45" width="117" height="20">the Koran:</line></paragraph><paragraph><line id="line_1_11" top="2131" left="79" width="451" height="27">“Oh People! Behold, we have created you</line></paragraph><paragraph><line id="line_1_12" top="2158" left="46" width="331" height="22">all out ofa male and a female.</line></paragraph><paragraph><line id="line_1_13" top="2187" left="79" width="451" height="25">“And we have made you into nations</line></paragraph><paragraph><line id="line_1_14" top="2214" left="46"
Convert HOCR to XML for Storage Sample Auto Generate Tags
IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians gathered at [ORG Dianella Uniting Church ] , last Thursday to share thei.r experiences , and pray for peace.
Tuesday, 18 January 2011
Demo
Tuesday, 18 January 2011
Prototype Interface for Ingesting Pages from Book Scanner
Tuesday, 18 January 2011
Perform Basic Image Rotation and Cropping
Rotation and Cropping can replicated to other pages
Tuesday, 18 January 2011
Prototype Search PagesResults on the left are the Auto Generated facets based on the natural language processing tags
Tuesday, 18 January 2011
Viewing Document Pages
Tuesday, 18 January 2011
Viewing Document Pages with Highlighted Results
Tuesday, 18 January 2011
Editing Document with Auto Updating of Indexer
Tuesday, 18 January 2011
Pazar2 can be used to alternative interfaces for search multiple existing catalogs
Tuesday, 18 January 2011
Questions?
Tuesday, 18 January 2011
More Info & Credits
Tesseract-OCRhttp://code.google.com/p/tesseract-ocr/
OCRopushttp://code.google.com/p/ocropus/
Do-It-Yourself Book Scanninghttp://www.diybookscanner.org/
CHDK - Canon Hack Development Kithttp://chdk.wikia.com/wiki/CHDK
Zebra - XML Indexinghttp://www.indexdata.com/zebra
PazPar2 -Federated Searchhttp://www.indexdata.com/pazpar2
Cuneiformhttp://en.wikipedia.org/wiki/HOCR
EyeFi Python Serverhttp://returnbooleantrue.blogspot.com/2009/01/eye-fi-standalone-server.html/
hOCR - HTML OCRhttp://en.wikipedia.org/wiki/HOCR
OpenNLPhttp://www.indexdata.com/pazpar2
Illinois Named Entity Taggerhttp://cogcomp.cs.illinois.edu/page/software_view/4
Tuesday, 18 January 2011