bl demo day - july2011 - (4) ocr for impact part 1

14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT OCR in a nutshell Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, British Library 12/11/11

Upload: impact-centre-of-competence

Post on 11-May-2015

4.561 views

Category:

Technology


0 download

DESCRIPTION

Clemens Neudecker's presentation describing how IMPACT has improved the quality of OCR in conjunction with ABBYY FineReader Engine. Delivered at the IMPACT BL Demo Day on the 12th of July 2011.

TRANSCRIPT

Page 1: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT OCR in a nutshellClemens Neudecker, National Library of the Netherlands

IMPACT Demo Day, British Library 12/11/11

Page 2: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR Process Binarisation

= transform greyscale or colour images to bitonal (b/w)

in order to separate foreground (text) from background

Segmentation

= detection of layout elements in hierarchical order

(blocks/regions, lines, words, glyphs)

Pattern Matching (Recognition)

= matching of character shapes with internal font database (classifiers)

Page 3: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ABBYY FineReader Main OCR technology provider in IMPACT OCR technologies experts since 30 years IMPACT uses FineReader Engine (SDK)

Page 4: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Binarisation

Page 5: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Adaptive Binarisation

Original scan

Prev. binarization

New binarization

Page 6: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Binarisation

6

Original State of the Art IMPACT

Page 7: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Segmentation

Blocks/Regions Words Glyphs

Page 8: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation examplePre-Impact FR Engine 9 FR Engine 10

Part of column was misclassified as image

8

Page 9: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation example

v. 9 v. 10

Linear word order errors

9

Page 10: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation examplev. 9 v. 10

Lost text

10

Page 11: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Recognition

Page 12: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Languages and Dictionaries Goal:

• Develop an interface so that external dictionaries can be integrated into the FineReader Engine

2008 - 2009:• External Dictionary beta interface• Same quality as with internal dictionaries possible

2010 - 2011:• Make interface work reliably• Teach partners how to use it• Support for any language, any time period

12

Page 13: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ALTO: New native export format

Available since FRE 10 R2 Supports most recent schema: ALTO v. 2.0 Line coordinates available

Page 14: BL Demo Day - July2011 - (4) OCR for IMPACT Part 1

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Thank you! Questions?