integration of bangla script recognition support in ocropus [thesis presentation by shouro...

19
Integration of Bangla script recognition support in OCRopus By Shouro Chowdhury Supervisor Dr. Mumit Khan Co-supervisor Md. Abul Hasnat

Upload: md-abul-hasnat

Post on 11-Apr-2015

477 views

Category:

Documents


0 download

DESCRIPTION

This is the undergrad thesis presentation by Shouro chowdhury. He worked on the integration of Bangla script recognition support with OCRopus.

TRANSCRIPT

Page 1: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Integration of Bangla script recognition support in OCRopus

By Shouro Chowdhury

SupervisorDr. Mumit Khan

Co-supervisorMd. Abul Hasnat

Page 2: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

What is OCRopus?

OCRopus is an open source state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

Reference: "The OCRopus Open Source OCR System", Thomas M. Breuel, Proceedings IS&T/SPIE 20th Annual Symposium 2008.

Page 3: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Why OCRopus?

A complete OCR framework.Multi-lingual capabilities.Rich image preprocessing and layout analysis library.Two recognition engines:

Tesseract.bpnet (Multi Layer Perceptron Neural Network) classifier.

Statistical natural language model (OpenFST) as a post processor.Developed with the generous support from Google and other organizations.

Page 4: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Framework overview

Training

Recognition

Page 5: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Training tesseract

Step 1: Create training image dataStep 2: Make Box fileStep 3: Run Tesseract for TrainingStep 4: ClusteringStep 5: Compute the Character SetStep 6: Prepare Dictionary data files

Page 6: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Recognition tesseract

Generate a properly segmented character image.Pass this image to the tesseract recognizer.

Page 7: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Training bpnet

Two Approaches:From individual character image.From text line image.

Ist approach:Individual character image.Character unicode.

2nd approach:Text line image.Segmented color image of the text line image. Transcription of the text image.

Page 8: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Recognition bpnet

Generate a color segmented character image.Pass this image to the bpnet classifier.

Page 9: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Example: Training data generation for Tesseract

Step 1: Create training image data

Step 2: Make Box file

Step 3: Run Tesseract for Training

Step 4: Clustering

Page 10: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Example: Training data generation for Tesseract

Step 5: Compute the Character Set

Step 6: Prepare Dictionary data files

Word List : 180K wordsFrequent word List : 30K words

Reference: http://crblpocr.blogspot.com/2008/07/how-to-train-bangla-and-devanagari.html

http://crblpocr.blogspot.com/2008/08/why-tesseract-need-to-train-all.html

Page 11: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Example: Testing Tesseract

Step 1: Generate a properly segmented character image.

Step 2: Pass this image to the tesseract recognizer

aদভনণঢডhOcr output

Page 12: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Example: Training data generation for bpnet

Approach 1: From individual character image

character image Character unicode

Image – transcription mapped list file

Page 13: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Example: Training data generation for bpnet

Approach 2: From line image

Line image

Line transcription

Image – transcription mapped list file

Color image

Page 14: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Example: Testing bpnet classifier

Step 1: Generate a color segmented character image.

Fig: Input text Image

Fig: Segmented color Image

Step 2: Pass this image to the bpnet classifier.

কত কাল কেব েস pভােতhOcr output

Page 15: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Example: Testing bpnet classifier (best NN Parameters)

nhidden - 500 epochs - 300 learningrate - 0.2 testportion - 0 normalize - 1 shuffle - 1

Page 16: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

CRBLP segmenter plugin

Four step character segmentation algorithm.Projection profile based.Elimination of the joining errors

Shadow character segmentationTouching character segmentation

Elimination of splitting errorsReordering the modifiers

Page 17: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

Acknowledgement

Mark Stillwell (University of Hawaii)

Prof. Thomas Breuel (IUPR)

Ilya Mezhirov (IUPR)

Yves Rangoni (IUPR)

Ray Smith (Google research)

CRBLP Lab

Page 18: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]

References

http://code.google.com/p/ocropus/http://code.google.com/p/tesseract-ocr/http://code.google.com/p/ocropus-bengali/http://crblpocr.blogspot.com/

Page 19: Integration of Bangla Script Recognition Support in OCRopus [Thesis Presentation by Shouro Chowdhury]