development of an ocr system nathan harmata tjhsst computer systems lab 2007-2008

Post on 13-Jan-2016

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Development of an OCR System

Nathan Harmata

TJHSST Computer Systems Lab2007-2008

What is OCR?

Optical Character Recognition

Font and handwriting based

Goals of My Project

Generic recognition for Latin-based fonts

Proper handling of most formatting

System built from scratch

Overview of Idocrase System

Image Processing

Transformations

Attribute

Character Model

Transformations

Sector Vector - image is parsed into parts that pass the vertical line test

- then each part is transformed into a collection of line segments

Gap Vector - gaps, if any, are found on the four sides of the image

Transformations

Pixel Concentration Vector – which sides, if any, have a higher concentration of pixels

Character Recognition

GCDD – Generic Character Definition Database

Averages of Character Models for every character from many different fonts

0 PixelConcentrationVector balanced balanced SectorVector 4 3 GapVector

Character Recognition

For a single character:

For words, dictionary and grammar references are used.

Idocrase Application

Results

-Mediocre word recognition-Doesn’t handle formatting well-Doesn’t handle small letters well-Fairly accurate single character recognition (93.7%)

top related