p otential ocr s oftware for n utrition f acts l abels dennis given

19
POTENTIAL OCR SOFTWARE FOR NUTRITION FACTS LABELS Dennis Given

Upload: rose-houston

Post on 23-Dec-2015

229 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

POTENTIAL OCR SOFTWARE FOR NUTRITION FACTS LABELSDennis Given

Page 2: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

THE GENERAL OPTICAL CHARACTER RECOGNITION CONCEPT

Input

OCR

buzz

Output

Page 3: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

PREFERENCES FOR THE OCRS

Accurate Fast Written in Java

This will make it easier to find someone to work on the software in the future.

Open-source (free) Commercial options, although considerably

faster and more accurate, are costly solutions. Editable

So that if I have to, I can go into the OCR engine and edit whatever I have to.

Commercial OCRs don’t always allow for this option.

Page 4: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

OCRS THAT MEET SOME OF THE PREFERENCES

Aspire OCR SDK Java OCR ABBYY FineReader Tesseract Version 3.01

Page 5: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

COMPARISONS

Preferences: Accuracy? Speed? Java? Editable? Open-source?

Example image to determine the best: GIF Image 1204x2004 image - this resolution is close to the

resolution of the iPhone 3GS camera phone (1500x2000) and the iPhone 3G resolution (1200x1600) images.

Page 6: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

EXAMPLE IMAGE

2004 pixels

1200 pixels

Page 7: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

ASPIRE OCR

Pros: Runs across many platforms Relatively fast Written in Java and meant to be added to Java

applications Cons:

Not very accurate. Must pay for the full SDK (Software Development

Kit).

Page 8: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

ASPIRE RESULTS

Page 9: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

JAVAOCR

Pros: Written entirely in Java Full source code is given (easy to edit) Easy graphical user interface Relatively fast

Cons: Instead of converting the image to text, it

converts it to .png files by character Not very accurate (sometimes won’t even bother

converting the image) to more than one character

Even the images that were converted were not done very well…

Page 10: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

JAVAOCR RESULTS

Page 11: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

ABBYY FINEREADER

Pros: Very good interface Lots of tools to edit the area being scanned The most accurate program tried

Cons: Not in Java Commercial (not open-source) and VERY

expensive

Page 12: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

ABBYY FINEREADER RESULTS

Page 13: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

TESSERACT VERSION 3.01

Developed by HP Labs Now used by Google Pros:

Close in accuracy to the commercial OCRs Easy to use from the command line Lots of documentation available

Cons: Must use a Java Wrapper if we want future edits

to be done in Java Source code is written in C/C++ - will be difficult

to edit

Page 14: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

TESSERACT RESULTS

Page 15: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

COMMERCIAL OCR VS. TESSERACT

100+ languages

Accuracy is good

Sophisticated

application with

complex user interface

Mostly meant for

Windows OS

Costs $100+ to use

6+ languages

Accuracy is good, but

not as good as

commercial OCRs

No user interface

Runs on Linux, Mac,

Windows, and more…

Open Source – Free!

Page 16: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

WHERE TO GO FROM HERE…

Tesseract is our best option at this point. It is…

Fast Free Outperforms the other available open-source

OCR engines Plenty of documentation

An Overview of the Tesseract OCR Engine by Ryan Smith

Tesseract OSCON pdf http://code.google.com/p/tesseract-ocr/

Three different ways to go

Page 17: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

OPTION 1: ~5 WEEKSUSE TESSERACT ENGINE AND WRAP IT

Wrapper Library A collection of subroutines or classes used to develop

software. Libraries expose interfaces which clients of the library use to execute library routines. Wrapper libraries (or library wrappers) consist of a thin layer of code which translates a library's existing interface into a compatible interface.

By wrapping Tesseract, it won’t matter that Tesseract’s source code is written in C++

However, this means we will still not be able to customize the Tesseract engine to do exactly what we want (specific to Nutrition labels).

We can control the input and output, but the process of determining characters will remain the same.

Page 18: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

OPTION 2: ~7 WEEKSBUILD AN OCR ENGINE FROM SCRATCH

Understand general concepts Can use ideas and implementations from OCRs

such as Tesseract and JavaOCR. Can customize the engine to run specifically for

nutrition facts labels. Would be more effective than a “general” OCR which

isn’t looking for specifics. The whole thing can be written in Java (easier for

future developers to work on). However:

It will take more time Will probably have more bugs in it

Option 3 is to take more time to determine the OCR…

Page 19: P OTENTIAL OCR S OFTWARE FOR N UTRITION F ACTS L ABELS Dennis Given

GOALS

At the end of the time frame, I plan to have: A running OCR application that will:

At least be able to scan in cereal box (flat) images effectively and convert the labels to usable data.

Have minimal bugs (although some will definitely exist).

Have an accuracy rate of at least 95%. Begin to identify effective ways to manage images

with curved (jars, bottles, etc.) and wrinkled (bags, packaging, etc.) nutrition facts labels.