ocrfeeder - ocr made easy on gnome (guadec 2012)

Post on 02-Jul-2015

147 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

By Joaquim Rocha. Currently there are still a lot of documents still stored in paper format and this presents some problems related to preservation, flexibility and even ecology. With the current Free Software OCR engines it is possible to get a good accuracy rate when converting printed text to digital format but these engines only perform that basic conversion and know nothing about a document's structure and elements. OCRFeeder presents itself as an easy to use solution implemented for GNOME that performs automatic content detection in pages, allows manual correction and uses the system-wide OCR engines to convert the text. It allows to export the documents in various formats such as ODT, HTML or PDF. This project stands as the most complete Free Software solution for converting printed documents to digital formats and competes with the proprietary alternatives.

TRANSCRIPT

static void_f_do_barnacle_install_properties(GObjectClass

*gobject_class){

GParamSpec *pspec;

/* Party code attribute */ pspec = g_param_spec_uint64

(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",

0, G_MAXUINT64,

G_MAXUINT64 /* default value */,

G_PARAM_READABLE | G_PARAM_WRITABLE |

G_PARAM_PRIVATE);

g_object_class_install_property (gobject_class,

F_DO_BARNACLE_PROP_CODE,

Joaquim Rochajrocha@igalia.com

OCRFeeder

OCR Made Easy on GNOME

July 27 2012

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

What is it?

Document Analysis and Optical Character Recognition

for GNOME

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Why?

Paper has a number of problems

No applications for GNU/Linux to do a fair job

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Security

CC Photo by: http://www.flickr.com/photos/badwsky/

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Preservation

CC Photo by: http://www.flickr.com/photos/98469445@N00/

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Data processing

CC Photo by: http://www.flickr.com/photos/hugovk/

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Ecology

CC Photo by: http://www.flickr.com/photos/pranavsingh/

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Accessibility

CC Photo by: http://www.flickr.com/photos/illustrator/

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

No fair conversion apps for GNU/Linux

apart from OCR engines, but...

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

OCR != Document Conversion

(it only deals with chars)(does not consider the layout)(does not distinguish contents)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

What's needed is

Document Analysis and Recognition

(conversion of documents to an electronic format)

(first projects in the 80s)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

How it works

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

So many layouts...

CC Photo by: http://www.flickr.com/photos/uber-tuber/

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Layouts vary with the type of document

What works on detecting one, won't work on others

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

OCRFeeder focuses on contents, not on layouts!

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Key concept:

If a document image can be divided in windows of 1 (content)

or 0 (not content), then it is possible to group all the

1s and outline the contents

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Recognition:

System-wide OCR engines are used

Engines are configured from the GUI or XML files

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Most known free OCR engines are detected and configured

automatically:

* Tesseract* GOCR

* OCRAD* Cuneiform

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Exportation formats:

ODTHTML

Plain textPDF

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

User interaction:

Users can edit everythingand review the algorithm's results

So, UI can work in attended and unattended ways

CLI only works in an unattended mode

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Demo time!

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Other features:

* PDF importation* Unpaper preprocessor

* Font style edition* Image deskewing

* OCR results cleaning* Project saving/loading

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Future:

* More exportation formats: HOCR, etc.

* Make OCR engines' management easier

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Webpage:http://live.gnome.org/OCRFeeder

git:http://git.gnome.org/ocrfeeder

Bugzilla:http://bugzilla.gnome.orgproduct: OCRFeeder

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Thank you!

top related