document analysis and recognition cs 661. what is a document? a.a written or printed paper that...

Document Analysis and Recognition

CS 661

What is a Document?

a. A written or printed paper that bears the original, official, or legal form of something and can be used to furnish decisive evidence or information.

b. Something, such as a recording or a photograph, that can be used to furnish evidence or information.

c. A writing that contains information.

d. Computer Science. A piece of work created with an application, as by a word processor.

e. Computer Science. A computer file that is not an executable file and contains data for use by applications

Document Image Analysis

• DIA is the theory and practice of recovering the symbol structures of digital images scanned from paper or produced by computer

• DIA is a subfield of Digital Image processing– Digital images of natural objects: X-rays, fingerprints, faces,

scenery, etc. are NOT part of DIA– Digital images of symbolic objects: Postal addresses, printed

articles, forms, music sheets, engineering drawings, topographic maps belong to DIA

– Source: Scanners, printers, fax machines, hand!– Incidental text: license plates, billboards, subtitles, in photos

and video– WWW ??

• DIA’s grand goal is take us to the land of paperless office

Paperless Office?• Traditional transmission and storage of information has

been by paper documents• Documents are increasingly originating on the computer• Documents printed for reading, dissemination, and markup• Paper in the office has increased!!• Goal: Deal with the flow of electronic and paper documents

in an efficient and integrated manner• Implication: Unlike computer media, paper documents

should be read by both the computer and people

Short Tour of DIA

• Field started before digital computers could represent information traditionally appeared on paper

• Patents on OCR for telegraph and reading machines for the blind filed in the 19th century and working models demonstrated in 1916

• OCR on specially designed fonts used in 1950s

• First postal address reader installed in 1965

• OCRs to read scanned pages came into their own in 1980s with the advent of the low cost microprocessors, bit-mapped displays, and scanners

• Large capacity storage devices have now ignited the field with the prospects of Digital Libraries

• Document imaging today is a billion dollar industry but document interpretation is only a small part of it


Graphical ProcessingTextual Processing

Optical Character

Recognition

PageLayout

Analysis

LineProcessing

Region and Symbol

Processing

Text Skew, blocks,paragraphs

Lines, curves, corners

Filled regions

Current

• Processors getting faster

• Storage costs are down– Pictures are typically 512 x 512 pixels

– Speech signals are typically 256 sample points

– Business letters are typically 2550 x 3300 pixels at 300 dpi

– Eng drawings are typically 34000 x 44000 pixels at 1000 dpi

• Digital libraries need WWW interface

• Information retrieval and search

• OCR accuracy on the rise

• Contextual models improved

Data capture

Pixel-level processing

Feature-level processing

Text analysis & recognition Graphics analysis & recognition

Document page

107 pixels

7,500 character boxes, 15x20 pixels each

500 line and curve segments, 20 to 20,000 pixels each

10 filled regions 20x20 to 200x200 pixels each

7500x10 character features500x5 line and curve features

10x5 region features

Document Description

1,500 words, 10 paragraphs,

1 title, 2 subtitles, etc.

2 line diagrams, 1 company logo, etc.

300 dpi, 8.5x11 in

255 gray

X 3 color

2,550 x 3,300 pixels

Processing Text Graphics

Pixels PreprocessingRepresentation, Noise removal, binarization, skew, script id, font id

PreprocessingRepresentation, Noise removal, binarization, thinning, vectorization

Primitives Glyph RecognitionConnected components, strokes, punctuations, words

Primitive RecognitionStraight lines, curve segments, junctions, nodes, loops, characters

Structures Text RecognitionWord segmentation, text line reconstruction, table analysis, linguistics

Structure RecognitionText fields, legends, labels, dimensions, graphics symbols

Documents Page Layout AnalysisText versus non-text, physical component analysis, logical component analysis, functional component analysis, compression

InterpretationComponent recognition, connectivity analysis, CAD layer separation, Database attribute extraction, Compression

Corpus Information RetrievalDocument Classification, indexing, search, security, authentication, privacy

Database, CADValidation, search, update


Type Example DIA Task Ancillary Data

Plain text narrative Moby Dick Extract word order English lexicon

Newspaper, magazine

NY Times, Vogue Separate and reassemble articles, pointers to illustrations

Publication specific format

Scholarly, technical text

IEEE PAMI Index, author, title, page, figs, table, footnotes, equations

Abbreviations, acronyms, units

Formal text Program listing, chess, bridge, recipe

Extract executable form Program, chess, bridge syntax

Letter, Envelope Recommendation Sender, date, subject, routing info Directories

Directory Telephone book Extract name phone pairs Previous edition

Structured List Table of Contents Recover hierarchy, cross-refs Previous edition

Business Forms Order, invoice Convert to XML, link to Database Database form

Engineering Drawing

Part drawing, isometric view

Convert to CAD format Part list, drawing standards

Schematic Diag Circuits Convert to CAD format Constraints

Map Street map Convert to GIS format GIS, other maps

Music score Moonlight Sonata Recover MIDI representation Music syntax

Table Stock quotes Construct model; header-entries Stock abbreviations

Document Taxonomy

Postal ExamplesMeter Mark

Sender’s Address

Delivery Address

Linear Code

Digital Post MarkEndorsem

entIn Case of Undeliverable as Addressed Return to Sender

Unconstrained Text

Graphics Documents

Personal DL

DAS 02, Princeton, NJ• OCR Features and Systems

– Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks, traffic ticket reading

• Handwriting Recognition– Stochastic models, holistic methods, Japanese OCR

• Classifiers and Learning– Multi-classifier systems

• Layout Analysis– Skew correction, geometric methods, test/graphics separation, logical

labeling

• Tables and Forms– Detecting tables in HTML documents, use of graph grammars, semantics

• Text Extraction• Indexing and Retrieval• Document Engineering• New Applications

– CAPTCHA, Tachograph chart system, accessing driving directions

ICDAR 03, Edinburgh, UK

• Multiple Classifiers• Postal Automation and Check Processing• Document Understanding• HMM Classifiers• Segmentation• Character Recognition• Graphics Recognition• Non-Latin Alphabets- Kanji/Chinese, Korean/Hangul,

Arabic/Indian• Web Documents, Video• Word Recognition• Image Processing• Writer Identification• Forms and Tables

CS 661 Class ScheduleWeek M W

AUG 25 Introduction NIH /PROJECTS/Other Apps

SEPT 1 X IMG PROC for DIA

SEPT 8 Doc Analysis; GRAPHICS DIG LIB, Indexing, Retrieval

SEPT 15 Q PR Statistical PR- Structural, Neural Nets

SEPT 22 OCR OCR, CAMERA based problems

SEPT 29 Word Recognition Online

OCT 6 Q Handwriting Reco Paradigms PROJECTS

OCT 13 HMM formulation HMM, Viterbi, Baum-Welch

OCT 20 HMM Examples WEB DIA

OCT 27 Q Multilingual ; Language Models MCS, Committee methods

NOV 3 POSTAL, Forms BANK CHECK

NOV 10 Annotations; Historical; Other Apps

IMG PROC, filters (Gabor), transforms (wavelets, cosine)

Nov 17 Q CAPTCHA, Security Biometrics, watermarking

Nov 24 PROJECTS PROJECTS

Grading

• Home Assignments and Quizzes:– 4 x 10 = 40 points

– schedule is tentative to preserve surprise element

– Based on class participation and paper handouts

• Midterm project – Demo: 10%

– Report: 15%

• Final project – Demo: 10%

– Report: 25%

References

• Handbook of Character Recognition and Document Image Analysis, H. Bunke and PSP Wang (editors), World Scientific Press

• Document Image Analysis, Gorman and Kasturi , IEEE Computer Society Press

• International Conference on Document Analysis and Recognition proceedings

• International Workshop on Document Analysis Systems proceedings

• Symposium on Document Image Understanding Technology

document analysis and recognition cs 661. what is a document? a.a written or printed paper that...

Documents

document analysis

computer dia

computer documents

paper documents documents

document image analysis

printed paper

storage of information

computer file