document image analysis lecture 12: word segmentation

31
UC Berkeley CS294-9 Fall 2000 12- 1 Document Image Analysis Lecture 12: Word Segmentation Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center

Upload: alvin-matthews

Post on 30-Dec-2015

40 views

Category:

Documents


2 download

DESCRIPTION

Document Image Analysis Lecture 12: Word Segmentation. Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center. The course, recently…. We studied symbol recognition, classifiers and their combinations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 1

Document Image AnalysisLecture 12: Word Segmentation

Richard J. FatemanHenry S. Baird

University of California – BerkeleyXerox Palo Alto Research Center

Page 2: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 2

The course, recently….

• We studied symbol recognition, classifiers

and their combinations

• Word recognition as distinct from characters

Page 3: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 3

A good segmentation method (or several) is handy

• We cannot rely on a lexicon to have all words (names, proper nouns, numbers, acronyms)

• Insisting that words be in the lexicon does not mean they are correct. Powerpoint tries to refuse misspell as mispell since the latter is not in the dictionary!

• Good segmentation means that the symbol based recognition has a better chance of success

Page 4: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 4

Segmentation/ Naïve or clever

• Numerous papers on the subject• Some without strong models (e.g. cut at

thin parts)• Some with exhaustive search / template

matching• Some with learning/ internal

comparisons

Page 5: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 5

Naïve connected component analysis can’t come close…

• Characters like “ij:; Ξ â% are separated• Ligatures are not separated: ffl, ŒÆœ ffi

• Vertical cuts between touching characters will not ordinarily work for italics

THIS IS ULTRA CONDENSED ..TZ this is times italic .

(other problems: X2 , )3 22 yx

Page 6: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 6

Papers of interest on segmentation

• Tsujimoto and Asada• Bayer and Kressel• Tao Hong’s (1995) PhD on Degraded

Text Recognition

Page 7: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 7

Segmentation + Clustering (Tao Hong)

Page 8: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 8

Can lead to decoding!

Page 9: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 9

Sometimes the image itself holds a key to decoding…

Page 10: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 10

Visual inter-word relations

Page 11: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 11

An example text block showing visual inter-word relationships

Page 12: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 12

Pattern matching can lead to identifying a segment

Page 13: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 13

Page 14: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 14

Where this fits…

Page 15: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 15

Example

Page 16: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 16

Tsujimoto & Asada: Overview

Page 17: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 17

Resolve the touching characters:

• New metric for finding breaks (find plausible breaks

• Use knowledge about “the usual suspects” rn/m k/lc d/cl … (limits search substantially)

Page 18: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 18

Metric, pre-processing

ANDing columns for profile removing slant from italics

Page 19: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 19

Choosing break candidates

Page 20: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 20

Decision Tree for “The”

Page 21: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 21

Tree search

• Depth first, looking for solution to the string matching, in sequence.

• Some partitions are penalized (but not eliminated) if the segmentation point is uncertain.

• Segments are matched to omnifont templates (“multiple similarity method..”)

Page 22: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 22

Reexamined explanations

m rn

q cj

k lc

B 13

H I-I

mm nun

ck dcEtc… 30 confusions

This might be mistaken for This

Page 23: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 23

Some tough calls…

Page 24: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 24

Unbelievable accuracy…

Page 25: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 25

A different, perhaps more general method (Bayer, Kressel)

• Goal: find the column position(s) at which characters are touching– Treat as a systematic classification problem– Learn from a data base containing labelled merged

characters• Collect real life data; get human breakpoints [or could

be synthetic, I suppose]• Find appropriate feature set• Learn the features of touching characters

– Hypothesize column breaks– Application: postal addresses, other stuff too

Page 26: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 26

Database of touching chars

….2158 patterns

Page 27: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 27

Big ideaRather than represent the breaks as low points in the projection profile, represent the breaks in the natural context of touching characters by actual example, suitably normalized for size (15-30 pixels high).

These locations are manually marked.

Page 28: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 28

Local feature set describing cut locations / measures of similarity

• Number of black pixels (= projection profile!)

• Number of white pixels counting from top/bottom

• Number of white-black transitions• Number of identical b or w pixels next to

this column (derivative of pp?)

Page 29: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 29

Global feature set describing cut locations / measures of similarity

• Width to height ratio of full image (wider suggests touching characters)

• Width to height ratio of the image AFTER cutting(s)

• Number of white-black transitions• Number of identical b or w pixels next to

this column (derivative of pp?)

Page 30: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 30

Illustration of the strategy

Page 31: Document Image Analysis Lecture 12:  Word Segmentation

UC Berkeley CS294-9 Fall 2000 12- 31

How accurate, how fast? (cut location)

• Finding cuts: 7.8% error in learning set, 7.2%(!) on test set

• 22% of the no-cut regions had errors• Best results used 50-feature classifier

using 9 column width• Cost for one image cut-analysis one

character analysis• Validates statistics > heuristics..