uc berkeley cs294-9 fall 20005- 1 document image analysis lecture 5: metrics richard j. fateman...

23
UC Berkeley CS294-9 Fall 2000 5- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center

Upload: logan-burke

Post on 24-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

UC Berkeley CS294-9 Fall 2000 5- 1

Document Image AnalysisLecture 5: Metrics

Richard J. FatemanHenry S. Baird

University of California – BerkeleyXerox Palo Alto Research Center

UC Berkeley CS294-9 Fall 2000 5- 2

The course so far….• Reminder: All course materials are online:

http://www-inst.eecs.berkeley.edu/~cs294-9/

• Overview of the DIA Research Field

• Some applications (Postal Addresses, Checks):

• Research Objectives: more systematic

modeling, design

• Some basic engineering

UC Berkeley CS294-9 Fall 2000 5- 3

How well are we doing?

• Cost to achieve a useful result• Compare digital version to

– hand keying/ digitizing– verification– correction

• Correction cost may dominate total system cost

UC Berkeley CS294-9 Fall 2000 5- 4

When is a result nearly correct?

• Character Model– Correct– Reject– Error

• String model– Insertion– Deletion– Rejection– Substitution [wrong letter identification]

UC Berkeley CS294-9 Fall 2000 5- 5

Using ascii character labels

ABCDEFGHIJKL = s1ACD~~OIIUKL = s2

Insert B after A in s2Substitute E for ~, F for ~ [~=reject]subst G for O in s2subst H for I in s2subst I for U … etc (really H was recognized as II, IJ was recognized as U)

UC Berkeley CS294-9 Fall 2000 5- 6

Ascii labels are inadequate

• Unicode +• Font +• Point size +• Tag information <author> .. </author>

UC Berkeley CS294-9 Fall 2000 5- 7

Simple measures may mislead

Increase the rejection rate and this “error rate” decreases. Reject all characters to get 0/0?

Some applications (e.g. post office) force very low error, even if (low confidence) correct results are sometimes rejected.

%100### rejectedcharacterserrors

UC Berkeley CS294-9 Fall 2000 5- 8

Some errors are acceptable

• Keyword search: if the key word occurs many times and is occasionally rejected

• Erroneous (nonsense) words are unlikely to be found by a search

• Caveat: if a key word is consistently changed to a nearby word, it may be missed (e.g. search for durnptruck and never find it.)

UC Berkeley CS294-9 Fall 2000 5- 9

Example: UNLV-ISRI document collection

• 20 million pages of scientific, legal, official memos from DOE and contractors– Rock mining– Maps– Safe transportation of nuclear waste– Average length 44 pages

UC Berkeley CS294-9 Fall 2000 5- 10

Example: UNLV-ISRI document collection

• DOE’s Licensing Support System Prototype– 104,000 Page images, 2,600 documents– Manually typed “correct” text– OCR text

• To determine relevance to queries, 3 methods used– Geology students ranking (0/1)– OCR keyword search– “correct” text search

UC Berkeley CS294-9 Fall 2000 5- 11

Example: UNLV-ISRI document collection

• Exact match on 71 queries. – 632 returned by correct text– 617 returned by OCR. – Essentially: OCR is OK for this application.

• Probabilistic ranking / frequency: – Excessive OCR errors affected ranking– On average, similar results

• Feedback on relevance was not helpful for poor OCR

• Benchmarking: similar relevance = good results

UC Berkeley CS294-9 Fall 2000 5- 12

Example: UNLV-ISRI document collection

One surprising result is that for some standard tests of precision and recall, processing OCR did better than actual text.

[Crummy OCR meant that some terms were not recognized; but the documents were irrelevant….]

UC Berkeley CS294-9 Fall 2000 5- 13

A theory for computing accuracy

• Consider the result of OCR to be a string– Idealization: most common errors involve

mis-counting the number of spaces!– Ignores size/font/absolute position etc etc

UC Berkeley CS294-9 Fall 2000 5- 14

Computing the shortest edit distance

• Bio-informatics sequencing• Associate a cost for each

correspondence. For example,– Match or substitute (cost 0 or 1)– Insert or delete (cost 2)

UC Berkeley CS294-9 Fall 2000 5- 15

Attempt to align of AUGGAA to ACUGAUGUGA. Distances were calculated using following parameters: s(a,b) = 0 when a equals b; s(a,b) = 1 when a differs from b insert or delete cost = 2. One of the possible optimal paths is indicated by a solid line connecting cells. It corresponds to the following alignment: ACUGAUGUGA A-UG--G-AA [explain dynamic programming here?]

A

U

G

G

A

A

A C U G A U G U G A

14

UC Berkeley CS294-9 Fall 2000 5- 16

Computing the shortest edit distance

• Also useful for other tasks (recognizing speech)

• Lots of ways of organization of dynamic programming, still O(n2).

• Probably of more interest is word accuracy, or accuracy on non-stopwords (excluding and the of … etc.)

UC Berkeley CS294-9 Fall 2000 5- 17

Correct Zoning is essential

• Read order in multi-column pages

• How to compare competing programs on performance of repeated headers

• What to do with figures, logos.

123456

123456

UC Berkeley CS294-9 Fall 2000 5- 18

Document Attribute Format Specification : DAFS

``While many formats exist for composing a document fromelectronic storage onto paper, no satisfactory standard existsfor the reverse process. DAFS is intended to be a standardfor document decomposition. It will used in applications suchas OCR and document image understanding.

There are three storage formats: DAFS-Unicode, DAFS-ASCII anda more compact DAFS-Binary form.

DAFS is a file format specification for documents with avariety of uses. It is developed under the Document ImageUnderstanding (DIMUND) project funded by ARPA.’’ www.raf.com, Illuminator, UW CDRoms (English and Japanese)

UC Berkeley CS294-9 Fall 2000 5- 19

DAFS vs SGML

• DAFS= SGML+Unicode +CCITFax4• SGML requires DTD (document type

definition) • SGML is intended for structure, not

appearance (e.g. not bold, italic)• Images which accidentally contain ascii

version of <tag> can be problematical– Solved by putting images in separate files!

UC Berkeley CS294-9 Fall 2000 5- 20

Perfect results: how to obtain ground truth?

• Painfully enter it by hand, or • Painfully correct OCR results, or• Compute some kind of average of OCR

programs

UC Berkeley CS294-9 Fall 2000 5- 21

Perfect ground truth: a synthetic approach

• (Kanungo,UMD): start with TeX, – produce the ground truth for layout form

TeX,– Extract character positions, glyphs by

analyzing DVI files– This provides essentially every bit position of

each character.

UC Berkeley CS294-9 Fall 2000 5- 22

Ground truth

• Next, commit to paper:– Print the DVI files– Scan a calibration page – Compute parameters of 2d2d transformations T

imposed by physics– Scan the printout– Align the page– Run the recognizer– Compare reported positions (• T-1 ) to correct ones

UC Berkeley CS294-9 Fall 2000 5- 23

Change of Pace

• Assignment 1– What does it mean to write a program?

• Documentation• Demo• Instructions for use• (perhaps optional)

– Extensions, limitations, discussion

• Discussion questions