uc berkeley cs294-9 fall 200021- 1 document image analysis lecture 21: introduction to layout...
TRANSCRIPT
![Page 1: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/1.jpg)
UC Berkeley CS294-9 Fall 2000 21- 1
Document Image AnalysisLecture 21: Introduction to Layout
Richard J. FatemanHenry S. Baird
University of California – BerkeleyXerox Palo Alto Research Center
![Page 2: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/2.jpg)
UC Berkeley CS294-9 Fall 2000 21- 2
Page layout analysis
• Structural (Physical, Geometric) Layout Analysis [Segmentation]
• Functional (Syntactic, Logical) Layout Analysis [Classification]
• Read-order determination
![Page 3: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/3.jpg)
UC Berkeley CS294-9 Fall 2000 21- 3
Structural
• Isolation of columns, paragraphs, lines words, tables, figures. Maybe letters.
• Without some layout analysis, much of the previous work would be impossible!
• Without layout analysis, what is the sequence of words in a multi-column format?
![Page 4: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/4.jpg)
UC Berkeley CS294-9 Fall 2000 21- 4
Functional
• Typically domain dependent• May require merging or splitting of
syntactic components• Encoding into ODA (object oriented
document architecture) or SGML (DTD describes components like section, title..)
![Page 5: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/5.jpg)
UC Berkeley CS294-9 Fall 2000 21- 5
Functional Components
• First page of a technical article may have• Title• Author• Abstract, body/column1 body/column2 footnotes• Pagination• Journal name/volume/date…
• Business letter might have• Sender• Date• Logo• Recipient• Body• Signature
![Page 6: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/6.jpg)
UC Berkeley CS294-9 Fall 2000 21- 6
Finding structural blocks
![Page 7: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/7.jpg)
UC Berkeley CS294-9 Fall 2000 21- 7
Common Approaches
• Top Down analysis– Horizontal and vertical profiles– Recursive: columns, paragraphs/lines/words– As illustrated earlier
• Bottom Up analysis– Use adjacency based on
• Pixels / morphology of dilation (millions)• RLE/ merge lines (thousands)• Connected Components (hundreds)
• Look at the background (shape-directed covers)• Also, human hints.
![Page 8: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/8.jpg)
UC Berkeley CS294-9 Fall 2000 21- 8
Standard images…the Scanned Input
![Page 9: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/9.jpg)
UC Berkeley CS294-9 Fall 2000 21- 9
Smear character boxes
![Page 10: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/10.jpg)
UC Berkeley CS294-9 Fall 2000 21- 10
Smear words to get lines
![Page 11: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/11.jpg)
UC Berkeley CS294-9 Fall 2000 21- 11
Smear lines to get paragraphs
![Page 12: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/12.jpg)
UC Berkeley CS294-9 Fall 2000 21- 12
Issues:
• Sensitivity to noise. Solutions: – Clean up via kfill or similar filtering, ruthlessly– Divide the page (artificially) and keep the noise from
affecting the document globally
• Slanted lines. Solution(s):– Deskew (since it is not too hard(?))– Use nearest neighbors “docstrum”
• Concave regions (text flow around a box). Solution(?) look at background
• Variation in font, spacing can throw off analysis– Allow for local analysis
![Page 13: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/13.jpg)
UC Berkeley CS294-9 Fall 2000 21- 13
Interactive semi-automatic zoning (RJF)
![Page 14: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/14.jpg)
UC Berkeley CS294-9 Fall 2000 21- 14
Zoom in
![Page 15: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/15.jpg)
UC Berkeley CS294-9 Fall 2000 21- 15
Scroll around
![Page 16: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/16.jpg)
UC Berkeley CS294-9 Fall 2000 21- 16
View individual pixels
![Page 17: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/17.jpg)
UC Berkeley CS294-9 Fall 2000 21- 17
Semi…
Turn up the noise filter until we start to kill some of the punctuation. How?
As we turn up the threshold, the number of connected components drops, then reaches a stable plateau after the noise is gone, and then drops again as we remove punctuation, the dots above the “i” etc.
![Page 18: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/18.jpg)
UC Berkeley CS294-9 Fall 2000 21- 18
auto…
Turn the horizontal smear knob until the number of components drops suddenly from about 3000 to about 600.
Character boxes have been merged into wordboxes
Turn the horizontal smear knob until the number of components drop from about 600 to about 100.
Wordboxes have become lineboxes.
![Page 19: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/19.jpg)
UC Berkeley CS294-9 Fall 2000 21- 19
matic..
Tweek the vertical smear knob. Lines become paragraphs.
(Turn further, and paragraphs become columns).
![Page 20: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/20.jpg)
UC Berkeley CS294-9 Fall 2000 21- 20
Specify read order
![Page 21: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/21.jpg)
UC Berkeley CS294-9 Fall 2000 21- 21
Interactive functional tagging:mark subject/author/etc? Here we attempt automatic id of math…
Automatic math zone. This is a challenge because the zone is in two parts, containing the math … f(p)=F(p)
![Page 22: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/22.jpg)
UC Berkeley CS294-9 Fall 2000 21- 22
Docstrum/ L.O’Gorman
5 nearest neighbors (ogorman93)
![Page 23: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/23.jpg)
UC Berkeley CS294-9 Fall 2000 21- 23
Example of “spectrum”
Each point represents distance and angle of a cc.
N^2, but not so bad.
![Page 24: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/24.jpg)
UC Berkeley CS294-9 Fall 2000 21- 24
Statistics for skew and spacing
Set the knobs?
![Page 25: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/25.jpg)
UC Berkeley CS294-9 Fall 2000 21- 25
Extract Lines, group to paragraphs
• Statistically close enough horizontally to be words, then lines
• Statistically close enough and parallel enough and the same length as… group two lines into the same text block.
• (arguably saving time by not deskewing; dealing with non-constant skew) Example follows..
![Page 26: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/26.jpg)
UC Berkeley CS294-9 Fall 2000 21- 26
Sections with different skew
6 business cards, nearest neighbors vectors
![Page 27: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/27.jpg)
UC Berkeley CS294-9 Fall 2000 21- 27
Extracted text lines, blocks
Useful? General?
![Page 28: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/28.jpg)
UC Berkeley CS294-9 Fall 2000 21- 28
Does Docstrum work?
• Great on this page of business cards• An attempt to remove the assumption of
most previous work that layout was “Manhattan”
• Largely skew-independent.
but• Useless if characters are not separated
![Page 29: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/29.jpg)
UC Berkeley CS294-9 Fall 2000 21- 29
Area Voronoi Diagram (Kise)
Start with connected components
Compute area ratios from pairs of neighboring connected components
Adaptively compute thresholds of intercharacter and interline gaps
![Page 30: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/30.jpg)
UC Berkeley CS294-9 Fall 2000 21- 30
Point Voronoi diagram
![Page 31: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/31.jpg)
UC Berkeley CS294-9 Fall 2000 21- 31
Area Voronoi Diagram
• Define the distance d between a point p and a figure g to be the minimum distance of p from any point in g
![Page 32: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/32.jpg)
UC Berkeley CS294-9 Fall 2000 21- 32
Computing an approximate area Voronoi diagram
• Compute the point Voronoi diagram from a sampled set of points on the boundary of each figure.
• Delete Voronoi edges generated from point-to-point on the same figure
Advantage: we are not abstracting shapes into points (centroids) or into rectangles.
![Page 33: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/33.jpg)
UC Berkeley CS294-9 Fall 2000 21- 33
ExampleThe points don’t show here…
All we have to do now is decide which of the (many) Voronoi edges are appropriate for segmentation.
![Page 34: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/34.jpg)
UC Berkeley CS294-9 Fall 2000 21- 34
Features for selecting edges
• Delete edges in narrow spaces, because they are merely separating characters or words.
• Delete edges which divide two components of about equal area
• Delete edges that don’t form loops.Characters in the same font but in different columns will be in different segments.
Characters, even if they are close to a (large) halftone figure, will be separated from the figure.
Find the threshold based on a frequency of distances
![Page 35: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/35.jpg)
UC Berkeley CS294-9 Fall 2000 21- 35
Example
![Page 36: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/36.jpg)
UC Berkeley CS294-9 Fall 2000 21- 36
Area Voronoi diagram
![Page 37: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/37.jpg)
UC Berkeley CS294-9 Fall 2000 21- 37
After deleting edges
![Page 38: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/38.jpg)
UC Berkeley CS294-9 Fall 2000 21- 38
Imposing loop conditions, pasting back the text (etc).
![Page 39: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/39.jpg)
UC Berkeley CS294-9 Fall 2000 21- 39
Errors
Fragmentation
Over-merging
![Page 40: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/40.jpg)
UC Berkeley CS294-9 Fall 2000 21- 40
Impressive
![Page 41: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/41.jpg)
UC Berkeley CS294-9 Fall 2000 21- 41
Reminder: Without layout analysis
• Reading across columns• Misplacing captions• Misplacing footnotes• Misunderstanding page numbers (which should
be REMOVED in the reformatting process)• Need extraction of biblio data: title, author,
abstract, keywords
• Nearly every subsequent step is compromised by lack of context.
![Page 42: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/42.jpg)
UC Berkeley CS294-9 Fall 2000 21- 42
A Diversion: Separating Math from Text
• Why separate math from text?• Types of mathematics encountered• Previous Work• Two approaches
– post-processing commercial OCR– character-based (details!)
• Errors and their correction• Ambiguities
![Page 43: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/43.jpg)
UC Berkeley CS294-9 Fall 2000 21- 43
Why separate math/text/images/..
• OCR programs do not work for math
l~F(P)=(~ ~(P)j(P) -~7~(p)
fli
becomes, in Textbridge,
Designation as a “picture” is only a partial solution
![Page 44: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/44.jpg)
UC Berkeley CS294-9 Fall 2000 21- 44
Mathematics on a Page
Inline is harder to pick out
because it may look like italics text
![Page 45: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/45.jpg)
UC Berkeley CS294-9 Fall 2000 21- 45
Previous Work
• Isolation by hand (most math parser papers)• Texture/ statistics based heuristics
– useful for display math “paragraphs”– not useful for in-line math
• Character based pseudo-parsing (but without font information or true parsing feedback)
• Incomplete
![Page 46: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/46.jpg)
UC Berkeley CS294-9 Fall 2000 21- 46
Proposal: Post-Processing of OCR
• Start with commercial best-effort recognition• Reprocess the intermediate data structure (e.g.
for TextBridge, the XDOC file)• Accept recognition of text zones with high
recognition certainty. (Lines with no errors surrounded by lines with no errors are considered solved)
![Page 47: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/47.jpg)
UC Berkeley CS294-9 Fall 2000 21- 47
Separate uncertain areas
• Re-consider “the rest of the image” as potential mathematics zones: uncertain regions (including nearby “certain” characters/lines)
• Isolate characters, identify fonts, etc.• Play out heuristic rules for separating text and
math zones.• Consider eradicating math and re-submitting
text; separately recognizing math and reinserting in XDOC
![Page 48: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/48.jpg)
UC Berkeley CS294-9 Fall 2000 21- 48
Alternatively, Starting from our own naïve OCR
• Connected component recognition• Separate characters by initial classification• Repeatedly re-examine via rules• Determine text zones, remove math / feed
remainder to commercial OCR– How best to blank-out math? XXX
• Most likely human interaction remains
![Page 49: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/49.jpg)
UC Berkeley CS294-9 Fall 2000 21- 49
Two bags: Math vs Text
• Initially MathInitially Math– + - = / Greek, scientific symbols, 0-9, italics,
bold, (), [], sin, cos, tan, dots, commas, decimal points
• Initially TextInitially Text– Roman Letters, junk
![Page 50: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/50.jpg)
UC Berkeley CS294-9 Fall 2000 21- 50
Sample Text Bag
![Page 51: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/51.jpg)
UC Berkeley CS294-9 Fall 2000 21- 51
Sample Math Bag
![Page 52: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/52.jpg)
UC Berkeley CS294-9 Fall 2000 21- 52
Second Pass
• Correct for too much Math• Grow “clumps” (expand BBs) to categorize
– 3.14159 vs “end of sentence.”– (comment) vs f(x)– hyphen-words vs x2 - y2
– horizontal lines generally– isolated 1 or is it l “ell” or I “eye”
• “bags” or “zones” of geometric-relation boxes containing either words or potential math
![Page 53: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/53.jpg)
UC Berkeley CS294-9 Fall 2000 21- 53
Importance of Context
Here are 12 L’s and a 1
![Page 54: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/54.jpg)
UC Berkeley CS294-9 Fall 2000 21- 54
Third Pass
• Too much is in the text bag now– blur the math to allow for embedded Roman text like
“sin” or “l”
• Re-clump the mathematics to see if new bridges have been formed
• Some italics in the math bag may be really – English words in theorems– emphasized text
![Page 55: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/55.jpg)
UC Berkeley CS294-9 Fall 2000 21- 55
On Ambiguity and Correctness
• Can we find the math in ad - bc by ad hoc methods?
• If we are unable to disambiguate English words, why should we be able to disambiguate mathematics?
• Abuse of mathematical notation is widespread: can we insist that new papers either have a non-ambiguous notation or an underlying electronic non-ambiguous notation?
![Page 56: UC Berkeley CS294-9 Fall 200021- 1 Document Image Analysis Lecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California](https://reader034.vdocument.in/reader034/viewer/2022042717/56649e2f5503460f94b1f729/html5/thumbnails/56.jpg)
UC Berkeley CS294-9 Fall 2000 21- 56
Conclusions
• We can make a first cut on separating math from text
• If we wish to “enliven” math publication with semantic underpinnings, this may help in their production
• Incorporation of AI rule-based transformations as well as hand correction are likely to be important