![Page 1: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/1.jpg)
Document Representation Refinement for Precise Region Description
Christian Clausner, Stefan Pletschacher and Apostolos Antonacopoulos
PRImA Lab, School of Computing, Science and Engineering, University of Salford,
United Kingdom
![Page 2: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/2.jpg)
Document Page Regions
DATeCH 2014 2
Segmentation, Classification
• Region (block, zone): Connected area of a document image with content of a single specific type
• Examples: Text, graphic, table
![Page 3: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/3.jpg)
Region Representation
• By geometric objects
– Bounding box
– Stack of rectangles
– Polygon
• By pixels
– Bitmap
– Run-length encoding
DATeCH 2014 3
![Page 4: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/4.jpg)
Need for Precise Region Descriptions
• Precise description is crucial for all but the most trivial document analysis and recognition applications
• For performance evaluation: The loss of quality introduced by imprecise regions can be bigger than the variation of accuracy of the actual recognition method
DATeCH 2014 4
![Page 5: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/5.jpg)
The Situation
• Trend to more precise descriptions, but…
• Output of state-of-the-art OCR systems:
– Stacks of rectangles (ABBYY FineReader Engine 11)
– Bounding boxes (Tesseract OCR 3.02)
• Popular formats for layout analysis and OCR results:
– ALTO XML (boxes, ellipses, polygons (region level only))
– FineReader XML (stacks of rectangles (region level only))
– PAGE XML (polygons for all levels)
– HOCR (boxes)
DATeCH 2014 5
![Page 6: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/6.jpg)
Refinement through Polygonal Fitting
• Applicable to regions that have child objects in the document model
• A typical object hierarchy contains regions, text lines, words and glyphs (characters)
• Idea: Tightly wrap a polygon around the child objects
DATeCH 2014 6
![Page 7: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/7.jpg)
Polygonal Fitting Approach
1. Create bitmasks for the child objects and transfer them to an empty bitmap
2. Fill the gaps between the child objects by a smearing approach
3. Optional: Exclude neighbour regions
4. Trace the contour of the foreground and create a polygon
DATeCH 2014 7
![Page 8: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/8.jpg)
1 - Transferring Child Object to Bitmap
• Starting point: Polygonal object (e.g. text line, word, or glyph)
• Lossless conversion to rectangle based interval representation
• Transferring the rectangles to the target bitmap
DATeCH 2014 8
![Page 9: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/9.jpg)
2 – Smearing Approach
• Goal: Connect all foreground components in the bitmap by filling the gaps in-between
1. Alternatingly fill horizontal and vertical gaps if they are smaller than a dynamic threshold (threshold is increased after each iteration)
2. If necessary, use diagonal smearing to connect remaining components
DATeCH 2014 9
![Page 10: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/10.jpg)
3 – Subtraction of Neighbours
• Optional step to avoid overlap with adjacent regions
• Simply erase the corresponding pixels from the created bitmap
DATeCH 2014 10
![Page 11: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/11.jpg)
4 – Outline Tracing
• Trace the contour of the foreground component in the created bitmap
• Create polygon on-the-fly by adding points for each change of direction (corner)
DATeCH 2014 11
![Page 12: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/12.jpg)
Experiments
• Carried out on a dataset of contemporary documents consisting of scanned magazine and technical article pages
• Processed with Tesseract OCR 3.02 (open source)
• Exported to PAGE XML with and without refinement
DATeCH 2014 12
![Page 13: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/13.jpg)
DATeCH 2014 13
Original (unrefined) Refined
![Page 14: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/14.jpg)
Results
• Measurement of region overlaps (number and area)
DATeCH 2014 14
Overlapping Regions
Overlap Area (Megapixel)
Original Outlines
621 (45.8%) 19.9
Refined Outlines
286 (21.1%) 2.5
![Page 15: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/15.jpg)
Impact on Performance Evaluation
• Real-world scenario
• Measure the performance of Tesseract OCR engine
• Evaluation metrics of previous ICDAR page segmentation competitions
DATeCH 2014 15
Average success rate using original outlines 81.1%
Average success rate using refined outlines 84.5%
Average improvement for all documents 3.4%
Maximum improvement 22.9%
![Page 16: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/16.jpg)
Conclusion • Existing geometric region data can be significantly refined by fitting
precise polygons around child objects
• Validity and impact on real-world scenarios has been shown
• Refinement in performance evaluation helps to eliminate problems that arise from insufficient geometric descriptions → Concentrate on real issues of OCR methods
• Positive effect on accuracy of presentation/repurposing systems (highlighting, cropping, article tracking, etc.)
• Approach used in Aletheia ground truth editor and result viewer (primaresearch.org/tools)
DATeCH 2014 16
![Page 17: Datech2014-Session1-Document Representation Refinement for Precise Region Description](https://reader034.vdocument.in/reader034/viewer/2022042613/546fd480af7959b80a8b4694/html5/thumbnails/17.jpg)
DATeCH 2014 17