a simple algorithm for page segmentation

8
AN ALGORITHM FOR PAGE SEGMENTATION Alexey O. Shigarov 1,2 Roman K. Fedorov 1 10th International Conference on PATTERN RECOGNITION and IMAGE ANALYSIS: NEW INFORMATION TECHNOLOGIES St. Petersburg, Russia December 2010 1 Institute for System Dynamics and Control Theory, SB of RAS 2 e-mail: [email protected]

Upload: alexey-shigarov

Post on 26-Jul-2015

91 views

Category:

Science


2 download

TRANSCRIPT

AN ALGORITHM FOR PAGE SEGMENTATION

Alexey O. Shigarov1,2

Roman K. Fedorov1

10th International Conference onPATTERN RECOGNITION and IMAGE ANALYSIS:

NEW INFORMATION TECHNOLOGIES

St. Petersburg, RussiaDecember 2010

1 Institute for System Dynamics and Control Theory, SB of RAS2 e-mail: [email protected]

2

Introduction

� Page and table segmentation (or layout analysis) is a task of Document Analysis and Recognition (DAR)

� Page segmentation (document layout analysis) is dividing document into parts (e.g. columns, figures, tables)

Existing approaches to the page segmentation

� 1st is to analyze text layout (structure)

� e.g. using the Voronoi diagram for page

segmentation

� 2nd is to use page whitespace analysis

� e.g. using the Largest empty rectangle problem

Figure from [Kise K., Sato A., Iwata M. Segmentation of page

images using the area Voronoi diagram // Computer Vision and Image Understanding. Elsevier Science Inc. 1998. Vol. 70, No. 3. P. 370–382.]

Figure from [Orlowski M. A new algorithm for the largest empty rectangle problem // Algorithmica. Springer New York. 1990. Vol. 5, No. 1-4. P. 65–73.]

3

Problem Formulation

� Page segmentation includes dividing multi-column text or table into columns

� Whitespace analysis can be used for detecting columns in multi-column text or table

� Our algorithm provides detecting whitespace gaps located between text blocks on a document page

4

Algorithm. Input

� Input

� A bounding box (rectangle)

• It bounds a page or table

� A set of obstacles (rectangles)

• Each obstacle bounds text block (e.g. word, some words, line)

• Each obstacle is inside the bounding box

• The obstacles don’t overlap each other

� It is necessary to divide the obstacles inside the bounding box by whitespace gaps

� The algorithm consists of two steps

5

Algorithm. Step 1

� For each obstacle

� First line (or rule) is extended from the left bound of the obstacle to up and down until

it is stopped by either any other obstacle, or the bounding box. In this case, each

resulting line is added in the set L1

� Second line (or rule) is extended from the right bound of the rectangle by analogy

with the first case. In this case, each resulting line is added in the set L2

6

Algorithm. Step 2

� Couples of lines (l1,l2) are formed.

� Either the set L1

includes l1

or l1

is the right bound of the bounding box

� Either the set L2

includes l2

or l2

is the left bound of the bounding box

� There are no obstacles between l1 and l

2

� Top Y-coordinates of l1 and l

2 are the same

� Bottom Y-coordinates of l1 and l

2 are the same

� Each couple of lines (l1,l2) is a whitespace gap

� Output is the set of whitespace gaps

Algorithm. Output

7

Using the algorithm for table detection

Text lines are grouped in table regions

Table regions are grouped in tables

8

Using the algorithm for table segmentation

� Recovering table graphical lines (rules) can be used for table segmentation

� Vertical lines are recovered by vertical whitespace gaps inside a table

� Horizontal lines are recovered by horizontal whitespace gaps inside a table

Conclusion

1. Our algorithm can be used for

1. Multi-column text segmentation

2. Table detection

3. Table segmentation

2. Computational complexity of the algorithm is O(n2)

3. The algorithm is sufficient simple for implementation(~60 statements of Object Pascal)