master thesis document logical structure extraction mehrdad nojoumian supervisor: professor t. c....

Master Thesis

Document Logical Structure Extraction

Mehrdad Nojoumian

Supervisor: Professor T. C. Lethbridge

University of Ottawa

School of Information Technology and Engineering

October 27, 2006

ContentsContents

Motivation and GoalMotivation and Goal

Document Properties (UML Superstructure Specification)Document Properties (UML Superstructure Specification)

Document TransformationDocument Transformation

Logical Structure ExtractionLogical Structure Extraction

Summary and Future WorkSummary and Future Work

1

Motivation and GoalMotivation and GoalProblems:

Specifications are: Dense, repetitive and difficult to use Written primarily in semi-structured text, but the structure must be maintained manually, resulting in inconsistency

End users cannot use them efficiently due to: Duplications Numerous concepts connected only implicitly General complexity of the document

High level goal:

Enable easier browsing and editing of specifications

To achieve this we have the following lower-level goals

Extract document's logical structure Generate a Knowledge Base for the UML specification

2

DefinitionsDefinitions Document analysis: Extraction of the geometric structure which refers to the pages, blocks, lines, and words

Document understanding: Mapping physical structure into the logical structure which refers to the chapters, sections, subsections, etc

Knowledge acquisition: Extracting concepts embedded in the document structure (physical or logical)

Unstructured document: A plain text with natural language

Semi-structured document: A document with tags dividing it into the paragraphs, headings and sections such as web pages

Structured document: A document in which all the elements are marked with meta-tags, typically using XML

3

Quick Literature ReviewQuick Literature ReviewDocument analysis:

WISDOM (Wise System for Document Management): is a document processing system that operates in five steps:

1. Document analysis (physical)2. Document classification3. Document understanding (mapping physical logical)4. Text recognition with OCR 5. Text transformation into XML format

MKB (Mathematical Knowledge Browser): by Using this browser

1. Printed mathematical documents can be scanned and recognized by OCR2. The meta-information (e.g. title, author, abstract, etc) can be extracted3. The logical structure (e.g. theorem, lemma, prove, etc) can be extracted

OCR: Optical Character Recognition 4

Document PropertiesDocument PropertiesUML Superstructure Specification (version 2.1):

Is a large specification in PDF format

Has 771 pages

Contains almost 2200 headings with a lot of nested lists, hyperlinks, figures, tables, etc.

Reasons for choosing the PDF format:

People do not have access to the original word-processor formats much of the time

PDF format has some useful features that make it semi-structured such as bookmarks

When documents are published, the best choice is PDF format to guarantee that everyone can read it

5

Document TransformationDocument TransformationI. Transforming the raw input into a format more amenable to analysis

II. Extracting and refining the structure

Conversion Experiments:

We performed various conversions using a similar sample file

We applied different tools such as:

1.1. Adobe Acrobat Professional 7.8Adobe Acrobat Professional 7.82.2. Microsoft Word 2003Microsoft Word 20033.3. Stylus Studio 2006 XML Enterprise SuiteStylus Studio 2006 XML Enterprise Suite4.4. ABBYY PDF Transformer 1.0ABBYY PDF Transformer 1.0

6

Input Format(Size KB)

Tools forConversions

Output Format(Size KB)

DOC (34.5)Microsoft Office

Word 2003TXT (2.81)


Word 2003RTF (55)


Word 2003HTML (40.7)


Word 2003XML (55)

DOC (34.5)Adobe AcrobatProfessional 7.8

PDF (19)with Bookmarks

DOC (34.5)Adobe AcrobatProfessional 7.8

PDF (15.9)without Bookmarks


Adobe AcrobatProfessional 7.8

HTML (6.38)



HTML (5.15)



XML (9.92)



XML (8.30)


ABBYY PDFTransformer 1.0

HTML (19.2)


ABBYY PDFTransformer 1.0

TXT (2.82)

Document Transformation (Cont)Document Transformation (Cont)

Criteria: To select the best conversion, we defined a set of criteria 1. Generality: A format should enable the design of a general extraction

algorithm for processing other electronic documents

2. Low volume: We should avoid a format which contains of a lot of extra material which is not related to the document content

3. Clean and understandable: Even if the output is small, it should be clean and understandable, e.g. formats which mark constructs such as paragraphs

4. Similarity to XML: We prefer a format which has a similar structure to XML because our final goal is to extract the logical structure in this style

5. Having good clues: A format should use markers which provide accurate and good clues for finding the logical structure, e.g. meaningful keywords with respect to the headings: “LinkTarget”, “DIV”, “Sect”, “Part”, etc

7


First Stage of Evaluation:

DOC & RTF: They are messy even code figures among the contents of the

document. In addition, they store information related to the font, size, style, etc of

each heading, paragraph, sentence and even words

HTML/XML: If we extract HTML/XML formats from DOC/RTF, the results

tend to have the same properties

TXT: It is very simple but does not give us any clues for processing and you may

not even find the beginning of the chapters, headings, tables, etc

PDF: It is complex itself, but after a conversion into HTML/XML by Adobe

Acrobat Professional 7.8, the result is very nice, especially in the case of PDF files

which have bookmarks

8


Second Stage of Evaluation:

Our finalist candidates are HTML and XML formats extracted by Adobe Acrobat professional 7.8 from the PDF file with bookmarks

We analyzed the following sample parts using the two finalist candidates:

Sample paragraphs Sample paragraphs Sample figures (e.g. figure 7.25)Sample figures (e.g. figure 7.25) Sample tables (e.g. table 2.1)Sample tables (e.g. table 2.1) Complex tables which have phrases, figures and hyperlinks in their cellsComplex tables which have phrases, figures and hyperlinks in their cells Complex nested lists which have complicated hierarchy structuresComplex nested lists which have complicated hierarchy structures

After many assessments, we found out the XML is the best candidate for processing

9

Logical Structure ExtractionFirst refinement approach: Grammars

Applied various parsing packages

Tried to write a comprehensive grammar to parse the XML document

Sample headings: 7 Classes\n7 Classes\n 7.1 Overview\n7.1 Overview\n 7.2 Abstract Syntax\n7.2 Abstract Syntax\n 7.3.1 Abstraction\n7.3.1 Abstraction\n

Encountered too many exceptions, resulting in the need for:

1. Too many rules

2. Context-sensitive parsing

10

NumberStart

Period

Space

WordNext

Line

Logical Structure Extraction (Cont)Logical Structure Extraction (Cont) Second refinement approach: stack-based parsing written in Java

We turned to writing a simple java code to match major tags, such as <Part>, <Sect> and <Div>, which Acrobat used to open and close each part, chapter, section, etc of the document

<Sect name=”Generalization”> <Generalization> <Sect name=”Class-Ref”> <Class-Ref> <Sect name=”Name”> <Name> </Sect> </Name> <Sect name=”Package-Ref”> <Package-Ref> </Sect> </Package-Ref> </Sect> </Class-Ref> </Sect> </Generalization>

Using a straightforward stack-based parsing approach

11

Logical Structure Extraction (Cont)Logical Structure Extraction (Cont)

Second refinement approach: stack-based parsing written in Java

After running the program for diverse chapters and the whole document, it failed

The tool opened each part, chapter, section, etc by <Sect> in a proper place of the document but it closed all of these tags by </Sect> in the wrong places

The problem was more crucial when we processed the whole document at once because of the accumulative mis-tagging

<Sect number=” 7.3”><Sect number=”7.3.1”></Sect><Sect number=”7.3.2”></Sect>

Correct place for closing <Sect number=”7.3”><Sect number=”7.4”></Sect></Sect> Wrong place

12


Third implementation approach: leveraging the bookmarks

We wrote a java-based parser which focused on a keyword: LinkTarget It corresponds to the bookmark elements created in the transformation phase It is attached to each heading in the bookmark

e.g.: <P id="LinkTarget_111914">7 Classes</P> We extracted all the lines containing the LinkTarget and put them in a queue We also defined the different type of headings in our document:

13

T Sample Heading Type

1 Part I - Structure Part

2 7 Classes Chapter

3 7.3 Class Descriptions Section

4 7.3.1 Abstraction Subsection

5 Generalization, Notation, etc Keyword

6 Annex End part

7 Index Last Part

Logical Structure Extraction (Cont)Logical Structure Extraction (Cont)Procedure DocumentStructureAnalysis(LinkTargetQueue) Q: Part I 1 Classes 1.1 Description 1.1.1 Abstraction

T of the last member of the HeadingStack = 0, HeadingStack = empty

While (LinkTargetQueue != empty) do

Get “L” from the LinkTargetQueue L//Line: e.g.:<P id="LinkTarget_111914">7 Classes</P>

Extract the heading “H” from the “L” H//Heading: e.g.: 7 Classes

Define heading's type: “T” T//Type: e.g.: for the Chapters, T Chapter = 2

While (T =< T of the last member of the HeadingStack) do

Pop “H” and “T” from the HeadingStack

Close the suitable tag w.r.t the popped “T”

If (HeadingStack == empty)

Break this while loop

End if

End while

Push the new “H” and “T” in the HeadingStack

Open new tags w.r.t the pushed “H” & “T”

End while

While (HeadingStack != empty) do

Pop “H” and “T” from the HeadingStack

Close the suitable tag w.r.t the popped “T”

End while

Return “F”

End procedure

14

Part I T Part = 1 <Part I>

1 Classes T Chapter = 2 <Chapter 1>

1.1 Description T Section = 3 <Section 1.1>

1.1.1 Abstraction T Subsection = 4 <Subsection 1.1.1>

1.1.1 Abstraction T Subsection = 4 </Subsection 1.1.1>

1.1 Description T Section = 3 </Section 1.1>

1 Classes T Chapter = 2 </Chapter 1>

Part I T Part = 1 </Part I>


We extracted 2191 headings from the UML Superstructure Specification (V: 2.1)UML Superstructure Specification (V: 2.1)

We tested other specifications such as:

UML Infrastructure Specification (V: 2.0)UML Infrastructure Specification (V: 2.0) Extractions were well in all cases with 100% accuracy We also imported the new XML file into Protégé

15

Logical structure in XML format

Logical structure model in Protégé

SummarySummary Goal:

Make specifications more usable

Main task in this stage of our work:

Extract clean structure from published PDF specification

Document transformation:

Took raw PDF version of a published specification Experimented with tools to convert to other formats: DOC, RTF, TXT, XML, etc

Logical structure extraction:

Best format: XML file extracted from PDF with bookmarks

Key challenge: Dealing with mis-tagging We needed to write a procedural program Declarative grammars with a parsing package did not work well

16

Future WorkFuture Work We extracted the document’s logical structure (document entity)

Intend to focus on the hidden concepts in the document (UML entity)

Interested to know what knowledge could be captured:

We will capture list of all words, bi-grams, tri-grams and quad-grams

Calculate their frequency of occurrence

Earn a sense of the terminology and concepts by these frequencies

We also like to do related-phrases analysis: “X is a kind of Y”, “X has a Y”

17

Some ReferencesSome References[1] S. Mao, A. Rosenfeld, and T. Kanungo, “Document Structure Analysis Algorithms: A

Literature Survey”, in Proceedings of SPIE Electronic Imaging, USA, 2003, pp. 197–207.

[2] S. Klink, A. Dengel, and T. Kieninger, “Document structure analysis based on layout and

textual Features”, in Proceedings of International Workshop on Document Analysis

systems, Brazil, 2000, pp. 99-111.

[3] J. Liang, “Document Structure Analysis and Performance Evaluation”, PhD thesis,

University of Washington, Seattle, USA, 1999.

[4] K.Nakagawa, A.Nomura, and M.Suzuki, “Extraction of Logical Structure from Articles in

Mathematics”, 3rd International Conference on Mathematical Knowledge Management,

Bialowieja, Poland, 2004, pp. 276-289.

[5] O. Altamura, F. Esposito and D. Malerba, “Transforming paper documents into XML

format with WISDOM++”, International Journal on Document Analysis and Recognition,

vol. 4, 2001, pp. 2-17.

[6] W. Cohen and L. Jensen, “A structured wrapper induction system for extracting

information from semi-structured documents”, 17th International Joint Conference on

AI, Workshop on Adaptive Text Extraction and Mining, Seattle, USA, 2001.

18

ThanksThanks

Questions?Questions?

19

master thesis document logical structure extraction mehrdad nojoumian supervisor: professor t. c....

Documents

document structure physical

document management

words document

document classification3

logical unstructured

definitions document

document analysis physical2

document processing