master thesis document logical structure extraction mehrdad nojoumian supervisor: professor t. c....
TRANSCRIPT
Master Thesis
Document Logical Structure Extraction
Mehrdad Nojoumian
Supervisor: Professor T. C. Lethbridge
University of Ottawa
School of Information Technology and Engineering
October 27, 2006
ContentsContents
Motivation and GoalMotivation and Goal
Document Properties (UML Superstructure Specification)Document Properties (UML Superstructure Specification)
Document TransformationDocument Transformation
Logical Structure ExtractionLogical Structure Extraction
Summary and Future WorkSummary and Future Work
1
Motivation and GoalMotivation and GoalProblems:
Specifications are: Dense, repetitive and difficult to use Written primarily in semi-structured text, but the structure must be maintained manually, resulting in inconsistency
End users cannot use them efficiently due to: Duplications Numerous concepts connected only implicitly General complexity of the document
High level goal:
Enable easier browsing and editing of specifications
To achieve this we have the following lower-level goals
Extract document's logical structure Generate a Knowledge Base for the UML specification
2
DefinitionsDefinitions Document analysis: Extraction of the geometric structure which refers to the pages, blocks, lines, and words
Document understanding: Mapping physical structure into the logical structure which refers to the chapters, sections, subsections, etc
Knowledge acquisition: Extracting concepts embedded in the document structure (physical or logical)
Unstructured document: A plain text with natural language
Semi-structured document: A document with tags dividing it into the paragraphs, headings and sections such as web pages
Structured document: A document in which all the elements are marked with meta-tags, typically using XML
3
Quick Literature ReviewQuick Literature ReviewDocument analysis:
WISDOM (Wise System for Document Management): is a document processing system that operates in five steps:
1. Document analysis (physical)2. Document classification3. Document understanding (mapping physical logical)4. Text recognition with OCR 5. Text transformation into XML format
MKB (Mathematical Knowledge Browser): by Using this browser
1. Printed mathematical documents can be scanned and recognized by OCR2. The meta-information (e.g. title, author, abstract, etc) can be extracted3. The logical structure (e.g. theorem, lemma, prove, etc) can be extracted
OCR: Optical Character Recognition 4
Document PropertiesDocument PropertiesUML Superstructure Specification (version 2.1):
Is a large specification in PDF format
Has 771 pages
Contains almost 2200 headings with a lot of nested lists, hyperlinks, figures, tables, etc.
Reasons for choosing the PDF format:
People do not have access to the original word-processor formats much of the time
PDF format has some useful features that make it semi-structured such as bookmarks
When documents are published, the best choice is PDF format to guarantee that everyone can read it
5
Document TransformationDocument TransformationI. Transforming the raw input into a format more amenable to analysis
II. Extracting and refining the structure
Conversion Experiments:
We performed various conversions using a similar sample file
We applied different tools such as:
1.1. Adobe Acrobat Professional 7.8Adobe Acrobat Professional 7.82.2. Microsoft Word 2003Microsoft Word 20033.3. Stylus Studio 2006 XML Enterprise SuiteStylus Studio 2006 XML Enterprise Suite4.4. ABBYY PDF Transformer 1.0ABBYY PDF Transformer 1.0
6
Input Format(Size KB)
Tools forConversions
Output Format(Size KB)
DOC (34.5)Microsoft Office
Word 2003TXT (2.81)
DOC (34.5)Microsoft Office
Word 2003RTF (55)
DOC (34.5)Microsoft Office
Word 2003HTML (40.7)
DOC (34.5)Microsoft Office
Word 2003XML (55)
DOC (34.5)Adobe AcrobatProfessional 7.8
PDF (19)with Bookmarks
DOC (34.5)Adobe AcrobatProfessional 7.8
PDF (15.9)without Bookmarks
PDF (19)with Bookmarks
Adobe AcrobatProfessional 7.8
HTML (6.38)
PDF (15.9)without Bookmarks
Adobe AcrobatProfessional 7.8
HTML (5.15)
PDF (19)with Bookmarks
Adobe AcrobatProfessional 7.8
XML (9.92)
PDF (15.9)without Bookmarks
Adobe AcrobatProfessional 7.8
XML (8.30)
PDF (19)with Bookmarks
ABBYY PDFTransformer 1.0
HTML (19.2)
PDF (19)with Bookmarks
ABBYY PDFTransformer 1.0
TXT (2.82)
Document Transformation (Cont)Document Transformation (Cont)
Criteria: To select the best conversion, we defined a set of criteria 1. Generality: A format should enable the design of a general extraction
algorithm for processing other electronic documents
2. Low volume: We should avoid a format which contains of a lot of extra material which is not related to the document content
3. Clean and understandable: Even if the output is small, it should be clean and understandable, e.g. formats which mark constructs such as paragraphs
4. Similarity to XML: We prefer a format which has a similar structure to XML because our final goal is to extract the logical structure in this style
5. Having good clues: A format should use markers which provide accurate and good clues for finding the logical structure, e.g. meaningful keywords with respect to the headings: “LinkTarget”, “DIV”, “Sect”, “Part”, etc
7
Document Transformation (Cont)Document Transformation (Cont)
First Stage of Evaluation:
DOC & RTF: They are messy even code figures among the contents of the
document. In addition, they store information related to the font, size, style, etc of
each heading, paragraph, sentence and even words
HTML/XML: If we extract HTML/XML formats from DOC/RTF, the results
tend to have the same properties
TXT: It is very simple but does not give us any clues for processing and you may
not even find the beginning of the chapters, headings, tables, etc
PDF: It is complex itself, but after a conversion into HTML/XML by Adobe
Acrobat Professional 7.8, the result is very nice, especially in the case of PDF files
which have bookmarks
8
Document Transformation (Cont)Document Transformation (Cont)
Second Stage of Evaluation:
Our finalist candidates are HTML and XML formats extracted by Adobe Acrobat professional 7.8 from the PDF file with bookmarks
We analyzed the following sample parts using the two finalist candidates:
Sample paragraphs Sample paragraphs Sample figures (e.g. figure 7.25)Sample figures (e.g. figure 7.25) Sample tables (e.g. table 2.1)Sample tables (e.g. table 2.1) Complex tables which have phrases, figures and hyperlinks in their cellsComplex tables which have phrases, figures and hyperlinks in their cells Complex nested lists which have complicated hierarchy structuresComplex nested lists which have complicated hierarchy structures
After many assessments, we found out the XML is the best candidate for processing
9
Logical Structure ExtractionFirst refinement approach: Grammars
Applied various parsing packages
Tried to write a comprehensive grammar to parse the XML document
Sample headings: 7 Classes\n7 Classes\n 7.1 Overview\n7.1 Overview\n 7.2 Abstract Syntax\n7.2 Abstract Syntax\n 7.3.1 Abstraction\n7.3.1 Abstraction\n
Encountered too many exceptions, resulting in the need for:
1. Too many rules
2. Context-sensitive parsing
10
NumberStart
Period
Space
WordNext
Line
Logical Structure Extraction (Cont)Logical Structure Extraction (Cont) Second refinement approach: stack-based parsing written in Java
We turned to writing a simple java code to match major tags, such as <Part>, <Sect> and <Div>, which Acrobat used to open and close each part, chapter, section, etc of the document
<Sect name=”Generalization”> <Generalization> <Sect name=”Class-Ref”> <Class-Ref> <Sect name=”Name”> <Name> </Sect> </Name> <Sect name=”Package-Ref”> <Package-Ref> </Sect> </Package-Ref> </Sect> </Class-Ref> </Sect> </Generalization>
Using a straightforward stack-based parsing approach
11
Logical Structure Extraction (Cont)Logical Structure Extraction (Cont)
Second refinement approach: stack-based parsing written in Java
After running the program for diverse chapters and the whole document, it failed
The tool opened each part, chapter, section, etc by <Sect> in a proper place of the document but it closed all of these tags by </Sect> in the wrong places
The problem was more crucial when we processed the whole document at once because of the accumulative mis-tagging
<Sect number=” 7.3”><Sect number=”7.3.1”></Sect><Sect number=”7.3.2”></Sect>
Correct place for closing <Sect number=”7.3”><Sect number=”7.4”></Sect></Sect> Wrong place
12
Logical Structure Extraction (Cont)Logical Structure Extraction (Cont)
Third implementation approach: leveraging the bookmarks
We wrote a java-based parser which focused on a keyword: LinkTarget It corresponds to the bookmark elements created in the transformation phase It is attached to each heading in the bookmark
e.g.: <P id="LinkTarget_111914">7 Classes</P> We extracted all the lines containing the LinkTarget and put them in a queue We also defined the different type of headings in our document:
13
T Sample Heading Type
1 Part I - Structure Part
2 7 Classes Chapter
3 7.3 Class Descriptions Section
4 7.3.1 Abstraction Subsection
5 Generalization, Notation, etc Keyword
6 Annex End part
7 Index Last Part
Logical Structure Extraction (Cont)Logical Structure Extraction (Cont)Procedure DocumentStructureAnalysis(LinkTargetQueue) Q: Part I 1 Classes 1.1 Description 1.1.1 Abstraction
T of the last member of the HeadingStack = 0, HeadingStack = empty
While (LinkTargetQueue != empty) do
Get “L” from the LinkTargetQueue L//Line: e.g.:<P id="LinkTarget_111914">7 Classes</P>
Extract the heading “H” from the “L” H//Heading: e.g.: 7 Classes
Define heading's type: “T” T//Type: e.g.: for the Chapters, T Chapter = 2
While (T =< T of the last member of the HeadingStack) do
Pop “H” and “T” from the HeadingStack
Close the suitable tag w.r.t the popped “T”
If (HeadingStack == empty)
Break this while loop
End if
End while
Push the new “H” and “T” in the HeadingStack
Open new tags w.r.t the pushed “H” & “T”
End while
While (HeadingStack != empty) do
Pop “H” and “T” from the HeadingStack
Close the suitable tag w.r.t the popped “T”
End while
Return “F”
End procedure
14
Part I T Part = 1 <Part I>
1 Classes T Chapter = 2 <Chapter 1>
1.1 Description T Section = 3 <Section 1.1>
1.1.1 Abstraction T Subsection = 4 <Subsection 1.1.1>
1.1.1 Abstraction T Subsection = 4 </Subsection 1.1.1>
1.1 Description T Section = 3 </Section 1.1>
1 Classes T Chapter = 2 </Chapter 1>
Part I T Part = 1 </Part I>
Logical Structure Extraction (Cont)Logical Structure Extraction (Cont)
We extracted 2191 headings from the UML Superstructure Specification (V: 2.1)UML Superstructure Specification (V: 2.1)
We tested other specifications such as:
UML Infrastructure Specification (V: 2.0)UML Infrastructure Specification (V: 2.0) Extractions were well in all cases with 100% accuracy We also imported the new XML file into Protégé
15
Logical structure in XML format
Logical structure model in Protégé
SummarySummary Goal:
Make specifications more usable
Main task in this stage of our work:
Extract clean structure from published PDF specification
Document transformation:
Took raw PDF version of a published specification Experimented with tools to convert to other formats: DOC, RTF, TXT, XML, etc
Logical structure extraction:
Best format: XML file extracted from PDF with bookmarks
Key challenge: Dealing with mis-tagging We needed to write a procedural program Declarative grammars with a parsing package did not work well
16
Future WorkFuture Work We extracted the document’s logical structure (document entity)
Intend to focus on the hidden concepts in the document (UML entity)
Interested to know what knowledge could be captured:
We will capture list of all words, bi-grams, tri-grams and quad-grams
Calculate their frequency of occurrence
Earn a sense of the terminology and concepts by these frequencies
We also like to do related-phrases analysis: “X is a kind of Y”, “X has a Y”
17
Some ReferencesSome References[1] S. Mao, A. Rosenfeld, and T. Kanungo, “Document Structure Analysis Algorithms: A
Literature Survey”, in Proceedings of SPIE Electronic Imaging, USA, 2003, pp. 197–207.
[2] S. Klink, A. Dengel, and T. Kieninger, “Document structure analysis based on layout and
textual Features”, in Proceedings of International Workshop on Document Analysis
systems, Brazil, 2000, pp. 99-111.
[3] J. Liang, “Document Structure Analysis and Performance Evaluation”, PhD thesis,
University of Washington, Seattle, USA, 1999.
[4] K.Nakagawa, A.Nomura, and M.Suzuki, “Extraction of Logical Structure from Articles in
Mathematics”, 3rd International Conference on Mathematical Knowledge Management,
Bialowieja, Poland, 2004, pp. 276-289.
[5] O. Altamura, F. Esposito and D. Malerba, “Transforming paper documents into XML
format with WISDOM++”, International Journal on Document Analysis and Recognition,
vol. 4, 2001, pp. 2-17.
[6] W. Cohen and L. Jensen, “A structured wrapper induction system for extracting
information from semi-structured documents”, 17th International Joint Conference on
AI, Workshop on Adaptive Text Extraction and Mining, Seattle, USA, 2001.
18
ThanksThanks
Questions?Questions?
19