structural analysis of documents functional extension parser (fep). günter mühlberger

27
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Structural analysis of documents Functional Extension Parser (FEP) Günter Mühlberger University Innsbruck Library (UIBK)

Upload: biblioteca-nacional-de-espana

Post on 21-Jun-2015

885 views

Category:

Technology


0 download

DESCRIPTION

Presentada en la Sesión de demostración de IMPACT en la Biblioteca Nacional de España (BNE).

TRANSCRIPT

Page 1: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Structural analysis of documentsFunctional Extension Parser (FEP)

Günter MühlbergerUniversity Innsbruck Library (UIBK)

Page 2: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

AgendaIntroductionFeatures– What do we recognise with the structural analysis?

Benefits– Why is structural analysis useful?

Architecture– How does it work?

Results– How good are we?

Roadmap– When will it come into being?

Business– Which offers will be available?

2

Page 3: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IntroductionDocument understanding platformTry to enhance and exploit the logical structure of documents for– Display– Navigation– Retrieval

Enhance OCR output with structural metadata– Fully automated processing– Interactive correction

IMPACT EVA/MINERVA 12th Nov. 2008 3

Page 4: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

FeaturesGeneral– We are able to recognise all structural elements which have some layout

representation: e.g. region, size, typeface, distance to other elements, etc.– Focus in IMPACT: Basic features which are typical for all documents– Rules set can be extended or specified according to other datasets

E.g. journals, dissertations, index cards, yearbooks, newspapers, etc.– The better the OCR, the better our structural analysis

Basic features for books– Page numbers– Running titles (headers)– Print space– Footnotes– Signature marks– Headings (within the running text)– Table of contents entries (additional to headings)– Front/Body/Back– Paragraphs

4

Page 5: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Print spaceHeadingsFootnotes

5

Page 6: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Running title (header)Page numberSignature mark

6

Page 7: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Table of contents – (linked with headings in

the running text, respectively page numbers)

7

Page 8: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Benefits (1)Display– Correct print space allows to display images centred (no flipping

between pages)Search & retrieval– Scoring of results

Could take into account structural data (headings, footnotes)– Noise reduction

Front, body, back are separated, text from the front is often misleadingRunning titles repeat the same wordsFootnotes can be included or excluded

– Facetted searchResults can be displayed for running text, footnotes, headings

8

Page 9: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Benefits (2)Navigation– Page numbers allow usage of original table of contents– Original table of contents can be linked with headings/page numbers in

the bookDocument editing– Further mark up (e.g. TEI) is supported– Manual preparation for Print-on-Demand is eased (print space)– Selective OCR correction can be applied: – E.g. only headings, running text, footnotes could be fed to CONCERT

Document matching– Contributions or footnotes can be matched with existing bibliographical

databases

9

Page 10: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Improved display in the Internet and PDF

10

Page 11: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Refinement of full-text searchFacets for e.g.– Running text– Footnotes– Headings

Less noise– Running titles,

signature marks excluded from search

11

Page 12: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Clickable table of contents entries – Google style

Selective OCR correction– Correct only ToC,

headings, footnotes, etc.

12

Page 13: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Matching of documents with external sources– Match footnotes with

library catalogues (bibliographies)Clickable table of content

– Match table of contents entries and headings with bibliographies

13

Page 14: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Improved editing– Alternating print spaces

for Print on Demand– Further processing for

TEI editions etc.

14

Page 15: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ArchitectureInput– Results from OCR processing on word level (coordinates)– E.g. ALTO file, ABBYY XML file or Google HTML

Output– Structural annotations for recognized text features, e.g. page numbers,

running titles, headings, etc.– E.g. XML, ALTO, METS, TEI, etc.

General workflow– OCR result files are parsed (FEP general XML format)– Rules set is applied to the dataset (rules are managed by rules engine)– Results are stored in a database– Export on various levels is provided

Optional– Online or offline correction (GUI)– Adaptation of rules set – Quality assurance on basis of ground truth

15

Page 16: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

16

Page 17: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The FEP CoreBased on expert-system like rule engine for java (Jess)Both manually crafted rules and rules obtained by machine learningUses fuzzy logic to deal with uncertainty

Typical rules:IF there is a numeral in the first line of the page AND this numeral is centred THEN this numeral may be the page numberIF there is a numeral in the first line of the page AND this numeral is at the right hand side of the page AND this numeral is an odd number THEN this numeral may be the page numberIF there is a numeral in the first line of the page AND this numeral is at the left hand side of the page AND this numeral is an even number THEN this numeral may be the page number.

IMPACT EVA/MINERVA 12th Nov. 2008 17

Page 18: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ResultsBasic rules set– General features for books from 1700 to 2000– Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set)– All books were manually annotated (ground truth)

Recall, Precision, F-Measure– E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are

correct, 4 are false.– Recall = 8 of 10 = 0,8– Precision = 8 of 12 = 0,66– F-Measure = 2*0.8*0.66/(0.8+0.66) = 0,72

More explanations– Important: We are counting lines, not structural items!

E.g. a heading consists of two lines (often with different size of typeface we have to find both to succeed)

– Difference between training and evaluation sets are marginal

18

Page 19: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Results on Evaluation Set

19

Recall Precision F‐measure

Running text 0,99 0,98 0,98

Footnotes 0,83 0,89 0,86

Page numbers 0,97 1 0,98

Running titles 0,97 1 0,98

Heading 0,85 0,80 0,82

Signature marks 0,68 0,89 0,77

Page 20: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

RoadmapSummer 2011: Beta version– Integration into IMPACT Interoperability Platform– Basic rules set: books from 1700 to 1900

End of the year: Version 1.0– Full featured version– Enhanced online correction interface– FEP as a service, not as a product for local installation

20

Page 21: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Business offersWeb-service for processing single volumes and correction– Will be integrated into eBooks-on-Demand EOD Network – Already now 30 libraries are uploading their images to OCR server in

Innsbruck– FEP will be an additional service for general material– Similar offers can be made to other libraries or networks as well

Adaptation of rules set– For specific datasets much more can be detected than just the basic

features– E.g. journals with a fixed structure over many years or parliamentary

papers, dissertations, research papers, etc.Onsite installations– Not our focus, but could be done for very large datasets or due to legal

requirements (e.g. Google images)

21

Page 22: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

22

Page 23: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

23

Page 24: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT EVA/MINERVA 12th Nov. 2008 24

Page 25: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

25

Page 26: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Results: TOC

25 TOC entries in total22 TOC entries are completely correct1 TOC entry was missed2 TOC entries are grouped incorrectly1 TOC entry has no link1 TOC entry has a wrong link

IMPACT EVA/MINERVA 12th Nov. 2008 26

Page 27: Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Thank you for your attention!

27