7 information extraction - automated...

7 Information Extraction - AutomatedIndexing

Prof. Dr. Knut Hinkelmann 27 Information Extraction - Automated Indexing

Information ExtractionInformation Extraction is the automaticidentification and structured representation of relevant information in documents

extract well-defined pieces of relevant information from collections of documentgoal: populate a database (e.g. metadata)

General FunctionalityInput

Templates coding relevant information, e.g. metadata atributesset of real world texts

Outputset of instantiated templates filled withrelevant text fragments


Application Scenarios for Information Extraction

Indexing: Creating indexes for information retrieval systemsAutomated determination of metadata of documents

Question AnsweringAnswer an arbitrary question by using textual documents as knowledge base

Mail distributionIdentification of recipients in incoming letters of a company

Converting unstructured text to structured dataautomatic insertion of data into operative application systemsand databases

Evaluation of surveysCapturing and analysis of questionnaires


Information extraction depends on …

… structural degree of input datastructured: tables with typed data like numberssemi-structured: XML, tables with textnon-structured: text

… formatelectronic information

codednon-coded

paper documents

… structural degree of output datatext summaryfulltext indexstructured data: database, attributes, classification


7.1 Information Extraction from Text Documentstoken

scanner

lexicalanalysis

named entityrecognition

parsing

coreferenceresolution

templateunification

patternrecognition

featureidentification

classification

classification informationextraction


Lexical Analysis

Token scanner:Identification of text structure (e.g. paragraphs, title etc.) and special strings (tokens) like date, time, punctuationsHTML or XML-parsers can be applied for markupdocuments

Lexical analysis (morphology):Determination of word forms (singular-plural)Determination of the kind of word (verb,noun)

Part of Speech tagging, POS

in German: composita analysis (in German)

tokenscanner

lexicalanalysis


Automatic Classification

Each document is described by a setof features

Each class is described using thesame kind of features

A document is associated to theclass(es) where the features are mostsimilar. This can be tested using rulesor similarity measures.

ClassificationC(FD)

document D

classdescriptions

classdescriptions



ClassifierC

ClassifierC

featurerepresentation

FD of thedocument


FD of thedocument


classification


Rule-based Text ClassificationThe features are keywords that are either associated to a document as metadata or that occur in the documents

Example: Assume there are three classes: businesscomputer scienceinformation systems

The keywords in this example are: processOOPaccountingERPdatabase

The classifier can be represented as a set of rules:IF a documents has the keywords process, accounting, and ERP

THEN the document belongs to class „business“IF a documents has the keywords OOP and database

THEN the document belongs to class „computer science“IF a documents has the keywords process, database, and ERP

THEN the document belongs to class „information systems“


Fulltext Classification

In the full text classification, the features are the terms occuring in thedocuments (fulltext index)

The classes are represented as vectors

c1 c2 c3w11 w21 w31w12 w22 w32w13 w23 w33w14 w24 w34w15 w25 w35w16 w26 w36

t1t2t3t4t5t6

The classification of a document is computed using a well-known rankingfunction well-known from inforamtion retrieval (cosinus).


Automatic Learning of Classification Rules

Training phase:

A characteristic set of documentsis manually classified.

A learning component analysesthe features of the documents in the classes

ClassificationC(FD)

document D

classdescriptions

classdescriptions



ClassifierC

ClassifierC


FD of thedocument


FD of thedocument


Classification Methods

Specific Document classifiers, e.g.Linear Least Square Fit (LLSF)Latent Semantic Analysis (LSA)

Adaptation of general Classifiers, e.g. Decision Trees

Explicit rules to test document featuresK Nearest Neighbor

Documents are represented as vectorsA new document is compared with all documents of the trainingsetThe majority of the k most similar documents gives theclassification

ZentroidEach class is represented by a prototypical vector

Neural Network

class A class B

newdocument


Information Extraction

Example: From business news information about job changes shouldbe extractedSample text:

Peter SmithSusan WinterdirectorArconia Ltd31 March 2007

PersonOutPersonInPosition OrganizationDate

Peter Smith left Arconia Ltd. The former directorretired on 31 March 2007. His successor isSusan Winter. At the same time George Young became sales manager. He followed John Kelly.

John KellyGeorge Young sales managerArconia Ltd31 March 2007

PersonOutPersonInPosition OrganizationDate

Template Instancesthat should be extractedfrom the sample text


Named Entity Recognition

Mark into the text each string that represents a person, organization, or location name, or a date or time, or a currency or percentage figure.

Example:

<name type=person>Peter Smith</name>, left<name type=organisation>Arconia Ltd. </name>. The former director retired on <date>31 March2007</date>. His successor is <nametype=person>Susan Winter</name>. At the sametime <name type=person>George Young</name>became sales manager. He followed <nametype=person>John Kelly</name>.

lexicalanalysis



templateunification

tokenscanner

parsing


Parsing

Parsing: Identification of phrase structures: noun phrase (NP), verb phrase (VP), ..

S

NP VP

Peter Smith left NP

Arconia Ltd.

lexicalanalysis



templateunification

tokenscanner

parsing


Coreference ResolutionCapture information on corefering expressions, i.e. all mentions of a given entity, including those marked in NE and TE (nouns, noun phrases, pronouns).

Example:„the former director“ refers to „Peter Smith“„His“ refers to „Peter Smith“„He“ refers to „Georgs Young“„At the same time“ refers to „31 March 2007“

<name type=person>Peter Smith</name>, left <nametype=organisation>Arconia Ltd. </name>. The formerdirector retired on <date>31 March 2007</date>. His successor is <name type=person>Susan Winter</name>. At the same time <name type=person>George Young</name>became sales manager. He followed <nametype=person>John Kelly</name>.

lexicalanalysis



templateunification

tokenscanner

parsing


Template Unification

Information for instantiating a single template often isdistributed over multiple sentences. This informationhas to be collected and unified.

Template Unification can comprise multiple tasks:Template Element Recognition (TE)

Extract basic information related to organization, person, and artifact entities, drawing evidence from everywhere in the text

Scenario Template Recognition (ST)Extract prespecified event information and relate the event information to particular organization, person, or artifact entities.

Pattern Recognition (PR)Identification of domain specific patterns (“Microsoft founder” = “Bill Gates”

lexicalanalysis



templateunification

tokenscanner

parsing


imageimage Objects

layoutcharactes

KNOWLEDGE

Example:

domainknowledge

termslogical objectsmessage type

Interpretation

INFORMATION

7.2 Information Extraction from (semi-)structuredDocument

Integrated consideration of layout structurelogical structurecontent (semantics)

Source: A. Dengel, DFKI


Information Extraction using Layout, LogicalStructure and Content

Example: Letter

Address of Recipient

Layout: General Rules for position of address block

Structure: Recipient consists of nameand address

Recipient

Content: Knowledge aboutnamedentities and context„Dear Mr Trasher“

Office World Inc.Anvenue 101New York

Connecticut, 18.2.2006

Dear Mr Trasher

According to your offer from 16.2.2006 we order:

100 rack HU150 white50 office desk BT344 frey50 office chair BS 382 black

We expect the delivery until 28.2.1993

Yours sincerely,

Office Space Ltd.City Center 2201

Connecticut


Treatyclientproduct...

AXA ColoniaDread disease...

Guiding Extraction by Classification

Knowledge about documentstructure can target informationextraction

1. Classification:Assigning documents to predefined document classesFor the document classes thestructural objects are defined

2. Information ExtractionIdentification of relevant informationTargeted seach in structuralelements

?

articlestreaties lessonslearned

memos

documentsimilarity


Information Extraction from Markup Documents: XML

<researcher><name> Knut Hinkelmann </name><affiliation>

<university> FachhochschuleNordwestschweiz</university>

<group> Wirtschaftsinformatik</group><address>

<street> Riggenbachstrasse 16 </street><city> 4600 Olten </city>

</address> </affiliation><phone > ++41 62 286 00 80 </phone><email> [email protected] </email></researcher>

Predefined markup guidesinformation extraction and recognition:

Elements (tags, attributes)Structure

researcher

name affiliation phone email

university group address

street city


7.3 Information Extraction from Paper Documents

ScanningResult: Image of the document (non-coded information)

PreprocessingCorrectionOptical Character Recognition OCRIntelligent Character Recognition ICR (advanced OCR e.g. hand writing) Result: Content as text (coded information)

ClassificationResult: Document class (e.g. invoice of Hamilton Inc., ...)

Information extraktionResult: Relevant information in structured form (e.g. amount invoiced)


Information Extraction from forms

In forms the layout (position) determines the meaning of information

The layout must be known to therecognition system

The form must be sparated from theentries (content)

:


Types of documens

Fixed form

space for entries fixed

Dynamic form

forms with space for freeentries (text, tables)

Free documents

no predefined layout


Dokumentklassen

Um Informationen extrahieren zu können, muss der Aufbau der Dokumente bekann sein.

Dokumentklassen sind Dokumente mit gleichartigem Aufbau

Dokumentklassen steuern die InformationsextraktionZu jeder Dokumentklasse ist definiert, wo welche Information extrahiert wirdBeispiel: Rechnung: > Adresse > Kunden.-Nr.

> Bank > Bankleitzahl> Kontonummer > Betrag

Dokumentklassen können sehr spezifisch seinz.B. Rechnungsformular der Firma Meyer GmbHin diesem Fall ist genau bekannt, wo die gesucht Information zu finden ist

Dokumentklassen können sehr allgemein seinz.B. allgemeine Arztrechnungin diesem Fall ist mehr Aufwand bei der Suche nach Information auf dem Dokument notwendig


Elimination of lines:lines negatively influence OCR results

Noiseelimination

Phase 1: Preprocessing

Rotationcorrection

Uside-down-correction


Problems with OCR/ICR

Ambiguities

Wrong segmentation

Errors in


Phase 2: Clasification

Layout: lines, tables, ...

Using layout and logic structure as additional features for classification

predefined searchpatterns(regular expressions)

table structure and content ...


Definition of Document Classes in DocumentAnalysis Systems Document Definition Interface:

Use the mouse to marks areas withrelevant informationDefine search pattern, regularexpression (e.g.for date) etc. for theexpected information

insurance number

table


Phase 3: Information ExtractionExtract relevant Information from

Form fields with fixed position

Search patterns

Tables

Regular expression hiermit kündige ich zum 31.12.2003 mein Abonnement …


Field `Netto´

Field `Mwst´

Field `Brutto´

Phase 4: Automatic Verification

Logical verification: Checking logical or mathematical conditions

Nettosumme + Mehrwertsteuer = BruttosummeExpression: EQUAL(ROI(`Brutto´), SUM(ROI(`Netto´), ROI(`Mwst´)))

Database matching: Compare extracted ifnormation with content of a database (Levensthein distance)


Phase 5: Manual VerificationDocument Analysis Tools provide an interface for manual verifcation