7 information extraction - automated...
TRANSCRIPT
7 Information Extraction - AutomatedIndexing
Prof. Dr. Knut Hinkelmann 27 Information Extraction - Automated Indexing
Information ExtractionInformation Extraction is the automaticidentification and structured representation of relevant information in documents
extract well-defined pieces of relevant information from collections of documentgoal: populate a database (e.g. metadata)
General FunctionalityInput
Templates coding relevant information, e.g. metadata atributesset of real world texts
Outputset of instantiated templates filled withrelevant text fragments
Prof. Dr. Knut Hinkelmann 37 Information Extraction - Automated Indexing
Application Scenarios for Information Extraction
Indexing: Creating indexes for information retrieval systemsAutomated determination of metadata of documents
Question AnsweringAnswer an arbitrary question by using textual documents as knowledge base
Mail distributionIdentification of recipients in incoming letters of a company
Converting unstructured text to structured dataautomatic insertion of data into operative application systemsand databases
Evaluation of surveysCapturing and analysis of questionnaires
Prof. Dr. Knut Hinkelmann 47 Information Extraction - Automated Indexing
Information extraction depends on …
… structural degree of input datastructured: tables with typed data like numberssemi-structured: XML, tables with textnon-structured: text
… formatelectronic information
codednon-coded
paper documents
… structural degree of output datatext summaryfulltext indexstructured data: database, attributes, classification
Prof. Dr. Knut Hinkelmann 57 Information Extraction - Automated Indexing
7.1 Information Extraction from Text Documentstoken
scanner
lexicalanalysis
named entityrecognition
parsing
coreferenceresolution
templateunification
patternrecognition
featureidentification
classification
classification informationextraction
Prof. Dr. Knut Hinkelmann 67 Information Extraction - Automated Indexing
Lexical Analysis
Token scanner:Identification of text structure (e.g. paragraphs, title etc.) and special strings (tokens) like date, time, punctuationsHTML or XML-parsers can be applied for markupdocuments
Lexical analysis (morphology):Determination of word forms (singular-plural)Determination of the kind of word (verb,noun)
Part of Speech tagging, POS
in German: composita analysis (in German)
tokenscanner
lexicalanalysis
Prof. Dr. Knut Hinkelmann 77 Information Extraction - Automated Indexing
Automatic Classification
Each document is described by a setof features
Each class is described using thesame kind of features
A document is associated to theclass(es) where the features are mostsimilar. This can be tested using rulesor similarity measures.
ClassificationC(FD)
document D
classdescriptions
classdescriptions
featureidentification
featureidentification
ClassifierC
ClassifierC
featurerepresentation
FD of thedocument
featurerepresentation
FD of thedocument
featureidentification
classification
Prof. Dr. Knut Hinkelmann 87 Information Extraction - Automated Indexing
Rule-based Text ClassificationThe features are keywords that are either associated to a document as metadata or that occur in the documents
Example: Assume there are three classes: businesscomputer scienceinformation systems
The keywords in this example are: processOOPaccountingERPdatabase
The classifier can be represented as a set of rules:IF a documents has the keywords process, accounting, and ERP
THEN the document belongs to class „business“IF a documents has the keywords OOP and database
THEN the document belongs to class „computer science“IF a documents has the keywords process, database, and ERP
THEN the document belongs to class „information systems“
Prof. Dr. Knut Hinkelmann 97 Information Extraction - Automated Indexing
Fulltext Classification
In the full text classification, the features are the terms occuring in thedocuments (fulltext index)
The classes are represented as vectors
c1 c2 c3w11 w21 w31w12 w22 w32w13 w23 w33w14 w24 w34w15 w25 w35w16 w26 w36
t1t2t3t4t5t6
The classification of a document is computed using a well-known rankingfunction well-known from inforamtion retrieval (cosinus).
Prof. Dr. Knut Hinkelmann 107 Information Extraction - Automated Indexing
Automatic Learning of Classification Rules
Training phase:
A characteristic set of documentsis manually classified.
A learning component analysesthe features of the documents in the classes
ClassificationC(FD)
document D
classdescriptions
classdescriptions
featureidentification
featureidentification
ClassifierC
ClassifierC
featurerepresentation
FD of thedocument
featurerepresentation
FD of thedocument
Prof. Dr. Knut Hinkelmann 117 Information Extraction - Automated Indexing
Classification Methods
Specific Document classifiers, e.g.Linear Least Square Fit (LLSF)Latent Semantic Analysis (LSA)
Adaptation of general Classifiers, e.g. Decision Trees
Explicit rules to test document featuresK Nearest Neighbor
Documents are represented as vectorsA new document is compared with all documents of the trainingsetThe majority of the k most similar documents gives theclassification
ZentroidEach class is represented by a prototypical vector
Neural Network
class A class B
newdocument
Prof. Dr. Knut Hinkelmann 127 Information Extraction - Automated Indexing
Information Extraction
Example: From business news information about job changes shouldbe extractedSample text:
Peter SmithSusan WinterdirectorArconia Ltd31 March 2007
PersonOutPersonInPosition OrganizationDate
Peter Smith left Arconia Ltd. The former directorretired on 31 March 2007. His successor isSusan Winter. At the same time George Young became sales manager. He followed John Kelly.
John KellyGeorge Young sales managerArconia Ltd31 March 2007
PersonOutPersonInPosition OrganizationDate
Template Instancesthat should be extractedfrom the sample text
Prof. Dr. Knut Hinkelmann 137 Information Extraction - Automated Indexing
Named Entity Recognition
Mark into the text each string that represents a person, organization, or location name, or a date or time, or a currency or percentage figure.
Example:
<name type=person>Peter Smith</name>, left<name type=organisation>Arconia Ltd. </name>. The former director retired on <date>31 March2007</date>. His successor is <nametype=person>Susan Winter</name>. At the sametime <name type=person>George Young</name>became sales manager. He followed <nametype=person>John Kelly</name>.
lexicalanalysis
named entityrecognition
coreferenceresolution
templateunification
tokenscanner
parsing
Prof. Dr. Knut Hinkelmann 147 Information Extraction - Automated Indexing
Parsing
Parsing: Identification of phrase structures: noun phrase (NP), verb phrase (VP), ..
S
NP VP
Peter Smith left NP
Arconia Ltd.
lexicalanalysis
named entityrecognition
coreferenceresolution
templateunification
tokenscanner
parsing
Prof. Dr. Knut Hinkelmann 157 Information Extraction - Automated Indexing
Coreference ResolutionCapture information on corefering expressions, i.e. all mentions of a given entity, including those marked in NE and TE (nouns, noun phrases, pronouns).
Example:„the former director“ refers to „Peter Smith“„His“ refers to „Peter Smith“„He“ refers to „Georgs Young“„At the same time“ refers to „31 March 2007“
<name type=person>Peter Smith</name>, left <nametype=organisation>Arconia Ltd. </name>. The formerdirector retired on <date>31 March 2007</date>. His successor is <name type=person>Susan Winter</name>. At the same time <name type=person>George Young</name>became sales manager. He followed <nametype=person>John Kelly</name>.
lexicalanalysis
named entityrecognition
coreferenceresolution
templateunification
tokenscanner
parsing
Prof. Dr. Knut Hinkelmann 167 Information Extraction - Automated Indexing
Template Unification
Information for instantiating a single template often isdistributed over multiple sentences. This informationhas to be collected and unified.
Template Unification can comprise multiple tasks:Template Element Recognition (TE)
Extract basic information related to organization, person, and artifact entities, drawing evidence from everywhere in the text
Scenario Template Recognition (ST)Extract prespecified event information and relate the event information to particular organization, person, or artifact entities.
Pattern Recognition (PR)Identification of domain specific patterns (“Microsoft founder” = “Bill Gates”
lexicalanalysis
named entityrecognition
coreferenceresolution
templateunification
tokenscanner
parsing
Prof. Dr. Knut Hinkelmann 177 Information Extraction - Automated Indexing
imageimage Objects
layoutcharactes
KNOWLEDGE
Example:
domainknowledge
termslogical objectsmessage type
Interpretation
INFORMATION
7.2 Information Extraction from (semi-)structuredDocument
Integrated consideration of layout structurelogical structurecontent (semantics)
Source: A. Dengel, DFKI
Prof. Dr. Knut Hinkelmann 187 Information Extraction - Automated Indexing
Information Extraction using Layout, LogicalStructure and Content
Example: Letter
Address of Recipient
Layout: General Rules for position of address block
Structure: Recipient consists of nameand address
Recipient
Content: Knowledge aboutnamedentities and context„Dear Mr Trasher“
Office World Inc.Anvenue 101New York
Connecticut, 18.2.2006
Dear Mr Trasher
According to your offer from 16.2.2006 we order:
100 rack HU150 white50 office desk BT344 frey50 office chair BS 382 black
We expect the delivery until 28.2.1993
Yours sincerely,
Office Space Ltd.City Center 2201
Connecticut
Prof. Dr. Knut Hinkelmann 197 Information Extraction - Automated Indexing
Treatyclientproduct...
AXA ColoniaDread disease...
Guiding Extraction by Classification
Knowledge about documentstructure can target informationextraction
1. Classification:Assigning documents to predefined document classesFor the document classes thestructural objects are defined
2. Information ExtractionIdentification of relevant informationTargeted seach in structuralelements
?
articlestreaties lessonslearned
memos
documentsimilarity
Prof. Dr. Knut Hinkelmann 207 Information Extraction - Automated Indexing
Information Extraction from Markup Documents: XML
<researcher><name> Knut Hinkelmann </name><affiliation>
<university> FachhochschuleNordwestschweiz</university>
<group> Wirtschaftsinformatik</group><address>
<street> Riggenbachstrasse 16 </street><city> 4600 Olten </city>
</address> </affiliation><phone > ++41 62 286 00 80 </phone><email> [email protected] </email></researcher>
Predefined markup guidesinformation extraction and recognition:
Elements (tags, attributes)Structure
researcher
name affiliation phone email
university group address
street city
Prof. Dr. Knut Hinkelmann 217 Information Extraction - Automated Indexing
7.3 Information Extraction from Paper Documents
ScanningResult: Image of the document (non-coded information)
PreprocessingCorrectionOptical Character Recognition OCRIntelligent Character Recognition ICR (advanced OCR e.g. hand writing) Result: Content as text (coded information)
ClassificationResult: Document class (e.g. invoice of Hamilton Inc., ...)
Information extraktionResult: Relevant information in structured form (e.g. amount invoiced)
Prof. Dr. Knut Hinkelmann 227 Information Extraction - Automated Indexing
Information Extraction from forms
In forms the layout (position) determines the meaning of information
The layout must be known to therecognition system
The form must be sparated from theentries (content)
:
Prof. Dr. Knut Hinkelmann 237 Information Extraction - Automated Indexing
Types of documens
Fixed form
space for entries fixed
Dynamic form
forms with space for freeentries (text, tables)
Free documents
no predefined layout
Prof. Dr. Knut Hinkelmann 247 Information Extraction - Automated Indexing
Dokumentklassen
Um Informationen extrahieren zu können, muss der Aufbau der Dokumente bekann sein.
Dokumentklassen sind Dokumente mit gleichartigem Aufbau
Dokumentklassen steuern die InformationsextraktionZu jeder Dokumentklasse ist definiert, wo welche Information extrahiert wirdBeispiel: Rechnung: > Adresse > Kunden.-Nr.
> Bank > Bankleitzahl> Kontonummer > Betrag
Dokumentklassen können sehr spezifisch seinz.B. Rechnungsformular der Firma Meyer GmbHin diesem Fall ist genau bekannt, wo die gesucht Information zu finden ist
Dokumentklassen können sehr allgemein seinz.B. allgemeine Arztrechnungin diesem Fall ist mehr Aufwand bei der Suche nach Information auf dem Dokument notwendig
Prof. Dr. Knut Hinkelmann 257 Information Extraction - Automated Indexing
Elimination of lines:lines negatively influence OCR results
Noiseelimination
Phase 1: Preprocessing
Rotationcorrection
Uside-down-correction
Prof. Dr. Knut Hinkelmann 267 Information Extraction - Automated Indexing
Problems with OCR/ICR
Ambiguities
Wrong segmentation
Errors in
Prof. Dr. Knut Hinkelmann 277 Information Extraction - Automated Indexing
Phase 2: Clasification
Layout: lines, tables, ...
Using layout and logic structure as additional features for classification
predefined searchpatterns(regular expressions)
table structure and content ...
Prof. Dr. Knut Hinkelmann 287 Information Extraction - Automated Indexing
Definition of Document Classes in DocumentAnalysis Systems Document Definition Interface:
Use the mouse to marks areas withrelevant informationDefine search pattern, regularexpression (e.g.for date) etc. for theexpected information
insurance number
table
Prof. Dr. Knut Hinkelmann 297 Information Extraction - Automated Indexing
Phase 3: Information ExtractionExtract relevant Information from
Form fields with fixed position
Search patterns
Tables
Regular expression hiermit kündige ich zum 31.12.2003 mein Abonnement …
Prof. Dr. Knut Hinkelmann 307 Information Extraction - Automated Indexing
Field `Netto´
Field `Mwst´
Field `Brutto´
Phase 4: Automatic Verification
Logical verification: Checking logical or mathematical conditions
Nettosumme + Mehrwertsteuer = BruttosummeExpression: EQUAL(ROI(`Brutto´), SUM(ROI(`Netto´), ROI(`Mwst´)))
Database matching: Compare extracted ifnormation with content of a database (Levensthein distance)
Prof. Dr. Knut Hinkelmann 317 Information Extraction - Automated Indexing
Phase 5: Manual VerificationDocument Analysis Tools provide an interface for manual verifcation