information extraction a practical survey

58
1 Information Extraction A Practical Survey TALP Research Center Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya [email protected] Mihai Surdeanu

Upload: ulema

Post on 08-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Information Extraction A Practical Survey. Mihai Surdeanu. TALP Research Center Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya [email protected]. Overview. What is information extraction? A “traditional” system and its problems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Extraction A Practical Survey

1

Information ExtractionA Practical Survey

TALP Research CenterDep. Llenguatges i Sistemes Informàtics

Universitat Politècnica de [email protected]

Mihai Surdeanu

Page 2: Information Extraction A Practical Survey

2

Overview What is information extraction? A “traditional” system and its

problems Pattern learning and classification Beyond patterns

Page 3: Information Extraction A Practical Survey

3

What is information extraction?

The extraction or pulling out of pertinent information from large volumes of texts. (http://www.itl.nist.gov/iad/894.02/related_projects/muc/index.html)

Information extraction (IE) systems extract concepts, events, and relations that are relevant for a given scenario domain.

But, what is a concept, an event, or a scenario domain? Actual implementations of IE systems varied throughout the history of the task: MUC, Event99, EELD.

The tendency is to simplify the definition (or rather the implementation) of the task.

Page 4: Information Extraction A Practical Survey

4

Information Extraction at the Message Understanding Conferences

Seven MUC conferences, between 1987 and 1998. Scenario domains driven by template specifications

(fairly similar to database schemas), which define the content to be extracted.

Each event fills exactly one template (fairly similar to a database record).

Each template slot contains either text, or pointers to other templates.

The goal was to use IE technology to populate relational databases. Never really happened:

The chosen representation was too complicated. Did not address real-world problems, but artificial

benchmarks. Systems never achieved good-enough accuracy.

Page 5: Information Extraction A Practical Survey

5

MUC-6 “Management Succession” Example

<SUCCESSION_EVENT-9301190125-1> :=SUCCESSION_ORG : <ORGANIZATION-9301190125-1>POST: “chief executive officer”IN_AND_OUT:

<IN_AND_OUT- 9301190125-1><IN_AND_OUT- 9301190125-2>

VACANCY_REASON: REASSIGNMENT

< IN_AND_OUT- 9301190125-1> := IO_PERSON: <PERSON- 9301190125-1>NEW_STATUS: INON_THE_JOB: UNCLEAROTHER_ORG: <ORGANIZATION- 9301190125-2>REL_OTHER_ORG: OUTSIDE_ORGCOMMENT: “Barry Diller IN”

…<ORGANIZATION-9301190125-1> :=

ORG_NAME: “QVC Network Inc.”ORG_TYPE: COMPANY

MUC6 Template

Template slot with a text fill

Template slot that points to another template

…Barry Diller was appointed chief executive officer of QVC Network Inc…

Page 6: Information Extraction A Practical Survey

6

Information Extraction at DARPA´s HUB-4 Event99

Was planned as a successor of MUC. Identification and extraction of relevant information

dictated by templettes, which are “flat”, simplified templates. Slots are filled only with text, no pointers to other templettes are accepted.

Domains closer to real-world applications are addressed: natural disasters, bombing, deaths, elections, financial fluctuations, illness outbreaks.

The goal was to provide event-level indexing into documents such as news wires, radio and television transcripts etcetera. Imagine querying: “BOMBING AND Gaza” in news messages, and retrieving only the relevant text about bombing events in the Gaza area classified into templettes.

Event99: A Proposed Event Indexing Task For Broadcast News. Lynette Hirschman et al. (http://citeseer.nj.nec.com/424439.html)

Page 7: Information Extraction A Practical Survey

7

Event99 “Death” ExampleTemplettes Versus Templates

The sole survivor of the car crash that killed Princess Diana and Dodi Fayedlast year in France is remembering more about the accident.

<DEATH-CNN3-1> :=DECEASED: “Princess [Diana]”

/ “[Dodi Fayed]”MANNER_OF_DEATH: “the car [crash] that killed Princess Diana and Dodi Fayed”

/ “the [accident]”LOCATION: ”in [France]”DATE: “last [year]”

<SUCCESSION_EVENT-9301190125-1> :=SUCCESSION_ORG : <ORGANIZATION-9301190125-1>POST: “chief executive officer”IN_AND_OUT:

<IN_AND_OUT- 9301190125-1><IN_AND_OUT- 9301190125-2>

VACANCY_REASON: REASSIGNMENT

< IN_AND_OUT- 9301190125-1> := IO_PERSON: <PERSON- 9301190125-1>NEW_STATUS: INON_THE_JOB: UNCLEAROTHER_ORG: <ORGANIZATION- 9301190125-2>REL_OTHER_ORG: OUTSIDE_ORGCOMMENT: “Barry Diller IN”

…<ORGANIZATION-9301190125-1> :=

ORG_NAME: “QVC Network Inc.”ORG_TYPE: COMPANY

Compare with:

Page 8: Information Extraction A Practical Survey

8

Information Extraction at DARPA´s Evidence Extraction and Link Detection (EELD) Program

IE used as a tool for the more general problem of link discovery: sift through large data collections and derive complex rules from collections of simpler IE patterns.

Example: certain sets of account_number(Person,Account), deposit(Account,Amount), greater_than(Amount,reporting_amount) patterns imply is_a(Person, money_launderer). Note: the fact that Person is a money_launderer is not stated in any form in text!

IE used to identify concepts (typically named entities), events (typically identified by trigger words), and basic entity-entity and entity-event relations.

Simpler IE problem: No templates or templettes generated. Not dealing with event merging. Events always marked by trigger words, e.g. “murder” triggers a

MURDER event. Relations are always intra-sentential.

EELD web portal: http://www.rl.af.mil/tech/programs/eeld/

Page 9: Information Extraction A Practical Survey

9

EELD Example

John Smith is the chief scientist of Hardcom Corporation.

Entities: Person(John Smith), Organization( Hardcom Corporation)Events: --Relations: person-affiliation(Person(John Smith), Organization(Hardcom Corporation))

The murder of John Smith…

Entities: Person(John Smith)Events: Murder(murder)Relations: murder-victim(Person(John Smith), Murder(murder))

Page 10: Information Extraction A Practical Survey

10

Overview What is information extraction? A “traditional” system and its

problems Pattern learning and classification Beyond patterns

Page 11: Information Extraction A Practical Survey

11

Traditional IE Architecture The Finite State Automaton Text Understanding System (FASTUS)

approach: cascaded finite state automata (FSA). Each FSA level recognizes larger linguistic contructs (from tokens to

chunks to clauses to domain patterns), which become the simplified input for the next FSA in the cascade.

Why? Speed. Robustness to unstructured input. Handles data sparsity well.

The FSA cascade is enriched with limited discourse processing components: coreference resolution and event merging.

Most systems in MUC ended up using this architecture: CIRCUS from UMass (was actually the first to introduce the cascaded FSA architecture), PROTEUS (NYU), PLUM (BBN), CICERO (LCC) and many others.

An ocean of information available: FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text. Jerry R. Hobbs et

al. http://www.ai.sri.com/natural-language/projects/fastus-schabes.html Infrastructure for Open-Domain Information Extraction. Mihai Surdeanu and Sanda Harabagiu. http://

www.languagecomputer.com/papers/hlt2002.pdf Rich IE bibliography maintained by Horacio Rodriguez at: http://www.lsi.upc.es/~horacio/varios/sevilla2001.zip

Page 12: Information Extraction A Practical Survey

12

Documents

Language Computer´s CICERO Information Extraction System

entity coreference resolution

phrasal parser

named-entity recognition

event merging

numerical-entity recognition

name aliasing

phrase combiner

domain pattern recognition

event coreference

Templettes/Templates

Identifies numerical entities such as money, percents, dates and times (FSA)Identifies named entities such as person, location, and organization names (FSA)Disambiguates incomplete or ambiguous names

stand-alonenamed-entity

recognizer

Identifies basic, noun, verb, and particle phrases (TBL + FSA)

Identifies domain-dependent complex noun and verb phrases (FSA)Detects pronominal and nominal coreference links

Identifies domain-dependent patterns (FSA)

Resolves empty templette slots

Merges templettes belonging to the same event

known word recognition

Recognizes known concepts using lexicons and gazetteers.

Page 13: Information Extraction A Practical Survey

13

Walk-Through Example (1/5)

<BOMBING> :=BOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DEAD: “At least seven police officers”INJURED: “as many as 52 other people, including several children”DAMAGE: “a police station”LOCATION: ”Kirkuk”DATE: “Monday”

<BOMBING> :=BOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DEAD: “At least seven police officers”INJURED: “as many as 52 other people, including several children”DAMAGE: “a police station”LOCATION: ”Kirkuk”DATE: “Monday”

At least seven police officers were killed and as many as 52 other people, including several children, were injured Monday in a car bombing that also wrecked a police station. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast.

At least seven police officers were killed and as many as 52 other people, including several children, were injured Monday in a car bombing that also wrecked a police station. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast.

Page 14: Information Extraction A Practical Survey

14

Walk-Through Example (2/5)

Lexicon + numerical entities + NER

At least seven/NUMBER police officers were killed and as many as 52/NUMBER other people, including several children, were injured Monday/DATE in a car bombing that also wrecked a police station. Kirkuk/LOC ´s police said they had "good information" that Ansar al-Islam/ORG was behind the blast.

At least seven/NUMBER police officers were killed and as many as 52/NUMBER other people, including several children, were injured Monday/DATE in a car bombing that also wrecked a police station. Kirkuk/LOC ´s police said they had "good information" that Ansar al-Islam/ORG was behind the blast.

At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people]/NP, [including]/VP several [children]/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VP a police [station]/NP. [Kirkuk]/NP ´s [police]/NP [said]/VP [they]/NP [had]/NP "good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP.

At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people]/NP, [including]/VP several [children]/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VP a police [station]/NP. [Kirkuk]/NP ´s [police]/NP [said]/VP [they]/NP [had]/NP "good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP.

Phrasal parser

Page 15: Information Extraction A Practical Survey

15

Walk-Through Example (3/5)

At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people], including several children/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VP a police [station]/NP. [Kirkuk]/NP ´s [police]/NP [said]/VP [they]/NP [had]/NP "good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP.

At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people], including several children/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VP a police [station]/NP. [Kirkuk]/NP ´s [police]/NP [said]/VP [they]/NP [had]/NP "good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP.

Complex phrase detection

Entity coreference resolution

they The policethe blast a car bombing

they The policethe blast a car bombing

Page 16: Information Extraction A Practical Survey

16

Walk-Through Example (4/5)

At least seven police officers were killed/PATTERN and as many as 52 other people, including several children, were injured Monday in a car bombing/PATTERN {car bombing} that also wrecked a police station/PATTERN. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast/PATTERN.

At least seven police officers were killed/PATTERN and as many as 52 other people, including several children, were injured Monday in a car bombing/PATTERN {car bombing} that also wrecked a police station/PATTERN. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast/PATTERN.

TEMPLETTEDEAD: “At least seven police officers”

TEMPLETTEBOMB: “a car bombing”INJURED: “as many as 52 other people, including several children”DATE: “Monday”

TEMPLETTEBOMB: “a car bombing”DAMAGE: “a police station”

TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”

Page 17: Information Extraction A Practical Survey

17

Walk-Through Example (5/5)

TEMPLETTEDEAD: “At least seven police officers”

TEMPLETTEBOMB: “a car bombing”INJURED: “as many as 52 other people, including several children”DATE: “Monday”

TEMPLETTEBOMB: “a car bombing”DAMAGE: “a police station”

TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”

TEMPLETTEBOMB: “a car bombing”DEAD: “At least seven police officers”DATE: “Monday”LOCATION: “Kirkuk”

TEMPLETTEBOMB: “a car bombing”INJURED: “as many as 52 other people, including several children”DATE: “Monday”LOCATION: “Kirkuk”

TEMPLETTEBOMB: “a car bombing”DAMAGE: “a police station”DATE: “Monday”LOCATION: “Kirkuk”

TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DATE: “Monday”LOCATION: “Kirkuk”

Event coreference

TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DEAD: “At least seven police officers”INJURED: “as many as 52 other people, including several children”DAMAGE: “a police station”DATE: “Monday”LOCATION: “Kirkuk”

TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DEAD: “At least seven police officers”INJURED: “as many as 52 other people, including several children”DAMAGE: “a police station”DATE: “Monday”LOCATION: “Kirkuk”

Event merging

Page 18: Information Extraction A Practical Survey

18

Coreference for IE Algorithm detailed in: Recognizing Referential

Links: An Information Extraction Perspective. Megumi Kameyama. http://citeseer.nj.nec.com/kameyama97recognizing.html

3 step algorithm: Identify all anaphoric entities, e.g. pronouns, nouns,

ambiguous named-entities. For each anaphoric entity identify all possible

candidates and sort them according to same salience ordering, e.g. left-to-right traversal in the same sentence, right-to-left traversal in previous sentences.

Extract the first candidate that matches some semantic constraints, e.g. number and gender consistency. Merge the candidate with the anaphoric entity.

Page 19: Information Extraction A Practical Survey

19

The Role of Coreference in Named Entity Recognition

Classifies unknown named-entities, that are likely part of a name but can not be identified as such due to insufficient local context.

Example: “Michigan National Corp./ORG said it will eliminate some senior management jobs… Michigan National/? said the restructuring…”

Disambiguates named entities of ambiguous length and/or ambiguous type.

“Michigan” changed from LOC to ORG when “Michigan Corp.” appears in the same context.

The text “McDonald´s” may contain a person name “McDonald” or an organization name “McDonald´s”. Non-deterministic FSA used to maintain both alternatives until after name aliasing, when one is selected.

Disambiguate headline named entities. Headlines typically capitalized, e.g. “McDermott Completes Sale” Processing of headlines postponed until after the body of text is

processed. A “longest-match” approach is used to match the headline sequence of

tokens against entities found in the first body paragraph. For example, “McDermott” is labeled to ORG because it matches over “McDermott International Inc.” in the first document paragraph.

Over 5% increase in accuracy (F-measure): from 87.81% to 93.64%.

Page 20: Information Extraction A Practical Survey

20

The Role of Coreference in IE

0

10

20

30

40

50

60

70

80

90

1 4 7 10 13 16 19 22 25

Rule count

F-m

easu

re NE +Entity + Event

NE + Entity

NE

Page 21: Information Extraction A Practical Survey

21

The Good Relatively good performance with a simple

system F-measures over 75% up to 88% for some simpler

Event99 domains Execution times below 10 seconds per 5KB

document Improvements to the FSA-only approach

Coreference almost doubles the FSA-only performance

More extraction rules add little to the IE performance whereas different forms of coreference add more

Non-determinism used to mitigate the limited power of FSA grammars

Page 22: Information Extraction A Practical Survey

22

The Bad Needs domain-specific lexicons, e.g. an ontology of bombing devices.

Work the automate this process: Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Ellen Riloff and Rosie Jones. http://www.cs.utah.edu/~riloff/psfiles/aaai99.pdf (not covered in this presentation)

Domain-specific patterns must be developed, e.g. “<SUBJECT> explode”.

Patterns must be classified: What does the above pattern mean? Is the subject a bomb, a perpetrator, a location?

Patterns can not cover the flexibility of the natural language. Need better models that go beyond the pattern limitations.

Event merging is another NP-complete problem. One of the few stochastic models for event merging: Probabilistic Coreference in Information Extraction. Andrew Kehler. http://ling.ucsd.edu/~kehler/Papers/emnlp97.ps.gz (not covered in this presentation)

All of the above issues are manually developed, which yields high domain development time (larger than 40 person hours per domain). This prohibits the use of this approach for “real-time” information extraction.

Page 23: Information Extraction A Practical Survey

23

Overview What is information extraction? A “traditional” system and its

problems Pattern learning and classification Beyond patterns

Page 24: Information Extraction A Practical Survey

24

Automatically Generating Extraction Patterns from Untagged Text

The first system to successfully discover domain patterns AutoSlog-TS.

Automatically Generating Extraction Patterns from Untagged Text. Ellen Riloff. http://www.cs.utah.edu/~riloff/psfiles/aaai96.pdf

The intuition is that domain-specific patterns will appear more often in documents related to the domain of interest than in unrelated documents.

Page 25: Information Extraction A Practical Survey

25

Weakly-Supervised Pattern Learning Algorithm (1/2)

1. Separate the training document set into relevant and irrelevant documents (manual process).

2. Generate all possible patterns in all documents, according to some meta-patterns. Examples below.

Meta Pattern Pattern

<subj> active-verb <perpetrator> bombed

active-verb <dobj> bombed <target>

infinitive <dobj> to kill <victim>

gerund <dobj> killing <victim>

<np> prep <np> <bomb> against <target>

Page 26: Information Extraction A Practical Survey

26

Weakly-Supervised Pattern Learning Algorithm (2/2)

3. Rank all generated patterns according to the formula: relevance_rate x log2(frequency), where the relevance_rate indicates the ratio of relevant instances (i.e. in relevant documents versus non-relevant documents) of the corresponding pattern, and frequency indicates the number of times the pattern was seen in relevant documents.

4. Add the top-ranked pattern to the list of learned patterns, and mark all documents where the pattern appears as relevant.

5. Repeat the process from Step 3 for a number of N iterations. Hence the output of the algorithm is N learned patterns.

Page 27: Information Extraction A Practical Survey

27

Examples of Learned Patterns Patterns learned for the MUC-4 “terrorism” domain

<subj> exploded

murder of <np>

assasination of <np>

<subj> was killed

<subj> was kidnapped

attack on <np>

<subj> was injured

exploded in <np>

death of <np>

<subj> took_place

Page 28: Information Extraction A Practical Survey

28

The Good and the Bad The good

Performance very close to the manually-customized system

The bad Documents must be separated into

relevant/irrelevant by hand When does the learning process stop? Pattern classification and event merging

still developed by human experts

Page 29: Information Extraction A Practical Survey

29

The ExDisco IE System Automatic Acquisition of Domain Knowledge for

Information Extraction. Roman Yangarber et al. http://www.cs.nyu.edu/roman/Papers/2000-coling-pub.ps.gz

Quasi automatically separates documents in relevant/non-relevant using a set of “seed” patterns selected by the user, e.g. <company> appoint-verb <person> for the MUC-6 “management succession” domain.

In addition to ranking patterns, ExDisco ranks documents based on how many relevant patterns they contain immediate application to text filtering.

Page 30: Information Extraction A Practical Survey

30

Counter-Training for Pattern Discovery

Counter-Training in Discovery of Semantic Patterns. Roman Yangarber. http://www.cs.nyu.edu/roman/Papers/2003-acl-countertrain-web.pdf

Previous approaches are iterative learning algorithms, where the output is a continuous stream of patterns with degrading precision. What is the best stopping point?

The approach is to introduce competition among multiple scenario learners (e.g. management succession, mergers and acquisitions, legal actions). Stop when the learners wander in the territories already discovered by others.

Pattern frequency weighted by the document relevance. Document relevance receives negative weight based on

how many patterns from a different scenario it contains. The learning for each scenario stops when the best pattern

has a negative score.

Page 31: Information Extraction A Practical Survey

31

Pattern Classification Multiple systems perform successful

pattern acquisition by now, e.g. “attacked <np>” is discovered for the bombing domain. But what does the <np> actually mean? Is it the victim, the physical target, or something else?

An Empirical Approach to Conceptual Case Frame Acquisition. Ellen Riloff and Mark Schmelzenbach. http://www.cs.utah.edu/~riloff/psfiles/wvlc98.pdf

Page 32: Information Extraction A Practical Survey

32

Pattern Classification Algorithm

Requires 5 seed words per semantic category (e.g. PERPETRATOR, VICTIM etc)

Builds a context for each semantic category by expanding the seed word set with words that appear frequently in the proximity of previous seed words.

Uses AutoSlog to discover domain patterns. Builds a semantic profile for each discovered

pattern based on the overlap between the noun phrases contained in the pattern and the previous semantic contexts.

Each pattern is associated with the best ranked semantic category.

Page 33: Information Extraction A Practical Survey

33

Pattern Classification Example

Semantic Category ProbabilityBUILDING 0.10

CIVILIAN 0.03

DATE 0.05

GOVOFFICIAL 0.03

LOCATION 0.03

MILITARYPEOPLE 0.09

TERRORIST 0.00

VEHICLE 0.03

WEAPON 0.00

Semantic profile for the pattern: attack on <np>

Page 34: Information Extraction A Practical Survey

34

Other Pattern-Learning Systems: RAPIER (1/2)

Relational Learning of Pattern-Match Rules for Information Extraction. Mary Elaine Califf and Raymond J. Mooney. http://citeseer.nj.nec.com/califf98relational.html

Uses Inductive Logic Programming (ILP) to implement a bottom-up generalization of patterns.

Patterns specified with pre-fillers (conditions on the tokens preceding the pattern), fillers (conditions on the tokens included in the pattern), and post-fillers (conditions on the tokens following the pattern)

The only linguistic resource used is a part-of-speech (POS) tagger. No parser (full or partial) used!

More robust to unstructured text. Applicability limited to simpler domains (e.g. job postings)

Page 35: Information Extraction A Practical Survey

35

Other Pattern-Learning Systems: RAPIER (2/2)

Pre-filler Filler Post-fillerword: located, tag:

VBNword: in, tag: IN

word: Atlanta, tag: NNP

word: , , tag: ,word: Georgia, tag: NNP

Pre-filler Filler Post-fillerword: in, tag: IN list: len: 2, tag: NNP word: , , tag: ,

semantic: STATE, tag: NNP

located in Atlanta, Georgia

offices in Kansas City, Missouri

Pre-filler Filler Post-fillerword: offices, tag:

NNSword: in, tag: IN

word: Kansas, tag: NNPword: City, tag: NNP

word: , , tag: ,word: Missouri, tag: NNP

Page 36: Information Extraction A Practical Survey

36

Other Pattern-Learning Systems

SRV Toward General-Purpose Learning for Information

Extraction. Dayne Freitag. http://citeseer.nj.nec.com/freitag98toward.html

Supervised machine learning based on FOIL. Constructs HORN clauses from examples.

Active learning Active Learning for Information Extraction with

Multiple View Feature Sets. Rosie Jones et al. http://www.cs.utah.edu/~riloff/psfiles/ecml-wkshp03.pdf

Active learning with multiple views. Ion Muslea. http://www.ai.sri.com/~muslea/PS/dissertation-02.pdf

Interactively learn and annotate data to reduce human effort in data annotation.

Page 37: Information Extraction A Practical Survey

37

Overview What is information extraction? A “traditional” system and its

problems Pattern learning and classification Beyond patterns

Page 38: Information Extraction A Practical Survey

38

The Need to Move Beyond the Pattern-Based Paradigm (1/2)

The space shuttle Challenger/AGENT_OF_DEATH flew apart over Florida

like a billion-dollar confetti killing/MANNER_OF_DEATH six astronauts/DECEASED.

The space shuttle Challenger/AGENT_OF_DEATH flew apart over Florida

like a billion-dollar confetti killing/MANNER_OF_DEATH six astronauts/DECEASED.

The space shuttle Challengerflew apart Floridalike a billion-dollar confettikillingsix astronauts

NP

S

VP

ADVPPP PP S

NPVP

over

LOC

AGENT_OF_DEATH MANNER_OF_DEATHDECEASED

Hard using surface-level informationEasier using full parse trees

Page 39: Information Extraction A Practical Survey

39

The Need to Move Beyond the Pattern-Based Paradigm (2/2)

Pattern-based systems Have limited power due to the strict formalism

accuracy < 60% without additional discourse processing.

Were developed also due to the historical conjecture: there was no high-performance full parser widely available.

Recent NLP developments: Full syntactic parsing 90% [Collins, 1997]

[Charniak, 2000]. Predicate-argument frames provide open-domain

event representation [Surdeanu et al, 2003], [Gildea and Jurafsky, 2002][Gildea and Palmer, 2002].

Page 40: Information Extraction A Practical Survey

40

Goal Novel IE paradigm:

Syntactic representation provided by full parser. Event representation based on predicate-argument

frames. Entity coreference provides pronominal and nominal

anaphora resolution (future work). Event merging merges similar/overlapping events

(future work). Advantages:

High accuracy due to enhanced syntactic and semantic processing.

Minimal domain customization time because most components are open-domain.

Page 41: Information Extraction A Practical Survey

41

Proposition Bank Overview

A one million word corpus annotated with predicate argument structures [Kingsbury, 2002]. Currently only predicates lexicalized by verbs.

Numbered arguments from 0 to 5. Typically ARG0 = agent, ARG1 = direct object or theme, ARG2 = indirect object, benefactive, or instrument, but they are predicate dependent!

Functional tags: ARMG-LOC = locative, ARGM-TMP = temporal, ARGM-DIR = direction.

NP

The futures halt was assailed by Big Board floor traders

PP

VP

VP

NP

S

ARG1 = entity assailed PRED ARG0 = agent

Page 42: Information Extraction A Practical Survey

42

Block ArchitectureDocuments

entity coreference

identification of pred-arg structures

mapping pred-arg structures to templettes

syntactic parsernamed-entity recognizer

event merging

Templettes

open-domain

domain-specific

Page 43: Information Extraction A Practical Survey

43

Walk-Through Example

The space shuttle Challenger flew apart over Florida like a billion-dollar confetti killing six astronauts.The space shuttle Challenger flew apart over Florida like a billion-dollar confetti killing six astronauts.

The space shuttle Challengerflew apart Floridalike a billion-dollar confettikillingsix astronauts

NP

S

VP

ADVPPP PP S

NPVP

over

LOC

ARG0 ARG1PRED

AGENT_OF_DEATH MANNER_OF_DEATH DECEASED

Page 44: Information Extraction A Practical Survey

44

The Model

Consists of two tasks: (1) identifying parse tree constituents corresponding to predicate arguments, and (2) assigning a role to each argument constituent.

Both tasks modeled using C5.0 decision tree learning, and two sets of features: Feature Set 1 adapted from [Gildea and Jurafsky, 2002], and Feature Set 2, novel set of semantic and syntactic features.

NP

The futures halt was assailed by Big Board floor traders

PP

VP

VP

NP

S

PRED

Task 1

ARG1 ARG0 Task 2

Page 45: Information Extraction A Practical Survey

45

Feature Set 1

PHRASE TYPE (pt): type of the syntactic phrase as argument. E.g. NP for ARG1.

PARSE TREE PATH (path): path between argument and predicate. E.g. NP S VP VP for ARG1.

PATH LENGTH (pathLen): number of labels stored in the predicate-argument path. E.g. 4 for ARG1.

POSITION (pos): indicates if constituent appears before predicate in sentence. E.g. true for ARG1 and false for ARG2.

VOICE (voice): predicate voice (active or passive). E.g. passive for PRED.

HEAD WORD (hw): head word of the evaluated phrase. E.g. “halt” for ARG1.

GOVERNING CATEGORY (gov): indicates if an NP is dominated by a S phrase or a VP phrase. E.g. S for ARG1, VP for ARG0.

PREDICATE WORD: the verb with morphological information preserved (verb), and the verb normalized to lower case and infinitive form (lemma). E.g. for PRED verb is “assailed”, lemma is “assail”.

NP

The futures haltwasassailedbyBig Board floor traders

PP

VP

VP

NP

S

ARG1 PRED ARG0

Page 46: Information Extraction A Practical Survey

46

Observations about Feature Set 1

Because most of the argument constituents are prepositional attachments (PP) and relative clauses (SBAR), often the head word (hw) is not the most informative word in the phrase.

Due to its strong lexicalization, the model suffers from data sparsity. E.g. hw used < 3%. The problem can be addressed with a back-off model from words to part of speech tags.

The features in set 1 capture only syntactic information, even though semantic information like named-entity tags should help. For example, ARGM-TMP typically contains DATE entities, and ARGM-LOC includes LOCATION named entities.

Feature set 1 does not capture predicates lexicalized by phrasal verbs, e.g. “put up”.

PP

NPin

last June

SBAR

S

VP

NP

that

occurred

yesterday

VP

VPto

be

declared

VP

Page 47: Information Extraction A Practical Survey

47

Feature Set 2 (1/2) CONTENT WORD (cw): lexicalized feature that selects an

informative word from the constituent, other than the head. Selection heuristics available in the paper. E.g. “June” for the phrase “in last June”.

PART OF SPEECH OF CONTENT WORD (cPos): part of speech tag of the content word. E.g. NNP for the phrase “in last June”.

PART OF SPEECH OF HEAD WORD (hPos): part of speech tag of the head word. E.g. NN for the phrase “the futures halt”.

NAMED ENTITY CLASS OF CONTENT WORD (cNE): The class of the named entity that includes the content word. 7 named entity classes (from the MUC-7 specification) covered. E.g. DATE for “in last June”.

Page 48: Information Extraction A Practical Survey

48

Feature Set 2 (2/2) BOOLEAN NAMED ENTITY FLAGS: set of features that indicate if a named

entity is included at any position in the phrase: neOrganization: set to true if an organization name is recognized in the phrase. neLocation: set to true if a location name is recognized in the phrase. nePerson: set to true if a person name is recognized in the phrase. neMoney: set to true if a currency expression is recognized in the phrase. nePercent: set to true if a percentage expression is recognized in the phrase. neTime: set to true if a time of day expression is recognized in the phrase. neDate: set to true if a date temporal expression is recognized in the phrase.

PHRASAL VERB COLLOCATIONS: set of two features that capture information about phrasal verbs:

pvcSum: the frequency with which a verb is immediately followed by any preposition or particle.

pvcMax: the frequency with which a verb is followed by its predominant preposition or particle.

Page 49: Information Extraction A Practical Survey

49

Experiments (1/3) Trained on PropBank release 2002/7/15, Treebank

release 2, both without Section 23. Named entity information extracted using CiceroLite.

Tested on PropBank and Treebank section 23. Used gold-standard trees from Treebank, and named entities from CiceroLite.

Task 1 (identifying argument constituents): Negative examples: any Treebank phrases not tagged in

PropBank. Due to memory limitations, we used ~11% of Treebank.

Positive examples: Treebank phrases (from the same 11% set) annotated with any PropBank role.

Task 2 (assigning roles to argument constituents): Due to memory limitations we limited the example set to

the first 60% of PropBank annotations.

Page 50: Information Extraction A Practical Survey

50

Experiments (2/3)

Features Arg P Arg R Arg F1 Role AFS1 84.96 84.26 84.61 78.76FS1 + POS tag of head word

92.24 84.50 88.20 79.04

FS1 + content word and POS tag

92.19 84.67 88.27 80.80

FS1 + NE label of content word

83.93 85.69 84.80 79.85

FS1 + phrase NE flags

87.78 85.71 86.73 81.28

FS1 + phrasal verb information

84.88 82.77 83.81 78.62

FS1 + FS2 91.62 85.06 88.22 83.05FS1 + FS2 + boosting

93.00 85.29 88.98 83.74

Page 51: Information Extraction A Practical Survey

51

Experiments (3/3) Four models compared:

[Gildea and Palmer, 2002] [Gildea and Palmer, 2002], our implementation Our model with FS1 Our model with FS1 + FS2 + boosting

Model Implementation Arg F1 Role A

Statistical Gildea and Palmer - 82.8

This study 71.86 78.87

Decision Trees

FS1 84.61 78.76

FS1 + FS2 + boosting

88.98 83.74

Page 52: Information Extraction A Practical Survey

52

Mapping Predicate-Argument Structures to Templettes

The mapping rules from predicate-arguments structures to templette slots are currently manually produced, using training texts and the corresponding templettes. Effort per domain < 3 person hours, if training information is available.

We focused on two Event99 domains: “Market change” tracks changes of financial

instruments. Relevant slots: INSTRUMENT – description of the financial instrument; AMOUNT_CHANGE – change amount; and CURRENT_VALUE – current instrument value after change.

“Death” extracts person death events. Relevant slots: DECEASED – person deceased; MANNER_OF_DEATH – manner of death; and AGENT_OF_DEATH – entity that caused the death event.

Page 53: Information Extraction A Practical Survey

53

Mappings for Event99 “Death” and “Market Change” Domains

(1) ARG1 and MARKET_CHANGE_VERB INSTRUMENT

(2) ARG2 and (MONEY or PERCENT or NUMBER or QUANTITY and

MARKET_CHANGE_VERB AMOUNT_CHANGE

(3) (ARG4 or ARGM_DIR) and NUMBER and

MARKET_CHANGE_VERB CURRENT_VALUE

(1) ARG1 and MARKET_CHANGE_VERB INSTRUMENT

(2) ARG2 and (MONEY or PERCENT or NUMBER or QUANTITY and

MARKET_CHANGE_VERB AMOUNT_CHANGE

(3) (ARG4 or ARGM_DIR) and NUMBER and

MARKET_CHANGE_VERB CURRENT_VALUE

Mapping rules for the “market change” domain

(1) (PERSON and ARG0 and DIE_VERB) or

(PERSON and ARG1 and KILL_VERB) DECEASED

(2) (ARG0 and KILL_VERB) or

(ARG1 and DIE_VERB) AGENT_OF_DEATH

(3) (ARGM-TMP and ILLNESS_NOUN) or

KILL_VERB or DIE_VERB MANNER_OF_DEATH

(4) ARGM-TMP and DATE DATE

(5) (ARGM-LOC or ARGM-TMP) and LOCATION LOCATION

(1) (PERSON and ARG0 and DIE_VERB) or

(PERSON and ARG1 and KILL_VERB) DECEASED

(2) (ARG0 and KILL_VERB) or

(ARG1 and DIE_VERB) AGENT_OF_DEATH

(3) (ARGM-TMP and ILLNESS_NOUN) or

KILL_VERB or DIE_VERB MANNER_OF_DEATH

(4) ARGM-TMP and DATE DATE

(5) (ARGM-LOC or ARGM-TMP) and LOCATION LOCATIONMapping rules for the “death” domain

Page 54: Information Extraction A Practical Survey

54

Experimental Setup Three systems compared:

This model with predicate-argument structures detected using the statistical approach.

This model with predicate-argument structures detected using decision trees.

Cascaded Finite-State-Automata system (Cicero).

In all systems entity coreference and event fusion disabled.

Documents

entity coreference

phrasal parser

named-entity recognizer

event fusion

Templettes

combiner

event recognizer

domain

specific

Page 55: Information Extraction A Practical Survey

55

Experiments

System Market Change

Death

Pred/Args Statistical

68.9% 58.4%

Pred/Args Inductive

82.8% 67.0%

FSA 91.3% 72.7%System Correct Misse

dIncorrect

Pred/Args Statistical

26 16 3

Pred/Args Inductive 33 9 2

FSA 38 4 2

Page 56: Information Extraction A Practical Survey

56

The good and the bad The good

The method achieves over 88% F-measure for the task of identifying argument constituents, and over 83% accuracy for role labeling.

The model scales well to unknown predicates because predicate lexical information is used for less than 5% of the branching decisions.

Domain customization of the complete IE system is less than 3 person hours per domain because most of the components are open-domain. Domain-specific components can be modeled with machine learning (future work).

Performance degradation versus a fully-customized IE system is only 10%. Will be further decreased by including coreference resolution (open-domain) and event fusion (domain-specific).

The bad Currently PropBank provides annotations only for verb-based

predicates. Noun-noun relations cannot be modeled for now. Can not be applied to unstructured text, where full parsing does not

work. Slower than the cascaded FSA models.

Page 57: Information Extraction A Practical Survey

57

Other Pattern-Free Systems

Algorithms That Learn To Extract Information. BBN: Description Of The Sift System As Used For MUC-7. Scott Miller et al. http://citeseer.nj.nec.com/miller98algorithms.html

Probabilistic model with features extracted from full parse trees enhanced with NEs

Kernel Methods for Relation Extraction. Dmitry Zelenko and Chinatsu Aone. http://citeseer.nj.nec.com/zelenko02kernel.html

Tree-based SVM kernels used to discover EELD relations. Automatic Pattern Acquisition for Japanese Information

Extraction. Kiyoshi Sudo et al. http://citeseer.nj.nec.com/sudo01automatic.html

Learns parse trees that subsume the information of interest.

Page 58: Information Extraction A Practical Survey

58

End

Gràcies!