information extraction a practical survey
DESCRIPTION
Information Extraction A Practical Survey. Mihai Surdeanu. TALP Research Center Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya [email protected]. Overview. What is information extraction? A “traditional” system and its problems - PowerPoint PPT PresentationTRANSCRIPT
1
Information ExtractionA Practical Survey
TALP Research CenterDep. Llenguatges i Sistemes Informàtics
Universitat Politècnica de [email protected]
Mihai Surdeanu
2
Overview What is information extraction? A “traditional” system and its
problems Pattern learning and classification Beyond patterns
3
What is information extraction?
The extraction or pulling out of pertinent information from large volumes of texts. (http://www.itl.nist.gov/iad/894.02/related_projects/muc/index.html)
Information extraction (IE) systems extract concepts, events, and relations that are relevant for a given scenario domain.
But, what is a concept, an event, or a scenario domain? Actual implementations of IE systems varied throughout the history of the task: MUC, Event99, EELD.
The tendency is to simplify the definition (or rather the implementation) of the task.
4
Information Extraction at the Message Understanding Conferences
Seven MUC conferences, between 1987 and 1998. Scenario domains driven by template specifications
(fairly similar to database schemas), which define the content to be extracted.
Each event fills exactly one template (fairly similar to a database record).
Each template slot contains either text, or pointers to other templates.
The goal was to use IE technology to populate relational databases. Never really happened:
The chosen representation was too complicated. Did not address real-world problems, but artificial
benchmarks. Systems never achieved good-enough accuracy.
5
MUC-6 “Management Succession” Example
<SUCCESSION_EVENT-9301190125-1> :=SUCCESSION_ORG : <ORGANIZATION-9301190125-1>POST: “chief executive officer”IN_AND_OUT:
<IN_AND_OUT- 9301190125-1><IN_AND_OUT- 9301190125-2>
VACANCY_REASON: REASSIGNMENT
< IN_AND_OUT- 9301190125-1> := IO_PERSON: <PERSON- 9301190125-1>NEW_STATUS: INON_THE_JOB: UNCLEAROTHER_ORG: <ORGANIZATION- 9301190125-2>REL_OTHER_ORG: OUTSIDE_ORGCOMMENT: “Barry Diller IN”
…<ORGANIZATION-9301190125-1> :=
ORG_NAME: “QVC Network Inc.”ORG_TYPE: COMPANY
MUC6 Template
Template slot with a text fill
Template slot that points to another template
…Barry Diller was appointed chief executive officer of QVC Network Inc…
6
Information Extraction at DARPA´s HUB-4 Event99
Was planned as a successor of MUC. Identification and extraction of relevant information
dictated by templettes, which are “flat”, simplified templates. Slots are filled only with text, no pointers to other templettes are accepted.
Domains closer to real-world applications are addressed: natural disasters, bombing, deaths, elections, financial fluctuations, illness outbreaks.
The goal was to provide event-level indexing into documents such as news wires, radio and television transcripts etcetera. Imagine querying: “BOMBING AND Gaza” in news messages, and retrieving only the relevant text about bombing events in the Gaza area classified into templettes.
Event99: A Proposed Event Indexing Task For Broadcast News. Lynette Hirschman et al. (http://citeseer.nj.nec.com/424439.html)
7
Event99 “Death” ExampleTemplettes Versus Templates
The sole survivor of the car crash that killed Princess Diana and Dodi Fayedlast year in France is remembering more about the accident.
<DEATH-CNN3-1> :=DECEASED: “Princess [Diana]”
/ “[Dodi Fayed]”MANNER_OF_DEATH: “the car [crash] that killed Princess Diana and Dodi Fayed”
/ “the [accident]”LOCATION: ”in [France]”DATE: “last [year]”
<SUCCESSION_EVENT-9301190125-1> :=SUCCESSION_ORG : <ORGANIZATION-9301190125-1>POST: “chief executive officer”IN_AND_OUT:
<IN_AND_OUT- 9301190125-1><IN_AND_OUT- 9301190125-2>
VACANCY_REASON: REASSIGNMENT
< IN_AND_OUT- 9301190125-1> := IO_PERSON: <PERSON- 9301190125-1>NEW_STATUS: INON_THE_JOB: UNCLEAROTHER_ORG: <ORGANIZATION- 9301190125-2>REL_OTHER_ORG: OUTSIDE_ORGCOMMENT: “Barry Diller IN”
…<ORGANIZATION-9301190125-1> :=
ORG_NAME: “QVC Network Inc.”ORG_TYPE: COMPANY
Compare with:
8
Information Extraction at DARPA´s Evidence Extraction and Link Detection (EELD) Program
IE used as a tool for the more general problem of link discovery: sift through large data collections and derive complex rules from collections of simpler IE patterns.
Example: certain sets of account_number(Person,Account), deposit(Account,Amount), greater_than(Amount,reporting_amount) patterns imply is_a(Person, money_launderer). Note: the fact that Person is a money_launderer is not stated in any form in text!
IE used to identify concepts (typically named entities), events (typically identified by trigger words), and basic entity-entity and entity-event relations.
Simpler IE problem: No templates or templettes generated. Not dealing with event merging. Events always marked by trigger words, e.g. “murder” triggers a
MURDER event. Relations are always intra-sentential.
EELD web portal: http://www.rl.af.mil/tech/programs/eeld/
9
EELD Example
John Smith is the chief scientist of Hardcom Corporation.
Entities: Person(John Smith), Organization( Hardcom Corporation)Events: --Relations: person-affiliation(Person(John Smith), Organization(Hardcom Corporation))
The murder of John Smith…
Entities: Person(John Smith)Events: Murder(murder)Relations: murder-victim(Person(John Smith), Murder(murder))
10
Overview What is information extraction? A “traditional” system and its
problems Pattern learning and classification Beyond patterns
11
Traditional IE Architecture The Finite State Automaton Text Understanding System (FASTUS)
approach: cascaded finite state automata (FSA). Each FSA level recognizes larger linguistic contructs (from tokens to
chunks to clauses to domain patterns), which become the simplified input for the next FSA in the cascade.
Why? Speed. Robustness to unstructured input. Handles data sparsity well.
The FSA cascade is enriched with limited discourse processing components: coreference resolution and event merging.
Most systems in MUC ended up using this architecture: CIRCUS from UMass (was actually the first to introduce the cascaded FSA architecture), PROTEUS (NYU), PLUM (BBN), CICERO (LCC) and many others.
An ocean of information available: FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text. Jerry R. Hobbs et
al. http://www.ai.sri.com/natural-language/projects/fastus-schabes.html Infrastructure for Open-Domain Information Extraction. Mihai Surdeanu and Sanda Harabagiu. http://
www.languagecomputer.com/papers/hlt2002.pdf Rich IE bibliography maintained by Horacio Rodriguez at: http://www.lsi.upc.es/~horacio/varios/sevilla2001.zip
12
Documents
Language Computer´s CICERO Information Extraction System
entity coreference resolution
phrasal parser
named-entity recognition
event merging
numerical-entity recognition
name aliasing
phrase combiner
domain pattern recognition
event coreference
Templettes/Templates
Identifies numerical entities such as money, percents, dates and times (FSA)Identifies named entities such as person, location, and organization names (FSA)Disambiguates incomplete or ambiguous names
stand-alonenamed-entity
recognizer
Identifies basic, noun, verb, and particle phrases (TBL + FSA)
Identifies domain-dependent complex noun and verb phrases (FSA)Detects pronominal and nominal coreference links
Identifies domain-dependent patterns (FSA)
Resolves empty templette slots
Merges templettes belonging to the same event
known word recognition
Recognizes known concepts using lexicons and gazetteers.
13
Walk-Through Example (1/5)
<BOMBING> :=BOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DEAD: “At least seven police officers”INJURED: “as many as 52 other people, including several children”DAMAGE: “a police station”LOCATION: ”Kirkuk”DATE: “Monday”
<BOMBING> :=BOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DEAD: “At least seven police officers”INJURED: “as many as 52 other people, including several children”DAMAGE: “a police station”LOCATION: ”Kirkuk”DATE: “Monday”
At least seven police officers were killed and as many as 52 other people, including several children, were injured Monday in a car bombing that also wrecked a police station. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast.
At least seven police officers were killed and as many as 52 other people, including several children, were injured Monday in a car bombing that also wrecked a police station. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast.
14
Walk-Through Example (2/5)
Lexicon + numerical entities + NER
At least seven/NUMBER police officers were killed and as many as 52/NUMBER other people, including several children, were injured Monday/DATE in a car bombing that also wrecked a police station. Kirkuk/LOC ´s police said they had "good information" that Ansar al-Islam/ORG was behind the blast.
At least seven/NUMBER police officers were killed and as many as 52/NUMBER other people, including several children, were injured Monday/DATE in a car bombing that also wrecked a police station. Kirkuk/LOC ´s police said they had "good information" that Ansar al-Islam/ORG was behind the blast.
At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people]/NP, [including]/VP several [children]/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VP a police [station]/NP. [Kirkuk]/NP ´s [police]/NP [said]/VP [they]/NP [had]/NP "good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP.
At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people]/NP, [including]/VP several [children]/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VP a police [station]/NP. [Kirkuk]/NP ´s [police]/NP [said]/VP [they]/NP [had]/NP "good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP.
Phrasal parser
15
Walk-Through Example (3/5)
At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people], including several children/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VP a police [station]/NP. [Kirkuk]/NP ´s [police]/NP [said]/VP [they]/NP [had]/NP "good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP.
At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people], including several children/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VP a police [station]/NP. [Kirkuk]/NP ´s [police]/NP [said]/VP [they]/NP [had]/NP "good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP.
Complex phrase detection
Entity coreference resolution
they The policethe blast a car bombing
they The policethe blast a car bombing
16
Walk-Through Example (4/5)
At least seven police officers were killed/PATTERN and as many as 52 other people, including several children, were injured Monday in a car bombing/PATTERN {car bombing} that also wrecked a police station/PATTERN. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast/PATTERN.
At least seven police officers were killed/PATTERN and as many as 52 other people, including several children, were injured Monday in a car bombing/PATTERN {car bombing} that also wrecked a police station/PATTERN. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast/PATTERN.
TEMPLETTEDEAD: “At least seven police officers”
TEMPLETTEBOMB: “a car bombing”INJURED: “as many as 52 other people, including several children”DATE: “Monday”
TEMPLETTEBOMB: “a car bombing”DAMAGE: “a police station”
TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”
17
Walk-Through Example (5/5)
TEMPLETTEDEAD: “At least seven police officers”
TEMPLETTEBOMB: “a car bombing”INJURED: “as many as 52 other people, including several children”DATE: “Monday”
TEMPLETTEBOMB: “a car bombing”DAMAGE: “a police station”
TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”
TEMPLETTEBOMB: “a car bombing”DEAD: “At least seven police officers”DATE: “Monday”LOCATION: “Kirkuk”
TEMPLETTEBOMB: “a car bombing”INJURED: “as many as 52 other people, including several children”DATE: “Monday”LOCATION: “Kirkuk”
TEMPLETTEBOMB: “a car bombing”DAMAGE: “a police station”DATE: “Monday”LOCATION: “Kirkuk”
TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DATE: “Monday”LOCATION: “Kirkuk”
Event coreference
TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DEAD: “At least seven police officers”INJURED: “as many as 52 other people, including several children”DAMAGE: “a police station”DATE: “Monday”LOCATION: “Kirkuk”
TEMPLETTEBOMB: “a car bombing”PERPETRATOR: “Ansar al-Islam”DEAD: “At least seven police officers”INJURED: “as many as 52 other people, including several children”DAMAGE: “a police station”DATE: “Monday”LOCATION: “Kirkuk”
Event merging
18
Coreference for IE Algorithm detailed in: Recognizing Referential
Links: An Information Extraction Perspective. Megumi Kameyama. http://citeseer.nj.nec.com/kameyama97recognizing.html
3 step algorithm: Identify all anaphoric entities, e.g. pronouns, nouns,
ambiguous named-entities. For each anaphoric entity identify all possible
candidates and sort them according to same salience ordering, e.g. left-to-right traversal in the same sentence, right-to-left traversal in previous sentences.
Extract the first candidate that matches some semantic constraints, e.g. number and gender consistency. Merge the candidate with the anaphoric entity.
19
The Role of Coreference in Named Entity Recognition
Classifies unknown named-entities, that are likely part of a name but can not be identified as such due to insufficient local context.
Example: “Michigan National Corp./ORG said it will eliminate some senior management jobs… Michigan National/? said the restructuring…”
Disambiguates named entities of ambiguous length and/or ambiguous type.
“Michigan” changed from LOC to ORG when “Michigan Corp.” appears in the same context.
The text “McDonald´s” may contain a person name “McDonald” or an organization name “McDonald´s”. Non-deterministic FSA used to maintain both alternatives until after name aliasing, when one is selected.
Disambiguate headline named entities. Headlines typically capitalized, e.g. “McDermott Completes Sale” Processing of headlines postponed until after the body of text is
processed. A “longest-match” approach is used to match the headline sequence of
tokens against entities found in the first body paragraph. For example, “McDermott” is labeled to ORG because it matches over “McDermott International Inc.” in the first document paragraph.
Over 5% increase in accuracy (F-measure): from 87.81% to 93.64%.
20
The Role of Coreference in IE
0
10
20
30
40
50
60
70
80
90
1 4 7 10 13 16 19 22 25
Rule count
F-m
easu
re NE +Entity + Event
NE + Entity
NE
21
The Good Relatively good performance with a simple
system F-measures over 75% up to 88% for some simpler
Event99 domains Execution times below 10 seconds per 5KB
document Improvements to the FSA-only approach
Coreference almost doubles the FSA-only performance
More extraction rules add little to the IE performance whereas different forms of coreference add more
Non-determinism used to mitigate the limited power of FSA grammars
22
The Bad Needs domain-specific lexicons, e.g. an ontology of bombing devices.
Work the automate this process: Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Ellen Riloff and Rosie Jones. http://www.cs.utah.edu/~riloff/psfiles/aaai99.pdf (not covered in this presentation)
Domain-specific patterns must be developed, e.g. “<SUBJECT> explode”.
Patterns must be classified: What does the above pattern mean? Is the subject a bomb, a perpetrator, a location?
Patterns can not cover the flexibility of the natural language. Need better models that go beyond the pattern limitations.
Event merging is another NP-complete problem. One of the few stochastic models for event merging: Probabilistic Coreference in Information Extraction. Andrew Kehler. http://ling.ucsd.edu/~kehler/Papers/emnlp97.ps.gz (not covered in this presentation)
All of the above issues are manually developed, which yields high domain development time (larger than 40 person hours per domain). This prohibits the use of this approach for “real-time” information extraction.
23
Overview What is information extraction? A “traditional” system and its
problems Pattern learning and classification Beyond patterns
24
Automatically Generating Extraction Patterns from Untagged Text
The first system to successfully discover domain patterns AutoSlog-TS.
Automatically Generating Extraction Patterns from Untagged Text. Ellen Riloff. http://www.cs.utah.edu/~riloff/psfiles/aaai96.pdf
The intuition is that domain-specific patterns will appear more often in documents related to the domain of interest than in unrelated documents.
25
Weakly-Supervised Pattern Learning Algorithm (1/2)
1. Separate the training document set into relevant and irrelevant documents (manual process).
2. Generate all possible patterns in all documents, according to some meta-patterns. Examples below.
Meta Pattern Pattern
<subj> active-verb <perpetrator> bombed
active-verb <dobj> bombed <target>
infinitive <dobj> to kill <victim>
gerund <dobj> killing <victim>
<np> prep <np> <bomb> against <target>
26
Weakly-Supervised Pattern Learning Algorithm (2/2)
3. Rank all generated patterns according to the formula: relevance_rate x log2(frequency), where the relevance_rate indicates the ratio of relevant instances (i.e. in relevant documents versus non-relevant documents) of the corresponding pattern, and frequency indicates the number of times the pattern was seen in relevant documents.
4. Add the top-ranked pattern to the list of learned patterns, and mark all documents where the pattern appears as relevant.
5. Repeat the process from Step 3 for a number of N iterations. Hence the output of the algorithm is N learned patterns.
27
Examples of Learned Patterns Patterns learned for the MUC-4 “terrorism” domain
<subj> exploded
murder of <np>
assasination of <np>
<subj> was killed
<subj> was kidnapped
attack on <np>
<subj> was injured
exploded in <np>
death of <np>
<subj> took_place
28
The Good and the Bad The good
Performance very close to the manually-customized system
The bad Documents must be separated into
relevant/irrelevant by hand When does the learning process stop? Pattern classification and event merging
still developed by human experts
29
The ExDisco IE System Automatic Acquisition of Domain Knowledge for
Information Extraction. Roman Yangarber et al. http://www.cs.nyu.edu/roman/Papers/2000-coling-pub.ps.gz
Quasi automatically separates documents in relevant/non-relevant using a set of “seed” patterns selected by the user, e.g. <company> appoint-verb <person> for the MUC-6 “management succession” domain.
In addition to ranking patterns, ExDisco ranks documents based on how many relevant patterns they contain immediate application to text filtering.
30
Counter-Training for Pattern Discovery
Counter-Training in Discovery of Semantic Patterns. Roman Yangarber. http://www.cs.nyu.edu/roman/Papers/2003-acl-countertrain-web.pdf
Previous approaches are iterative learning algorithms, where the output is a continuous stream of patterns with degrading precision. What is the best stopping point?
The approach is to introduce competition among multiple scenario learners (e.g. management succession, mergers and acquisitions, legal actions). Stop when the learners wander in the territories already discovered by others.
Pattern frequency weighted by the document relevance. Document relevance receives negative weight based on
how many patterns from a different scenario it contains. The learning for each scenario stops when the best pattern
has a negative score.
31
Pattern Classification Multiple systems perform successful
pattern acquisition by now, e.g. “attacked <np>” is discovered for the bombing domain. But what does the <np> actually mean? Is it the victim, the physical target, or something else?
An Empirical Approach to Conceptual Case Frame Acquisition. Ellen Riloff and Mark Schmelzenbach. http://www.cs.utah.edu/~riloff/psfiles/wvlc98.pdf
32
Pattern Classification Algorithm
Requires 5 seed words per semantic category (e.g. PERPETRATOR, VICTIM etc)
Builds a context for each semantic category by expanding the seed word set with words that appear frequently in the proximity of previous seed words.
Uses AutoSlog to discover domain patterns. Builds a semantic profile for each discovered
pattern based on the overlap between the noun phrases contained in the pattern and the previous semantic contexts.
Each pattern is associated with the best ranked semantic category.
33
Pattern Classification Example
Semantic Category ProbabilityBUILDING 0.10
CIVILIAN 0.03
DATE 0.05
GOVOFFICIAL 0.03
LOCATION 0.03
MILITARYPEOPLE 0.09
TERRORIST 0.00
VEHICLE 0.03
WEAPON 0.00
Semantic profile for the pattern: attack on <np>
34
Other Pattern-Learning Systems: RAPIER (1/2)
Relational Learning of Pattern-Match Rules for Information Extraction. Mary Elaine Califf and Raymond J. Mooney. http://citeseer.nj.nec.com/califf98relational.html
Uses Inductive Logic Programming (ILP) to implement a bottom-up generalization of patterns.
Patterns specified with pre-fillers (conditions on the tokens preceding the pattern), fillers (conditions on the tokens included in the pattern), and post-fillers (conditions on the tokens following the pattern)
The only linguistic resource used is a part-of-speech (POS) tagger. No parser (full or partial) used!
More robust to unstructured text. Applicability limited to simpler domains (e.g. job postings)
35
Other Pattern-Learning Systems: RAPIER (2/2)
Pre-filler Filler Post-fillerword: located, tag:
VBNword: in, tag: IN
word: Atlanta, tag: NNP
word: , , tag: ,word: Georgia, tag: NNP
Pre-filler Filler Post-fillerword: in, tag: IN list: len: 2, tag: NNP word: , , tag: ,
semantic: STATE, tag: NNP
located in Atlanta, Georgia
offices in Kansas City, Missouri
Pre-filler Filler Post-fillerword: offices, tag:
NNSword: in, tag: IN
word: Kansas, tag: NNPword: City, tag: NNP
word: , , tag: ,word: Missouri, tag: NNP
36
Other Pattern-Learning Systems
SRV Toward General-Purpose Learning for Information
Extraction. Dayne Freitag. http://citeseer.nj.nec.com/freitag98toward.html
Supervised machine learning based on FOIL. Constructs HORN clauses from examples.
Active learning Active Learning for Information Extraction with
Multiple View Feature Sets. Rosie Jones et al. http://www.cs.utah.edu/~riloff/psfiles/ecml-wkshp03.pdf
Active learning with multiple views. Ion Muslea. http://www.ai.sri.com/~muslea/PS/dissertation-02.pdf
Interactively learn and annotate data to reduce human effort in data annotation.
37
Overview What is information extraction? A “traditional” system and its
problems Pattern learning and classification Beyond patterns
38
The Need to Move Beyond the Pattern-Based Paradigm (1/2)
The space shuttle Challenger/AGENT_OF_DEATH flew apart over Florida
like a billion-dollar confetti killing/MANNER_OF_DEATH six astronauts/DECEASED.
The space shuttle Challenger/AGENT_OF_DEATH flew apart over Florida
like a billion-dollar confetti killing/MANNER_OF_DEATH six astronauts/DECEASED.
The space shuttle Challengerflew apart Floridalike a billion-dollar confettikillingsix astronauts
NP
S
VP
ADVPPP PP S
NPVP
over
LOC
AGENT_OF_DEATH MANNER_OF_DEATHDECEASED
Hard using surface-level informationEasier using full parse trees
39
The Need to Move Beyond the Pattern-Based Paradigm (2/2)
Pattern-based systems Have limited power due to the strict formalism
accuracy < 60% without additional discourse processing.
Were developed also due to the historical conjecture: there was no high-performance full parser widely available.
Recent NLP developments: Full syntactic parsing 90% [Collins, 1997]
[Charniak, 2000]. Predicate-argument frames provide open-domain
event representation [Surdeanu et al, 2003], [Gildea and Jurafsky, 2002][Gildea and Palmer, 2002].
40
Goal Novel IE paradigm:
Syntactic representation provided by full parser. Event representation based on predicate-argument
frames. Entity coreference provides pronominal and nominal
anaphora resolution (future work). Event merging merges similar/overlapping events
(future work). Advantages:
High accuracy due to enhanced syntactic and semantic processing.
Minimal domain customization time because most components are open-domain.
41
Proposition Bank Overview
A one million word corpus annotated with predicate argument structures [Kingsbury, 2002]. Currently only predicates lexicalized by verbs.
Numbered arguments from 0 to 5. Typically ARG0 = agent, ARG1 = direct object or theme, ARG2 = indirect object, benefactive, or instrument, but they are predicate dependent!
Functional tags: ARMG-LOC = locative, ARGM-TMP = temporal, ARGM-DIR = direction.
NP
The futures halt was assailed by Big Board floor traders
PP
VP
VP
NP
S
ARG1 = entity assailed PRED ARG0 = agent
42
Block ArchitectureDocuments
entity coreference
identification of pred-arg structures
mapping pred-arg structures to templettes
syntactic parsernamed-entity recognizer
event merging
Templettes
open-domain
domain-specific
43
Walk-Through Example
The space shuttle Challenger flew apart over Florida like a billion-dollar confetti killing six astronauts.The space shuttle Challenger flew apart over Florida like a billion-dollar confetti killing six astronauts.
The space shuttle Challengerflew apart Floridalike a billion-dollar confettikillingsix astronauts
NP
S
VP
ADVPPP PP S
NPVP
over
LOC
ARG0 ARG1PRED
AGENT_OF_DEATH MANNER_OF_DEATH DECEASED
44
The Model
Consists of two tasks: (1) identifying parse tree constituents corresponding to predicate arguments, and (2) assigning a role to each argument constituent.
Both tasks modeled using C5.0 decision tree learning, and two sets of features: Feature Set 1 adapted from [Gildea and Jurafsky, 2002], and Feature Set 2, novel set of semantic and syntactic features.
NP
The futures halt was assailed by Big Board floor traders
PP
VP
VP
NP
S
PRED
Task 1
ARG1 ARG0 Task 2
45
Feature Set 1
PHRASE TYPE (pt): type of the syntactic phrase as argument. E.g. NP for ARG1.
PARSE TREE PATH (path): path between argument and predicate. E.g. NP S VP VP for ARG1.
PATH LENGTH (pathLen): number of labels stored in the predicate-argument path. E.g. 4 for ARG1.
POSITION (pos): indicates if constituent appears before predicate in sentence. E.g. true for ARG1 and false for ARG2.
VOICE (voice): predicate voice (active or passive). E.g. passive for PRED.
HEAD WORD (hw): head word of the evaluated phrase. E.g. “halt” for ARG1.
GOVERNING CATEGORY (gov): indicates if an NP is dominated by a S phrase or a VP phrase. E.g. S for ARG1, VP for ARG0.
PREDICATE WORD: the verb with morphological information preserved (verb), and the verb normalized to lower case and infinitive form (lemma). E.g. for PRED verb is “assailed”, lemma is “assail”.
NP
The futures haltwasassailedbyBig Board floor traders
PP
VP
VP
NP
S
ARG1 PRED ARG0
46
Observations about Feature Set 1
Because most of the argument constituents are prepositional attachments (PP) and relative clauses (SBAR), often the head word (hw) is not the most informative word in the phrase.
Due to its strong lexicalization, the model suffers from data sparsity. E.g. hw used < 3%. The problem can be addressed with a back-off model from words to part of speech tags.
The features in set 1 capture only syntactic information, even though semantic information like named-entity tags should help. For example, ARGM-TMP typically contains DATE entities, and ARGM-LOC includes LOCATION named entities.
Feature set 1 does not capture predicates lexicalized by phrasal verbs, e.g. “put up”.
PP
NPin
last June
SBAR
S
VP
NP
that
occurred
yesterday
VP
VPto
be
declared
VP
47
Feature Set 2 (1/2) CONTENT WORD (cw): lexicalized feature that selects an
informative word from the constituent, other than the head. Selection heuristics available in the paper. E.g. “June” for the phrase “in last June”.
PART OF SPEECH OF CONTENT WORD (cPos): part of speech tag of the content word. E.g. NNP for the phrase “in last June”.
PART OF SPEECH OF HEAD WORD (hPos): part of speech tag of the head word. E.g. NN for the phrase “the futures halt”.
NAMED ENTITY CLASS OF CONTENT WORD (cNE): The class of the named entity that includes the content word. 7 named entity classes (from the MUC-7 specification) covered. E.g. DATE for “in last June”.
48
Feature Set 2 (2/2) BOOLEAN NAMED ENTITY FLAGS: set of features that indicate if a named
entity is included at any position in the phrase: neOrganization: set to true if an organization name is recognized in the phrase. neLocation: set to true if a location name is recognized in the phrase. nePerson: set to true if a person name is recognized in the phrase. neMoney: set to true if a currency expression is recognized in the phrase. nePercent: set to true if a percentage expression is recognized in the phrase. neTime: set to true if a time of day expression is recognized in the phrase. neDate: set to true if a date temporal expression is recognized in the phrase.
PHRASAL VERB COLLOCATIONS: set of two features that capture information about phrasal verbs:
pvcSum: the frequency with which a verb is immediately followed by any preposition or particle.
pvcMax: the frequency with which a verb is followed by its predominant preposition or particle.
49
Experiments (1/3) Trained on PropBank release 2002/7/15, Treebank
release 2, both without Section 23. Named entity information extracted using CiceroLite.
Tested on PropBank and Treebank section 23. Used gold-standard trees from Treebank, and named entities from CiceroLite.
Task 1 (identifying argument constituents): Negative examples: any Treebank phrases not tagged in
PropBank. Due to memory limitations, we used ~11% of Treebank.
Positive examples: Treebank phrases (from the same 11% set) annotated with any PropBank role.
Task 2 (assigning roles to argument constituents): Due to memory limitations we limited the example set to
the first 60% of PropBank annotations.
50
Experiments (2/3)
Features Arg P Arg R Arg F1 Role AFS1 84.96 84.26 84.61 78.76FS1 + POS tag of head word
92.24 84.50 88.20 79.04
FS1 + content word and POS tag
92.19 84.67 88.27 80.80
FS1 + NE label of content word
83.93 85.69 84.80 79.85
FS1 + phrase NE flags
87.78 85.71 86.73 81.28
FS1 + phrasal verb information
84.88 82.77 83.81 78.62
FS1 + FS2 91.62 85.06 88.22 83.05FS1 + FS2 + boosting
93.00 85.29 88.98 83.74
51
Experiments (3/3) Four models compared:
[Gildea and Palmer, 2002] [Gildea and Palmer, 2002], our implementation Our model with FS1 Our model with FS1 + FS2 + boosting
Model Implementation Arg F1 Role A
Statistical Gildea and Palmer - 82.8
This study 71.86 78.87
Decision Trees
FS1 84.61 78.76
FS1 + FS2 + boosting
88.98 83.74
52
Mapping Predicate-Argument Structures to Templettes
The mapping rules from predicate-arguments structures to templette slots are currently manually produced, using training texts and the corresponding templettes. Effort per domain < 3 person hours, if training information is available.
We focused on two Event99 domains: “Market change” tracks changes of financial
instruments. Relevant slots: INSTRUMENT – description of the financial instrument; AMOUNT_CHANGE – change amount; and CURRENT_VALUE – current instrument value after change.
“Death” extracts person death events. Relevant slots: DECEASED – person deceased; MANNER_OF_DEATH – manner of death; and AGENT_OF_DEATH – entity that caused the death event.
53
Mappings for Event99 “Death” and “Market Change” Domains
(1) ARG1 and MARKET_CHANGE_VERB INSTRUMENT
(2) ARG2 and (MONEY or PERCENT or NUMBER or QUANTITY and
MARKET_CHANGE_VERB AMOUNT_CHANGE
(3) (ARG4 or ARGM_DIR) and NUMBER and
MARKET_CHANGE_VERB CURRENT_VALUE
(1) ARG1 and MARKET_CHANGE_VERB INSTRUMENT
(2) ARG2 and (MONEY or PERCENT or NUMBER or QUANTITY and
MARKET_CHANGE_VERB AMOUNT_CHANGE
(3) (ARG4 or ARGM_DIR) and NUMBER and
MARKET_CHANGE_VERB CURRENT_VALUE
Mapping rules for the “market change” domain
(1) (PERSON and ARG0 and DIE_VERB) or
(PERSON and ARG1 and KILL_VERB) DECEASED
(2) (ARG0 and KILL_VERB) or
(ARG1 and DIE_VERB) AGENT_OF_DEATH
(3) (ARGM-TMP and ILLNESS_NOUN) or
KILL_VERB or DIE_VERB MANNER_OF_DEATH
(4) ARGM-TMP and DATE DATE
(5) (ARGM-LOC or ARGM-TMP) and LOCATION LOCATION
(1) (PERSON and ARG0 and DIE_VERB) or
(PERSON and ARG1 and KILL_VERB) DECEASED
(2) (ARG0 and KILL_VERB) or
(ARG1 and DIE_VERB) AGENT_OF_DEATH
(3) (ARGM-TMP and ILLNESS_NOUN) or
KILL_VERB or DIE_VERB MANNER_OF_DEATH
(4) ARGM-TMP and DATE DATE
(5) (ARGM-LOC or ARGM-TMP) and LOCATION LOCATIONMapping rules for the “death” domain
54
Experimental Setup Three systems compared:
This model with predicate-argument structures detected using the statistical approach.
This model with predicate-argument structures detected using decision trees.
Cascaded Finite-State-Automata system (Cicero).
In all systems entity coreference and event fusion disabled.
Documents
entity coreference
phrasal parser
named-entity recognizer
event fusion
Templettes
combiner
event recognizer
domain
specific
55
Experiments
System Market Change
Death
Pred/Args Statistical
68.9% 58.4%
Pred/Args Inductive
82.8% 67.0%
FSA 91.3% 72.7%System Correct Misse
dIncorrect
Pred/Args Statistical
26 16 3
Pred/Args Inductive 33 9 2
FSA 38 4 2
56
The good and the bad The good
The method achieves over 88% F-measure for the task of identifying argument constituents, and over 83% accuracy for role labeling.
The model scales well to unknown predicates because predicate lexical information is used for less than 5% of the branching decisions.
Domain customization of the complete IE system is less than 3 person hours per domain because most of the components are open-domain. Domain-specific components can be modeled with machine learning (future work).
Performance degradation versus a fully-customized IE system is only 10%. Will be further decreased by including coreference resolution (open-domain) and event fusion (domain-specific).
The bad Currently PropBank provides annotations only for verb-based
predicates. Noun-noun relations cannot be modeled for now. Can not be applied to unstructured text, where full parsing does not
work. Slower than the cascaded FSA models.
57
Other Pattern-Free Systems
Algorithms That Learn To Extract Information. BBN: Description Of The Sift System As Used For MUC-7. Scott Miller et al. http://citeseer.nj.nec.com/miller98algorithms.html
Probabilistic model with features extracted from full parse trees enhanced with NEs
Kernel Methods for Relation Extraction. Dmitry Zelenko and Chinatsu Aone. http://citeseer.nj.nec.com/zelenko02kernel.html
Tree-based SVM kernels used to discover EELD relations. Automatic Pattern Acquisition for Japanese Information
Extraction. Kiyoshi Sudo et al. http://citeseer.nj.nec.com/sudo01automatic.html
Learns parse trees that subsume the information of interest.
58
End
Gràcies!