automatic acquisition of lexical classes and extraction patterns for information extraction kiyoshi...

35
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee: Ralph Grishman Satoshi Sekine I. Dan Melamed

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Automatic Acquisition ofLexical Classes and Extraction Patterns

for Information Extraction

Kiyoshi Sudo

Ph.D. Research Proposal

New York University

Committee:

Ralph Grishman

Satoshi Sekine

I. Dan Melamed

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation2

Outline

Introduction Research Proposal

– Problem Setting– Approach– Application to Information Extraction

Discussion

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation3

MURREE, Pakistan (AP) -- Masked gunmen firing Kalashnikov rifles burst through the front gates of a Christian school Monday, killing six people and wounding three in the latest attack against Western interests since Pakistan joined the war against terrorism.

MUC Scenario Template Task

Date Perpetrator Weapon Victim Location

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation4

MUC Scenario Template Task

Date Perpetrator Weapon Victim Location

Maskedgunmen

Monday six people

three

Kalashnikovrifles

a Christianschool

MURREE, Pakistan (AP) -- Masked gunmen firing Kalashnikov rifles burst through the front gates of a Christian school Monday, killing six people and wounding three in the latest attack against Western interests since Pakistan joined the war against terrorism.

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation5

High Cost forAcquiring Knowledge-Base

Find extraction patterns– Find relevant documents– Find relevant events– Analyze sentences

Find domain-specific lexicon– Find existing KB (e.g. thesaurus, gazetteers)

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation6

Prior Work

Automatic Knowledge Acquisition

Lexical Acquisition Pattern Acquisition

Mutual Bootstrapping(Riloff and Jones 1999)

Simultaneous Multi-Semantic Class(Thelen and Riloff 2002)(Yangarber et al. 2002)

Pattern Discovery withDocument Re-ranking

(Yangarber et al. 2000)

Pattern Acquisition for QA (Ravichandran and Hovy 2002)

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation7

Challenge

Seed LexiconSeed Pattern

Expanded LexiconExpanded Pattern Set

User

KnowledgeBase

DateTypePerpatrator-IndividualPerpatrator-OrgPhysical TargetPhysical Target-NumPhysical Target-TypeHuman TargetHuman Target-Num…

MUC-3:Terrorism Event

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation8

Meeting the Challenge

Seed LexiconSeed Pattern

Expanded LexiconExpanded Pattern Set

User

KnowledgeBase

Semantic Clustering

ScenarioDescription

Semantic Cluster

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation9

Semantic Clustering

Semantic Clustering

ScenarioDescription

Semantic Cluster

– Description specific enoughto define the scenario

– (terrorism, bombing, kidnapping)– “Tell me about the terrorism action,

such as bombing and kidnapping.”

– Find Scenario-specific Semantic Clusters each of which consists of

– Semantic Lexicon– Extraction Patterns

Goal:

Input:

Semantic Lexicon

Extraction Patterns

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation10

Benefit for User

Semantic Clustering

ScenarioDescription

Semantic Cluster

Simplify Domain Analysis

Low-cost Knowledge-base Acquisitionfor IE systems

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation11

Extraction Patterns

Definition Lcwcontext ),(: patterns ofset a

wherec unifies with the context that is defined by semantic class L

context =

Case Frame: (bomb (v), x (subj), himself (obj))

Sequential: (x, bombs, himself)

Dependency: himselfbombx

(cf. Sudo et al. 2001)

V:subj V:obj

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation12

Outline

Introduction Research Proposal

– Problem Setting– Approach– Information Extraction

Evaluation

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation13

Overview

Semantic Clustering

ScenarioDescription

Semantic Cluster

InformationRetrieval

Boot-strapping

QueryExpansion

Source

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation14

Overview

Semantic Clustering

ScenarioDescription

Semantic Cluster

InformationRetrieval

Boot-strapping

QueryExpansion

Source

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation15

Information Retrieval

Get Relevant Document set Get list of lexical items and extraction patterns

ordered by relevance to the scenario– TF/IDF scoring

R

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation16

Example of TF/IDF scoring(Management Succession: Business)

w (N,V) TF/IDF

president 0.3897officer 0.3835named 0.3832

executive 0.3273Mr. 0.2587

chairman 0.2214vice 0.2186

years 0.1800company 0.1606

Inc. 0.1605

p TF/IDF

(succeed, V:obj:N, x ) 0.3435(x , N:title:, Mr.) 0.3311(succeed, V:subj:N, x ) 0.3167(name, V:obj:N, x ) 0.3141(name, V:subj:N, x ) 0.3069(name, V:iobj:N, x ) 0.2454(resign, V:subj:N, x ) 0.1920(as, Prep:pcomp-n:N, x ) 0.1118(retire, V:subj:N, x ) 0.0915(remain, V:subj:N, x ) 0.0861

300 documents retrievedFrom WSJ (7/94 - 8/94)

Extracted by MINIPAR (Lin 1998)

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation17

Overview

Semantic Clustering

ScenarioDescription

Semantic Cluster

InformationRetrieval

Boot-strapping

QueryExpansion

Source

extractionpatterns

lexicon

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation18

Bootstrapping

Assumption:

Patterns provide Lexical Classes. Lexicon provides contextual information.

Riloff and Jones 1999Agichtein and Gravano 2000

Find one cluster that consists of Lexicon and Extraction Patterns

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation19

Bootstrapping (Cont.)

Algorithm (cf. Riloff and Jones 1999)– Given

the ordered list of terms the ordered list of extraction patterns Lexicon = (), Pattern = ()

– w the most relevant term in the list and add it into Lexicon

1. p the most relevant pattern among those that extract w.2. Add p into Pattern3. w the most relevant term among those that are extracted by p4. Add w into Lexicon5. Go to 1

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation20

Example of Bootstrapping(Management Succession: Business)

w (N,V) TF/IDF

president 0.3897officer 0.3835named 0.3832

executive 0.3273Mr. 0.2587

chairman 0.2214vice 0.2186

years 0.1800company 0.1606

Inc. 0.1605

p TF/IDF

(succeed, V:obj:N, x ) 0.3435(x , N:title:, Mr.) 0.3311(succeed, V:subj:N, x ) 0.3167(name, V:obj:N, x ) 0.3141(name, V:subj:N, x ) 0.3069(name, V:iobj:N, x ) 0.2454(resign, V:subj:N, x ) 0.1920(as, Prep:pcomp-n:N, x ) 0.1118(retire, V:subj:N, x ) 0.0915(remain, V:obj:N, x ) 0.0861

From WSJ (7/94 - 8/94)

Extracted by MINIPAR (Lin 1998)

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation21

Example of Bootstrapping(Management Succession: Business)

w (N,V) TF/IDF

president 0.3897officer 0.3835named 0.3832

executive 0.3273Mr. 0.2587

chairman 0.2214vice 0.2186

years 0.1800company 0.1606

Inc. 0.1605

p TF/IDF

(succeed, V:obj:N, x ) 0.3435(x , N:title:, Mr.) 0.3311(succeed, V:subj:N, x ) 0.3167(name, V:obj:N, x ) 0.3141(name, V:subj:N, x ) 0.3069(name, V:iobj:N, x ) 0.2454(resign, V:subj:N, x ) 0.1920(as, Prep:pcomp-n:N, x ) 0.1118(retire, V:subj:N, x ) 0.0915(remain, V:obj:N, x ) 0.0861

From WSJ (7/94 - 8/94)

Extracted by MINIPAR (Lin 1998)

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation22

Problem:Polysemous Lexicon, Pattern

Lexicon can be ambiguous– e.g. Clinton (Person, Organization, Location … )

Extraction patterns can be ambiguous– e.g. be killed in <x> (x: Location, Date … )

Needs more study– more restriction– Probabilistic Model ??

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation23

Overview

Semantic Clustering

ScenarioDescription

Semantic Cluster

InformationRetrieval

Boot-strapping

QueryExpansion

Source

pattern

lexicon

pt lex

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation24

Query Expansion

Generalize terms in a query with a newly discovered cluster– cf. Rocchio 1971 (Vector model)– Zhai and Lafferty 2001 (Language-modeling)

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation25

Overview

Semantic Clustering

ScenarioDescription

Semantic Cluster

InformationRetrieval

Boot-strapping

QueryExpansion

Source

pattern

lexicon

pt lex

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation26

Outline

Introduction Research Proposal

– Problem Setting– Approach– Application to Information Extraction

Discussion

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation27

Application toInformation Extraction

Semantic Clustering

ScenarioDescription

Semantic Cluster

Preprocessing

EntityRecognition

Event RecognitionRole Assignment

Merging

Pattern MatchingSemantic Lexicon

Extraction Patterns

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation28

Human Intervention

Extraction patterns– Event pattern

Context contains a verb or nominalization of verb Used for event extraction and role assignment e.g. (terrorist, fire, x)

– Local pattern Context contains only enough information to recognize semantic class Used for entity recognition only e.g. (x,Inc.)

Association of Event Pattern to Role– e.g. (company, hire, x)PersonIn and (company, fire, x)PersonOut

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation29

Outline

Introduction Research Proposal

– Problem Setting– Approach– Application to Information Extraction

Discussion

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation30

Discussion

Domain Portability– User only needs to specify the scenario

Language Portability– Language-dependent Tools

Segmentation (Lemmatization) Dependency Parsing

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation31

Evaluation

MUC-style (Scenario-Template task)– Slot-base

Precision, Recall, F-measure

– Domain Portability Several pre-defined tasks that differ in difficulty

– Language Portability Japanese English

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation32

Contribution

Tool for Domain Analysis

Low-cost Knowledge-base Acquisition

Towards Open-domain Information Extraction

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation33

Conclusion

Proposed New Approach for Knowledge-base Acquisition (Semantic Clustering)

Discussed Application of Acquired KB to Information Extraction (Human Intervention and Local vs. Event patterns)

Discussed Evaluation with several predefined MUC-style tasks different in difficulty and across languages (Domain portability and Language portability)

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation34

ToDo

Implementation

Preparation for Evaluation

Evaluation

August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation35

Time for Questions(Conclusion)

Proposed New Approach for Knowledge-base Acquisition (Semantic Clustering)

Discussed Application of Acquired KB to Information Extraction (Human Intervention and Local vs. Event patterns)

Discussed Evaluation with several predefined MUC-style tasks different in difficulty and across languages (Domain portability and Language portability)