graph-based methods for “open domain” information extraction william w. cohen machine learning...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Graph-Based Methods for “Open Domain”
Information Extraction
William W. Cohen
Machine Learning Dept. and Language Technologies Institute
School of Computer ScienceCarnegie Mellon University
Traditional IE vs Open Domain IE
• Goal: recognize people, places, companies, times, dates, … in NL text.
• Supervised learning from corpus completely annotated with target entity class (e.g. “people”)
• Linear-chain CRFs• Language- and genre-
specific extractors
• Goal: recognize arbitrary entity sets in text– Minimal info about entity
class– Example 1: “ICML, NIPS”– Example 2: “Machine
learning conferences”• Semi-supervised learning
from very large corpora (WWW)
• Graph-based learning methods
• Techniques are largely language-independent (!)– Graph abstraction fits
many languages
Outline
• History– Open-domain IE by pattern-matching
• The bootstrapping-with-noise problem– Bootstrapping as a graph walk
• Open-domain IE as finding nodes “near” seeds on a graph– Approach 1: A “natural” graph derived
from a smaller corpus + learned similarity– Approach 2: A carefully-engineered graph
derived from huge corpus (e.g’s above)
History: Open-domain IE by pattern-matching (Hearst, 92)
• Start with seeds: “NIPS”, “ICML”• Look thru a corpus for certain
patterns:• … “at NIPS, AISTATS, KDD and other
learning conferences…”
• Expand from seeds to new instances• Repeat….until ___
– “on PC of KDD, SIGIR, … and…”
Bootstrapping as graph proximity
“…at NIPS, AISTATS, KDD and other learning conferences…”
… “on PC of KDD, SIGIR, … and…”
NIPS
AISTATS
KDD
For skiiers, NIPS, SNOWBIRD,… and…”
SNOWBIRD
SIGIR
“… AISTATS,KDD,…” shorter paths ~ earlier iterationsmany paths ~ additional evidence
Outline
• Open-domain IE as finding nodes “near” seeds on a graph– Approach 1: A “natural” graph derived
from a smaller corpus + learned similarity
– Approach 2: A carefully-engineered graph derived from huge corpus (above)“with” Richard Wang (CMU ?)
“with” Einat Minkov (CMU Nokia)
Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008)
boys like playing cars
nsubj partmod prep.with
all kinds
det prep.of
NN NNVB VB DT NN
Dependency parsed sentence is a naturally represented as a tree
Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008)
Dependency parsed corpus is “naturally”
represented as a graph
Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008)
Open IE Goal:• Find “coordinate terms” (eg, girl/boy, dolls/cars) in the graph, or find• Similarity measure S so S(girl,boy) is high• What about off-the-shelf similarity measures:
• Random Walk with Restart (RWR)• Hitting time• Commute time• … ?
Personalized PR/RWR
graph walk parameters: edge weights Θ , walk length K and reset probability γ.
M[x,y] = Prob. of reaching y from x in one step: the edge weight from x to y, out of the outgoing weight from x.
`Personalized PageRank’:reset probability biased towardsinitial distribution.
The graph
Nodes
Node type
Edge label
Edge weight
Returns a list of nodes
(of type ) ranked by
the graph walk probs.
A query language:
Q: { , }
Approximate with power iteration, cut off after fixed
number of iterations K.
girls girls1 like1 like like2 boys2 boys
mention nsubj mention-1 mention nsubj-1 mention-1
girls girls1 like1 playing1 playing … boys
mention nsubj partmod mention-1 mention mention-1
girls girls1 like1 playing1 dolls1 dolls
mention nsubj mention-1 Prep.with mention-1
Useful but not our goal here…
Learning a better similarity metric
Query a
node rank 1
node rank 2
node rank 3
node rank 4
…
node rank 10
node rank 11
node rank 12
…
node rank 50
Query b Query q
node rank 1
node rank 2
node rank 3
node rank 4
…
node rank 10
node rank 11
node rank 12
…
node rank 50
node rank 1
node rank 2
node rank 3
node rank 4
…
node rank 10
node rank 11
node rank 12
…
node rank 50
…
GRAPH WALK
+ Rel. answers a + Rel. answers b + Rel. answers q
Task T (query class)
Seed words(“girl”, “boy”,
…)
Potential new instances of
the target concept
(“doll”,“child”,
“toddler”,…)
Learning methods
Weight tuning – weights learned per edge type
[Diligenti et-al, 2005]
Reranking – re-order the retrieved list using global features
of all paths from source to destination [Minkov et-al, 2006]
FEATURES
Edge label sequences Lexical unigrams
…
boys dolls
nsubj.nsubj-inv
nsubj partmod partmod-inv nsubj-inv
nsubj partmod prep.in
“like”, “playing” “like”, “playing”
Learning methods: Path-Constrained Graph Walk
PCW (summary): for each node x, learn
P(xz : relevant(z) | history(Vq,x) )
History(Vq,x) = seq of edge labels leading from Vq to x,
with all histories stored in a tree
boys dolls
nsubj.nsubj-inv
nsubj partmod partmod-inv nsubj-inv
nsubj partmod prep.in
boys
dolls
Vq“girls”
nsubjnsubj-inv
partmod
partmod-inv
nsubj-inv
boys
prep.in
x1
x2
x3
City and person name extraction
City names: Vq = {sydney, stamford, greenville, los_angeles}Person names: Vq = {carter, dave_kingman, pedro_ramos, florio}
words nodes edges NEs
MUC 140K 82K 244K 3K (true)
MUC+AP 2,440K 1,030K 3,550K 36K (auto)
– 10 (X4) queries for each task• Train queries q1-q5 / test queries q6-q10
– Extract nodes of type NE.– GW: 6 steps, uniform/learned weights– Reranking: top 200 nodes (using learned weights)– Path trees: 20 correct / 20 incorrect; threshold 0.5
Complete Partial/Noisy
Labeling
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
Graph Walk
City names Person namesMUC
pre
cis
ion
rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
Graph Walk
Weight Tuning
City names Person names
conj-and, prep-in, nn, appos … subj, obj, poss, nn …
MUC
pre
cis
ion
rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
Graph Walk
Weight Tuning
PCW
City names Person names
conj-and, prep-in, nn, appos … subj, obj, poss, nn …
prep-in-inv conj-andnn-inv nn
nsubj nsubj-invappos nn-inv
MUC
pre
cis
ion
rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
Graph Walk
Weight Tuning
PCW
Reranking
City names Person names
conj-and, prep-in, nn, appos … subj, obj, poss, nn …
Prep-in-inv conj-andnn-inv nn
LEX.”based”, LEX.”downtown” LEX.”mr”, LEX.”president”
MUC
pre
cis
ion
rank
nsubj nsubj-invappos nn-inv
Vector-space models
• Co-occurrence vectors (counts; window: +/- 2)
• Dependency vectors [Padó & Lapata, Comp Ling 07]
– A path value function:
• Length-based value: 1 / length(path)• Relation based value: subj-5, obj-4, obl-3, gen-2, else-1
– Context selection function:
• Minimal: verbal predicate-argument (length 1)• Medium: coordination, genitive construction, noun compounds
(<=3)• Maximal: combinations of the above (<=4)
– Similarity function:• Cosine• Lin Only score the top nodes retrieved with reranking (~1000 overall)
GWs – Vector models
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
PCW
Rerank
CO
DV
MUCCity names Person names
pre
cis
ion
rank
The graph-based methods are best (syntactic + learning)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71 81 91
Rerank
PCW
DV
CO
GWs – Vector models
MUC + APCity names Person names
pre
cis
ion
rank
The advantage of the graph based models diminishes with the amount of data.
This is hard to evaluate at high ranks
Outline
• Open-domain IE as finding nodes “near” seeds on a graph– Approach 1: A “natural” graph derived
from a smaller corpus + learned similarity
– Approach 2: A carefully-engineered graph derived from huge corpus“with” Richard Wang (CMU ?)
“with” Einat Minkov (CMU Nokia)
Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07)
• Basic ideas– Dynamically build the graph using
queries to the web– Constrain the graph to be as useful as
possible• Be smart about queries• Be smart about “patterns”: use clever
methods for finding meaningful structure on web pages
System Architecture
• Fetcher: download web pages from the Web that contain all the seeds
• Extractor: learn wrappers from web pages• Ranker: rank entities extracted by wrappers
1. Canon2. Nikon3. Olympus
4. Pentax5. Sony6. Kodak7. Minolta8. Panasonic9. Casio10. Leica11. Fuji12. Samsung13. …
The Extractor
• Learn wrappers from web documents and seeds on the fly– Utilize semi-structured documents– Wrappers defined at character level
• Very fast• No tokenization required; thus language
independent• Wrappers derived from doc d applied to d
only
– See ICDM 2007 paper for details
<img src="/common/logos/honda/logo -horiz -rgb-lg-dkbg.gif" alt="4"></a> <ul><li><a href="http://www.curryhonda -ga.com/"> <span class="dName">Curry Honda Atlanta</span>.. .</li> <li><a href="http://www.curryhondamass. com/"> <span class="dName">C urry Honda</span>.. .</li> <li class="last"><a href="http://www.curryhondany.com/"> <span class="dName">Curry Honda Yorktown</span>...</li></ul> </li>
<li class=" honda "><a href="http://www. curryauto .com/" >
<li class=" acura"><a href="http://www. curryauto .com/" >
<li class=" toyota"><a href="http://www. curryauto.com/" >
<li class=" nissan"><a href="http://www. curryauto.com/" >
<li class=" ford"><a href="http://www.curry auto.com/" > <img src="/common/logos/ ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a> <ul><li class="last"><a href="http://www.curry auto.com/"> <span class="dName">Curry Ford</span>...</li></ul > </li>
<img src="/curryautogroup/images/logo -horiz-rgb-lg-dkbg.gif" alt="5"></a> <ul><li class="last"><a href="http://www.curryacura.com/" > <span class="dName">Curry Acura</span>...</li></ul> </li>
<img src="/common/logos/ toyota /logo-horiz-rgb-lg-dkbg.gif" alt="7"></a > <ul><li class="last"><a href="http://www.geisau to.com/toyota/" > <span class="dName">Curry Toyota </span>...</li ></ul> </li>
<img src="/common/logos/ nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a> <ul><li class="last"><a href= "http://www.geisau to.com/ "> <span class="dName">Curry Nissan </span>...</li></ul> </li>
I am noise
Me too!
The Ranker
• Rank candidate entity mentions based on “similarity” to seeds
– Noisy mentions should be ranked lower
• Random Walk with Restart (GW)• As before…• What’s the graph?
Building a Graph
• A graph consists of a fixed set of…– Node Types: {seeds, document, wrapper, mention}– Labeled Directed Edges: {find, derive, extract}
• Each edge asserts that a binary relation r holds• Each edge has an inverse relation r-1 (graph is cyclic)
– Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions,
“ford”, “nissan”, “toyota”
curryauto.com
Wrapper #3
Wrapper #2
Wrapper #1
Wrapper #4
“honda”26.1%
“acura”34.6%
“chevrolet”22.5%
“bmw pittsburgh”8.4%
“volvo chicago”8.4%
find
derive
extract northpointcars.com
Evaluation Method• Mean Average Precision
– Commonly used for evaluating ranked lists in IR– Contains recall and precision-oriented aspects– Sensitive to the entire ranking– Mean of average precisions for each
ranked list
• Evaluation Procedure (per dataset)
1. Randomly select three true entities and use their first listed mentions as seeds
2. Expand the three seeds obtained from step 13. Repeat steps 1 and 2 five times4. Compute MAP for the five ranked lists
where L = ranked list of extracted mentions, r = rank
Prec(r) = precision at rank r
(a) Extracted mention at r matches any true mention
(b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r
⎩⎨⎧
=
otherwise
trueare (b) and (a) if
0
1
)(NewEntity r
# True Entities = total number of true entities in this dataset
Experimental Results: 3 seeds
Vary: [Extractor] + [Ranker] + [Top N URLs]
Extractor:• E1: Baseline Extractor (longest common context for all seed occurrences)• E2: Smarter Extractor (longest common context for 1 occurrence of each seed)
Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk }
N URLs: { 100, 200, 300 }
Overall MAP vs. Various Methods
14.59%
43.76%
82.39%
0%
20%
40%
60%
80%
100%
G.Sets G.Sets (Eng) E1+EF+100
Methods
MAP (%)
Overall MAP vs. Various Methods
82.39%
87.61%
93.13%
70%
75%
80%
85%
90%
95%
100%
E1+EF+100 E2+EF+100 E2+GW+100
Methods
MAP (%)
Overall MAP vs. Various Methods
93.13% 94.03% 94.18%
70%
75%
80%
85%
90%
95%
100%
E2+GW+100 E2+GW+200 E2+GW+300
Methods
MAP (%)
A limitation of the original SEAL
Preliminary Study on Seed Sizes
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
2 3 4 5 6# Seeds (Seed Size)
Mean Average PrecisionRW
PRBS
WL
Proposed Solution: Iterative SEAL (iSEAL)(Wang & Cohen, ICDM 2008)
• Makes several calls to SEAL, each call…– Expands a couple of seeds– Aggregates statistics
• Evaluate iSEAL using…– Two iterative processes
• Supervised vs. Unsupervised (Bootstrapping)
– Two seeding strategies• Fixed Seed Size vs. Increasing Seed Size
– Five ranking methods
ISeal (Fixed Seed Size, Supervised)
Initial Seeds
• Finally rank nodes by proximity to seeds in the full graph
• Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,…
• Variant (Bootstrap): use high-confidence extractions when seeds run out
Ranking Methods
Random Graph Walk with Restart– H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with
restart and its application. In ICDM, 2006.
PageRank– L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank
citation ranking: Bringing order to the web. 1998.
Bayesian Sets (over flattened graph)– Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.
Wrapper Length– Weights each item based on the length of common contextual
string of that item and the seeds
Wrapper Frequency– Weights each item based on the number of wrappers that
extract the item
Fixed Seed Size (Supervised)
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Fixed Seed Size (Bootstrap)
86%
87%
88%
89%
90%
91%
92%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Increasing Seed Size (Supervised)
90%
91%
92%
93%
94%
95%
96%
97%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Increasing Seed Size (Bootstrapping)
89%
90%
91%
92%
93%
94%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Fixed Seed Size (Supervised)
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Fixed Seed Size (Bootstrap)
86%
87%
88%
89%
90%
91%
92%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Increasing Seed Size (Supervised)
90%
91%
92%
93%
94%
95%
96%
97%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Increasing Seed Size (Bootstrapping)
89%
90%
91%
92%
93%
94%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Little difference between ranking methods for supervised case (all seeds correct); large differences when bootstrapping
Increasing seed size {2,3,4,4,…} makes all ranking methods improve steadily in
bootstrapping case
Fixed Seed Size (Supervised)
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Fixed Seed Size (Bootstrap)
86%
87%
88%
89%
90%
91%
92%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Increasing Seed Size (Supervised)
90%
91%
92%
93%
94%
95%
96%
97%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Increasing Seed Size (Bootstrapping)
89%
90%
91%
92%
93%
94%
1 2 3 4 5 6 7 8 9 10
# Iterations (Cumulative Expansions)
Mean Average Precision
RW
PR
BS
WL
WF
Current work
• Start with name of concept (e.g., “NFL teams”)
• Look for (language-dependent) patterns:– “… for successful NFL teams (e.g.,
Pittsburgh Steelers, New York Giants, …)”
• Take most frequent answers as seeds• Run bootstrapping iSEAL with seed
sizes 2,3,4,4….
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds on a graph
“…at NIPS, AISTATS, KDD and other learning conferences…”
… “on PC of KDD, SIGIR, … and…”
NIPS
AISTATS
KDD
For skiiers, NIPS, SNOWBIRD,… and…”
SNOWBIRD
SIGIR
“… AISTATS,KDD,…” shorter paths ~ earlier iterationsmany paths ~ additional evidence
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds on a graph, approach 1:– Minkov & Cohen, EMNLP 08: – Graph ~ dependency-parsed corpus– Off-the-shelf distance metrics not great– With learning:
• Results significantly better than state-of-the-art on small corpora(e.g. a personal email corpus)• Results competitive on 2M+ wordcorpora
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds on a graph, approach 2:– Wang & Cohen, ICDM 07, 08: – Graph built on-the-fly with web queries
• A good graph matters!– Off-the-shelf distance metrics work
• Differences are minimal for clean seeds• Modest improvements from learning w/ clean seeds
– E.g., reranking (not described here)
• Bigger differences in similarity measures with noisy seeds