active sampling for entity matching aditya parameswaran stanford university jointly with: kedar...
TRANSCRIPT
Active Sampling for Entity Matching
Aditya Parameswaran
Stanford University
Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi
Yahoo! Research
Entity Matching
Goal: Find duplicate entities in a given data set
Fundamental data cleaning primitive decades of prior work
Especially important at Yahoo! (and other web companies)
2
Homma’s Brown Rice SushiCalifornia AvenuePalo Alto
Homma’s SushiCal AvePalo Alto
Why is it important?
3
Websites
DatabasesContent Providers
Dirty Entitie
s
???
Deduplicated
Entities
Applications:Business Listings in Y! LocalCelebrities in Y! MoviesEvents in Y! Upcoming….
Find Duplicates
Yelp
Zagat
Foursq
How?
4
Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs
Problem: How do we select training data?
Answer: Active Learning with Human Experts!
Reformulated Workflow
5
Websites
DatabasesContent Providers
Dirty Entities
Our Technique
DeduplicatedEntities
Active Learning (AL) Primer
Properties of an AL algorithm:—Label Complexity—Time Complexity—Consistency
Prior work:—Uncertainty Sampling—Query by Committee—…—Importance Weighted Active Learning (IWAL)—Online IWAL without Constraints
• Implemented in Vowpal Wabbit (VW)• 0-1 Metric• Time and Label efficient• Provably Consistent
Work even under noisy settings}
6
Problem One: Imbalanced Data
Typical to have 100:1 even after blocking Solution: Metric from [Arasu11]:
—Maximize Recall —Such that Precision > τ
7
100 1Non-matches Matches
Solution: All Non-matches• Precision 100% • 0-1 Error ≈ 0
Correctly identified matches
% of correct matches
Problem Two: Guarantees
Prior work on Entity Matching—No guarantees on Recall/Precision—Even if they do, they have:• High time + label complexity
Can we adapt prior work on AL for the new objective:—Maximize recall, such that precision > τ
With:—Sub-linear label complexity—Efficient time complexity
8
Overview of Our Approach
Recall Optimizationwith Precision
Constraint
Weighted 0-1 Error
Active Learningwith 0-1 Error
Reduction: Convex-hull Search inRelaxed Lagrangian
Reduction: Rejection Sampling
This talk
Paper
9
Objective
Given: —Hypothesis class H, —Threshold τ in [0,1]
Objective: Find h in H that—Maximizes recall(h)—Such that: precision(h) >= τ
Equivalently:—Maximize -falseneg(h)—Such that: ε truepos(h) - falsepos(h) >= 0—Where ε = τ/(1-τ)
10
Unconstrained Objective
Current formulation:—Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0
If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as:—Minimize δ falseneg (h) + (1 – δ) falsepos(h)
X(h)
Y(h) Weighted 0-
1 objective
11
Convex Hull of Classifiers
12
Y(h)
X(h)
We want a classifier here
0
Convex shape formed by joining classifiers strictly dominating others
Maximize X(h)Such that Y(h)
>= 0
Can have exponential number of points inside
Convex Hull of Classifiers
13
Y(h)
X(h)
For any λ>0, there is a point / line with largest value of X + λ Y
If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier.
u
v
u-v
Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h)
Maximize X(h)Such that Y(h)
>= 0
Convex Hull of Classifiers
14
Y(h)
X(h)
Worst case, we get this point
Naïve strategy: try all λEquivalently, try all slopes
Instead, do binary search for λ
Problem: When to stop?
1) Bounds2) Discretization of λDetails in Paper!
Too long!
Maximize X(h)Such that Y(h)
>= 0
Algorithm I (Ours Weighted)
Given: AL black box C for weighted 0-1 error Goal: Precision constrained objective
Range of λ: [Λmin,Λmax]—Don’t enumerate all candidate λ too expensive;
O(n3)—Instead, discretized using factor θ see paper!
Binary search over discretized values Same complexity as binary search
—O(log n)
15
Algorithm II (Weighted 0-1)
Given: AL black box B for 0-1 error Goal: AL Black box C for weighted 0-1 error
Use trick from Supervised Learning [Zadrozny03]—Cost-sensitive objective Binary —Reduction by rejection sampling
16
Overview of Our ApproachRecall
Optimizationwith Precision
Constraint
Weighted 0-1 Error
Active Learningwith 0-1 Error
Reduction: Convex-hull Search inRelaxed Lagrangian
Reduction: Rejection Sampling
This talk
Paper
O(log n)
O(log n)
Labels = O(log2 n) L(B)Time = O(log2 n) T(B)
17
Experiments
Four real-world data sets
All labels known—Simulate active learning
Two approaches for AL with Precision Constraint:—Ours
• With Vowpal Wabbit as 0-1 AL Black Box—Monotone [Arasu11]
• Assumes monotonicity of similarity features• High computational + label complexity
Data Set Size Ratio (+/-) Features
Y! Local Businesses 3958 0.115 5
UCI Person Linkage 574913 0.004 9
DBLP-ACM Bibliography 494437 0.005 7
Scholar-DBLP Bibliography 589326 0.009 7
18
Results I (Runtime with #Features)
Computational complexity on UCI Person
5 5.5 6 6.5 7 7.5 8 8.5 90
200400600800
100012001400160018002000
OursMonotone
Number of similarity features
Tim
e (
in s
econ
ds)
19
Results II (Quality & #Label Queries)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.7
0.75
0.8
0.85
0.9
0.95
1
OursMonotone
threshold
F-1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95200
250
300
350
400
450
500
550
threshold
Label queri
esBusiness
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
OursMonotone
threshold
F-1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950
500
1000
1500
2000
2500
3000
3500
4000
threshold
Label queri
esPerson
20
Results II (Contd.)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
OursMonotone
threshold
F-1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950
200
400
600
800
1000
1200
threshold
Label queri
es
DBLP-ACM
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
OursMonotone
threshold
F-1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95600
700
800
900
1000
1100
1200
1300
1400
1500
threshold
Label queri
es
21
Scholar
Results III (0-1 Active Learning)
Precision Constraint Satisfaction % of 0-1 AL
busine
ss
pers
on
dblp
-acm
scho
lar-d
blp
0
0.2
0.4
0.6
0.8
1
t=0.7t=0.8t=0.9t=0.95
Datasets
Con
str
ain
t sati
sfa
cti
on
%
22