active sampling for entity matching aditya parameswaran stanford university jointly with: kedar...

Active Sampling for Entity Matching

Aditya Parameswaran

Stanford University

Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi

Yahoo! Research

Entity Matching

Goal: Find duplicate entities in a given data set

Fundamental data cleaning primitive decades of prior work

Especially important at Yahoo! (and other web companies)

2

Homma’s Brown Rice SushiCalifornia AvenuePalo Alto

Homma’s SushiCal AvePalo Alto

Why is it important?

3

Websites

DatabasesContent Providers

Dirty Entitie

s

???

Deduplicated

Entities

Applications:Business Listings in Y! LocalCelebrities in Y! MoviesEvents in Y! Upcoming….

Find Duplicates

Yelp

Zagat

Foursq

How?

4

Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs

Problem: How do we select training data?

Answer: Active Learning with Human Experts!

Reformulated Workflow

5

Websites

DatabasesContent Providers

Dirty Entities

Our Technique

DeduplicatedEntities

Active Learning (AL) Primer

Properties of an AL algorithm:—Label Complexity—Time Complexity—Consistency

Prior work:—Uncertainty Sampling—Query by Committee—…—Importance Weighted Active Learning (IWAL)—Online IWAL without Constraints

• Implemented in Vowpal Wabbit (VW)• 0-1 Metric• Time and Label efficient• Provably Consistent

Work even under noisy settings}

6

Problem One: Imbalanced Data

Typical to have 100:1 even after blocking Solution: Metric from [Arasu11]:

—Maximize Recall —Such that Precision > τ

7

100 1Non-matches Matches

Solution: All Non-matches• Precision 100% • 0-1 Error ≈ 0

Correctly identified matches

% of correct matches

Problem Two: Guarantees

Prior work on Entity Matching—No guarantees on Recall/Precision—Even if they do, they have:• High time + label complexity

Can we adapt prior work on AL for the new objective:—Maximize recall, such that precision > τ

With:—Sub-linear label complexity—Efficient time complexity

8

Overview of Our Approach

Recall Optimizationwith Precision

Constraint

Weighted 0-1 Error

Active Learningwith 0-1 Error

Reduction: Convex-hull Search inRelaxed Lagrangian

Reduction: Rejection Sampling

This talk

Paper

9

Objective

Given: —Hypothesis class H, —Threshold τ in [0,1]

Objective: Find h in H that—Maximizes recall(h)—Such that: precision(h) >= τ

Equivalently:—Maximize -falseneg(h)—Such that: ε truepos(h) - falsepos(h) >= 0—Where ε = τ/(1-τ)

10

Unconstrained Objective

Current formulation:—Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0

If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as:—Minimize δ falseneg (h) + (1 – δ) falsepos(h)

X(h)

Y(h) Weighted 0-

1 objective

11

Convex Hull of Classifiers

12

Y(h)

X(h)

We want a classifier here

0

Convex shape formed by joining classifiers strictly dominating others

Maximize X(h)Such that Y(h)

>= 0

Can have exponential number of points inside


13

Y(h)

X(h)

For any λ>0, there is a point / line with largest value of X + λ Y

If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier.

u

v

u-v

Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h)


>= 0


14

Y(h)

X(h)

Worst case, we get this point

Naïve strategy: try all λEquivalently, try all slopes

Instead, do binary search for λ

Problem: When to stop?

1) Bounds2) Discretization of λDetails in Paper!

Too long!


>= 0

Algorithm I (Ours Weighted)

Given: AL black box C for weighted 0-1 error Goal: Precision constrained objective

Range of λ: [Λmin,Λmax]—Don’t enumerate all candidate λ too expensive;

O(n3)—Instead, discretized using factor θ see paper!

Binary search over discretized values Same complexity as binary search

—O(log n)

15

Algorithm II (Weighted 0-1)

Given: AL black box B for 0-1 error Goal: AL Black box C for weighted 0-1 error

Use trick from Supervised Learning [Zadrozny03]—Cost-sensitive objective Binary —Reduction by rejection sampling

16

Overview of Our ApproachRecall

Optimizationwith Precision

Constraint

Weighted 0-1 Error

Active Learningwith 0-1 Error

Reduction: Convex-hull Search inRelaxed Lagrangian

Reduction: Rejection Sampling

This talk

Paper

O(log n)

O(log n)

Labels = O(log2 n) L(B)Time = O(log2 n) T(B)

17

Experiments

Four real-world data sets

All labels known—Simulate active learning

Two approaches for AL with Precision Constraint:—Ours

• With Vowpal Wabbit as 0-1 AL Black Box—Monotone [Arasu11]

• Assumes monotonicity of similarity features• High computational + label complexity

Data Set Size Ratio (+/-) Features

Y! Local Businesses 3958 0.115 5

UCI Person Linkage 574913 0.004 9

DBLP-ACM Bibliography 494437 0.005 7

Scholar-DBLP Bibliography 589326 0.009 7

18

Results I (Runtime with #Features)

Computational complexity on UCI Person

5 5.5 6 6.5 7 7.5 8 8.5 90

200400600800

100012001400160018002000

OursMonotone

Number of similarity features

Tim

e (

in s

econ

ds)

19

Results II (Quality & #Label Queries)

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.7

0.75

0.8

0.85

0.9

0.95

1

OursMonotone

threshold

F-1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95200

250

300

350

400

450

500

550

threshold

Label queri

esBusiness

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

OursMonotone

threshold

F-1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950

500

1000

1500

2000

2500

3000

3500

4000

threshold

Label queri

esPerson

20

Results II (Contd.)

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

OursMonotone

threshold

F-1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950

200

400

600

800

1000

1200

threshold

Label queri

es

DBLP-ACM

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

OursMonotone

threshold

F-1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95600

700

800

900

1000

1100

1200

1300

1400

1500

threshold

Label queri

es

21

Scholar

Results III (0-1 Active Learning)

Precision Constraint Satisfaction % of 0-1 AL

busine

ss

pers

on

dblp

-acm

scho

lar-d

blp

0

0.2

0.4

0.6

0.8

1

t=0.7t=0.8t=0.9t=0.95

Datasets

Con

str

ain

t sati

sfa

cti

on

%

22

Conclusion

Active learning for Entity Matching Can use any 0-1 AL as black box Great real world performance:

—Computationally efficient (600k examples in 25 seconds)

—Label efficient and better F-1 on four real-world tasks

Guaranteed —Precision of matcher—Time and label complexity

23

active sampling for entity matching aditya parameswaran stanford university jointly with: kedar...

Documents

xh yh

error active learning

falseposh xh yh

active sampling

high time label complexity

duplicate entities

convex hull of classifiers

nonmatches precision