generic entity resolution: identifying real-world entities in large data sets
DESCRIPTION
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets. Hector Garcia-Molina Stanford University. Work with: Omar Benjelloun, Qi Su, Jennifer Widom, Tyson Condie, Johnson Gong, Nicolas Pombourcq, David Menestrina, Steven Whang. Entity Resolution. e2. e1. - PowerPoint PPT PresentationTRANSCRIPT
Generic Entity Resolution:Identifying Real-World Entities in
Large Data Sets
Hector Garcia-MolinaStanford University
Work with: Omar Benjelloun, Qi Su,Jennifer Widom, Tyson Condie, Johnson Gong,
Nicolas Pombourcq, David Menestrina, Steven Whang
2
Entity Resolution
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
3
Applications
• comparison shopping• mailing lists• classified ads• customer files• counter-terrorism
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
4
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
5
Challenges (1)
•No keys!•Value matching– “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”...
•Record matching
Nm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212
6
Challenges (2)
•Merging recordsNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212Zp: 94305
Nm: TomNm: ThomasAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777Zp: 94305
7
Challenges (3)
•ChainingNm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
8
Challenges (4)
•Un-mergingNm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
too young to make500K at IBM!!
9
Challenges (5)
• Confidences in dataNm: Tom (0.9)Ad: 123 Main St (1.0)Ph: (650) 555-1212 (0.6)Ph: (650) 777-7777 (0.8)
(0.8)
• In value matching, match rules, merge:
conf = ?
10
Taxonomy
•Pairwise snaps vs. clustering•De-duplication vs. fidelity enhancement•Schema differences•Relationships•Exact vs. approximate•Generic vs application specific•Confidences
11
Schema Differences
Name: TomAddress: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
FirstName: TomStreetName: Main StStreetNumber: 123Tel: (650) 777-7777
12
Pair-Wise Snaps vs. Clustering
r1
r2
r3
r4
r5
r6
s9
s8
s7
s10
r1
r2
r3
r4r5
r6
r9
r8
r7
r10
13
De-Duplication vs. Fidelity Enhancement
R SB S
N
14
Relationships
r1 r2
r5
r7brother
father
businessbusiness
15
Using Relationships
p1
p2
p5
p7
a1
a2
a3
a5
a4
authors papers
same??
16
Exact vs Approximate ER
products
cameras resolvedcameras
CDs
books
...
resolvedCDs
resolvedbooks
...
ER
ER
ER
17
Exact vs Approximate ER
terrorists terroristssortby age
Widom 30match against
ages 25-35
18
Generic vs Application Specific
• Match function M(r, s)• Merge function <r, s> => t
19
Taxonomy
•Pairwise snaps vs. clustering•De-duplication vs. fidelity enhancement•Schema differences•Relationships•Exact vs. approximate•Generic vs application specific•Confidences
20
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
21
Taxonomy
•Pairwise snaps vs. clustering•De-duplication vs. fidelity enhancement•Schema differences No•Relationships No•Exact vs. approximate•Generic vs application specific•Confidences ... later on
22
Model
Nm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
r1 r3r2
r4:<r1, r2> <r4, r3>
M(r1, r2) M(r4, r3)
23
Correct Answer
r1
r2
r3
r4
r5
r6
s9
s8
s7
s10
ER(R) = All derivable records.....
Minus “dominated” records
24
Question
• What is best sequence of match, merge calls that give us right answer?
25
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
26
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
27
Brute Force Algorithm• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
• Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =
[a:1, b:2, c:4, e:5, f:6]
28
Question # 1
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
Can we deleter1, r2?
29
Question # 2
Brute Force Algorithm
• Match all pairs:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]
• Repeat:– r1 = [a:1, b:2]– r2 = [a:1, c: 4, e:5]– r3 = [b:2, c:4, f:6]– r4 = [a:7, e:5, f:6]– r12 = [a:1, b:2, c:4, e:5]– r123 =
[a:1, b:2, c:4, e:5, f:6]
Can we avoidcomparisons?
30
ICAR Properties
• Idempotence:– M(r1, r1) = true; <r1, r1> = r1
•Commutativity:– M(r1, r2) = M(r2, r1)– <r1, r2> = <r2, r1>
•Associativity– <r1, <r2, r3>> = <<r1, r2>, r3>
31
More Properties
•Representativity– If <r1, r2> = r3, then
for any r4 such that M(r1, r4) is true we also have M(r3, r4) = true.
r1
r2
r3
r4
32
ICAR Properties Efficiency
• Commutativity• Idempotence• Associativity• Representativity
• Can discard records• ER result independent
of processing order
33
Swoosh Algorithms
•Record Swoosh•Merges records as soon as they match•Optimal in terms of record comparisons
•Feature Swoosh•Remembers values seen for each feature•Avoids redundant value comparisons
34
Swoosh Performance
35
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
36
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}
37
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}
38
Swoosh Without ICAR Properties
39
Distributed Swoosh
P1 P2 P3
r1r2r3r4r5r6...
40
Distributed Swoosh
P1 P2 P3
r1
r3r4
r6...
r1r2
r4r5
...
r2r3
r5r6...
41
DSwoosh Performance
42
Outline
• Why is ER challenging?• How is ER done?• Some ER work at Stanford• Confidences
43
Conclusion
• ER is old and important problem• Our approach: generic• Confidences– challenging– two ways to tame:
• thresholds• packages
44
Thanks.