exploiting relationships for object consolidation
DESCRIPTION
Work supported by NSF Grants IIS-0331707 and IIS-0083489. Exploiting Relationships for Object Consolidation. Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC - PowerPoint PPT PresentationTRANSCRIPT
Exploiting Relationships for Object Consolidation
Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra
Computer Science DepartmentUniversity of California, Irvine
http://www.ics.uci.edu/~dvk/RelDChttp://www.itr-rescue.org (RESCUE)
ACM IQIS 2005
Work supported by NSF Grants IIS-0331707 and IIS-0083489
2
Talk Overview
• Motivation
• Object consolidation problem
• Proposed approach – RelDC: Relationship based data cleaning– Relationship analysis and graph partitioning
• Experiments
3
Why do we need “Data Cleaning”?
q Hi, my name is Jane Smith.
I’d like to apply for a faculty
position at your university
Wow! Unbelievable!
Are you sure you will join us even if
we do not offer you tenure right
away?
Jane Smith – Fresh Ph.D. Tom - Recruiter
OK, let me check
something quickly …
???
Publications:1. ……2. ……3. ……
Publications:1. ……2. ……3. ……
CiteSeer Rank
4
• Names often do not uniquely identify people
What is the problem?
CiteSeer: the top-k most cited authors DBLP DBLP
5
Comparing raw and cleaned CiteSeer
Rank Author Location # citations
1 (100.00%) douglas schmidt cs@wustl 5608
2 (100.00%) rakesh agrawal almaden@ibm 4209
3 (100.00%) hector garciamolina @ 4167
4 (100.00%) sally floyd @aciri 3902
5 (100.00%) jennifer widom @stanford 3835
6 (100.00%) david culler cs@berkeley 3619
6 (100.00%) thomas henzinger eecs@berkeley 3752
7 (100.00%) rajeev motwani @stanford 3570
8 (100.00%) willy zwaenepoel cs@rice 3624
9 (100.00%) van jacobson lbl@gov 3468
10 (100.00%) rajeev alur cis@upenn 3577
11 (100.00%) john ousterhout @pacbell 3290
12 (100.00%) joseph halpern cs@cornell 3364
13 (100.00%) andrew kahng @ucsd 3288
14 (100.00%) peter stadler tbi@univie 3187
15 (100.00%) serge abiteboul @inria 3060
CiteSeer top-k
Cleaned CiteSeer top-k
6
Object Consolidation Problem
• Cluster representations that correspond to the same “real” world object/entity
• Two instances: real world objects are known/unknown
r1 r2 r3 r4 r5 r6 r7 rN
o1 o2 o3 o4 o5 o6 o7 oM
Representations of objects in the database
Real objects in the database
7
RelDC Approach
• Exploit relationships among objects to disambiguate when traditional approach on clustering based on similarity does not work
f1
f2
f3
?
?
?
f4
Y
f1
f2
f3
f4?
X
Traditional Methods
+ X Y
A
B C
D
E F
Relationship Analysis
ARG
RelDC Framework
features and context
Relationship-based Data Cleaning
8
Attributed Relational Graph (ARG)
View the database as an ARG
Nodes
– per cluster of representations (if already resolved by feature-based approach)
– per representation (for “tough” cases)
Edges – Regular – correspond to
relationships between entities
– Similarity – created using feature-based methods on representations
person publication
department organization
9
Context Attraction Principle (CAP)
Who is “J. Smith” – Jane?– John?
Jane Smith
John Smith
J. Smith
Merging a new publication.
10
Questions to Answer
1. Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the
quality of consolidation improves?
2. Can we design a generic strategy that exploits CAP for consolidation?
11
Consolidation Algorithm
1. Construct ARG and identify all virtual clusters (VCSs)– use FBS in constructing the ARG
2. Choose a VCS and compute connection strength between nodes– for each pair of repr. connected via a similarity edge
3. Partition the VCS– use a graph partitioning algorithm– partitioning is based on connection strength– after partitioning, adjust ARG accordingly– go to Step 2, if more potential clusters exists
12
Connection Strength c(u,v)
Models for c(u,v)– many possibilities
– diffusion kernels, random walks, etc
– none is fully adequate– cannot learn similarity from data
u v
A
B C
D
E F
G H z Diffusion kernels
– (x,y)= 1(x,y) “base similarity” – via direct links (of size 1)
– k(x,y) “indirect similarity”– via links of size k
– B: where Bxy = B1xy = 1(x,y)
– base similarity matrix
– Bk: indirect similarity matrix– K: total similarity matrix, or “kernel”
13
Connection Strength c(u,v) (cont.)
T1
T2 T2
T1
N-2... ... ... ... ...
John Smith Alan WhiteP1
MIT
Instantiating parameters– Determining (x,y)
– regular edges have types T1,...,Tn
– types T1,...,Tn have weights w1,...,wn
– (x,y) = wi
– get the type of a given edge
– assign this weigh as base similarity
– Handling similarity edges– (x,y) assigned value proportional to similarity
(heuristic)
– Approach to learn (x,y) from data (ongoing work)
Implementation
– we do not compute the whole matrix K– we compute one c(u,v) at a time
– limit path lengths by L
P1R1:John R2:J.SmithA4:Alan P3A5:MikeP2
R3:John A3:JohnA6:Tom StanfordP4 A7:Kate
P1 R1:JohnA1:John A4:AlanMITR3:John
(a)
(b)
(c)
14
Consolidation via Partitioning
Observations– each VCS contains representations of at least
1 object– if a repr. is in VCS, then the rest of repr. of the
same object are in it too
Partitioning– two cases
– k, the number of entities in VSC, is known– k is unknown
– when k is known, use any partit. algo– maximize inside-con, minimize outside-con.– we use [Shi,Malik’2000]– normalized cut
– when k is unknown– split into two: just to see the cut– compare cut against threshold– decide “to split” or “not to split”– Iterate
1
1
1
12
2 2
3
3
4
4
VCS 1
VCS 2
5
5
5
5
15
Measuring Quality of Outcome
– dispersion – for an entity, into how many clusters
its repr. are clustered, ideal is 1
– diversity– for a cluster, how many distinct
entities it covers, ideal is 1
– Entity uncertainty– for an entity, if out of m represent.
m1 to C1; ...; mn to Cn then
– Cluster Uncertainty– if a cluster consists of represent.: m1
of E1; ...; mn of En then (same...)– ideal entropy is zero
1 11 1
2 22 2 2 2
1 1
Ideal Clustering
1 11 1
2 22 2 2
21
1
One Misassigned (Example 1)
1
1
1 1
2 22
2 2 2
1 1
Half Misassigned
1 11 1
2 22 2 2
211
One Misassigned (Example 2)
C1
C2
Div H
1
1
0
0
E1
E2
Dis H
1
1
0
0
C1
C2
Div H
2
2
E1
E2
Dis H
2
2
0.65
0.65
C1
C2
Div H
2
2
1
1
E1
E2
Dis H
2
2
1
1
C1
C2
Div H
2
1
0.592
0
E1
E2
Dis H
1
2
0
0.65
0.65
0.65
Dis/Div cannot distinguish the two cases
Entropy can: since 0.65 < 1, first clustering is better
Average entropy decreases (improves), compared to Example 1
16
Experimental Setup
Parameters– L-short simple paths, L = 7– L is the path-length limit
Note– The algorithm is applied to
“tough cases”, after FBS already has successfully consolidated many entries!
RealMov– movies (12K) – people (22K)
– actors– directors– producers
– studious (1K) – producing – distributing
Uncertainty– d1,d2,...,dn are director entities– pick a fraction d1,d2,...,dm– Group entries in size k,
– e.g. in groups of two {d1,d2}, ... ,{d9,d10}
– make all representations of a group indiscernible by FBS, ...
Baseline 1Baseline 1– one cluster per VCS, regardlessone cluster per VCS, regardless– Equivalent to using only FBSEquivalent to using only FBS– ideal dispersion & H(E)!ideal dispersion & H(E)!
Baseline 2Baseline 2– knows grouping statisticsknows grouping statistics– gueses #ent in VCS gueses #ent in VCS – random assigns repr. to clustersrandom assigns repr. to clusters
17
Sample Movies Data
18
The Effect of L on Quality
Cluster Entropy & Diversity Entity Entropy & Dispersion
19
Effect of Threshold and Scalability
20
Summary
RelDC– domain-independent data cleaning framework– uses relationships for data cleaning
– reference disambiguation [SDM’05]– object consolidation [IQIS’05]
Ongoing work– “learning” the importance of relationships from data– Exploiting relationships among entities for other
data cleaning problems
21
Contact Information
RelDC projectwww.ics.uci.edu/~dvk/RelDC
www.itr-rescue.org (RESCUE)
Zhaoqi [email protected]
Dmitri V. Kalashnikovwww.ics.uci.edu/~dvk
Sharad Mehrotrawww.ics.uci.edu/~sharad
22
extra slides…
24
Object Consolidation
Notation– O={o1,...,o|O|} set of entities
– unknown in general
– X={x1,...,x|X|} set of repres.
– d[xi] the entity xi refers to– unknown in general
– C[xi] all repres. that refer to d[xi]– “group set”
– unknown in general
– the goal is to find it for each xi
– S[xi] all repres. that can be xi
– “consolidation set”
– determined by FBS
– we assume C[xi] S[xi]
25
Object Consolidation Problem
• Let O={o1,...,o|O|} be the set of entities
– unknown in general
• Let X={x1,...,x|X|} be the set of representations
• Map xi to its corresponding entity oj in O d[xi] the entity xi refers to
– unknown in general
– C[xi] all repres. that refer to d[xi]– “group set”
– unknown in general
– the goal is to find it for each xi
– S[xi] all repres. that can be xi
– “consolidation set”
– determined by FBS
– we assume C[xi] S[xi]
26
RelDC Framework
f1
f2
f3
?
?
?
f4
Y
f1
f2
f3
f4?
X
Traditional Methods
+ X Y
A
B C
D
E F
Relationship Analysis
ARG
Raw Data Representation
Tables/ARGs
RelDC Framework
ARG
X Y
AB C
D
E F
features and context
Data CleaningExtraction
Relationship-based Data Cleaning
Analysis
27
Connection Strength
Computation of c(u,v)
Phase 1: Discover connections– all L-short simple paths between u and v
– bottleneck
– optimizations, not in IQIS’05
Phase 2: Measure the strength– in the discovered connections
– many c(u,v) models exist
– we use model similar to diffusion kernels
u v
A
B C
D
E F
G H z
28
Our c(u,v) Model
T1
T2 T2
T1
N-2... ... ... ... ...
John Smith Alan WhiteP1
MIT
Our c(u,v) model– regular edges have types T1,...,Tn
– types T1,...,Tn have weights w1,...,wn
– (x,y) = wi
– get the type of a given edge– assign this weigh as base similarity
– paths with similarity edges– might not exist, use heuristics
Our model & Diff. kernels– virtually identical, but...– we do not compute the whole matrix K
– we compute one c(u,v) at a time
– we limit path lengths by L– (x,y) is unknown in general
– the analyst assigns them– learn from data (ongoing work)
P1R1:John R2:J.SmithA4:Alan P3A5:MikeP2
R3:John A3:JohnA6:Tom StanfordP4 A7:Kate
P1 R1:JohnA1:John A4:AlanMITR3:John
(a)
(b)
(c)