exploiting relationships for object consolidation

27
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC http://www.itr-rescue.org (RESCUE) ACM IQIS 2005 Work supported by NSF Grants IIS-0331707 and IIS-0083489

Upload: muriel

Post on 28-Jan-2016

15 views

Category:

Documents


0 download

DESCRIPTION

Work supported by NSF Grants IIS-0331707 and IIS-0083489. Exploiting Relationships for Object Consolidation. Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploiting Relationships  for Object Consolidation

Exploiting Relationships for Object Consolidation

Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra

Computer Science DepartmentUniversity of California, Irvine

http://www.ics.uci.edu/~dvk/RelDChttp://www.itr-rescue.org (RESCUE)

ACM IQIS 2005

Work supported by NSF Grants IIS-0331707 and IIS-0083489

Page 2: Exploiting Relationships  for Object Consolidation

2

Talk Overview

• Motivation

• Object consolidation problem

• Proposed approach – RelDC: Relationship based data cleaning– Relationship analysis and graph partitioning

• Experiments

Page 3: Exploiting Relationships  for Object Consolidation

3

Why do we need “Data Cleaning”?

q Hi, my name is Jane Smith.

I’d like to apply for a faculty

position at your university

Wow! Unbelievable!

Are you sure you will join us even if

we do not offer you tenure right

away?

Jane Smith – Fresh Ph.D. Tom - Recruiter

OK, let me check

something quickly …

???

Publications:1. ……2. ……3. ……

Publications:1. ……2. ……3. ……

CiteSeer Rank

Page 4: Exploiting Relationships  for Object Consolidation

4

• Names often do not uniquely identify people

What is the problem?

CiteSeer: the top-k most cited authors DBLP DBLP

Page 5: Exploiting Relationships  for Object Consolidation

5

Comparing raw and cleaned CiteSeer

Rank Author Location # citations

1 (100.00%) douglas schmidt cs@wustl 5608

2 (100.00%) rakesh agrawal almaden@ibm 4209

3 (100.00%) hector garciamolina @ 4167

4 (100.00%) sally floyd @aciri 3902

5 (100.00%) jennifer widom @stanford 3835

6 (100.00%) david culler cs@berkeley 3619

6 (100.00%) thomas henzinger eecs@berkeley 3752

7 (100.00%) rajeev motwani @stanford 3570

8 (100.00%) willy zwaenepoel cs@rice 3624

9 (100.00%) van jacobson lbl@gov 3468

10 (100.00%) rajeev alur cis@upenn 3577

11 (100.00%) john ousterhout @pacbell 3290

12 (100.00%) joseph halpern cs@cornell 3364

13 (100.00%) andrew kahng @ucsd 3288

14 (100.00%) peter stadler tbi@univie 3187

15 (100.00%) serge abiteboul @inria 3060

CiteSeer top-k

Cleaned CiteSeer top-k

Page 6: Exploiting Relationships  for Object Consolidation

6

Object Consolidation Problem

• Cluster representations that correspond to the same “real” world object/entity

• Two instances: real world objects are known/unknown

r1 r2 r3 r4 r5 r6 r7 rN

o1 o2 o3 o4 o5 o6 o7 oM

Representations of objects in the database

Real objects in the database

Page 7: Exploiting Relationships  for Object Consolidation

7

RelDC Approach

• Exploit relationships among objects to disambiguate when traditional approach on clustering based on similarity does not work

f1

f2

f3

?

?

?

f4

Y

f1

f2

f3

f4?

X

Traditional Methods

+ X Y

A

B C

D

E F

Relationship Analysis

ARG

RelDC Framework

features and context

Relationship-based Data Cleaning

Page 8: Exploiting Relationships  for Object Consolidation

8

Attributed Relational Graph (ARG)

View the database as an ARG

Nodes

– per cluster of representations (if already resolved by feature-based approach)

– per representation (for “tough” cases)

Edges – Regular – correspond to

relationships between entities

– Similarity – created using feature-based methods on representations

person publication

department organization

Page 9: Exploiting Relationships  for Object Consolidation

9

Context Attraction Principle (CAP)

Who is “J. Smith” – Jane?– John?

Jane Smith

John Smith

J. Smith

Merging a new publication.

Page 10: Exploiting Relationships  for Object Consolidation

10

Questions to Answer

1. Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the

quality of consolidation improves?

2. Can we design a generic strategy that exploits CAP for consolidation?

Page 11: Exploiting Relationships  for Object Consolidation

11

Consolidation Algorithm

1. Construct ARG and identify all virtual clusters (VCSs)– use FBS in constructing the ARG

2. Choose a VCS and compute connection strength between nodes– for each pair of repr. connected via a similarity edge

3. Partition the VCS– use a graph partitioning algorithm– partitioning is based on connection strength– after partitioning, adjust ARG accordingly– go to Step 2, if more potential clusters exists

Page 12: Exploiting Relationships  for Object Consolidation

12

Connection Strength c(u,v)

Models for c(u,v)– many possibilities

– diffusion kernels, random walks, etc

– none is fully adequate– cannot learn similarity from data

u v

A

B C

D

E F

G H z Diffusion kernels

– (x,y)= 1(x,y) “base similarity” – via direct links (of size 1)

– k(x,y) “indirect similarity”– via links of size k

– B: where Bxy = B1xy = 1(x,y)

– base similarity matrix

– Bk: indirect similarity matrix– K: total similarity matrix, or “kernel”

Page 13: Exploiting Relationships  for Object Consolidation

13

Connection Strength c(u,v) (cont.)

T1

T2 T2

T1

N-2... ... ... ... ...

John Smith Alan WhiteP1

MIT

Instantiating parameters– Determining (x,y)

– regular edges have types T1,...,Tn

– types T1,...,Tn have weights w1,...,wn

– (x,y) = wi

– get the type of a given edge

– assign this weigh as base similarity

– Handling similarity edges– (x,y) assigned value proportional to similarity

(heuristic)

– Approach to learn (x,y) from data (ongoing work)

Implementation

– we do not compute the whole matrix K– we compute one c(u,v) at a time

– limit path lengths by L

P1R1:John R2:J.SmithA4:Alan P3A5:MikeP2

R3:John A3:JohnA6:Tom StanfordP4 A7:Kate

P1 R1:JohnA1:John A4:AlanMITR3:John

(a)

(b)

(c)

Page 14: Exploiting Relationships  for Object Consolidation

14

Consolidation via Partitioning

Observations– each VCS contains representations of at least

1 object– if a repr. is in VCS, then the rest of repr. of the

same object are in it too

Partitioning– two cases

– k, the number of entities in VSC, is known– k is unknown

– when k is known, use any partit. algo– maximize inside-con, minimize outside-con.– we use [Shi,Malik’2000]– normalized cut

– when k is unknown– split into two: just to see the cut– compare cut against threshold– decide “to split” or “not to split”– Iterate

1

1

1

12

2 2

3

3

4

4

VCS 1

VCS 2

5

5

5

5

Page 15: Exploiting Relationships  for Object Consolidation

15

Measuring Quality of Outcome

– dispersion – for an entity, into how many clusters

its repr. are clustered, ideal is 1

– diversity– for a cluster, how many distinct

entities it covers, ideal is 1

– Entity uncertainty– for an entity, if out of m represent.

m1 to C1; ...; mn to Cn then

– Cluster Uncertainty– if a cluster consists of represent.: m1

of E1; ...; mn of En then (same...)– ideal entropy is zero

1 11 1

2 22 2 2 2

1 1

Ideal Clustering

1 11 1

2 22 2 2

21

1

One Misassigned (Example 1)

1

1

1 1

2 22

2 2 2

1 1

Half Misassigned

1 11 1

2 22 2 2

211

One Misassigned (Example 2)

C1

C2

Div H

1

1

0

0

E1

E2

Dis H

1

1

0

0

C1

C2

Div H

2

2

E1

E2

Dis H

2

2

0.65

0.65

C1

C2

Div H

2

2

1

1

E1

E2

Dis H

2

2

1

1

C1

C2

Div H

2

1

0.592

0

E1

E2

Dis H

1

2

0

0.65

0.65

0.65

Dis/Div cannot distinguish the two cases

Entropy can: since 0.65 < 1, first clustering is better

Average entropy decreases (improves), compared to Example 1

Page 16: Exploiting Relationships  for Object Consolidation

16

Experimental Setup

Parameters– L-short simple paths, L = 7– L is the path-length limit

Note– The algorithm is applied to

“tough cases”, after FBS already has successfully consolidated many entries!

RealMov– movies (12K) – people (22K)

– actors– directors– producers

– studious (1K) – producing – distributing

Uncertainty– d1,d2,...,dn are director entities– pick a fraction d1,d2,...,dm– Group entries in size k,

– e.g. in groups of two {d1,d2}, ... ,{d9,d10}

– make all representations of a group indiscernible by FBS, ...

Baseline 1Baseline 1– one cluster per VCS, regardlessone cluster per VCS, regardless– Equivalent to using only FBSEquivalent to using only FBS– ideal dispersion & H(E)!ideal dispersion & H(E)!

Baseline 2Baseline 2– knows grouping statisticsknows grouping statistics– gueses #ent in VCS gueses #ent in VCS – random assigns repr. to clustersrandom assigns repr. to clusters

Page 17: Exploiting Relationships  for Object Consolidation

17

Sample Movies Data

Page 18: Exploiting Relationships  for Object Consolidation

18

The Effect of L on Quality

Cluster Entropy & Diversity Entity Entropy & Dispersion

Page 19: Exploiting Relationships  for Object Consolidation

19

Effect of Threshold and Scalability

Page 20: Exploiting Relationships  for Object Consolidation

20

Summary

RelDC– domain-independent data cleaning framework– uses relationships for data cleaning

– reference disambiguation [SDM’05]– object consolidation [IQIS’05]

Ongoing work– “learning” the importance of relationships from data– Exploiting relationships among entities for other

data cleaning problems

Page 21: Exploiting Relationships  for Object Consolidation

21

Contact Information

RelDC projectwww.ics.uci.edu/~dvk/RelDC

www.itr-rescue.org (RESCUE)

Zhaoqi [email protected]

Dmitri V. Kalashnikovwww.ics.uci.edu/~dvk

[email protected]

Sharad Mehrotrawww.ics.uci.edu/~sharad

[email protected]

Page 22: Exploiting Relationships  for Object Consolidation

22

extra slides…

Page 23: Exploiting Relationships  for Object Consolidation

24

Object Consolidation

Notation– O={o1,...,o|O|} set of entities

– unknown in general

– X={x1,...,x|X|} set of repres.

– d[xi] the entity xi refers to– unknown in general

– C[xi] all repres. that refer to d[xi]– “group set”

– unknown in general

– the goal is to find it for each xi

– S[xi] all repres. that can be xi

– “consolidation set”

– determined by FBS

– we assume C[xi] S[xi]

Page 24: Exploiting Relationships  for Object Consolidation

25

Object Consolidation Problem

• Let O={o1,...,o|O|} be the set of entities

– unknown in general

• Let X={x1,...,x|X|} be the set of representations

• Map xi to its corresponding entity oj in O d[xi] the entity xi refers to

– unknown in general

– C[xi] all repres. that refer to d[xi]– “group set”

– unknown in general

– the goal is to find it for each xi

– S[xi] all repres. that can be xi

– “consolidation set”

– determined by FBS

– we assume C[xi] S[xi]

Page 25: Exploiting Relationships  for Object Consolidation

26

RelDC Framework

f1

f2

f3

?

?

?

f4

Y

f1

f2

f3

f4?

X

Traditional Methods

+ X Y

A

B C

D

E F

Relationship Analysis

ARG

Raw Data Representation

Tables/ARGs

RelDC Framework

ARG

X Y

AB C

D

E F

features and context

Data CleaningExtraction

Relationship-based Data Cleaning

Analysis

Page 26: Exploiting Relationships  for Object Consolidation

27

Connection Strength

Computation of c(u,v)

Phase 1: Discover connections– all L-short simple paths between u and v

– bottleneck

– optimizations, not in IQIS’05

Phase 2: Measure the strength– in the discovered connections

– many c(u,v) models exist

– we use model similar to diffusion kernels

u v

A

B C

D

E F

G H z

Page 27: Exploiting Relationships  for Object Consolidation

28

Our c(u,v) Model

T1

T2 T2

T1

N-2... ... ... ... ...

John Smith Alan WhiteP1

MIT

Our c(u,v) model– regular edges have types T1,...,Tn

– types T1,...,Tn have weights w1,...,wn

– (x,y) = wi

– get the type of a given edge– assign this weigh as base similarity

– paths with similarity edges– might not exist, use heuristics

Our model & Diff. kernels– virtually identical, but...– we do not compute the whole matrix K

– we compute one c(u,v) at a time

– we limit path lengths by L– (x,y) is unknown in general

– the analyst assigns them– learn from data (ongoing work)

P1R1:John R2:J.SmithA4:Alan P3A5:MikeP2

R3:John A3:JohnA6:Tom StanfordP4 A7:Kate

P1 R1:JohnA1:John A4:AlanMITR3:John

(a)

(b)

(c)