cs345 compact skeletons. assume tuples components are scattered over website we have a tagger that...

Post on 22-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS345CS345

Compact Skeletons

Compact SkeletonsCompact Skeletons

Assume tuples components are scattered over website

We have a tagger that can tag all tuple components on website– Assume no noise for now

Reconstruct relation

Website

Data Graph

Skeleton

Relation

Compact SkeletonsCompact Skeletons

Dept (D)

Title (T)

Salary (S)

Address (A)

Welcome to Big Corp!

Join our team.Jobs are available in these departments:

R&D

Corporate The following jobs are open:

Job #12345

Job #12346

Send resumes to:1200 Jose Blvd, CA 94123Job Title:

Salary: Must know Java…..

Programmer100K

Dept (D)

Title (T)

Salary (S)

Address (A)

Welcome to Big Corp!

Join our team.Jobs are available in these departments:

R&D

Corporate The following jobs are open:

Job #12345

Job #12346

Send resumes to:1200 Jose Blvd, CA 94123Job Title:

Salary: Must know Java…..

Programmer100K

R & D

Programmer 100K

1200 Jose Blvd

Corporate

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Corporate

CEOAdminAsst

400 7th Ave

60K

Programmer 100K R &D 1200 Jose BlvdCTO 150K R & D 1200 Jose BlvdAdmin Asst 60K Corporate 400 7th AveCEO (null) Corporate 400 7th Ave

T S D A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Corporate

CEOAdminAsst

60K

Programmer 100K R &D 1200 Jose BlvdCTO 150K R & D 1200 Jose BlvdAdmin Asst 60K Corporate 1200 Jose BlvdCEO (null) Corporate 1200 Jose Blvd

T S D A

Website

Data Graph

Skeleton

Relation

SkeletonsSkeletons

Labeled treesTransformation from data graphs to

relations

D

T S

A

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

D

T S

A

Overlays

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

D

T S

A

Overlays

T S D A

Programmer 100K R &D 1200 Jose Blvd

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

D

T S

A

Overlays

T S D A

Programmer 100K R &D 1200 Jose Blvd

CTO 150K R &D 1200 Jose Blvd

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Overlays

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

T S D A

Programmer 150K R &D 1200 Jose Blvd

Overlays

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

T S D A

Programmer 150K R &D 1200 Jose Blvd

CTO 100K R & D 1200 Jose Blvd

Overlays

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Inconsistent Overlays

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Inconsistent Overlays

Compact SkeletonsCompact Skeletons

A skeleton is compact if all overlays are consistent

Perfect if each node and edge of data graph is covered by at least one overlay

Given a data graph G, does G have a Perfect Compact Skeleton (PCS)?– Not always– But if it exists it is unique

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

PCS Algorithm

D

T S T S

A

PCS Algorithm

Work bottom-up:Compute node signaturesPlace nodes in equivalence classes based on signatureConstruct skeleton from equivalence classes

D

T S T S

A

T S

A

D

PCS Algorithm

Corporate

CEOAdminAsst

400 7th Ave

60K

D

T S

A

Incomplete information

Corporate

CEOAdminAsst

400 7th Ave

60K

D

T S

A

T S D AAdmin Asst 60K Corporate 400 7th Ave

Incomplete information

D

T S

A

Corporate

CEOAdminAsst

400 7th Ave

60K

T S D AAdmin Asst 60K Corporate 400 7th Ave

Incomplete information

CEO Corporate 400 7th Ave

Partial Compact SkeletonsPartial Compact Skeletons

For data graphs with incomplete information, we allow partial overlays– Results in nulls in relation

If we can use consistent partial overlays to cover every node and edge of the graph, we have a partially perfect compact skeleton (PPCS)

Tuple subsumptionTuple subsumption

Tuple t subsumes tuple u if t and u agree on every component of u that is not null

tu

Noisy Data GraphsNoisy Data Graphs

Real-life websites are noisy– False positives e.g., MS = degree, state or

Microsoft?– Non-skeleton links e.g., featured products

Data graph for a retail websiteData graph for a retail website

c1 c2 c3

p1 p3 p4 a3a1 a2

i1 i2 i3 i4a4 C: CategoryI: ItemP: PriceA: Availability

C

I

Skeleton K1

P A

For simplicity: assume all nodes have a label

Coverage of a skeletonCoverage of a skeleton

c1 c2 c3

p1 p3 p4 a3

C

I

Skeleton K1

a1 a2

i1 i2 i3 i4a4

P A

Coverage of a skeletonCoverage of a skeleton

c1 c2 c3

p1 p3 p4 a3

C

I

Skeleton K1Coverage = 28

a1 a2

i1 i2 i3 i4a4

P A

Coverage of a skeletonCoverage of a skeleton

c1 c2 c3

p1 p3 p4 a3

C

I

C I

Skeleton K1Coverage = 28

Skeleton K2Coverage = 12

a1 a2

i1 i2 i3 i4a4

P A

A P

Skeletons for Noisy Data GraphsSkeletons for Noisy Data Graphs

Problem:– Find skeleton K with optimal coverage,

called the best-fit skeleton (BFS)NP-complete

Greedy Heuristic for BFSGreedy Heuristic for BFS

c1 c2 c3

p1 p3 p4 a3a1 a2

i1 i2 i3 i4a4

r

Greedy Heuristic for BFSGreedy Heuristic for BFS

C C C

P P P A CA A

I I I IA

R

R

Label Parent Count

P I 3

A I 3

C 1

I C 4

R 1

C R 1

I

P A

R

A B

CC C

D D D D

Label Parent Count

D C 4

C A 2

B 1

A R 1

B R 1

C

R

A

D

B

Greedy skeleton

R

A B

CC C

D D D D

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

R

A

D

B

Greedy skeletonCoverage = 9

C

R

A

D

B

Optimal skeletonCoverage = 15

Weighted Greedy Heuristic Weighted Greedy Heuristic

Simple Greedy heuristic uses parent counts– “Memory-less”

Weighted Greedy heuristic takes into account past selections to improve simple greedy selection– Computes “benefit” of each decision at

every stage

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

D

Weighted Greedy

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

D

Weighted Greedy

A C ) = 4benefit(

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

D

Weighted Greedy

A C ) = 4benefit((B C ) = 10benefit

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

R

A

D

B

Weighted Greedy

R

A B

CC C

D D D D

C

R

A

D

B

Greedy skeletonCoverage = 9

C

R

A

D

B

Weighted greedy skeletonCoverage = 15

Website

Data Graph

Compact Skeleton

Relation

SummarySummary

top related