comp53311 data warehouse prepared by raymond wong presented by raymond wong raywong@cse

COMP5331 1

COMP5331

Data Warehouse

Prepared by Raymond WongPresented by Raymond Wong

raywong@cse

COMP5331 2

Data Warehouse

Also called Online Analytical Processing (OLAP)

Many corporations use data warehouses for their analysis

COMP5331 3

Data Warehouse

Databases Users

Databases UsersData Warehouse

Need to wait for a long time (e.g., 1 day to 1 week)

Pre-computed results

COMP5331 4

Advantages

Fast Query Response

COMP5331 5

Data Warehouse

Problem Data Warehouse NP-hardness

Algorithm Performance Study

COMP5331 6

Data WarehouseParts are bought from suppliers and then sold to customers at a sale price SP

part supplier customer SP

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

COMP5331 7

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

p1 p2 p3 p4 p5 suppliers1s2

customer

Data cube

COMP5331 8

e.g., select part, customer, SUM(SP)from table Tgroup by part, customerpart

customer

SUM(SP)

p1 c1 4

p3 c2 3

p2 c1 7

e.g., select customer, SUM(SP)from table Tgroup by customer

customer

SUM(SP)

pc 3 c 2

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

AVG(SP), MAX(SP), MIN(SP), …

COMP5331 9

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Parts are bought from suppliers and then sold to customers at a sale price SP

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

COMP5331 10

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Suppose we materialize all views. This wastes a lot of space.

Cost for accessing pc = 4M

Cost for accessing ps = 0.8M

Cost for accessing sc = 2MCost for accessing p = 0.2M

Cost for accessing c = 0.1MCost for accessing s = 0.01M

COMP5331 11

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Suppose we materialize the top view only.

Cost for accessing pc = 6M(not 4M)

Cost for accessing ps = 6M(not 0.8M)

Cost for accessing sc = 6M(not 2M)Cost for accessing p = 6M

(not 0.2M)

Cost for accessing c = 6M(not 0.1M)

Cost for accessing s = 6M(not 0.01M)

COMP5331 12

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Suppose we materialize the top view and the view for “ps” only.

Cost for accessing pc = 6M(still 6M)

Cost for accessing sc = 6M(still 6M)

Cost for accessing p = 0.8M(not 6M previously)

Cost for accessing ps = 0.8M(not 6M previously)

Cost for accessing c = 6M(still 6M)

Cost for accessing s = 0.8M(not 6M previously)

COMP5331 13

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Suppose we materialize the top view and the view for “ps” only.

Cost for accessing pc = 6M(still 6M)

Cost for accessing sc = 6M(still 6M)

Cost for accessing p = 0.8M(not 6M previously)

Cost for accessing ps = 0.8M(not 6M previously)

Cost for accessing c = 6M(still 6M)

Cost for accessing s = 0.8M(not 6M previously)

Gain = 0Gain = 5.2M

Gain = 5.2M Gain = 0

Gain({view for “ps”, top view}, {top view}) = 5.2*3 = 15.6

Selective Materialization Problem:We can select a set V of k views such that Gain(V U {top view}, {top view}) is maximized.

COMP5331 14

Data Warehouse

COMP5331 15

NP-hardness

Selective Materialization Problem is NP-hard.

COMP5331 16

NP-hardness

Selective Materialization Decision Problem (SMD) Given an integer k and a real number J,

We want to find a set V of k views such that Gain(V U {top view}, {top view}) is at least J.

Selective Materialization Decision Problem is NP-hard.

COMP5331 17

NP-hardness Exact Cover by 3-Sets (XC)

Instance: Set X with 3q elements, and a collection C of size 3 subsets of X

Question: Does C contain an exact cover for X, i.e., a subcollection C’ C such that every element of X occurs in exactly one set of C’.

It is well-known that this problem is NP-complete.

COMP5331 18

NP-hardness Problem XC can be transformed to Problem SMD

Create a root node with size = 200 (at level 1) Create a bottom node with size 1 (at level 4) For each element x in X,

Create a node Nx with size = 50 at level 3 Create an edge between Nx and the bottom node

For each element a C (where a = (x, y, z))

Create a node Na with size = 100 at level 2 Create an edge between Na and the root node Create an edge between Na and Nx Create an edge between Na and Ny Create an edge between Na and Nz

Set k = q Set J = 400q

COMP5331 19

NP-hardness

E.g., X = {A, B, C, D, E, F} C = {(A, B, C), (B, C, D), (D, E, F)}

A B C D E F

50 50 50 50 50 50

100 100 100

J = 400x2 = 800

COMP5331 20

NP-hardness

It is easy to verify that solving the problem SMD is equal to solving problem XC

Problem SMD is NP-hard.

COMP5331 21

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

COMP5331 22

Data Warehouse

psc 6M

pc 6M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

COMP5331 23

Data Warehouse

Problem Date Warehouse NP-hardness

COMP5331 24

Greedy Algorithm k = number of views to be materialized

Given v is a view S is a set of views which are selected to be

materialized Define the benefit of selecting v for

materialization as B(v, S) = Gain(S U {v}, S)

COMP5331 25

Greedy Algorithm

S {top view}; For i = 1 to k do

Select that view v not in S such that B(v, S) is maximized;

S S U {v} Resulting S is the greedy selection

COMP5331 26

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

Benefit from pc =

Benefit

6M-6M = 0 k = 2

COMP5331 27

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit from ps =

Benefit

6M-0.8M = 5.2M k = 2

COMP5331 28

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit from sc =

Benefit

6M-6M = 0

0 x 3= 0

COMP5331 29

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit from p =

Benefit

6M-0.2M = 5.8M

0 x 3= 0

5.8 x 1= 5.8

COMP5331 30

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit from s =

Benefit

6M-0.01M = 5.99M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

COMP5331 31

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit from c =

Benefit

6M-0.1M = 5.9M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

COMP5331 32

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from pc = 6M-6M = 0

0 x 2= 0

COMP5331 33

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from sc = 6M-6M = 0

0 x 2= 0

COMP5331 34

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from p = 0.8M-0.2M = 0.6M

0 x 2= 0

0.6 x 1= 0.6

COMP5331 35

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from s = 0.8M-0.01M = 0.79M

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

COMP5331 36

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from c = 6M-0.1M = 5.9M

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

COMP5331 37

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

Two views to be materialized are1. ps2. c

V = {ps, c} Gain(V U {top view}, {top view})= 15.6 + 5.9 = 21.5

COMP5331 38

Data Warehouse

COMP5331 39

Performance Study

How bad does the Greedy Algorithm perform?

COMP5331 40

1.1 Data Cube

b 100 c 99 d 100

none 1

1st Choice (M) 2nd Choice (M)

… … …

41 x 100= 4100

Benefit from b =

Benefit

200-100= 100

20 nodes …

p20 97

q20 97

r20 97

…s20 97

COMP5331 41

1.1 Data Cube

b 100 c 99 d 100

none 1

… … …

41 x 100= 4100

Benefit from c =

Benefit

200-99 = 101

20 nodes …

p20 97

q20 97

r20 97

…s20 97

41 x 101= 4141

COMP5331 42

1.1 Data Cube

b 100 c 99 d 100

none 1

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q20 97

r20 97

…s20 97

41 x 101= 4141

41 x 100= 4100

COMP5331 43

1.1 Data Cube

b 100 c 99 d 100

none 1

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q20 97

r20 97

…s20 97

41 x 101= 4141

41 x 100= 4100

Benefit from b = 200-100= 100

21 x 100= 2100

COMP5331 44

1.1 Data Cube

b 100 c 99 d 100

none 1

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q20 97

r20 97

…s20 97

41 x 101= 4141

41 x 100= 4100

21 x 100= 2100

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

COMP5331 45

1.1 Data Cube

b 100 c 99 d 100

none 1

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q20 97

r20 97

…s20 97

41 x 101= 4141

41 x 100= 4100 41 x 100= 4100

21 x 101 + 20 x 1= 2141

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

COMP5331 46

1.1 Data Cube

b 100 c 99 d 100

none 1

20 nodes …

p20 97

q20 97

r20 97

…s20 97

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

Greedy

Optimal=

8200=0.7611

If this ratio = 1, Greedy can give an optimal solution. If this ratio 0, Greedy may give a “bad” solution.

Does this ratio has a “lower” bound?

It is proved that this ratio is at least 0.63.

COMP5331 47

Performance Study

This is just an example to show that this greedy algorithm can perform badly.

A complete proof of the lower bound can be found in the paper.

comp53311 data warehouse prepared by raymond wong presented by raymond wong raywong@cse

Documents

anti-money laundering and counter-terrorist financing...

anti-money laundering and counter-terrorist financing...

1 probabilistic inference protection on anonymized data...

comp53311 data stream prepared by raymond wong presented by...

1 intro to crypto and mod supplementary notes prepared by...

1 variance of rvs supplementary notes prepared by raymond...

1 inclusion-exclusion supplementary notes prepared by...

learning on probabilistic labels peng peng, raymond chi-wing...

creating energy with infinite...

comp53311 classification prepared by raymond wong the...

wong and wong

john q. wong, md, msc justine joyce alim, jose lorenzo...

1 efficient method for maximizing bichromatic reverse...

intrinsic physiological properties of rat retinal ganglion...

surmounting challenges - southern acids report-2013.pdf ·...

thalassemia chelation update: prof raymond wong

1 binomial coefficient supplementary notes prepared by...

diaphragm wall 2 - const-infobank.org – raymond wong wai...

1 independence supplementary notes prepared by raymond wong...

construction of langham place by raymond wong wai...