1 cube computation and indexes for data warehouses cps 196.03 notes 7

28
1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

Upload: margaret-chandler

Post on 12-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

1

Cube Computation and Indexes for Data Warehouses

CPS 196.03Notes 7

Page 2: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

2

Processing

ROLAP servers vs. MOLAP servers Index Structures Cube computation What to Materialize? Algorithms

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 3: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

3

ROLAP Server

Relational OLAP Server

relationalDBMS

ROLAPserver

tools

utilities

sale prodId date sump1 1 62p2 1 19p1 2 48

Special indices, tuning;

Schema is “denormalized”

Page 4: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

4

MOLAP Server

Multi-Dimensional OLAP Server

multi-dimensional

server

M.D. tools

utilitiescould also

sit onrelational

DBMS

Pro

du

ctCity

Date1 2 3 4

milk

soda

eggs

soap

AB

Sales

Page 5: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

5

MOLAP

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Page 6: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

6

MOLAP

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428 56

4024 52

3620

60

B

Page 7: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

7

Challenges in MOLAP

Storing large arrays for efficient access Row-major, column major Chunking Compressing sparse arrays

Creating array data from data in tables Efficient techniques for Cube computation

Topics are discussed in the paper for reading

Page 8: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

8

Index Structures

Traditional Access Methods B-trees, hash tables, R-trees, grids, …

Popular in Warehouses inverted lists bit map indexes join indexes text indexes

Page 9: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

9

Inverted Lists

2023

1819

202122

232526

r4r18r34r35

r5r19r37r40

rId name ager4 joe 20

r18 fred 20r19 sally 21r34 nancy 20r35 tom 20r36 pat 25r5 dave 21

r41 jeff 26

. .

.

ageindex

invertedlists

datarecords

Page 10: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

10

Using Inverted Lists

Query: Get people with age = 20 and name = “fred”

List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection: r18

Page 11: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

11

Bit Maps

2023

1819

202122

232526

id name age1 joe 202 fred 203 sally 214 nancy 205 tom 206 pat 257 dave 218 jeff 26

. .

.

ageindex

bitmaps

datarecords

110110000

0010001011

Page 12: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

12

Bitmap Index Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the

value for the indexed column not suitable for high cardinality domains

Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer

RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1

RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0

Base table Index on Region Index on Type

Page 13: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

13

Using Bit Maps

Query: Get people with age = 20 and name = “fred”

List for age = 20: 1101100000 List for name = “fred”: 0100000001 Answer is intersection: 010000000000

Good if domain cardinality small Bit vectors can be compressed

Page 14: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

14

Join

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• “Combine” SALE, PRODUCT relations• In SQL: SELECT * FROM SALE, PRODUCT WHERE ...

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

Page 15: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

15

Join Indexes

product id name price jIndexp1 bolt 10 r1,r3,r5,r6p2 nut 5 r2,r4

sale rId prodId storeId date amtr1 p1 c1 1 12r2 p2 c1 1 11r3 p1 c3 1 50r4 p2 c2 1 8r5 p1 c1 2 44r6 p1 c2 2 4

join index

Page 16: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

16

Cube Computation for Data Warehouses

Page 17: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

17

Counting Exercise

How many cuboids are there in a cube? The full or nothing case When dimension hierarchies are present

What is the size of each cuboid?

Page 18: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

18

Lattice of Cuboids

city, product, date

city, product city, date product, date

city product date

all

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

129

Page 19: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

19

Dimension Hierarchies

all

state

city

cities city statec1 CAc2 NY

Page 20: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

20

Dimension Hierarchies

city, product

city, product, date

city, date product, date

city product date

all

state, product, date

state, date

state, product

state

not all arcs shown...

Page 21: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

21

Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L

levels?

Materialization of data cube Materialize every (cuboid) (full materialization), none (no

materialization), or some (partial materialization) Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.

)11(

n

i iLT

Page 22: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

22

Derived Data

Derived Warehouse Data indexes aggregates materialized views (next slide)

When to update derived data? Incremental vs. refresh

Page 23: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

23

Idea of Materialized Views

Define new warehouse tables/arrays

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

does not existat any source

Page 24: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

24

Efficient OLAP Processing

Determine which operations should be performed on available cuboids

Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,

e.g., dice = selection + projection

Determine which materialized cuboid(s) should be selected for OLAP:

Let the query to be processed be on {brand, province_or_state} with the

condition “year = 2004”, and there are 4 materialized cuboids available:

1) {year, item_name, city}

2) {year, brand, country}

3) {year, brand, province_or_state}

4) {item_name, province_or_state} where year = 2004

Which should be selected to process the query?

Explore indexing structures & compressed vs. dense arrays in MOLAP

Page 25: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

25

What to Materialize?

Store in warehouse results useful for common queries

Example:day 2

c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

129

. . .

total sales

materialize

Page 26: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

26

Materialization Factors

Type/frequency of queries Query response time Storage cost Update cost

Will study a concrete algorithm later

Page 27: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

27

Iceberg Cube Computing only the cuboid cells whose count

or other aggregates satisfying the condition like

HAVING COUNT(*) >= minsup

Motivation Only a small portion of cube cells may be “above the

water’’ in a sparse cube Only calculate “interesting” cells—data above certain

threshold

Page 28: 1 Cube Computation and Indexes for Data Warehouses CPS 196.03 Notes 7

28

Challenges in MOLAP

Storing large arrays for efficient access Row-major, column major Chunking Compressing sparse arrays

Creating array data from data in tables Efficient techniques for Cube computation

Topics are discussed in the paper for reading