hongwei zhao, xiaojun ye molap on cloud interactive, cluster data warehouse tsinghua university...

24
Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University [email protected] , [email protected] n

Upload: amos-jordan

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Hongwei Zhao, Xiaojun Ye

MOLAP on Cloud

Interactive, Cluster Data Warehouse

Tsinghua University [email protected], [email protected]

Page 2: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

MotivationExtend the cube model to support OLAP operations on Big Data:»OLAP operations»Interactive queries

Page 3: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

OutlineCube modelling

Building and querying

Experimenting

Page 4: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Data Transform for CubeTPC-DS tables Star views Cube data

User queries

Page 5: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

A Simplified Cube Model

Cube Instance

Cuboid InstanceDimension

Instance

DimensionInstance

CubeMetadata

DimensionInstance

Cuboid Instance

KeyMemb

erKeyMemb

erKeyDimensi

on Member

KeyMeasure NodeKeyMeasure NodeKeyMeasure Cell

ABC

AB

A

AC

B

BC

C

*

Base Cuboids

Result

Page 6: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Example: TPC-DS Query7select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 from store_sales, customer_demographics, date_dim, item, promotionwhere ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = [YEAR] group by i_item_id order by i_item_id

Page 7: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Relation Schema

Store Sales

Date Dim

Item Promotion

Customer Demographic

s

Page 8: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Converting to BitKeyDimensio

n ADimension B

Dimension C

Measure

A1 B1 C1 M1

A2 B1 C2 M2

A3 B2 C2 M3

Dimension Member

BitKey Dimension Mask

A1 000001 000001

B1 000010 000010

C1 000100 000100A2 001000 001001

B1 000010 000010

C2 010000 010100

BitKeys

Value

000111 M1011010 M2Result2

Result1

Intermediate

Result1

Fact1

Fact2

Intermediate

Result1

Page 9: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Cube StorageTable

Region

ColumnFamily

Row

Column

Version

Value

Cell

One table for dimension instances storage:

Row Key Dimension Name

Column Family

Default

Column Member BitKey

Value Member ValueMultiple tables for cuboids instances

Table Name Cuboid Name

Row Key Cell BitKey

Column Family

Default

Column Measure Name

Value Measure Value

Page 10: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

MDX for query 7select { i_item_id } on rows,

{ avg(ss_quantity), avg(ss_list_price),avg(ss_coupon_amt),

avg(ss_sales_price) } on columns

from store_sales_cubewhere (cd_gender .[Male], cd_marital_status .[Single], cd_education_status .[College],

d_year.[2000])

Page 11: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Cube Implementation

Base cuboid building with 4 stages:Dimension constructingHive queryAggregationSaving

Query execution with 4 stages:Loading dimensionOther cuboid constructingMappingReducing

Page 12: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

OLAP System

Engin

eC

olu

mnar

Data

base

Master Node

Region NodeRegion

Node

Dispatcher Node

Worker Node

Region Node

Worker NodeWorker Node

cachedat

a

Cube data

Cluster FrameworkDispatcher Node

Worker Nodes

• Distribute dynamically cubes data onto worker nodes

• Parallelize OLAP operations into a concurrent model

Page 13: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Actor of AkkaStateBehaviorMailbox

Lifecycle

Fault tolerance

Page 14: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Execute Query

Query Dispatch

er

Cuboid Manager

Dimension

Manager

Mapper Reducer

1 2

34

require

Cuboid ready

Dimension load

data ready

Extract Query

Hit Cell

Hit Cell

Actors for Query

• Load dimension members

• Build other cuboids• Mapping• Reducing

Page 15: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Compiling & MappingQuery 7 Condition: GEN=M and MS=S and ES=College and YEAR=2000

GEN Mask: 000000011 Male 000000010MS Mask: 000011100 Single :000001100ES Mask: 001100000College: 001000000YEAR Mask: 110000000 2000:010000000

Mask: 111111111FilterKey: 011001110

Query Dispatch

er

Mapper1

Mapper2

Mapper3

For each cell in mapper{ If (key & mask

== Filter Key) Send to Reducer}

Page 16: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Region 1

Region 2

Region 3

Worker

Worker

Worker

Master

messages

results

Cache 1

Cache 2

Cache 3

Query Execution

• Master sends task messages to workers

• Each worker caches each region data

• Sequential tasks reuse the cache data

First query on 1G consume 48 secs, the following queries with various parameters consume 2.4 secs

Page 17: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Experiments On TPC-DS

1g 10g 100g0

50000000

100000000

150000000

200000000

250000000

300000000

fact recordscells

  1G 10G 100G

records number

2,653,108

26,532,571

265,325,821

cube cell number

1,836,162

10,190,922

41,892,286

4 nodes:• 2*Intel Xeon CPU E5-2630• 4*600G 15000r/s SAS • 256G RAM• 10Gb Network

Dimensions:1. "i_item_id", 2. "cd_gender", 3. "cd_marital_status", 4. "cd_education_status", 5. "p_channel_email", 6. "p_channel_event", 7. "d_year“Measures: 8. ss_quantity_avg,9. ss_list_price_avg, 10. ss_coupon_amt_avg, 11. ss_sales_price_avg

Page 18: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Build Cube for Query 7

1G

10G

100G

0 1000 2000 3000 4000 5000 6000

queryingaggregatingSaving

running time (seconds)

TP

C-D

S d

ata

siz

e

• Partition by the largest Dimension(i_item_id)

• In-Memory aggregation• Saving stage can be

ignore(cache)

Page 19: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

1 2 3 4 50

50

100

150

200

250

300

350

400

4 workers8 workers16 workers

iteration number

run

nin

g t

ime

(seco

nd

s)

Execute Query 7First execution on the cube includes • Dimension loading• other cuboids construction • Caching• Mapping• Reducing

Sequential execution includes:• Mapping• Reducing

Page 20: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Hive Query for Fact Data select p_channel_email, p_channel_event, cd_gender, cd_marital_status, cd_education_status, i_item_id,d_year, ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_price from store_sales

join date_dim on (store_sales.ss_sold_date_sk

= date_dim.d_date_sk) join item on (store_sales.ss_item_sk =

item.i_item_sk) join customer_demographics on

(store_sales.ss_cdemo_sk = custom-er_demographics.cd_demo_sk)

join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)

Page 21: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Compare with Hive

1G 10G 100G0

200

400

600

800

1000

1200

1400

hiveprototype

1G 10G 100G0

200

400

600

800

1000

1200

1400

hiveprototype

First query time compare:2-3X

Sequential execution time:30-50X

Page 22: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Future work• Cube Model:

• Demand-driven & Data-driven

• Cube Data: • Model-driven & Requirement-driven

• More experiments on TPC-DS queries• Report, ad hoc, iterative, data mining,

• MDX/XMLA compliance

Page 23: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Thanks.

Page 24: Hongwei Zhao, Xiaojun Ye MOLAP on Cloud Interactive, Cluster Data Warehouse Tsinghua University hwzhao73@gmail.comhwzhao73@gmail.com, yexj@mail.tsinghua.edu.cn

Storage for Example

Row Key

Column Family: default

Dimension A

Mask 000001 001000 001001

001001 A1 A2 A3

Dimension B

Mask 000010 100000

100010 B1 B2

Row Key Column Family: default

000111 Mea_count Mea_sum

1 M1

011010 Mea_count Mea_sum

1 M2

Table: Dimension

Table: Cuboid_ABC