efficient olap query processing for distributed data warehouses michael o. akinde, smhi, sweden...

20
Efficient OLAP Query Efficient OLAP Query Processing for Distributed Processing for Distributed Data Warehouses Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB, Aalborg University, Denmark Theodore Johnson, AT&T Labs-Research, USA Laks V. S. Lakshmanan, University of British Columbia, Canada Divesh Srivastava, AT&T Labs-Research, USA Michael O. Akinde EDBT’2002 -- March 24-28, Prague

Upload: marvin-freeman

Post on 22-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

Efficient OLAP Query Processing Efficient OLAP Query Processing for Distributed Data Warehousesfor Distributed Data Warehouses

Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark

Michael H. Böhlen, NDB, Aalborg University, Denmark

Theodore Johnson, AT&T Labs-Research, USA

Laks V. S. Lakshmanan, University of British Columbia, Canada

Divesh Srivastava, AT&T Labs-Research, USA

Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Page 2: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

2Michael O. AkindeEDBT’2002 -- March 24-28, Prague

MotivationMotivation

Analysis of network data Collect, correlate, and analyze data across the

network Huge amounts of decentrally collected data Complex OLAP operations Performed using ad-hoc Perl scripts

Experiment: OLAP technology Pro: Improves specification, performance Con: Expensive (or data loss) when centralized

Existing centralized OLAP tools are inadequate

Page 3: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

3Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Solution:Distributed DatawarehouseSolution:Distributed Datawarehouse

Local DW at each collection point (e.g., router) Compute queries across multiple DWs A technology is needed for distributed processing of complex

OLAP queries

DW

DW

DWSource

Source

Coordinator Source

QueryCoordinator

DWs close to the datacollection points Network Administrators

& Control Systems

Application

Application

Page 4: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

4Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Complex OLAP QueriesComplex OLAP Queries

Examples: Network usage: For each IP address, what fraction of

the total number of flows is due to web traffic? Principal components: On an hourly basis, what

fraction of the total traffic is from IP subnets whose total hourly traffic is within 10% of the maximum?

Pattern identification: Break down all flows recorded on US election day by all possible combinations of source AS, destination AS, and protocol

Diverse OLAP queries, involving pivots, correlations, and multiple levels of aggregation

Page 5: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

5Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Skalla Skalla

Translates OLAP queries expressed with extended algebra, into distributed query evaluation plans

Salient Features: Efficiently handles a significant variety of complex

OLAP queries (incl. pivots, correlations,etc.) Only partial results are shipped between the sites

and the coordinator -- never subsets of the detail data

No site to site communication

Page 6: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

6Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Extended Algebra: GMDJExtended Algebra: GMDJ

Algebraic OLAP operator [Chatziantoniou et al., 2001]

Salient feature: Splits grouping and aggregation MD(B, R, l, )

B is the base-values table (the “groups”) R is the detail table (fact data) l is the list of aggregate functions : possibly complex condition over B and R

describing what fact data is to be aggregated

Result: The table B extended with the aggregates in l

Page 7: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

7Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Extended Algebra: ExampleExtended Algebra: Example

For each IP address, what fraction of the total number of flows is due to web traffic?MD ( MD(IPT, Flows, (Cnt1), (IPT.key = Flows.key)),

Flows,

(Cnt2),

(IPT.key = Flows.key and Flows.Source = WEB ))

Result of inner GMDJ: (IPT, Cnt1) Result of outer GMDJ: (IPT, Cnt1, Cnt2) Sequences of GMDJs instead of multiple

aggregate-join expressions

Page 8: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

8Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Coordinator

Query Engine

Mediator

Skalla Architecture & EvaluationSkalla Architecture & Evaluation

Skalla Evaluation Rounds: Computation of GMDJ at the local DWs Synchronize sub-results at coordinator

DW

DW

DWSource

Source

Source

Administers localGMDJ queries

Application

Application

SiteWrapper

SiteWrapper

SiteWrapper

Coordinator

Query Engine

Mediator

Skalla

• Computes distributed query plans• Synchronizes sub-results

Page 9: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

9Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Skalla Evaluation: ExampleSkalla Evaluation: Example

For each IP address, what fraction of the total number of flows is due to web traffic?

IP1.2.02.5.0

xxxx

S1

S2

Coordinator

DW

DW

Build Groups

• Distribute• Compute Aggregates

IP Cnt11.2.0 752.5.0 0 xxxx xxxx

IP Cnt11.2.0 02.5.0 111 xxxx xxxx

IP Cnt11.2.0 752.5.0 111 xxxx xxxx

Synchronize

IP Cnt1 Cnt21.2.0 75 342.5.0 111 0 xxxx xxxx xxxx

IP Cnt1 Cnt21.2.0 75 02.5.0 111 53 xxxx xxxx xxxx

• Distribute• Compute Aggregates

IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx

Synchronizeresult

Page 10: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

10Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Skalla Evaluation: FeaturesSkalla Evaluation: Features

Each round of computation in the distributed query evaluation computes a single GMDJ.

Features of the Evaluation: Semantics of the query plans ensure that the amount

of data shipped by the algorithm is dependent on the number of groups and aggregate functions and independent of the size of the fact relation in the database!

The algorithm permits for a wide variety of optimizations

Page 11: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

11Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Optimizations: Group ReductionOptimizations: Group Reduction

During processing, we only ship data that has actually been changed

Example: Query: For each IP address, what fraction of the total

number of flows is due to web traffic? Each local DW receives a base-values table

containing all source data Coordinator has a copy of base-values table Local DWs ship only those tuples back that have

actually been changed

Page 12: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

12Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Group Reduction: ExampleGroup Reduction: Example

For each IP address, what fraction of the total number of flows is due to web traffic?

• Distribute• Compute Aggregates

IP Cnt11.2.0 75

2.5.0 0 xxxx xxxx

IP Cnt11.2.0 02.5.0 111

xxxx xxxx

IP1.2.02.5.0

xxxx

S1

S2

Coordinator

DW

DW

Build Groups

IP Cnt11.2.0 752.5.0 111 xxxx xxxx

Synchronize

IP Cnt1 Cnt21.2.0 75 34

2.5.0 111 0 xxxx xxxx xxxx

IP Cnt1 Cnt21.2.0 75 02.5.0 111 53

xxxx xxxx xxxx

• Distribute• Compute Aggregates

IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx

Synchronizeresult

Page 13: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

13Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Optimizations: Optimizations: Synchronization ReductionSynchronization Reduction It is possible to detect cases where no

synchronization is required between passes. Example:

DW data: All the flows of an autonomous system are always registered (stored) at a particular local DW

Query: For each IP address, what fraction of the total number of flows is due to web traffic?

Each IP address belongs to a particular autonomous system; i.e., all data for a particular IP address is located at the system storing the flows of its autonomous system

Page 14: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

14Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Synch Reduction: Example (1)Synch Reduction: Example (1)

For each IP address, what fraction of the total number of flows is due to web traffic?

• Distribute• Compute Aggregates

IP Cnt11.2.0 75

2.5.0 0 xxxx xxxx

IP Cnt11.2.0 02.5.0 111

xxxx xxxx

IP1.2.02.5.0

xxxx

S1

S2

Coordinator

DW

DW

Build Groups

IP Cnt11.2.0 752.5.0 111 xxxx xxxx

Synchronize

IP Cnt1 Cnt21.2.0 75 34

2.5.0 111 0 xxxx xxxx xxxx

IP Cnt1 Cnt21.2.0 75 02.5.0 111 53

xxxx xxxx xxxx

• Distribute• Compute Aggregates

IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx

Synchronizeresult

Page 15: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

15Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Synch Reduction: Example (2)Synch Reduction: Example (2)

For each IP address, what fraction of the total number of flows is due to web traffic?

• Distribute• Compute Aggregates

IP Cnt11.2.0 75

2.5.0 0 xxxx xxxx

IP Cnt11.2.0 02.5.0 111

xxxx xxxx

IP1.2.02.5.0

xxxx

S1

S2

Coordinator

DW

DW

Build Groups

IP Cnt1 Cnt21.2.0 75 34

2.5.0 0 0 xxxx xxxx xxxx

IP Cnt1 Cnt21.2.0 0 02.5.0 111 53

xxxx xxxx xxxx

IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx

Synchronizeresult

Page 16: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

16Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Experiment: Number of Sites (GR)Experiment: Number of Sites (GR)

Query Evaluation Time (high cardinality)

0

100

200

300

400

500

600

0 2 4 6 8 10

Number of Sites

Sec

on

ds

No Group Reduction

Group Reduction

Page 17: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

17Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Experiment: Number of Sites (SR)Experiment: Number of Sites (SR)

Query Evaluation Time (high cardinality)

0

50

100

150

200

250

300

350

2 4 6 8

Number of SItes

Se

con

ds No Synchronization

Reduction

SynchronizationReduction

Page 18: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

18Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Experiments: Size of DatabaseExperiments: Size of Database

Query Evaluation Time

0

100

200

300

400

500

600

700

800

900

1000

0 1 2 3 4 5

Database Size (Relative)

Sec

on

ds

not optimized

optimized

Page 19: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

19Michael O. AkindeEDBT’2002 -- March 24-28, Prague

Experiments: Cost BreakdownExperiments: Cost Breakdown

Query Evaluation Time Breakdown

0

100

200

300

400

500

600

1 2 3 4

Database Size

Sec

on

ds Communication

Client Compute

Server Compute

Page 20: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

20Michael O. AkindeEDBT’2002 -- March 24-28, Prague

ConclusionsConclusions

We develop a framework for evaluating complex OLAP queries on a distributed data warehouse Efficient query plans that minimize data transfer over

the network

Further work: Additional developments of the architecture Cost-based query optimization