efficient olap query processing for distributed data warehouses michael o. akinde, smhi, sweden...
Post on 22-Dec-2015
214 Views
Preview:
TRANSCRIPT
Efficient OLAP Query Processing Efficient OLAP Query Processing for Distributed Data Warehousesfor Distributed Data Warehouses
Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark
Michael H. Böhlen, NDB, Aalborg University, Denmark
Theodore Johnson, AT&T Labs-Research, USA
Laks V. S. Lakshmanan, University of British Columbia, Canada
Divesh Srivastava, AT&T Labs-Research, USA
Michael O. AkindeEDBT’2002 -- March 24-28, Prague
2Michael O. AkindeEDBT’2002 -- March 24-28, Prague
MotivationMotivation
Analysis of network data Collect, correlate, and analyze data across the
network Huge amounts of decentrally collected data Complex OLAP operations Performed using ad-hoc Perl scripts
Experiment: OLAP technology Pro: Improves specification, performance Con: Expensive (or data loss) when centralized
Existing centralized OLAP tools are inadequate
3Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Solution:Distributed DatawarehouseSolution:Distributed Datawarehouse
Local DW at each collection point (e.g., router) Compute queries across multiple DWs A technology is needed for distributed processing of complex
OLAP queries
DW
DW
DWSource
Source
Coordinator Source
QueryCoordinator
DWs close to the datacollection points Network Administrators
& Control Systems
Application
Application
4Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Complex OLAP QueriesComplex OLAP Queries
Examples: Network usage: For each IP address, what fraction of
the total number of flows is due to web traffic? Principal components: On an hourly basis, what
fraction of the total traffic is from IP subnets whose total hourly traffic is within 10% of the maximum?
Pattern identification: Break down all flows recorded on US election day by all possible combinations of source AS, destination AS, and protocol
Diverse OLAP queries, involving pivots, correlations, and multiple levels of aggregation
5Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Skalla Skalla
Translates OLAP queries expressed with extended algebra, into distributed query evaluation plans
Salient Features: Efficiently handles a significant variety of complex
OLAP queries (incl. pivots, correlations,etc.) Only partial results are shipped between the sites
and the coordinator -- never subsets of the detail data
No site to site communication
6Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Extended Algebra: GMDJExtended Algebra: GMDJ
Algebraic OLAP operator [Chatziantoniou et al., 2001]
Salient feature: Splits grouping and aggregation MD(B, R, l, )
B is the base-values table (the “groups”) R is the detail table (fact data) l is the list of aggregate functions : possibly complex condition over B and R
describing what fact data is to be aggregated
Result: The table B extended with the aggregates in l
7Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Extended Algebra: ExampleExtended Algebra: Example
For each IP address, what fraction of the total number of flows is due to web traffic?MD ( MD(IPT, Flows, (Cnt1), (IPT.key = Flows.key)),
Flows,
(Cnt2),
(IPT.key = Flows.key and Flows.Source = WEB ))
Result of inner GMDJ: (IPT, Cnt1) Result of outer GMDJ: (IPT, Cnt1, Cnt2) Sequences of GMDJs instead of multiple
aggregate-join expressions
8Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Coordinator
Query Engine
Mediator
Skalla Architecture & EvaluationSkalla Architecture & Evaluation
Skalla Evaluation Rounds: Computation of GMDJ at the local DWs Synchronize sub-results at coordinator
DW
DW
DWSource
Source
Source
Administers localGMDJ queries
Application
Application
SiteWrapper
SiteWrapper
SiteWrapper
Coordinator
Query Engine
Mediator
Skalla
• Computes distributed query plans• Synchronizes sub-results
9Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Skalla Evaluation: ExampleSkalla Evaluation: Example
For each IP address, what fraction of the total number of flows is due to web traffic?
IP1.2.02.5.0
xxxx
S1
S2
Coordinator
DW
DW
Build Groups
• Distribute• Compute Aggregates
IP Cnt11.2.0 752.5.0 0 xxxx xxxx
IP Cnt11.2.0 02.5.0 111 xxxx xxxx
IP Cnt11.2.0 752.5.0 111 xxxx xxxx
Synchronize
IP Cnt1 Cnt21.2.0 75 342.5.0 111 0 xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 75 02.5.0 111 53 xxxx xxxx xxxx
• Distribute• Compute Aggregates
IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx
Synchronizeresult
10Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Skalla Evaluation: FeaturesSkalla Evaluation: Features
Each round of computation in the distributed query evaluation computes a single GMDJ.
Features of the Evaluation: Semantics of the query plans ensure that the amount
of data shipped by the algorithm is dependent on the number of groups and aggregate functions and independent of the size of the fact relation in the database!
The algorithm permits for a wide variety of optimizations
11Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Optimizations: Group ReductionOptimizations: Group Reduction
During processing, we only ship data that has actually been changed
Example: Query: For each IP address, what fraction of the total
number of flows is due to web traffic? Each local DW receives a base-values table
containing all source data Coordinator has a copy of base-values table Local DWs ship only those tuples back that have
actually been changed
12Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Group Reduction: ExampleGroup Reduction: Example
For each IP address, what fraction of the total number of flows is due to web traffic?
• Distribute• Compute Aggregates
IP Cnt11.2.0 75
2.5.0 0 xxxx xxxx
IP Cnt11.2.0 02.5.0 111
xxxx xxxx
IP1.2.02.5.0
xxxx
S1
S2
Coordinator
DW
DW
Build Groups
IP Cnt11.2.0 752.5.0 111 xxxx xxxx
Synchronize
IP Cnt1 Cnt21.2.0 75 34
2.5.0 111 0 xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 75 02.5.0 111 53
xxxx xxxx xxxx
• Distribute• Compute Aggregates
IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx
Synchronizeresult
13Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Optimizations: Optimizations: Synchronization ReductionSynchronization Reduction It is possible to detect cases where no
synchronization is required between passes. Example:
DW data: All the flows of an autonomous system are always registered (stored) at a particular local DW
Query: For each IP address, what fraction of the total number of flows is due to web traffic?
Each IP address belongs to a particular autonomous system; i.e., all data for a particular IP address is located at the system storing the flows of its autonomous system
14Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Synch Reduction: Example (1)Synch Reduction: Example (1)
For each IP address, what fraction of the total number of flows is due to web traffic?
• Distribute• Compute Aggregates
IP Cnt11.2.0 75
2.5.0 0 xxxx xxxx
IP Cnt11.2.0 02.5.0 111
xxxx xxxx
IP1.2.02.5.0
xxxx
S1
S2
Coordinator
DW
DW
Build Groups
IP Cnt11.2.0 752.5.0 111 xxxx xxxx
Synchronize
IP Cnt1 Cnt21.2.0 75 34
2.5.0 111 0 xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 75 02.5.0 111 53
xxxx xxxx xxxx
• Distribute• Compute Aggregates
IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx
Synchronizeresult
15Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Synch Reduction: Example (2)Synch Reduction: Example (2)
For each IP address, what fraction of the total number of flows is due to web traffic?
• Distribute• Compute Aggregates
IP Cnt11.2.0 75
2.5.0 0 xxxx xxxx
IP Cnt11.2.0 02.5.0 111
xxxx xxxx
IP1.2.02.5.0
xxxx
S1
S2
Coordinator
DW
DW
Build Groups
IP Cnt1 Cnt21.2.0 75 34
2.5.0 0 0 xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 0 02.5.0 111 53
xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx
Synchronizeresult
16Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Experiment: Number of Sites (GR)Experiment: Number of Sites (GR)
Query Evaluation Time (high cardinality)
0
100
200
300
400
500
600
0 2 4 6 8 10
Number of Sites
Sec
on
ds
No Group Reduction
Group Reduction
17Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Experiment: Number of Sites (SR)Experiment: Number of Sites (SR)
Query Evaluation Time (high cardinality)
0
50
100
150
200
250
300
350
2 4 6 8
Number of SItes
Se
con
ds No Synchronization
Reduction
SynchronizationReduction
18Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Experiments: Size of DatabaseExperiments: Size of Database
Query Evaluation Time
0
100
200
300
400
500
600
700
800
900
1000
0 1 2 3 4 5
Database Size (Relative)
Sec
on
ds
not optimized
optimized
19Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Experiments: Cost BreakdownExperiments: Cost Breakdown
Query Evaluation Time Breakdown
0
100
200
300
400
500
600
1 2 3 4
Database Size
Sec
on
ds Communication
Client Compute
Server Compute
20Michael O. AkindeEDBT’2002 -- March 24-28, Prague
ConclusionsConclusions
We develop a framework for evaluating complex OLAP queries on a distributed data warehouse Efficient query plans that minimize data transfer over
the network
Further work: Additional developments of the architecture Cost-based query optimization
top related