efficient olap query processing for distributed data warehouses michael o. akinde, smhi, sweden...
TRANSCRIPT
![Page 1: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/1.jpg)
Efficient OLAP Query Processing Efficient OLAP Query Processing for Distributed Data Warehousesfor Distributed Data Warehouses
Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark
Michael H. Böhlen, NDB, Aalborg University, Denmark
Theodore Johnson, AT&T Labs-Research, USA
Laks V. S. Lakshmanan, University of British Columbia, Canada
Divesh Srivastava, AT&T Labs-Research, USA
Michael O. AkindeEDBT’2002 -- March 24-28, Prague
![Page 2: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/2.jpg)
2Michael O. AkindeEDBT’2002 -- March 24-28, Prague
MotivationMotivation
Analysis of network data Collect, correlate, and analyze data across the
network Huge amounts of decentrally collected data Complex OLAP operations Performed using ad-hoc Perl scripts
Experiment: OLAP technology Pro: Improves specification, performance Con: Expensive (or data loss) when centralized
Existing centralized OLAP tools are inadequate
![Page 3: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/3.jpg)
3Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Solution:Distributed DatawarehouseSolution:Distributed Datawarehouse
Local DW at each collection point (e.g., router) Compute queries across multiple DWs A technology is needed for distributed processing of complex
OLAP queries
DW
DW
DWSource
Source
Coordinator Source
QueryCoordinator
DWs close to the datacollection points Network Administrators
& Control Systems
Application
Application
![Page 4: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/4.jpg)
4Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Complex OLAP QueriesComplex OLAP Queries
Examples: Network usage: For each IP address, what fraction of
the total number of flows is due to web traffic? Principal components: On an hourly basis, what
fraction of the total traffic is from IP subnets whose total hourly traffic is within 10% of the maximum?
Pattern identification: Break down all flows recorded on US election day by all possible combinations of source AS, destination AS, and protocol
Diverse OLAP queries, involving pivots, correlations, and multiple levels of aggregation
![Page 5: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/5.jpg)
5Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Skalla Skalla
Translates OLAP queries expressed with extended algebra, into distributed query evaluation plans
Salient Features: Efficiently handles a significant variety of complex
OLAP queries (incl. pivots, correlations,etc.) Only partial results are shipped between the sites
and the coordinator -- never subsets of the detail data
No site to site communication
![Page 6: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/6.jpg)
6Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Extended Algebra: GMDJExtended Algebra: GMDJ
Algebraic OLAP operator [Chatziantoniou et al., 2001]
Salient feature: Splits grouping and aggregation MD(B, R, l, )
B is the base-values table (the “groups”) R is the detail table (fact data) l is the list of aggregate functions : possibly complex condition over B and R
describing what fact data is to be aggregated
Result: The table B extended with the aggregates in l
![Page 7: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/7.jpg)
7Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Extended Algebra: ExampleExtended Algebra: Example
For each IP address, what fraction of the total number of flows is due to web traffic?MD ( MD(IPT, Flows, (Cnt1), (IPT.key = Flows.key)),
Flows,
(Cnt2),
(IPT.key = Flows.key and Flows.Source = WEB ))
Result of inner GMDJ: (IPT, Cnt1) Result of outer GMDJ: (IPT, Cnt1, Cnt2) Sequences of GMDJs instead of multiple
aggregate-join expressions
![Page 8: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/8.jpg)
8Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Coordinator
Query Engine
Mediator
Skalla Architecture & EvaluationSkalla Architecture & Evaluation
Skalla Evaluation Rounds: Computation of GMDJ at the local DWs Synchronize sub-results at coordinator
DW
DW
DWSource
Source
Source
Administers localGMDJ queries
Application
Application
SiteWrapper
SiteWrapper
SiteWrapper
Coordinator
Query Engine
Mediator
Skalla
• Computes distributed query plans• Synchronizes sub-results
![Page 9: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/9.jpg)
9Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Skalla Evaluation: ExampleSkalla Evaluation: Example
For each IP address, what fraction of the total number of flows is due to web traffic?
IP1.2.02.5.0
xxxx
S1
S2
Coordinator
DW
DW
Build Groups
• Distribute• Compute Aggregates
IP Cnt11.2.0 752.5.0 0 xxxx xxxx
IP Cnt11.2.0 02.5.0 111 xxxx xxxx
IP Cnt11.2.0 752.5.0 111 xxxx xxxx
Synchronize
IP Cnt1 Cnt21.2.0 75 342.5.0 111 0 xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 75 02.5.0 111 53 xxxx xxxx xxxx
• Distribute• Compute Aggregates
IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx
Synchronizeresult
![Page 10: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/10.jpg)
10Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Skalla Evaluation: FeaturesSkalla Evaluation: Features
Each round of computation in the distributed query evaluation computes a single GMDJ.
Features of the Evaluation: Semantics of the query plans ensure that the amount
of data shipped by the algorithm is dependent on the number of groups and aggregate functions and independent of the size of the fact relation in the database!
The algorithm permits for a wide variety of optimizations
![Page 11: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/11.jpg)
11Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Optimizations: Group ReductionOptimizations: Group Reduction
During processing, we only ship data that has actually been changed
Example: Query: For each IP address, what fraction of the total
number of flows is due to web traffic? Each local DW receives a base-values table
containing all source data Coordinator has a copy of base-values table Local DWs ship only those tuples back that have
actually been changed
![Page 12: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/12.jpg)
12Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Group Reduction: ExampleGroup Reduction: Example
For each IP address, what fraction of the total number of flows is due to web traffic?
• Distribute• Compute Aggregates
IP Cnt11.2.0 75
2.5.0 0 xxxx xxxx
IP Cnt11.2.0 02.5.0 111
xxxx xxxx
IP1.2.02.5.0
xxxx
S1
S2
Coordinator
DW
DW
Build Groups
IP Cnt11.2.0 752.5.0 111 xxxx xxxx
Synchronize
IP Cnt1 Cnt21.2.0 75 34
2.5.0 111 0 xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 75 02.5.0 111 53
xxxx xxxx xxxx
• Distribute• Compute Aggregates
IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx
Synchronizeresult
![Page 13: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/13.jpg)
13Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Optimizations: Optimizations: Synchronization ReductionSynchronization Reduction It is possible to detect cases where no
synchronization is required between passes. Example:
DW data: All the flows of an autonomous system are always registered (stored) at a particular local DW
Query: For each IP address, what fraction of the total number of flows is due to web traffic?
Each IP address belongs to a particular autonomous system; i.e., all data for a particular IP address is located at the system storing the flows of its autonomous system
![Page 14: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/14.jpg)
14Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Synch Reduction: Example (1)Synch Reduction: Example (1)
For each IP address, what fraction of the total number of flows is due to web traffic?
• Distribute• Compute Aggregates
IP Cnt11.2.0 75
2.5.0 0 xxxx xxxx
IP Cnt11.2.0 02.5.0 111
xxxx xxxx
IP1.2.02.5.0
xxxx
S1
S2
Coordinator
DW
DW
Build Groups
IP Cnt11.2.0 752.5.0 111 xxxx xxxx
Synchronize
IP Cnt1 Cnt21.2.0 75 34
2.5.0 111 0 xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 75 02.5.0 111 53
xxxx xxxx xxxx
• Distribute• Compute Aggregates
IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx
Synchronizeresult
![Page 15: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/15.jpg)
15Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Synch Reduction: Example (2)Synch Reduction: Example (2)
For each IP address, what fraction of the total number of flows is due to web traffic?
• Distribute• Compute Aggregates
IP Cnt11.2.0 75
2.5.0 0 xxxx xxxx
IP Cnt11.2.0 02.5.0 111
xxxx xxxx
IP1.2.02.5.0
xxxx
S1
S2
Coordinator
DW
DW
Build Groups
IP Cnt1 Cnt21.2.0 75 34
2.5.0 0 0 xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 0 02.5.0 111 53
xxxx xxxx xxxx
IP Cnt1 Cnt21.2.0 75 342.5.0 111 53 xxxx xxxx xxxx
Synchronizeresult
![Page 16: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/16.jpg)
16Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Experiment: Number of Sites (GR)Experiment: Number of Sites (GR)
Query Evaluation Time (high cardinality)
0
100
200
300
400
500
600
0 2 4 6 8 10
Number of Sites
Sec
on
ds
No Group Reduction
Group Reduction
![Page 17: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/17.jpg)
17Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Experiment: Number of Sites (SR)Experiment: Number of Sites (SR)
Query Evaluation Time (high cardinality)
0
50
100
150
200
250
300
350
2 4 6 8
Number of SItes
Se
con
ds No Synchronization
Reduction
SynchronizationReduction
![Page 18: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/18.jpg)
18Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Experiments: Size of DatabaseExperiments: Size of Database
Query Evaluation Time
0
100
200
300
400
500
600
700
800
900
1000
0 1 2 3 4 5
Database Size (Relative)
Sec
on
ds
not optimized
optimized
![Page 19: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/19.jpg)
19Michael O. AkindeEDBT’2002 -- March 24-28, Prague
Experiments: Cost BreakdownExperiments: Cost Breakdown
Query Evaluation Time Breakdown
0
100
200
300
400
500
600
1 2 3 4
Database Size
Sec
on
ds Communication
Client Compute
Server Compute
![Page 20: Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d845503460f94a6a349/html5/thumbnails/20.jpg)
20Michael O. AkindeEDBT’2002 -- March 24-28, Prague
ConclusionsConclusions
We develop a framework for evaluating complex OLAP queries on a distributed data warehouse Efficient query plans that minimize data transfer over
the network
Further work: Additional developments of the architecture Cost-based query optimization