efficient query optimization for distributed join in database federation
DESCRIPTION
Efficient Query Optimization for Distributed Join in Database Federation. A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008. Outline. Introduction – Query Optimization in Database Federations Architecture and Problem Definition Proposed Work Schedule. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/1.jpg)
Efficient Query Optimization for Distributed Join
in Database Federation
A Master’s Thesis Proposalby
Di Wang
Advisor: Prof. Murali Mani
Dec 4, 2008
![Page 2: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/2.jpg)
OutlineIntroduction – Query Optimization
in Database Federations
Architecture and Problem Definition
Proposed Work
Schedule
![Page 3: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/3.jpg)
Introduction: Need for data integration ◦Various systems -> full picture◦Mergers -> access both resources with a
common interface◦Business partners -> combine data
Multiple Access MethodsMultiple Data Schemas
![Page 4: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/4.jpg)
Introduction to Database Federation
Database Federation is one approach to data integration◦Key performance advantage: efficiently
combine data from multiple sources in a single statement
◦The data sources are federated into a unified middleware, called mediator.
![Page 5: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/5.jpg)
Key Components of Database Federation
Query Rewriter
Cost-Based Optimizer
Query
. . . . . .
Research Issues: •containment algorithms for conjunctive queries,• schema mapping, •capability-based optimization
Cost-based optimization --Closely related to the optimization techniques developed for the distributed database systems
![Page 6: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/6.jpg)
The problem
![Page 7: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/7.jpg)
Things that make us unhappySortMerge on M1
NestLoop on M1M3.R3
M1.R1 M2.R2
Optimizer
M1
M2
M3
Estimated Condition: Available buffer sizes of sites; CPU utility of sites; Network traffics …Statistics: physical designs …
SortMerge on M2
NestLoop on M2M3.R3
M1.R1 M2.R2
Plan 1 Plan 2 HashJoin on M3
SortMerge on M1M3.R3
M1.R1 M2.R2
Plan 3
Run CPU Utility Available Buffer Chosen Plan
Optimal PlanM1 M2 M3 M1 M2 M3
1 25%
25% 25%
B(R1) - - Plan 1 Plan 1
2 75%
10% 25%
> B(R1) > B(R1)
- Plan 1 Plan 2
3 50%
50% 15%
> - > Plan 1 Plan3Need to take run-time conditions into account at optimization time.
Assume: B(R1) < B(R2) < B(R3), B(R1 join R2) < B(R3)
![Page 8: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/8.jpg)
Existing Solution - Parametric Query Optimization Y. E. Ioannidis, et al. Parametric Query Optimization. VLDB
1992. Key idea: To identify several execution plans, each one of
which is optimal for a subset of ALL possible values of the run-time parameters
E.g. Two parameters: Buffer size B = [2, 151]Kind of indexes I = {no_index, clustered_Btree, non_clustered_BTree}
P – possible vectors of values of parameters P = cross product B × I|P| = 150*3 = 450
The optimization problem: p P , to find the plan s0 in that plan space S that satisfies the condition:
is static parameters, c( ) is the cost function
![Page 9: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/9.jpg)
Existing Solution - Parametric Query Optimization (Cont.)
Efficient exploration algorithm – Randomized Algorithm
Justification for using parametric query optimizationRelative cost
Buffer size
Problems of the implementation in distributed database• Site selection + algebraic transformation + physical method selection• Much more combinations of run-time parameters
![Page 10: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/10.jpg)
Existing Solution – Two-Phase Algorithm
W. Hong, et al. Optimization of Parallel Query Execution Plans in XPRS. PDIS,1991.
Developed for a parallel database based on a share-memory multiprocessor
Phase 1: find the optimal sequential plan assuming the entire buffer pool is available
Phase 2: find the optimal parallelization of the optimal sequential plan, considering run-time available buffer size & # of free processors
![Page 11: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/11.jpg)
Benefits:
◦ Phase 1 has the same plan space as a System-R-style algorithm, but only one plan is explored in Phase 2
◦ Capability of dealing with compile-time unknown parameters
Problems for applying in database federations:◦ Communication cost was not considered◦ Exhaustive search in phase 2 is still expensive
for large scale of data sources
Existing Solution – Two-Phase Algorithm(Cont.)
![Page 12: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/12.jpg)
Proposed Work
![Page 13: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/13.jpg)
Important Observation many national-scale or global-scale data federations are
built on the networks which consist of both broad, LAN paths and narrow, long-haul paths.
many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations.
![Page 14: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/14.jpg)
Cluster-and-Conquer consider all data resources in the database federation
as a set of several clusters of sites
design two layers of mediators to schedule the query plan cooperatively:◦ Global Mediator + Cluster Mediator
Cluster 2Cluster 1
Cluster 8
Cluster 4
Cluster 5 Cluster 6Cluster 7
Cluster9
Cluster 11
Cluster10Cluster12
Cluster13Global
Mediator
![Page 15: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/15.jpg)
Architecture•System-R style algorithm•performs at compiling time •considers all the tables as being stored in the clustered fashion• decide inter-cluster operations
•schedules the optimal plan found by the optimizer in a distributed and parallelized way •assigns each sub-plan to the corresponding cluster
•Consider run-time conditions & static physical designs•Find a intra-cluster optimal plan•Every cluster mediator functions independently and potentially in parallel
![Page 16: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/16.jpg)
Cost Model and Optimization Goal
Cost Model
Optimization Goal◦to find the distributed join schedule
plan with minimum cost.
![Page 17: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/17.jpg)
Problem DefinitionRun-time parameters:
◦Available buffer size◦CPU utilization
Parallelism:◦ Partitioned parallelism◦ Pipelined parallelism
Reasons: input data partition is not often feasible ;in bushy plans it is common to have two operations that do not each other’s output
Independent parallelism
![Page 18: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/18.jpg)
Optimization Algorithm
E.g. SELECT * FROM S1.t1, S2.t2, S5.t7, S1.t2, S6.t5, S2.t3 WHERE S1.t1.CustomerID = S2. t2. CustomerID AND S2.t2. SupplierID = S5.t7.SupplierID AND S5.t7.ItemID = S6.t5. ItemID AND S6.t5.Country = S1.t2.Country AND
S1.t2.Year = S2.t3.Year
Global Mediator
Clustered view
Physical design info:B(R), T(R), V(R.attr), ……
Rule 1: only determine inter-cluster operations
Rule 2: plans that join two relations in distinct clusters are eliminated
![Page 19: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/19.jpg)
Optimization Algorithm (Cont.)
Cluster
1Mediato
r
Sub-plan
Search space:•Algebraic transform
•Physical method selection – Available_buffer
•Site selection – CPU_utility (fine grain operator scheduling)
Run-time conditions:Available_buffer(S1), CPU_utility(S1), ……
Physical design info:B(R), T(R), V(R.attr), ……
![Page 20: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/20.jpg)
Theoretical AnalysisIn global mediator
In cluster mediator
Compare to related works
![Page 21: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/21.jpg)
Experiment Design
![Page 22: Efficient Query Optimization for Distributed Join in Database Federation](https://reader036.vdocument.in/reader036/viewer/2022062305/568164d8550346895dd71d18/html5/thumbnails/22.jpg)
That is what I want to do for my Master
Thesis …
Thanks