query optimization in object databases

85
1 G. Gardarin G. Gardarin Query Optimization in Query Optimization in Object Databases Object Databases Georges GARDARIN Laboratoire PRiSM/UVSQ

Upload: loring

Post on 11-Jan-2016

38 views

Category:

Documents


3 download

DESCRIPTION

Query Optimization in Object Databases. Georges GARDARIN. Laboratoire PRiSM/UVSQ. G. Gardarin. 1. Introduction. Object models provide ADTs, inheritance, complex structures, relationships and object identity - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Query Optimization in Object Databases

1 G. GardarinG. Gardarin

Query Optimization in Object Query Optimization in Object DatabasesDatabases

Georges GARDARIN

Laboratoire PRiSM/UVSQ

Page 2: Query Optimization in Object Databases

2 G. Gardarin

1. Introduction 1. Introduction

Object models provide ADTs, inheritance, complex structures, relationships and object identity

Query Optimizers transform query in query plans composed of low level operations to evaluate on the object collections

New techniques are required for supporting the object-oriented features

Page 3: Query Optimization in Object Databases

3 G. Gardarin

OutlineOutline

Object Query Languages Complex Object Algebra Operator Algorithms Query Plan Transformations Cost Models Search Strategies Open Problems

Page 4: Query Optimization in Object Databases

4 G. Gardarin

OverviewOverview

Presentation of the various topics of Query Processing in OODBMSs

Topics are not independent Operators depend on data structures (index) Search strategies depend on cost model

An optimizer has to consider all aspects Complex piece of software Has to be extensible for additional features

data types access methods operators

Page 5: Query Optimization in Object Databases

5 G. Gardarin

VocabularyVocabulary

Collection : a set, list, array or bag of objects Query : a user query in high level language Predicate : a term of a query criteria Qualification : a logical expression of predicates Operator : a low level accessor to 1 or several collections Annotation : the selected algorithm for executing an

operator Query plan : a program of annotated operator Cluster : a group of related objects stored together in a

bucket Index : an accelerator by value of an attribute Path index : an accelerator by values along a path

Page 6: Query Optimization in Object Databases

6 G. Gardarin

2. Object Query Languages2. Object Query Languages

Extension of SQL with : user defined functions in predicates and results user defined comparison predicates path expressions to traverse relationships flattening, grouping and degrouping operators automatic scan of inheritance hierarchies

Two “standards” are under construction : The object standard of ODMG (OQL) The object-relational standard of ISO/ANSI (SQL3)

Page 7: Query Optimization in Object Databases

7 G. Gardarin

Database Example (1)Database Example (1)

Vehicle

Company

Employee

String String

Number Color

Maker

String

String

Name

City

President

Float

String Int

Ssn

Name

BirthDate

Page 8: Query Optimization in Object Databases

8 G. Gardarin

Query ExampleQuery Example

Object identity : SELECT E.Name, C.Name FROM Employee E, Company C WHERE C.President == E

Paths and method : SELECT Number FROM Vehicle WHERE Color = "Red" AND Vehicle.Maker.City= "Paris" AND Vehicle.Maker.President.age() < 50

Page 9: Query Optimization in Object Databases

9 G. Gardarin

Database Example (2)Database Example (2)

Company

Person

Vehicle

Employs

Owns

Name

Name

Number Power

Age

City

String String

String

IntInt

Int

Page 10: Query Optimization in Object Databases

10 G. Gardarin

Qualified Path ExpressionQualified Path Expression

OQL form :SELECT C.Name, P.Name, V.Number

FROM C IN Companies, P IN C.Employs, V IN P.Owns

WHERE C.City="Paris" AND P.Age<30 AND V.Pow>10

Direct form :SELECT C.Name, P.Name, V.Number

FROM Companies C, Persons P, Vehicles V

WHERE

C[City="Paris"].Employs.P[Age<30].Owns.V[Pow>10]

Page 11: Query Optimization in Object Databases

11 G. Gardarin

Exercice QueriesExercice Queries

Express in OQL, then with qualified path expressions, a set of given queries.

Page 12: Query Optimization in Object Databases

12 G. Gardarin

3. Complex Object Algebra3. Complex Object Algebra

Generalization of relational algebra Set-oriented processing of objects A set of operations on collections of objects

generating collections of objects Different types of collections :

class extent, set, bag, list, array

Any query can be expressed as a complex object algebra expression

Logical algebra annotated for execution

Page 13: Query Optimization in Object Databases

13 G. Gardarin

The LORA AlgebraThe LORA Algebra

SearchOp

FilterMap

SetOp GroupOp

- Union- Intersect- Difference

- RemoveDup- Aggregate- Nest- Unnest

Join Sort

LoraOp

UpdateOp TransactOp

RJoin VJoin

Finance and Gardarin 1991

Page 14: Query Optimization in Object Databases

14 G. Gardarin

Main Operator SignaturesMain Operator Signatures

JOIN:Col,Exp,Col =>Col OUTER_JOIN: Col,Exp,Col =>

Col SORT: Col, Exp => Col AGG: Col, Exp, Exp => Col NEST: Col, Nest_Exp => Col UNNEST: Col, Nest_exp =>Col RDUPLICATE: Col, Exp =>

Col FILTER: Col, Qual =>Col MAP: Col, Exp,Qual => Col

MINUS: Col, Col => Col DIVIDE: Col, Col =>Col UNION: Col, Col => Col OUTER_UNION: Col, Col =>

Col INTERSECT: Col, Col=> Col UPDATE: Col, Col, Ident,

Assignement =>Col INSERT: Col, Col => Col DELETE: Col, Col, Ident =>

Col

Page 15: Query Optimization in Object Databases

15 G. Gardarin

Algebraic TreeAlgebraic Tree

EmployeeCompany

Filter(age()<50,*)

RJoin(President)

Filter(City="Paris",*)

Filter(Color="Red")

Vehicle

RJoin(Maker)

Filter(*,Number)

SELECT NumberFROM Vehicle

WHERE Color = "Red"

AND Vehicle.Maker.City= "Paris"

AND Vehicle.Maker.President.age() < 50

Page 16: Query Optimization in Object Databases

16 G. Gardarin

The ENCORE Algebra (1)The ENCORE Algebra (1)

Shaw and Zdonik 1990

Select(InputCollection, p) = {s(s in InputCollection) p(s)}

Image(InputCollection, f : T) = {f(s)s in InputCollection}

Project(InputCollection,<(A1, f1), ...,(An, fn)>) = {<A1 : f1(s), ...,An : fn(s)>(s in InputCollection)}

Page 17: Query Optimization in Object Databases

17 G. Gardarin

The ENCORE Algebra (2)The ENCORE Algebra (2)

Nest(InputCollection,Ai) = {<A1 : s.A1, ...,Ai : t, ...,An : s.An>r s (r in t s in InputCollection s.Ai = r)}

UnNest(InputCollection,Ai) = {<A1 : s.A1, ...,Ai : t, ...,An : s.An>s in InputCollectiont in Ai}

Flatten(InputCollection) = {rt in InputCollection r in t}

DupEliminate (InputCollection) Coalesce (InputCollection, Ai)

Page 18: Query Optimization in Object Databases

18 G. Gardarin

The ENCORE Algebra (3)The ENCORE Algebra (3)

OJoin(InputCollection1,InputCollection2,A1,A2,p)={<A1 : s, A2 : r>s in InputCollection1 r in

InputCollection2 p(s,r)}

Set-oriented operations : Union Intersection Difference

with set membership based on object identity

Page 19: Query Optimization in Object Databases

19 G. Gardarin

The OFL OperatorsThe OFL Operators

Gardarin & Machucca 1995 Navigational traversal often interesting :

Existential quantification Better control of query plans, smaller granularity

Mixing navigational and set-oriented traversal Based on Bachus’ functional approach Processing of collections of objects Side effect introduced through cursors

Page 20: Query Optimization in Object Databases

20 G. Gardarin

The OFL LanguageThe OFL Language

Definition : Abstract Collection A container of objects encapsulated by a finite set of

behavioral and traversal functions.

Constructions : Composition f.g (x) = f(g(x)) Path expressions f0.f1....fn(x) Conditional If_Then_Else (p, f1, f2) (x) Iteration While (p,f) Sequence Sequence(f1, f2, …, fn)

Page 21: Query Optimization in Object Databases

21 G. Gardarin

Collection Traversal in OFLCollection Traversal in OFL

Quantified function Apply to all A second order function of signature ForAll(C, p, f) that

applies a function f to all objects of a collection C satisfying a predicate p.

Quantified function Apply to any A second order function of signature ForAny(C, p, f) that

applies a function f to any object of a collection C satisfying a predicate p.

Iterator and Annotations Each quantified function works on an iterator Set-oriented or navigational traversal is selected

Page 22: Query Optimization in Object Databases

22 G. Gardarin

Person VehicleOwner Composed

String

Part

PartLabelLastName

Color

String

BirthyearSalary

Price

Database ExampleDatabase Example

Page 23: Query Optimization in Object Databases

23 G. Gardarin

Translating Query in OFLTranslating Query in OFL

SELECT p.lastname FROM p in Person WHERE exists v in p.owner : v.color = "Red"

ForAll(Person P, null,

ForAny(Owner(P) V, StringEqual(Color(V),"Red"), LastName(P) ) );

Page 24: Query Optimization in Object Databases

24 G. Gardarin

A More Complex QueryA More Complex Query

OQL Query SELECT tuple(p.lastname, v.price, c.partlabel) FROM p in Person, v in p.owner, c in v.composed WHERE p.age = 16 and v.price=c.price

OFL translation : ForAll(Find(AgeIndex,16) P, null, ForAll(Owner(P) V, null, ForAll(Composed(V)T,IntegerEqual(Price(V),Price(T)), Tuple (LastName(P), Price(V), PartLabel(T))))));

Page 25: Query Optimization in Object Databases

25 G. Gardarin

Further OperatorsFurther Operators

Recursive operators FixPoint(ResultCollection,

InitializationExpression,RecursivePredicate, RecursiveExpression, FinalExpression)

gives the OFL program: Sequence(OFLInitializationExpression,

While(OFLRecursivePredicate, OFLRecursiveExpression),

OFLFinalExpression)

Page 26: Query Optimization in Object Databases

26 G. Gardarin

Exercice AlgebraExercice Algebra

Write in OFL the definition of LORA operations exemple :

Join(InputCollection1,InputCollection2,ResultCollection, JoinPredicate, ProjectionExpression)

gives the OFL program: ForAll(InputCollection1,null,

ForAll(InputCollection2, OFLJoinPredicate,

InsertResultCollection(ResultCollection,

OFLProjectionExpression))))

Page 27: Query Optimization in Object Databases

27 G. Gardarin

4. Algebraic Operator Algorithms4. Algebraic Operator Algorithms

Classical relational operators still valid ... Filtering with a predicate (Restriction)

Sequential scan Index scan, clustered or non-clustered

Value-based join Nested loop join :

iterate on the outer collection and compare each outer object with each object in the inner collection

Merge join : sort on join fields the two collections and then merge

Hash join : hash the outer collection on join fields, scan the inner table and probe the

hashed collection

Page 28: Query Optimization in Object Databases

28 G. Gardarin

Path TraversalsPath Traversals

Paths may involve multiple collections Each collection can be qualified by predicates

Page 29: Query Optimization in Object Databases

29 G. Gardarin

Depth-First-FetchDepth-First-Fetch

Depth-First-Fetch (DFF) is the natural algorithm for evaluating a path expression.

It follows the path from the root to the target collection, using a depth first graph traversal algorithm.

The corresponding operator is an n-ary operator denoted DFF.

Advantages : no intermediate results, simple pointer chasing result are assembled one at a time allowing pipeline efficient when the memory size is large enough to avoid

swapping of objects

Page 30: Query Optimization in Object Databases

30 G. Gardarin

Breadth-First-FetchBreadth-First-Fetch

Breadth-First-Fetch (BFF) traversal processes the tree of objects using a Forward Join (FJ) algorithm which is based on pointer chasing between two collections.

Successive binary joins of collections are performed from the source collection to the target, following the path in a forward order.

Advantages : no multiple fetch of objects requires the construction of hashed support table to memorize

FJ results

Page 31: Query Optimization in Object Databases

31 G. Gardarin

Reverse-Breadth-First-FetchReverse-Breadth-First-Fetch

Reverse-Breadth-First-Fetch (RBFF) performs a sequence of binary joins between two neighbor collections to traverse the path, but it proceeds in the reverse order of the path.

Thus, each join is called a Reverse Join (RJ). The join criterion is the member-ship of the second collection object identifier to the first collection pointer attribute values.

Advantages : efficient when predicate(s) in last collection(s) selective requires supporting tables and value-based joins

Page 32: Query Optimization in Object Databases

32 G. Gardarin

Illustration of BFF & RBFFIllustration of BFF & RBFF

RJ

A(Oid) C(Oid)

a1 a2a3

a4

a5

a6a7

c5c2c1c3c2c4c1

a2

Tb

A B

FJA(Oid) D(Oid)

a1 a2a3

a4

a5

a6 a6

d5d7d1d2d8d4d2

a1

Tc

Tb C

FJ

a6

E(Oid) D(Oid)e6

a2e1e4e4e7

e5

d2d5d4d7d4d2d1

e2

Te

DE

RJ

Ta

Tc Te

A(Oid) E(Oid)

a1 a2

(a) (b)

(c) (d)

a3 e5a4 e7a6 e4a6 e1

a1 e4e2

a6 e6

Page 33: Query Optimization in Object Databases

33 G. Gardarin

Further AlgorithmsFurther Algorithms

Various algorithms are available for each operator Combination of operators can be applied :

to traverse long paths to derive new algorithms

hash both & sort buckets & merge buckets limited breadth-first-fetch

Cost is dependent of many factors : physical organization of objects size of collections selectivity of predicates available memory size possible degree of parallelism

Page 34: Query Optimization in Object Databases

34 G. Gardarin

Exercice AlgorithmsExercice Algorithms

List all the possible annotated query plans to process the query : SELECT C.Name, P.Name, V.Number FROM C IN Companies, P IN C.Employs, V IN P.Owns WHERE C.City="Paris" AND P.Age<30 AND V.Pow>10

over the database schema :

Company

Person

Vehicle

Employs

Owns

Name

Name

Number Power

Age

City

String String

String

IntInt

Int

Page 35: Query Optimization in Object Databases

35 G. Gardarin

5. Query Plan Transformations5. Query Plan Transformations

Query rewrite : Algebraic rewrite of query tree semantic transformations based on properties of data types and

integrity constraints syntactic transformation based on properties of operators

Query planning : Selection of best algorithms annotation of logical operators with selected algorithms cost of an annotated algorithm often dependent of result of

previous algorithm e.g., no sort needed if result sorted

Query rewrite and query planning are not independent

Page 36: Query Optimization in Object Databases

36 G. Gardarin

Extensible OptimizersExtensible Optimizers

Closed Optimizer set of operators and transformations fixed heuristic-based or cost-based selection of plans efficient but hard to modify and extend e.g., Oracle 7.3, SQL Server 10, ...

Extensible Optimizer extensible set of operators and transformations rule-based generation of query plans selection of "best" plan using a search strategy e.g., Exodus, Starbust and DB2 CS, Esprit EDS & IDEA,

Illustra, ...

Page 37: Query Optimization in Object Databases

37 G. Gardarin

Rewrite Rule BaseRewrite Rule Base

STRATEGY

Common Expression Detection

Syntactic

Optimization

Semantic

Optimization

Predicate

Simplification

Common Expression Detection

Syntactic

Optimization

Semantic

Optimization

Predicate

Simplification

Modular

Rule

Base

From Gardarin, Finance DKE 93

Query Plan

Optimized Query Plan

Cost model

Heuristics

Page 38: Query Optimization in Object Databases

38 G. Gardarin

Syntactic Rewrite Rules (1)Syntactic Rewrite Rules (1)

Restrict through Union Pushing Rule :Restrict(Union(C1,C2)) <=>

Union(Restrict(C1,C2))

Restrictions through Super Class Pushing Rule :Restrict (Super(C1,C2)) <=> Super(Restrict(C1),Restrict(C2))

Page 39: Query Optimization in Object Databases

39 G. Gardarin

Syntactic Rewrite Rules (2)Syntactic Rewrite Rules (2)

Join CommutativityJoin (C1,C2) <=>

Join (C2,C1)

Join Associativity(C1 Join C2) Join C3) <=>

C1 Join (C2 Join C3)

Restrict through Join Pushing RuleRestrict(Join(C1,C2)) <=>

Join(Restrict(C1),Restrict(C2))

Page 40: Query Optimization in Object Databases

40 G. Gardarin

Planning RulesPlanning Rules

Join method choice JoinNL (C1,C2) <=> JoinSM (C1,C2)

JoinHP (C1,C2) <=> JoinSM (C1,C2)

Depth First Fetch introduction DFF(C1,C2,C3) <=> Join(C1,Join(C2,C3))

Index Scan introductionScan(C1,P) <=> Scan(IScan(C1,I(P)),P~I(P))

Page 41: Query Optimization in Object Databases

41 G. Gardarin

Semantic RulesSemantic Rules

Integrity constraints Type(x) = Square <=> Type(x) = Polygon and large(x) = long(x)

User function properties draw(x+y) = draw(x) + draw(y)

Page 42: Query Optimization in Object Databases

42 G. Gardarin

What Rule Language ?What Rule Language ?

Rules are often complex to express Conditions on qualifications, operators, results, ...

Proposed rule languages : C rewriting procedure [Exodus, Starbust]

if <C procedure> is true then <C procedure> Practical but hard to extend optimizer

Side effective rule language [Finance91] WHEN <Query Expression> IF <Condition> THEN <Query Expression'> UNDER <Action>

Complex to implement for pattern matching OQL Query equivalence [Florescu95]

Parametrized Query ~ Parametrized Query Lack of generality (e.g., query planning not possible)

Page 43: Query Optimization in Object Databases

43 G. Gardarin

Choice of Best Query PlanChoice of Best Query Plan

Query Plan

Generator

Algebraic

Tree

Database

Schema

Query

Plans

action: { }cost: floatgoak: boolean

Search

Strategy Cost Model

Transformation

Rule base

"Best" Query Plan

Page 44: Query Optimization in Object Databases

44 G. Gardarin

Exercice RulesExercice Rules

Given a linear path expression from collection 1 to i, determine the number of distinct query plans to process the query, assuming that 3 algorithms are available to process any path expression (DFF, BFF, RBFF)

Give the rule base to generate those plans

C1 Ci-1C2 Ci........

Page 45: Query Optimization in Object Databases

45 G. Gardarin

6. Cost Models6. Cost Models

Extension of relational cost model to handle : Object identifiers Path indexes Object linking and embedding Clustering

Takes into account CPU cost and I/O cost : CPU cost = * Number of examined objects I/O cost = * Number of pages read

Page 46: Query Optimization in Object Databases

46 G. Gardarin

Collection ParametersCollection Parameters

|C| = number of pages of collection C ||C|| = number of objects of collection C |Ci| = number of pages of cluster i of collection C

||Ci|| = number of objects of cluster i of collection C

SC = average object size in collection C

SProj = average size of projection result

M = available memory size Sel(Qual) = selectivity of qualification Qual Sel(Pred) = selectivity of indexed predicate Pred

Page 47: Query Optimization in Object Databases

47 G. Gardarin

I/O Scan FormulasI/O Scan Formulas

Sequential scan I/O cost =I/OScan + I/OResult

I/OScan = |C| I/OResult =Sel(Qual)*|C|*SProj/SC - M if > 0 else 0

Unclustered index scan I/O cost = I/OIndex + I/OHit+I/OResult

I/OIndex = Blevel(I) I/OHit = Yao(|C|,||C||,Sel(Pred)*|C|)

Clustered index scan I/O cost = I/OIndex + I/OHit+I/OResult

I/OHit = Sel(Pred)*|C|

Page 48: Query Optimization in Object Databases

48 G. Gardarin

Object ClusteringObject Clustering

Clustering par classe Regroupement de toutes les instances d'une même classe dans

un même fichier

Clustering par composition Regroupement d' un objet d'une classe avec un ou plusieurs de

ses objets composants. Placement adapté aux parcours de chemin

Clustering aléatoire les objets sont placés dans l'ordre de leur création, dans un

espace unique.

Page 49: Query Optimization in Object Databases

49 G. Gardarin

Clustered Collection CasesClustered Collection Cases

Cluster objects on disk Reduce the number of IO’s

(Placement trees are represented by directed graphs)

COMPANY

Default clustering

Simple clusteringCOMPANY

PRODUCT

Page 50: Query Optimization in Object Databases

50 G. Gardarin

More Clustering Cases More Clustering Cases

COMPANY

PRODUCT COMMAND

COMPANY

COMMAND

PROPOSAL

5 10

Conjunctive clustering

Disjunctive clustering

Page 51: Query Optimization in Object Databases

51 G. Gardarin

Clustering : Statistics on PartitionsClustering : Statistics on Partitions

SA : Average Size of Object : Average Size of ObjectSp : Available Page Size: Available Page SizeDA,B : Average Number of distinct references: Average Number of distinct references.....

• Can be maintained by the system• Can be evaluated

How many pages will have to be loaded to scan the collection A ?

IIAII: Cardinality: CardinalityIAI: Number of disk blocks: Number of disk blocks

Page 52: Query Optimization in Object Databases

52 G. Gardarin

Clustering : Statistics ExampleClustering : Statistics Example

A ClA XA, B

Sp

SA

B ClB B A * DA,B

Sp

SB

A ClAB

A X A,B

Sp

SClAB

if SClAB S p

A X A,B if SClAB

Sp

A ClA if SClAB Sp

else

A XA ,B if ZA ,B * SB Sp

A XA, B * Z A, B * SB

Sp

if Z A, B * SB Sp

with SClAB SA Z A, B * SB

B ClAB

+

+

ICOMPANYI

IPRODUCTI

Page 53: Query Optimization in Object Databases

53 G. Gardarin

Yao function : Yao( IICII , ICI , k : number of selected objects )

returns the number of block hits

Yao' function : sum Yao functions applied on each involved cluster

Given a clustered collection C and p the number of partitions to be scanned, we have :

where ki is the number of objects to be selected in cluster i

Yao’ : Number of Clustered Block HitsYao’ : Number of Clustered Block Hits

Yao' (C,k) Yao Ci , Ci , ki i1

p

Page 54: Query Optimization in Object Databases

54 G. Gardarin

Yao’ : ExampleYao’ : Example

•x in Companies, x.asset>100 000 kF

•x in Companies, x.asset>100 000 kF and x.product.year<1980

•x in Companies, x.asset>100 000 kF and x.product.year<1980and x.product.command.N°<1000

COMPANY

PRODUCT COMMAND

Yao( ) + Yao( ) + Yao( ) + Yao( )

Yao( ) + Yao( )

Yao( )

Page 55: Query Optimization in Object Databases

55 G. Gardarin

I/O Join FormulasI/O Join Formulas

Nested loops ||C1|| + ||C1||*||C2||

Merge join cost(sort(C1)) + cost(sort(C2)) + cost(merge(C1,C2)) +

cost(Result) *||C1||*log||C1|| + *||C2||*log||C2|| + ||C1||+||C2|| + ...

Hash join cost(hash(C1)) + cost(scan(C2)) +|C2|cost(probe(C1))+

cost(Result) ||C1|| + ||C2|| + ...

Page 56: Query Optimization in Object Databases

56 G. Gardarin

Parameters for LinksParameters for Links

fanC1,C2 = average number of references from a C1 object to a C2 object

DC1,C2 = number of distinct references from a C1 object to a C2 object

XC1,C2 = number of C1 objects having no reference to C2 object

ZC1,C2 = average number of distinct references from C1 objects having at least one reference to C2 object

ZC1,C2 = DC1,C2 * ||C1|| / (||C1||- XC1,C2 )

Page 57: Query Optimization in Object Databases

57 G. Gardarin

I/O Path Traversal FormulasI/O Path Traversal Formulas

Cost of DFF [Gardarin, Gruser, Tang 96] Large memory (no swap)

Small memory (worst case)

DFF is efficient with large memory

C fan Cj Cjj

i

i

nSeli1 1

11

1* ( , ) *

C yao Ci Ci Xii

n1

2

( , , )

Page 58: Query Optimization in Object Databases

58 G. Gardarin

Exercice Cost ModelExercice Cost Model

Compare the I/O costs of DFF, BFF and RBFF Discuss the advantage of each of them according to

memory size and predicate selectivity

Page 59: Query Optimization in Object Databases

59 G. Gardarin

Search

Strategy

RandomizedEnumerative

AugmentationHeuristic

SimulatedAnnealing

GeneticAlgorithm

from Lanzelotte 1992

IterativeImprovement

TabuSearch

ExhaustiveSearch

Mixte (2 phases)

7. Search Strategies7. Search Strategies

Page 60: Query Optimization in Object Databases

60 G. Gardarin

Exhaustive SearchExhaustive Search

Function Exhaustive(Query)

p:= Parse(Query) ; // Set the initial plan

S := {p} ; // S is the set of all investigated plans

while not StopCond()

{ p' := Transform (p) ; // Apply a transformation rule

if p' S then

p ::= p';

Insert (S, p') ; // Maintain the set of investigated plans

}

}

return Optimal(S) ; // Select best plan among all generated plans

Page 61: Query Optimization in Object Databases

61 G. Gardarin

Illustration of ESIllustration of ES

Parse(Query)

Up to StopCond(Exhausted time)

r1 r2 r3 r4

r5

r6r7r8r9

r10

r11 SELECTMINIMAL COST

PLAN

Page 62: Query Optimization in Object Databases

62 G. Gardarin

Classical AmeliorationsClassical Ameliorations

Reduce search space and control rule selection Select profitable/best move at each step Introduce a gain estimator for each rule Apply only rules with best estimators Avoids loops and applying rules in both directions

Such approaches find only a local minimum Risk of fall into a hole Iterative improvement minimize risk

Page 63: Query Optimization in Object Databases

63 G. Gardarin

Iterative Improvement SchemeIterative Improvement Scheme

II randomly chooses an initial processing tree. It then accepts only those downhill moves.

This is called local optimization.

When the local condition is reached, II picks up a new random state, and performs local optimization from that state.

The process is repeated until a stopping condition is met.

The global minimum is the best local minimum found till now.

Page 64: Query Optimization in Object Databases

64 G. Gardarin

Iterative Improvement ProcedureIterative Improvement Procedure

Procedure II() {

p = Initialize(); // set an initial state, i.e., pick a random PT for evaluating the query

OptimalPlan = p; // Initialize optimal plan

while not(stopping_condition) do { // Loop for global optimization (on various initial states)

while not(local_condition) do { // Loop for local optimization

p’ = move(p); // Apply a valid transformation to p

if (Cost(p’)<Cost(p)) then p = p’; // Keep plan if less costly

}

if cost(p)<cost(OptimalPlan) then OptimalPlan = p; // Select optimal plan

p = RandomPlan ; // Move to next random selected plan

}

Return(OptimalPlan);

}

Page 65: Query Optimization in Object Databases

65 G. Gardarin

Illustration of I.I.Illustration of I.I.

Parse(Query)

Profitabler1

Profitabler2

SELECTMINIMAL COST

PLAN

Profitabler'1

Profitabler"1

Profitabler"2

Rand(Parse(Query)) Rand(Rand((Parse(Query)))

Page 66: Query Optimization in Object Databases

66 G. Gardarin

P e t t t t temperature (cos ( ) cos ( ))/1

Simulated Annealing SchemeSimulated Annealing Scheme

SA also starts at a random processing tree and generates the next state by applying a transformation rule on the current processing tree.

Differently from II, SA accepts both downhill and uphill moves. Uphill moves are allowed with the probability

The parameter temperature decreases when the inner block reaches an equilibrium point. Thus the uphill moves are being accepted with less and less probability.

When a stopping condition is satisfied, the best traversed plan is selected as optimal.

Page 67: Query Optimization in Object Databases

67 G. Gardarin

Simulated Annealing ProcedureSimulated Annealing Procedure

Procedure SA() {

p = Initialize(); // set an initial state, i.e., pick a random PT for evaluating the query

OptimalPlan = p; // Initialize optimal plan

T=T0; // Initialize temperature

while not(stopping_condition) do { // Loop for global optimization (on various initial states)

while not(equilibrium) do { // Loop for local optimization

p’ = move(p); // Apply a valid transformation to p

delta=cost(p’)-cost(p); // Compute differential cost

if (delta<0) then p = p’; // If cost reduced pick new plan

if (delta>0) then p = p’ with probability e-delta/T // If cost increased, accept if hot

if cost(p)<cost(OptimalPlan) then OptimalPlan = p; // Maintain optimal plan

}

T=reduce(T); // Reduce temperature

}

Return(OptimalPlan); }

Page 68: Query Optimization in Object Databases

68 G. Gardarin

Illustration of S.A.Illustration of S.A.

Parse(Query)

SELECTEDPLAN

Profitabler1

Non Profit.r2

Profitabler3

Profitabler4

Non Profit.r5

Profitabler6

Cost

Moves

Page 69: Query Optimization in Object Databases

69 G. Gardarin

Tabu Search SchemeTabu Search Scheme

TS is a general meta-heuristic procedure for global optimization, which performs an aggressive exploration of the state space ( best possible move, with restriction list)

TS starts from a randomly generated initial state, and repeatedly performs moves from a state to a neighbor one.

At each iteration the procedure generates a subset V* of the set N(S) of the neighbors of the current state S and select the best.

The subset does not contain any state which is recorded in the Tabu list. This avoids the cycling or at least reduces its probability. The Tabu list is updated each time the current state is updated. This forbids moves which should bring back to a previous explored state.

Page 70: Query Optimization in Object Databases

70 G. Gardarin

Tabu Search ProcedureTabu Search Procedure

Procedure TS() {

p = Initialize(); // set an initial state, i.e., pick a random PT for evaluating the query

OptimalPlan = p; // Initialize optimal plan

T = ; // initialize Tabu list

while not(stopping condition) do { // global loop

generate the set V*N(S)-T by applying move(S); // All move accepted except tabu

choose the best solution p V*; // Pick best move

T= (T-(oldest)) {p}; // Update the tabu list by removing oldest plans and adding picked

if cost(p)<cost(OptimalPlan) then OptimalPlan = p; // Maintain optimal plan

}

return(OptimalPlan);

}

Page 71: Query Optimization in Object Databases

71 G. Gardarin

Comparison of StrategiesComparison of Strategies

Cost of Best Plan

0200 400 600 800 1000 1200 1400

SA

II Swap

II Join exchange

II is the best with goodrandom sampling

Tabu looks attractive

Tabu

Number of moves

Tabu

Page 72: Query Optimization in Object Databases

72 G. Gardarin

Genetic AlgorithmGenetic Algorithm

Genetic Algorithm (GA) is a non-gradient optimization algorithm used for the search of local extremes (minimum or maximum) of functions with many variables and functional extremes.

These functions are usually defined on very complex and discrete domains.

The basic idea of GA is to use principles of evolution of organisms in nature.

Instead of working on one particular solution at a time, it considers a population of solutions.

Page 73: Query Optimization in Object Databases

73 G. Gardarin

GA PrincipleGA Principle

Initialisation

Mutation

Sort

Evaluation

Crossover

Selection

Terminate

Yes

No

Page 74: Query Optimization in Object Databases

74 G. Gardarin

GA PhasesGA Phases

Initialization - randomly generate an initial small population of solutions (i.e., processing trees) from the whole search space.

Mutation - choose one solution (i.e., processing tree) from the population, and apply transformation rules to it.

Crossover - randomly choose two solutions from the population, and exchange their common subtrees in order to generate two new processing trees.

Evaluation - for each solution, evaluate the value of its fitness function (i.e., cost function),

Sort - sort all solutions according to their cost values. Selection - choose certain number of the best solutions from the

result of Sort as the parents of the next generation. Termination - check termination criteria for stopping the

optimization.

Page 75: Query Optimization in Object Databases

75 G. Gardarin

Gene Base for 5 collectionsGene Base for 5 collections

DFF/PI (0, 4)

DFF/PI (0, 3)

DFF/PI (1, 4)

DFF/PI (0, 2)

DFF/PI (1, 3)

DFF/PI (2, 4)

FJ/PI/RJ (0, 1)

1 2 3 4

DFF/PI (2, 0)

FJ/PI/RJ (1, 2)

FJ/PI/RJ (2, 3)

FJ/PI/RJ (3, 4)

0

FJ/PI/RJ (1, 0)

FJ/PI/RJ (2, 1)

FJ/PI/RJ (3, 2)

FJ/PI/RJ (4, 3)

DFF/PI (3, 1)

DFF/PI (4, 2)

DFF/PI (3, 0)

DFF/PI (4, 1)

DFF/PI (4, 0)

Page 76: Query Optimization in Object Databases

76 G. Gardarin

Mutation OperatorMutation Operator

DFF

C DA B

RJ

DFFA

B C D

FJ

A B

FJ

B A

RJ

CFJ

A B

RJ

AFJ

B C

DFF

A B C

PI

A B C

Reverse Link

Page 77: Query Optimization in Object Databases

77 G. Gardarin

Crossover operatorCrossover operator

A

crossover

RJ

FJ

DFF

DFF

B C D

E F G

RJ

RJFJ

RJFJ

FJ F G

A B

C

D E

Page 78: Query Optimization in Object Databases

78 G. Gardarin

Improved GAImproved GA

Initialisation

Mutation

Sort

Evaluation

Crossover

Selection

Terminate

Yes

No

Random Gene Generator

Page 79: Query Optimization in Object Databases

79 G. GardarinG. Gardarin

GA ProcedureGA ProcedureProcedure GA() {

Generate the initial population : Popu[BasePopu] ; // Initialize the base population at randomSort(Popu) ; // Sort population of PTs on increasing cost OptimalPlan = Popu[0] ; // Keep the best traversed plan While not (stopping_condition) do {

Percent = 0;While Percent < Part * BasePopu do { // Apply Crossover to Part of the population

p1 = Popu[Random(BasePopu)] ; // Randomly choose p1 and p2 from Popup2 = Popu[Random(BasePopu)] ;Crossover(p1, p2) ; // Apply Crossover if possiblePercent = Percent + 2 ; }

For the rest in Popu do Mutation ; // Apply Mutation for the rest of the populationFor (i=0 ; i < NewPopu ; i++) do evaluate(Popu[i]) ; // Compute cost for new populationSort(Popu) ; // Sort population of PTs on increasing cost if (Popu[0] < OptimalPlan) then OptimalPlan = Popu[0] ; // Keep the best traversed plan// Optional replacement of the worst elementsPercent = 0 ; i = BasePopu ; // Initialize for replacementWhile Percent < Repl * BasePopu do { // Apply replacement to Repl of the population

p = RandomPlan() ; // Pick a random PT for insertion in PopuPopu[i] = p ; // Replace worst plan by pi = i-1 ; // Prepare for next replacementPercent = Percent + 1 ; } }

return(OptimalPlan) ;}

Page 80: Query Optimization in Object Databases

80 G. GardarinG. Gardarin

8. Open Problems8. Open Problems

Efficient control of rule applications What is the best strategy (Genetic ?) Simple but accurate rule gain estimator (Priorities ?)

Estimation of method costs Statistics : keep average and variations at each call Revelation : the user provide a cost estimate attribute Disencapsulation : understand the method code Problem : late binding complexifies the estimation

Querying bulk types Collections may be list, array, trees or matrices Ordered collections may requires additional operators Cost model should be extended to capture aggregates

Page 81: Query Optimization in Object Databases

81 G. GardarinG. Gardarin

Exercice StrategiesExercice Strategies

Discuss the advantages and inconvenients of each search strategy

Compare them in case of a small rule base and a large rule base

Page 82: Query Optimization in Object Databases

82 G. GardarinG. Gardarin

ConclusionConclusion

The Query Optimizer is a key component of a DBMS Relational techniques can be generalized

Object algebra Cost model Search strategies

New techniques are required for : Extensibility of data types Optimizing path expressions Optimizing method calls Optimizing bulk data types Path index maintenance and access

Not much is done on new features in OODBMS

Page 83: Query Optimization in Object Databases

83 G. GardarinG. Gardarin

For More Informations (1)For More Informations (1)

Finance, Gardarin, IEEE DE 91 LORA Algebra and EDS extensible optimizer

Finance, Gardarin, DKE 93 Rule Language for extensible optimizer

Gardarin,Gruser, Tang, VLDB95 Cost model for OODBs : Scan, DFF, validation on O2

Gardarin,Gruser, Tang, VLDB96 Analytical & experimental comparisons of DFF,BFF,RBFF

Gardarin, in Advances in OO DB Systems, Springer94 Object rule language, optimisation of recursive updates &

extension of LORA to recursion

Page 84: Query Optimization in Object Databases

84 G. GardarinG. Gardarin

For More Informations (2)For More Informations (2)

Mitchell, Zdonik, Dayal, in Advances in OO DB Systems , Springer V. 94 Optimization of OO query languages : Problems and

Approaches

Cluet, Delobel, SIGMOD 92 A General Framework for the Optimization of Object-Oriented

Queries

Kemper, Moerkotte, VLDB 90 Advance Query Processing in An Object Bases Using Access

Support Relations

Lanzelotte, Valduriez, VLDB 91 Extending the Search Strategy in a Query Optimizer

Page 85: Query Optimization in Object Databases

85 G. GardarinG. Gardarin

Path IndexPath Index

Multi-index [Gemstone] les chemins sont des séquences de variables appartenant à la structure des objets. Les index sont définis pour chaque lien du chemin les index représentent des identifiants d'objet ou les valeurs des variables. Implémentés comme des B+ trees.

Nested index [Bertino 89] une seule entrée définie pour toute la longueur d'un chemin. Accès au début seulement en connaissant la fin du chemin. Utile seulement si l'on connait parfaitement le chemin et qu'il est utilisé souvent. Difficile à maintenir.

Path index [Bertino 89] [Kemper 90] associe la fin du chemin avec tous les suffixes du chemin il fonctionne avec des sous-chemins Implémenté comme des relations :

chaque colonne d'un tuple correspond à un pas du chemin chaque champs du tuple contient un identifiant d'objet ou une valeur.