v storage manager

V Storage ManagerV Storage Manager

Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California

Two New MethodsTwo New Methods void PrintStats()void PrintStats()

Prints the cache hit rate and byte hit rate Prints the cache hit rate and byte hit rate observed by your implementation of the V observed by your implementation of the V storage manager.storage manager.

void DumpCache()void DumpCache() For each datazone name, show its object-id and For each datazone name, show its object-id and

size occupying the cache. Conclude with a size occupying the cache. Conclude with a summary table showing:summary table showing:

Data zone name, # of objects, average object Data zone name, # of objects, average object size, min object size, max object sizesize, min object size, max object size

Minor ModificationsMinor Modifications Create method should check to see if its Create method should check to see if its

input dzname already exists. If so, it should input dzname already exists. If so, it should not create it a second time.not create it a second time.

Trace Driven EvaluationTrace Driven Evaluation A trace consisting of the following information:A trace consisting of the following information:

Trs,02/26/2009 23:07:48.4856586,Get,24,113392298,2797,0Trs,02/26/2009 23:07:48.4856586,Get,24,113392298,2797,0Trs,02/26/2009 23:07:48.4856586,Get,17,120188330,64,0Trs,02/26/2009 23:07:48.4856586,Get,17,120188330,64,0Trs,02/26/2009 23:07:48.6262836,Save,24,89768490,1404,0Trs,02/26/2009 23:07:48.6262836,Save,24,89768490,1404,0Trs,02/26/2009 23:07:48.7200336,Get,11,12671850,11155,0Trs,02/26/2009 23:07:48.7200336,Get,11,12671850,11155,0Trs,02/26/2009 23:07:48.7200336,Get,5,449482986,229,0Trs,02/26/2009 23:07:48.7200336,Get,5,449482986,229,0Trs,02/26/2009 23:07:49.2200336,Delete,11,49174954,0,0Trs,02/26/2009 23:07:49.2200336,Delete,11,49174954,0,0Trs,02/26/2009 23:07:49.2200336,Get,2,14161514,117,0Trs,02/26/2009 23:07:49.2200336,Get,2,14161514,117,0Trs,02/26/2009 23:07:49.2200336,Delete,11,444312042,0,0Trs,02/26/2009 23:07:49.2200336,Delete,11,444312042,0,0

Time stamp,Time stamp, Possible commands are “Get”, “Delete” and “Save”,Possible commands are “Get”, “Delete” and “Save”, Followed by data-zone name, key, and size of the value.Followed by data-zone name, key, and size of the value.

Trace Driven EvaluationTrace Driven Evaluation Gathered from a powerful production server Gathered from a powerful production server

at myspace.at myspace. Idea: Submit the requests as fast as Idea: Submit the requests as fast as

possible. We probably cannot submit possible. We probably cannot submit requests at the rate that this server is requests at the rate that this server is processing them.processing them.

Objective: A multi-threaded workload Objective: A multi-threaded workload generator that issues the requests as fast as generator that issues the requests as fast as possible.possible.

Workload GeneratorWorkload Generator Consists of a main thread and N worker Consists of a main thread and N worker

threads.threads. The main thread is in charge of reading a The main thread is in charge of reading a

trace file and populating main memory data trace file and populating main memory data structures for the worker threads.structures for the worker threads.

The worker thread reads its corresponding The worker thread reads its corresponding main memory data structure element and main memory data structure element and invokes the corresponding Get, Insert, and invokes the corresponding Get, Insert, and Delete methods of V.Delete methods of V.

Main ThreadMain Thread DZ-array: an array of 25 data-zones, each DZ-array: an array of 25 data-zones, each

represented as a Vdt element. The data field is 1 represented as a Vdt element. The data field is 1 character long and its size is one. The data field character long and its size is one. The data field corresponds to the values 0 to 24.corresponds to the values 0 to 24.

Invokes the Create method of V storage manager for Invokes the Create method of V storage manager for each element of the array. If the data-zone already each element of the array. If the data-zone already exists, it returns with an error message.exists, it returns with an error message.

Populates a linked list of P EachSecond elements, Populates a linked list of P EachSecond elements, each consisting of m trace elements:each consisting of m trace elements:

struct EachSecond {struct EachSecond {char Assigned[MAXTHREADS];char Assigned[MAXTHREADS];char Complete[MAXTHREADS];char Complete[MAXTHREADS];int NumTraceElts[MAXTHREADS];int NumTraceElts[MAXTHREADS];int *key[MAXTHREADS];int *key[MAXTHREADS];int *zoneid[MAXTHREADS];int *zoneid[MAXTHREADS];int *size[MAXTHREADS];int *size[MAXTHREADS];cmndType *cmnd[MAXTHREADS];cmndType *cmnd[MAXTHREADS];struct EachSecond *next;struct EachSecond *next;

} OneSecond;} OneSecond;

Starts up N worker threads, each with a unique-id 0 to N-1.Starts up N worker threads, each with a unique-id 0 to N-1.

Worker ThreadWorker Thread A worker thread with id j A worker thread with id j

corresponds to element j of corresponds to element j of the array in EachSecond.the array in EachSecond.

Iterates from element 0 to Iterates from element 0 to NumTraceElts[j] of cmnd, NumTraceElts[j] of cmnd, zoneid, key, size.zoneid, key, size.

cmndType is {Get, Save, cmndType is {Get, Save, Delete}, invoking your Delete}, invoking your implementation of Get, implementation of Get, Insert, and Delete, Insert, and Delete, respectively.respectively.

Zoneid is an index into the Zoneid is an index into the DZ-Array of the main DZ-Array of the main thread.thread.

Key is a 4 byte char: Vdt Key is a 4 byte char: Vdt data is a char, its length is 4 data is a char, its length is 4 bytes.bytes.

struct EachSecond {struct EachSecond {char Assigned[MAXTHREADS];char Assigned[MAXTHREADS];char Complete[MAXTHREADS];char Complete[MAXTHREADS];int NumTraceElts[MAXTHREADS];int NumTraceElts[MAXTHREADS];int *key[MAXTHREADS];int *key[MAXTHREADS];int *zoneid[MAXTHREADS];int *zoneid[MAXTHREADS];int *size[MAXTHREADS];int *size[MAXTHREADS];cmndType *cmnd[MAXTHREADS];cmndType *cmnd[MAXTHREADS];struct EachSecond *next;struct EachSecond *next;

} OneSecond;} OneSecond;

SynchronizationSynchronization The main thread and worker threads are The main thread and worker threads are

synchronized using handles (HANDLE hEvent).synchronized using handles (HANDLE hEvent). Main thread creates an event and assigns it to Main thread creates an event and assigns it to

hEvent.hEvent. After populating the P elements of EachSecond After populating the P elements of EachSecond

linked list and activating N threads to process the linked list and activating N threads to process the elements, the main thread does:elements, the main thread does:For (int k = 0; k < N; k++) WaitForSingleObject(hEvent, INFINITE)For (int k = 0; k < N; k++) WaitForSingleObject(hEvent, INFINITE)

When a worker thread is done with one EachSecond When a worker thread is done with one EachSecond element, it sets the Complete[thread-id] to 1 and element, it sets the Complete[thread-id] to 1 and does a SetEvent(hEvent).does a SetEvent(hEvent).

Once the main-thread falls out of the for loop, it Once the main-thread falls out of the for loop, it checks to see if the first N elements of Complete checks to see if the first N elements of Complete array in the current EachSecond are set. If so, this array in the current EachSecond are set. If so, this means all the N threads are done with this means all the N threads are done with this EachSecond element.EachSecond element.

Main thread populates this EachSecond element Main thread populates this EachSecond element with additional trace elements and links it to the end with additional trace elements and links it to the end of the P list.of the P list.

Termination ConditionTermination Condition Once the main thread hits the end of the Once the main thread hits the end of the

trace file, it will wait for the N threads to trace file, it will wait for the N threads to complete.complete.

Once a worker thread process all elements Once a worker thread process all elements of the P linked list (encounters a “Null” for of the P linked list (encounters a “Null” for the next element), it terminates by returning.the next element), it terminates by returning.

An Overview of Query Optimization in An Overview of Query Optimization in Relational Systems (by S. Chaudhuri)Relational Systems (by S. Chaudhuri)

Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California

TerminologyTerminology A SQL relational database management A SQL relational database management

system consists of:system consists of: Query optimizerQuery optimizer Query execution engine: implements a set of Query execution engine: implements a set of

physical operators.physical operators. An operator consumes one or more data streams and An operator consumes one or more data streams and

produces an output data stream, e.g., sort, sequential produces an output data stream, e.g., sort, sequential scan, index scan, nested loop join, etc.scan, index scan, nested loop join, etc.

Physical operator tree or execution plan ties different Physical operator tree or execution plan ties different operators with one another.operators with one another.

Terminology (Cont…)Terminology (Cont…) Physical property is any characteristics of a plan that is not Physical property is any characteristics of a plan that is not

shared by all plans for the same logical expression, but can shared by all plans for the same logical expression, but can impact the cost of subsequent operations. Concept of impact the cost of subsequent operations. Concept of interesting order in System R.interesting order in System R.

Query Execution EngineQuery Execution Engine RSS of system R is a query execution RSS of system R is a query execution

engine.engine. Is Berkeley-DB a query execution engine?Is Berkeley-DB a query execution engine?

Query OptimizerQuery Optimizer Generates the most efficient execution plan Generates the most efficient execution plan

for the execution engine.for the execution engine. Non-trivial because there can be a large number Non-trivial because there can be a large number

of possible trees.of possible trees. The optimization criteria might be different: The optimization criteria might be different:

throughput versus response time.throughput versus response time. Optimization as a search problem, providing:Optimization as a search problem, providing:

A space of plans, the search space.A space of plans, the search space. Cost estimation techniqueCost estimation technique Enumeration algorithm to search through the Enumeration algorithm to search through the

search space.search space. Ideal optimizer:Ideal optimizer:

The search space includes low cost plans,The search space includes low cost plans, Cost estimation techniques are accurate,Cost estimation techniques are accurate, Enumeration algorithm is efficient.Enumeration algorithm is efficient.

Basic Estimation FrameworkBasic Estimation Framework

1.1. Collect statistical summaries of data that has Collect statistical summaries of data that has been stored.been stored.

2.2. Given an operator and the statistical Given an operator and the statistical summary for each of its input data streams, summary for each of its input data streams, determine the:determine the: Statistical summary of the output data stream,Statistical summary of the output data stream, Estimated cost of executing the operation.Estimated cost of executing the operation.

EQUALITY JOIN (Cont…)EQUALITY JOIN (Cont…) Example: Assume the following statistics on the Employee and Example: Assume the following statistics on the Employee and

Department relations: t(Dept)=1000 tuples, P(Dept)=100 disk pages, Department relations: t(Dept)=1000 tuples, P(Dept)=100 disk pages, ν(Dept,dno)ν(Dept,dno)=1000, =1000, ν(Dept,dname)ν(Dept,dname)=500. t(Employee)=100,000 tuples =500. t(Employee)=100,000 tuples and P(Employee)=10,000 pages. Note that 10 tuples of each relation fit and P(Employee)=10,000 pages. Note that 10 tuples of each relation fit on a disk page. Assume that a concatenation of one Employee and on a disk page. Assume that a concatenation of one Employee and one Dept record is wide enough to enable a disk page to hold five one Dept record is wide enough to enable a disk page to hold five such records.such records.Lets compare the cost of two alternative algebraic expressions for Lets compare the cost of two alternative algebraic expressions for processing a query that retrieves those employees that work for the processing a query that retrieves those employees that work for the toy department:toy department: б б dname=Toydname=Toy(Emp Dept)(Emp Dept) Emp Emp б б dname=Toydname=Toy(Dept)(Dept)

EQUALITY JOIN (Cont…)EQUALITY JOIN (Cont…) б б dname=Toydname=Toy(Emp Dept)(Emp Dept)

Emp Dept

Tmp1

б dname=Toy

Tmp2t(Tmp1) = t(Emp) × t(Dept) / ν(Dept, dno) = 100,000 ×1000 / 1000 = 100,000P(Tmp1) = 100,000 / 5 = 20,000C( ) = P(Dept) + P(Emp) ×P(Dept) = 100 +10,000 ×100 = 1,000,100 (page nested loop)Cw(Tmp1) = 20,000Cw(б) = 20,000t(Tmp2) = t(Tmp1) / ν(Dept, dname) = 100,000 / 500 = 200P(Tmp2) = 200 / 5 = 40Cw(Tmp2) = 40Cost = C( ) + Cw(Tmp1) + C(б) + Cw(Tmp2) = 1,000,100 + 20,000 + 20,000 + 40 = 1,040,140 I/O

EQUALITY JOIN (Cont…)EQUALITY JOIN (Cont…) Emp Emp б б dname=Toydname=Toy(Dept)(Dept)

Emp

Dept

Tmp1

б dname=Toy

Tmp2Cw(б) = 100t(Tmp1) = t(Dept) / ν(Dept, dno) = 1000 / 500 = 2P(Tmp1) = 1Cw(Tmp1) = 1C( ) = P(Tmp1) + P(Tmp1) ×P(Emp) = 1 +1×10,000 = 10,001 (page nested loop)t(Tmp2) = t(Emp) × t(Tmp1) / ν(Emp,dno) = 100,000 ×2 / 1000 = 200P(Tmp2) = 200 / 5 = 40Cw(Tmp2) = 40Cost = C(б) + Cw(Tmp1) + C( ) + Cw(Tmp2) = 100 + 1 + 10,001 + 40 = 10,142 I/O

Statistical Summaries of DataStatistical Summaries of Data Equi-height histogramsEqui-height histograms

Review of System RReview of System R Use of dynamic programming and interesting order Use of dynamic programming and interesting order

as a heuristic.as a heuristic. Dynamic programming when optimizing a query Dynamic programming when optimizing a query

with multiple join predicateswith multiple join predicates Assumption: To obtain an optimal plan for a query Assumption: To obtain an optimal plan for a query

Q consisting of k joins, it suffices to consider only Q consisting of k joins, it suffices to consider only the optimal plans for subexpressions of Q that the optimal plans for subexpressions of Q that consist of (k-1) joins and extend those plans with an consist of (k-1) joins and extend those plans with an additional join.additional join.

Prune suboptimal plans for subexpressions of Q Prune suboptimal plans for subexpressions of Q consisting of (k-1) joins.consisting of (k-1) joins.

For example, the optimal plan for {R1, R2, R3, R4} is For example, the optimal plan for {R1, R2, R3, R4} is obtained by picking the plan with the cheapest cost obtained by picking the plan with the cheapest cost from among the optimal plans for:from among the optimal plans for:

Join({R1, R2, R3}, R4)Join({R1, R2, R3}, R4)Join({R1, R2, R4}, R3)Join({R1, R2, R4}, R3)Join({R1, R3, R4}, R2)Join({R1, R3, R4}, R2)Join({R2, R3, R4}, R1)Join({R2, R3, R4}, R1)

Instead of analyzing O(n!) plans, it considers O(n 2Instead of analyzing O(n!) plans, it considers O(n 2n-1n-1) plans ) plans where n is the number of relations.where n is the number of relations.

Left Outer JoinLeft Outer Join R Left Outer Join SR Left Outer Join S

Merging ViewsMerging Views If one or more relations in a query are views, If one or more relations in a query are views,

and each is defined using a conjunctive and each is defined using a conjunctive predicate, then the view definition can predicate, then the view definition can simply be “unfolded” to obtain a single simply be “unfolded” to obtain a single block SQL query.block SQL query.

May not work when the views are more May not work when the views are more complex using aggregates or eliminate complex using aggregates or eliminate duplicates using “Select distinct”.duplicates using “Select distinct”.

Nested SubqueriesNested Subqueries ““Flatten” nested queries whenever possibleFlatten” nested queries whenever possible

??

Nested SubqueriesNested Subqueries ““Flatten” nested queries whenever possibleFlatten” nested queries whenever possible

““Flattening” Queries with AggregatesFlattening” Queries with Aggregates Aggregates are more tricky. Consider:Aggregates are more tricky. Consider:

““Flattening” Queries with AggregatesFlattening” Queries with Aggregates What is wrong with this?What is wrong with this?

SELECT SELECT Dept.nameDept.nameFROM FROM Dept, EmpDept, EmpWHEREWHERE Dept.name = Emp.dept_nameDept.name = Emp.dept_nameGROUP BYGROUP BY Dept.nameDept.nameHAVINGHAVING Dept.num-of-machines < Count(Emp.*)Dept.num-of-machines < Count(Emp.*)

““Flattening” Queries with AggregatesFlattening” Queries with Aggregates Correct flattening will use outerjoin:Correct flattening will use outerjoin:

v storage manager

Documents

data zone

worker threada worker

vdt data

data field

main memory data structures

n worker threads

main threaddzarray

average object size