chapter 12 query processing (1) yonsei university 2 nd semester, 2013 sanghyun park

Chapter 12Query Processing (1)

Yonsei University2nd Semester, 2013

Sanghyun Park

Outline Overview Measures of Query Cost Selection Operation Sorting Join Operation (will be covered in the next file) Other Operations (will be covered in the next file) Evaluation of Expressions (will be covered in the next file)

Basic Steps in Query Processing

1. Parsing and translation

2. Optimization

3. Evaluation

Optimization A relational algebra expression can be evaluated in many ways

Examplesalary75000(salary(instructor)) is equivalent tosalary(salary75000(instructor))

Annotated expression specifying detailed evaluation strategy is called an evaluation plan Example

(1) can use an index on instructor to find tuples with salary < 75000(2) can perform complete relation scan and discard instructors with salary ≥ 75000

Query optimization: Amongst all equivalent evaluation plans choose the one with lowest cost (details in Chapter 13)

Measures of Query Cost (1/2) Cost is generally measured as total elapsed time for answering

query; many factors contribute the cost (disk accesses, CPU, …)

Typically disk access is the predominant cost, and is also relatively easy to estimate; measured by taking into account Number of seeks average-seek-cost Number of blocks read average-block-read-cost Number of blocks written average-block-write-cost

For simplicity we just use number of block transfers from disk as the cost measure

Measures of Query Cost (2/2) Costs depend on the size of the buffer in main memory

Having more memory reduces need for disk access Amount of real memory available to buffer depends on other concurrent

OS processes, and hard to determine ahead of actual execution We often use worst case estimates, assuming only the minimum

amount of memory needed for the operation is available

Real systems take CPU cost, difference between sequential and random I/O, and buffer size into account

We do not include cost to writing output to disk in our cost formula

Selection Operation In query processing,

the file scan is the lowest-level operator to access data

File scans are search algorithms that locate and retrieve records that satisfy a selection condition

In relational systems,a file scan allows an entire relation to be read in those caseswhere the relation is stored in a single, dedicated file

Basic Algorithms: Linear Search (A1)

Scan each file block and test all records to see whether they satisfy the selection condition

Cost estimate (number of disk blocks scanned) = br

(br denotes number of blocks containing records from relation r)

Selections on key attributes have an average cost br/2,but still have a worst-case cost of br

Linear search algorithm can be applied to any file, regardless of Ordering of records in the file Availability of indices Nature of the selection operation

Basic Algorithms: Binary Search (A2)

Applicable if selection is an equality comparison on the attributeon which the file is ordered

Cost estimate (number of disk blocks to be scanned) log2(br) - cost of locating the first tuple by a binary search

on the blocks Plus number of blocks containing records that satisfy selection

condition

Selection Using Indices (1/2) Search algorithms that use an index are referred to as index scans

(selection condition must be on search-key of index)

A3 (primary index on candidate key, equality) Retrieve a single record that satisfies the equality condition If a B+-tree is used, the cost is equal to the height of the tree plus

one I/O to fetch the record; Cost = HTi + 1

Selection Using Indices (2/2) A4 (primary index on nonkey, equality)

Records will be on consecutive blocks Cost = HTi + number of blocks containing retrieved records

A5(a) (secondary index on candidate key, equality) Cost = HTi + 1 (ignoring the cost for bucket access)

A5(b) (secondary index on nonkey, equality) Cost = HTi + number of records retrieved

(ignoring the cost for bucket access)(each record may be on a different block, very expensive)

Selections Involving Comparisons (1/2)

Can implement selections of the form A≤V(r) or A≥V(r) by using A file scan Or by using indices in the following ways

A6 (primary index, comparison) (Relation is sorted on A) For A≥V(r), use index to find first tuple ≥ v and scan relation

sequentially from there For A≤V(r), just scan relation sequentially till first tuple > v;

do not use index

Selections Involving Comparisons (2/2)

A7 (secondary index, comparison)

For A≥V(r), use index to find first index entry ≥ v and scan index sequentially from there, to find pointers to records

For A≤V(r), just scan leaf pages of index finding pointers to records,till first entry > v

In either case, retrieve records that are pointed to Requires an I/O for each record Linear file scan may be cheaper if many records are to be fetched

Sorting For relations that fit in memory, techniques like quicksort can be

used

For relations that don’t fit in memory, external sort-merge is a good choice

External Sort-Merge (1/3)

Let M denote memory size (in blocks)

Create sorted runs

Let i be 0 initially

Repeatedly do the following till the end of the relation: Read M blocks of relation into memory Sort the in-memory blocks Write sorted data to run Ri; increment i

Let the final value of i be N

External Sort-Merge (2/3) Merge the runs (N-way merge)

We assume (for now) that N < M

Use N blocks of memory as buffers for input runs, and 1 block as buffer for output. Read the first block of each run into its input buffer

Repeatedly do the following until all input buffers are empty: Select the first record (in sort order) among all input buffers Write the record to the output buffer;

if the output buffer is full, write it to disk Delete the record from its input buffer;

if the input buffer becomes empty,then read the next block (if any) of the run into the input buffer

External Sort-Merge (3/3) If N M, several merge passes are required

In each pass, contiguous groups of M - 1 runs are merged

A pass reduces the number of runs by a factor of M - 1,and creates runs longer by the same factor

Repeated passes are performed till all runs have been merged into one

Example: External Sort-Merge

1 9

chapter 12 query processing (1) yonsei university 2 nd semester, 2013 sanghyun park

Documents

cpu cost

scannedlog2br cost

predominant cost

cost disk accesses

average cost br2

lowest cost details

search algorithms

file block