secondary storage management the memory hierarchy

Secondary Storage ManagementThe Memory Hierarchy

The Memory Hierarchy• Computer systems have several different

components in which data may be stored.

• Data capacities & access speeds range over at least seven orders of magnitude

• Devices with smallest capacity also offer the fastest access speed

• The term memory hierarchy is used in computer architecture when discussing performance issues in computer architectural design, algorithm predictions,

• The lower level programming constructs such as involving locality of reference.

Description of Levels

1. Cache

• Megabyte or more of Cache storage.

• On-board cache : On same chip.

• Level-2 cache : On another chip.

• Cache data accessed in few nanoseconds.

• Data moved from main memory to cache when needed by processor

• Volatile

Description of Levels 2. Main Memory

• 1 GB or more of main memory.• Instruction execution & Data Manipulation -

involves information resident in main memory.• Time to move data from main memory to the

processor or cache is in the 10-100 nanosecond range.

• Volatile3. Secondary Storage

• Typically a magnetic disk.• Capacity upto 1 TB.• One machine can have several disk units.• Time to transfer a single byte between disk &

main memory is around 10 milliseconds.

Description of Levels

4. Tertiary Storage

• Holds data volumes measured in terabytes.– As capacious as a collection of disk units can be, there are databases much

larger than what can be stored on the disk(s) of a single machine, or even several machines.

• Significantly higher read/write times.• Tertiary storage is characterized by significantly

higher read/write times than secondary storage• Smaller cost per bytes.• Retrieval takes seconds or minutes, but

capacities in the petabyte range are possible.

Transfer of Data Between Levels

• Data moves between adjacent levels of the hierarchy.

• Each level is organized to transfer large amounts of data to or from the level below

• Key technique for speeding up database operations is to arrange data so that when one piece of a disk block is needed

• It is likely that other data on the same block will also be needed at about the same time.

Volatile & Non Volatile Storage• A volatile device “forgets” what is stored in it

when the power goes off. • Example: Main Memory

• A nonvolatile device, on the other hand, is expected to keep its contents intact even for long periods when the device is turned off or there is a power failure.

• Example: Secondary & Tertiary Storage

Note: No change to the database can be considered final until ithas migrated to nonvolatile, secondary storage.

Virtual Memory• Managed by Operating System.• Typical software executes in virtual-memory, an

address space that is typically 32 bits;• There are 232 bytes, or 4 gigabytes, in a virtual

memory.

• Some memory in main memory & rest on disk.

• Transfer between the two is in units of disk blocks (pages).

• Not a level of the memory hierarchy

Thank you!

Section 13.2 – Secondary storage management

CS-257 Database System PrinciplesAvinash Anantharamu (102)

008629907

• 13.2 Disks

• 13.2.1 Mechanics of Disks • 13.2.2 The Disk Controller • 13.2.3 Disk Access Characteristics

Index

Structure of a Disk

• Two principal moving pieces of hard drive1- Head Assembly2- Disk Assembly

• Disk Assembly has 1 or more circular platters that rotate around a central spindle.• Platters are covered with thin magnetic material• The upper and lower surfaces of the platters are covered with a thin layer

of magnetic material,on which bits are stored.• 0’s and l ’s are represented by different patterns in the magnetic material.• A common diameter for disk platters is 3.5 inches, although disks with

diameters from an inch to several feet have been built.

Mechanics of Disks

Top View of Disk Surface

• Tracks are concentric circles on a platter.• The two principal moving pieces of a disk drive - disk assembly and a head

assembly.• The disk is organized into tracks,• Tracks are organized into sectors which are segments of circular platter.• In 2008, a typical disk has about 100,000 tracks per inch but stores about

a million bits per inch along the tracks.• Sectors are indivisible as far as errors are concerned.

• Blocks are logical data transfer units.

Mechanics of Disks

• Control the actuator to move head assembly

• Selecting the surface from which to read or write

• Transfer bits from desired sector to main memory• buffering an entire track or more in local memory of the disk

controller• additional accesses to the disk can be avoided.

Disk Controller

Simple Single Processor Computer

• Seek time

• Rotational latency

• Transfer time

• Latency of the disk.

Disk Access characteristics

Thank you

13.3 Accelerating Access to Secondary StorageSan Jose State University

Spring 2012

13.3 Accelerating Access to Secondary Storage

Section Overview

13.3.1: The I/O Model of Computation 13.3.2: Organizing Data by Cylinders 13.3.3: Using Multiple Disks 13.3.4: Mirroring Disks 13.3.5: Disk Scheduling and the Elevator

Algorithm 13.3.6: Prefetching and Large-Scale Buffering

13.3 Introduction Average block access is ~10ms. Disks may be busy. Requests may outpace access delays, leading

to infinite scheduling latency. There are various strategies to increase disk

throughput. The “I/O Model” is the correct model to

determine speed of database operations the scheduling latency becomes infinite.

13.3 Introduction (Contd.)

Actions that improve database access speed:

– Place blocks closer, within the same cylinder

– Increase the number of disks

– Mirror disks

– Use an improved disk-scheduling algorithm

– Use prefetching

– improve the throughput

13.3.1 The I/O Model of Computation

If we have a computer running a DBMS that:

– Is trying to serve a number of users

– Has 1 processor, 1 disk controller, and 1 disk

– Each user is accessing different parts of the DB It can be assumed that:

– Time required for disk access is much larger than access to main memory; and as a result:

– The number of block accesses is a good approximation of time required by a DB algorithm

13.3.2 Organizing Data by Cylinders

It is more efficient to store data that might be accessed together in the same or adjacent cylinder(s).

In a relational database, related data should be stored in the same cylinder.

we can approach the theoretical transfer rate for moving data on or off the disk.

13.3.3 Using Multiple Disks If the disk controller supports the addition of multiple disks

and has efficient scheduling, using multiple disks can improve performance significantly

By striping a relation across multiple disks, each chunk of data can be retrieved in a parallel fashion, improving performance by up to a factor of n, where n is the total number of disks the data is striped over

The disk controller, bus, and main memorycan handle n times the data-transfer rate,

n disks will have approximately the performance of one disk that operates n times as fast.

A drawback of striping data across multiple disks is that you increase your chances of disk failure.

To mitigate this risk, some DBMS use a disk mirroring configuration

Disk mirroring makes each disk a copy of the other disks, so that if any disk fails, the data is not lost

Since all the data is in multiple places, access speedup can be increased by more than n since the disk with the head closest to the requested block can be chosen

13.3.4 Mirroring Disks

13.3.4 Mirroring Disks

Advantages Disadvantages

Striping Read/Write speedup ~nCapacity increased by ~n

Higher risk of failure

Mirroring Read speedup ~nReduced failure riskFast initial access

High cost per bitSlow writes compared to striping

One way to improve disk throughput is to improve disk scheduling, prioritizing requests such that they are more efficient

– The elevator algorithm is a simple yet effective disk scheduling algorithm

– The algorithm makes the heads of a disk oscillate back and forth similar to how an elevator goes up and down

– The access requests closest to the heads current position are processed first

13.3.5 Disk Scheduling

When sweeping outward, the direction of head movement changes only after the largest cylinder request has been processed

When sweeping inward, the direction of head movement changes only after the smallest cylinder request has been processed

Example:

13.3.5 Disk Scheduling

Cylinder Time Requested (ms)

8000 0

24000 0

56000 0

16000 10

64000 20

40000 30

Cylinder Time Completed (ms)

8000 4.3

24000 13.6

56000 26.9

64000 34.2

40000 45.5

16000 56.8

In some cases we can anticipate what data will be needed

We can take advantage of this by prefetching data from the disk before the DBMS requests it

Since the data is already in memory, the DBMS receives it instantly

13.3.6 Prefetching and Large-Scale Buffering

? Questions ?

Disk Failures

Presented by Timothy ChenSpring 2013

Index

• 13.4 Disk Failures13.4.1 Intermittent Failures13.4.2 Organizing Data by Cylinders13.4.3 Stable Storage13.4.4 Error- Handling Capabilities of Stable

Storage13.4.5 Recovery from Disk Crashes13.4.6 Mirroring as a Redundancy Technique13.4.7 Parity Blocks13.4.8 An Improving: RAID 513.4.9 Coping With Multiple Disk Crashers

Intermittent Failures

• If we try to read the sector but the correct content of that sector is not delivered to the disk controller

• with repeated tries we are able to read or write successfully.• Controller will check good and bad sector• If the write is correct: Read is performed• Good sector and bad sector is known by the read operation• The controller may attempt to write a sector, but the

contents of the sector are not what was intended.• We assume the write was correct, and if the sector read is

bad, then the write was apparently unsuccessful and must be repeated.

CheckSum

• Read operation that determine the good or bad status

• If, on reading, we find that the checksum is not proper for the data bits, then we know there is an error in reading.

• If the checksum is proper, there is still a small chance that the block was not read correctly, but by using many checksum bits we can make the probability of missing a bad read arbitrarily small.

How CheckSum perform

• Each sector has some additional bits• A simple form of checksum is based on the parity of all

the bits in the sector.• Set depending on the values of the data bits stored in

each sector• If the data bit in the not proper we know there is an

error reading• Odd number of 1: bits have odd parity(01101000)• Even number of 1: bit have even parity (111011100)• Find Error is the it is one bit parity

Stable Storage

• Deal with disk error• Sectors are paired, and each pair represents one sector-

contents X .• Sectors are paired and each pair X showing left and right

copies as Xl and Xr • It check the parity bit of left and right by substituting

spare sector of Xl and Xr until the good value is returned• Assume that if the read function returns a good value w

for either X l or X r , then w is the true value of X .

Error-Handling Capabilities of Stable Storage

• Since it has XL and XR, one of them fail we can still read other one

• Chance both of them fail are pretty small• The write Fail, it happened during power outage• Media Failure• Write Failure

– The failure occurred as we were writing XL– The failure occurred after we wrote XL

Recover Disk Crash

• The most serious mode of failure for disks is “head crash” where data permanently destroyed.

• This situation represents a disaster for many DBMS applications, such as banking and other financial applications.

• The way to recover from crash , we use RAID method

• RAID- Redundant Arrays of Independent Disks.

Mirroring as a Redundancy Technique

• it is call Raid 1• Just mirror each disk• Mirroring, as a protection against data loss, is

often referred to as RAID level 1.• Essentially, with mirroring and the other

redundancy schemes we discuss, the only way data can be lost is if there is a second disk crash while the first crash is being repaired.

Raid 1 graph

Parity Block

• It often call Raid 4 technical• read block from each of the other disks and

modulo-2 sum of each column and get redundant disk

disk 1: 11110000disk 2: 10101010disk 3: 00111000

get redundant disk 4(even 1= 0, odd 1 =1)disk 4: 01100010

Raid 4 graphic

Parity Block- Fail Recovery

• It can only recover one disk fail• If it has more than one like two disk• Then it can’t be recover us modulo-2 sum• If the failed disk is one of the data disks, then

we need to swap in a good disk and recompute its data from the other disks.

An Improvement Raid 5

Coping with multiple Disk Crash

• For more one disk fail• Either raid 4 and raid 5 can’t be work• So we need raid 6• It is need at least 2 redundant disk

Raid 6

Reference

• http://www.definethecloud.net/wp-content/uploads/2010/12/325px-RAID_1.svg_.png

• http://en.wikipedia.org/wiki/RAID

http://www.definethecloud.net/wp-content/uploads/2010/12/325px-RAID_1.svg_.png

http://www.definethecloud.net/wp-content/uploads/2010/12/325px-RAID_1.svg_.png

Secondary Storage Management

13.5 Arranging data on disk

Mangesh Dahale

ID-105

CS 257

Outline

• Fixed-Length Records• Example of Fixed-Length Records• Packing Fixed-Length Records into

Blocks• Example of Packing Fixed-Length

Records into Blocks• Details of Block header

Arranging Data on Disk

• A data element such as a tuple or object is represented by a record,

• It consists of consecutive bytes in some disk block.

Fixed Length Records

The Simplest record consists of fixed length fields.

The record begins with a header, a fixed-length regionwhere information about the record itself is kept.• it is necessary to lay out the record so it can be moved to main

memory and accessed efficiently there.

Fixed Length Record header1. A pointer – To record schema.• 2. The length of the record. -This information helps us skip over

records without consulting the schema.3. A timestamp – To indicate when the record was created.

ExampleCREATE TABLE employee(

name CHAR(30) PRIMARY KEY,

address VARCHAR(255),

gender CHAR(1),

birthdate DATE

);

Packing Fixed Length Records into Blocks

• Records are stored in blocks of the disk and moved into main memory when we need to access or update them.

• A block header is written first and it is followed by series of blocks.

Example

•Along with the header we can pack as many record as we can in one block as shown in the figure and remaining space will be unused

Block header contains following information

• Links to one or more other blocks that are part of a network blocks

• Information about the role played by this block in such a network

• Information about which relation the tuples of this block belong to.

• A “directory” giving the offset of each round in the block

• Timestamp(s) indicating the time of the block's last modification and / or access

•Thank You

13.6 Representing Block and Record

AddressesLokmanyaThilakCS257 ID:106

Topics

Addresses in Client-Server Systems

Logical and Structured Addresses

Pointer Swizzling

Returning Blocks to Disk

Pinned Records and Blocks

Introduction

• Address of a block and Record

▫ In Main Memory Address of the block is the virtual memory address of the first byte

a database system consists of a server process that provides data from secondary storage to one or more client processes that are applications using the data.

Address of the record within the block is the virtual memory address of the first byte of the record

▫ In Secondary Memory Sequence of Bytes describe the location of the block : the device Id for the

disk, Cylinder number, etc.

Address in Client-Server System

Physical Addresses: Byte strings referring to the place within the secondary

storage system where the record can be found.The client application uses a conventional “virtual”

address space, typically 32 bits, or about 4 billion different addresses

Logical Addresses: Arbitrary string of bytes of some fixed length that

maps to physical addressA map table, stored on disk in a known location

Map Table

Logical Address

Physical Address

13.6.2 Logical and Structured Addresses

• Purpose of logical address:▫ Gives more flexibility, when we

Move the record around within the block Move the record to another block

▫ Easy updating of records▫ Structured address ▫ Gives us an option of deciding what to do when a

record is deleted

HeaderOffset table

Unused

• Having pointers is common in an object-relational database systems

• Every data item (block, record, etc.) has two addresses:– database address: address on the disk– memory address, if the item is in virtual memory

• When block is moved from secondary storage to main memory pointers within block are “swizzled”,i.e translated from database address space (Server) to virtual address space (Client)

Translation Table

Database address

Memory Address

Types of Swizzling Automatic Swizzling

As soon as block is brought into memory, swizzle all relevant pointers.

Swizzling on DemandOnly swizzle a pointer if and when it is actually followed.

No SwizzlingPointers are not swizzled they are accesses using the database

address.

• Unswizzling– When a block is moved from memory back to disk, all

pointers must go back to database (disk) addresses– Use translation table again– Important to have an efficient data structure for the

translation table

Returning Blocks to Disk

• When a block is moved from memory back to disk, any pointers within that block must be unswizzled, i.e. memory address must be replaced by database address.

• A block in memory is said to be pinned if it cannot be written back to disk safely

Thank you

Variable Length Data and Records

- Ashwin Kalbhor Class ID : 107

Agenda

• Records with Variable Length Fields• Records with Repeating Fields• Variable Format Records• Records that do not fit in a block

• Example of a record

name

address

gender

birth date

0 30 286 287 297

Records with Variable Length Fields

• Simple and Effective way to represent variable length records is as follows –1. Fixed length fields are kept ahead of the variable length records.2. A header is put in front of the of the record.3. Record header contains• Length of the record• Pointers to the beginning of all variable length

fields except the first one.

Example

Record with name and address as variable length field.

birth date

name address

header informationrecord

lengthto address

gender

Records with repeating fields

• Repeating fields simply means fields of the same length L.

• All occurrences of Field F grouped together.• Pointer pointing to the first field F is put in the

header.• Based on the length L the starting offset of any

repeating field can be obtained.

Example of a record with Repeating Fields

Movie star record with “movies” as the repeating field.

name address

other header informationrecord

lengthto addressto movie pointers

pointers to movies

Alternative representation

• Record is of fixed length• Variable length fields stored on a separate

block.• The record itself keeps track of -

1. Pointers to the place where each repeating field begins, and2. Either how many repetitions there are, or where the repetitions end.

Storing variable length fields separately from the record.

Variable Format Records

• Records that do not have fixed schema• Represented by sequence of tagged fields• Each of the tagged fields consist of information

• Attribute or field name• Type of the field• Length of the field• Value of the field

Variable Format Records

N 16

S S14

Clint Eastwood

Hog’s Breath Inn

R

code for name

code for restaurant ownedcode for string

typecode for string typelength

length

Records that do not fit in a block

• When the length of a record is greater than block size ,then record is divided and placed into two or more blocks

• Portion of the record in each block is referred to as a RECORD FRAGMENT

• Record with two or more fragments is called a SPANNED RECORD

• Record that do not cross a block boundary is called UNSPANNED RECORD

Spanned Records

• Spanned records require the following extra header information –• A bit indicates whether it is fragment or not• A bit indicates whether it is first or last fragment

of a record• Pointers to the next or previous fragment for the

same record

Spanned Records

record 1 record 3 record 2 - a

record 2 - b

block header

record header

block 1 block 2

Thank You.

13.8 Record Modifications

CS257Lok Kei Leong ( 108 )

Outline

• Record Insertion

• Record Deletion

• Record Update

Insertion• Insert new records into a relation

- records of a relation in no particular order- record of a relation in fixed order

(e.g. sorted by primary key)• A pointer to a record from outside the block is a “structured

address”

What If The Block is Full?

• If we need to insert the record in a particular block but the block is full. What should we do?

• Find room outside the Block• There are 2 solutions I. Find Space on Nearby BlockII. Create an Overflow Block

Insertion (solution 1)

• Find space on a “nearby” block• Block B1 has no space • If space available on block B2 move records of B1 to

B2 • If there are external pointers to records of B1 moved

to B2 leave forwarding address in offset table of B1

Insertion (solution 2)

• Create an overflow block• Each block B has its header pointer to an overflow

block where additional blocks of B can be placed

Deletion• Slide around the block• Cannot slide records

- maintain an available-space list in the block headerto keep track of space available

• Avoid dangle or wind up pointing to a new record

Tombstone• What about pointer to deleted records ?• A tombstone is placed in place of each

deleted record• A tombstone is a bit placed at first byte of

deleted record to indicate the record was deleted ( 0 – Not Deleted 1 – Deleted)

• A tombstone is permanent

Update

• For Fixed-Length Records, there is no effect on the storage system

• For variable length records:• associated with insertion and deletion

(never create a tombstone for old record) • Longer updated record

create more space on its block- sliding records - create an overflow block

Question?

BITMAP INDEXES

Mahathi Kashojula (Id :- 132)

Contents• 14.7.1 -- Motivation for Bitmap Indexes• 14.7.2 -- Compressed Bitmaps• 14.7.3 -- Operating on Run-Length-Encoded

Bit-Vectors• 14.7.4 -- Managing Bitmap Index

Introduction

• A bitmap index is a special kind of index that stores the data as bit arrays (commonly called "bitmaps").

• It answers most queries by performing bitwise logical operations on these bitmaps.

• The bitmap index is designed for cases where number of distinct values is low, in other words, the values repeat very frequently.

Example

No F G

1 30 FOO

2 30 BAR

3 40 BAZ

4 50 FOO

5 40 BAR

6 30 BAZ

• Suppose a file consists of records with two fields, F and G, of type integer and string, respectively. The current file has six records, numbered 1 through 6, with the following values in order:

Example (contd…)

Value Vector

30 11000140 00101050 000100

• A bitmap index for the first field, F, would have three bit-vectors, each of length 6 as shown in the table.

• In each case, the 1's indicate the records in which the corresponding string appears.

• Table 2

Example (contd…)

• Table 3• A bitmap index for the first field, G, would have three bit-vectors, each of length 6 as shown in the table.


Value Vector

FOO 100100

BAR 010010

BAZ 001001

Motivation for Bitmap Indexes:

• Table 4• Bitmap indexes can

help answer range queries.

• Example: Given is the data of a

jewelry store. The attributes are

age and salary.

No Age Salary

1 25 60

2 45 60

3 50 75

4 50 100

5 50 120

6 70 110

7 85 140

8 30 260

9 25 400

10 45 350

11 50 275

Motivation (contd…)

• Table 5• A bitmap index for the first field Age, would have seven bit-vectors, each of length 12 as shown in the table.

• In each case, the 1's indicate in which records the corresponding string appears.

Value Vector

25 100000001000

30 000000010000

45 010000000100

50 001110000010

60 000000000001

70 000001000000

85 000000100000

Motivation (contd…)

• Table 5• A bitmap index for the second field Salary, would have ten bit-vectors, each of length 12 as shown in the table.


Value Vector

60 110000000000

75 001000000000

100 000100000000

110 000001000000

120 000010000000140 000000100000260 000000010001275 000000000010

350 000000000100400 000000001000

Motivation (contd…)• Suppose we want to find the jewelry buyers with

an age in the range 45-55 and a salary in the range 100-200.

• We first have to find the bit-vectors for the age values in this range;

in this example there are only two: 010000000100 and 001110000010, for 45 and 50, respectively. If we take their bitwise OR, we have a new bit-vector with 1 in position i if and only if the ith record has an age in the desired range. • The new bit-vector is 011110000110.

Motivation (contd…)• Next, we have to find the bit-vectors for the

salaries between 100 and 200.• There are four, corresponding to salaries 100,

110, 120, and 140. 100: 000100000000110: 000001000000120: 000010000000140: 000000100000

• Their bitwise OR is 000111100000.

Motivation (contd…)• The last step is to take the bitwise AND of the two

bit-vectors we calculated by OR. • That is:

011110000110 AND 000111100000 -----------------------------------

000110000000• We thus find that only the fourth and fifth records,

which are (50,100) and (50,120), are in the desired range.

Compressed Bitmaps• Consider:

The number of records in F are n. Attribute A has m distinct values in F.

• The size of a bitmap index on attribute A is m*n.

• If m is large, then the number of 1’s in a bit-vector will be very rare.

• A common encoding approach is called run-length encoding.

Run-length encoding• Represents run:

A run is a sequence of i 0’s followed by a 1, by some suitable binary encoding of the integer i.

• A run of i 0’s followed by a 1 is encoded by: First computing how many bits are needed to represent i,

let be j. Then represent the run by j-1, 1’s and a single 0 followed

by j bits which represent i in binary. The encoding for i = 1 is 01. j = 1 The encoding for i = 0 is 00. j = 1

• We concatenate the codes for each run together, and the sequence of bits is the encoding of the entire bit-vector.

Run-length encoding (contd…)• Let us decode the sequence 11101101001011• Staring at the beginning (left most bit):

First run: The first 0 is at position 4, so j = 4. The next 4 bits are 1101, so we know that the first integer is i = 13

Second run: 001011 j=1

i=0 Last run: 1011

j = 1i = 3

• Our entire run length is thus 13,0,3, hence our bit-vector is: 0000000000000110001

Managing Bitmap Indexes1) Finding bit vectors

• Think of each bit-vector as a key to a value.• Any secondary storage technique will be efficient in

retrieving the values.• Create secondary key with the attribute value as a

search key 2) Finding records

• Create secondary key with the record number as a search key (if we need record k, you can create a secondary index using the kth position as a search key.)

3) Handling Modifications• Record numbers must remain fixed once assigned• Changes to data file require changes to bitmap index

References:http://en.wikipedia.org/wiki/Bitmap_indexhttp://en.wikipedia.org/wiki/Run-length_encoding

Thank You

Questions ????

Query ExecutionSection 15.1

Sweta ShahCS257: Database Systems

ID: 118

Query Processor Query compilation Physical Query Plan Operators

Scanning Tables Table Scan Index scan

Sorting while scanning tables Model of computation for physical operators Parameters for measuring cost Iterators

Agenda

The Query Processor is a group of components of a DBMS that turns user queries and data-modification commands into a sequence of database operations and executes those operations

Query processor is responsible for supplying details regarding how the query is to be executed

a naive execution strategy for a query may take far more time than necessary

Query Processor

The major parts of the query processor

Query compilation itself is a multi-step process consisting of :

Parsing: in which a parse tree representing query and its structure is constructed

Query rewrite: in which the parse tree is converted to an initial query plan

Physical plan generation: where the logical query plan is turned into a physical query plan by selecting algorithms.

The physical plan also includes details such as how the queried relations are accessed, and when and if a relation should be sorted.

Query compilation

Outline of query compilation

Physical query plans are built from operators Each of the operators implement one step of the plan. They are particular implementations for one of the

operators of relational algebra. we also need physical operators for other tasks that do

not involve an operation of relational algebra They can also be non relational algebra operators like

“scan” which scans tables.

Physical Query Plan Operators

One of the most basic things in a physical query plan. Necessary when we want to perform join or union of a

relation with another relation. There are two basic approaches to locating the tuples

of a relation R.1. table-scan.2. index-scan.

Scanning Tables

Two basic approaches to locating the tuples of a relation R

Table-scan Relation R is stored in secondary memory

with its tuples arranged in blocks it is possible to get the blocks one by one This operation is called Table Scan

Two basic approaches to locating the tuples of a relation R

Index-scan there is an index on any attribute of

Relation R Use this index to get all the tuples of R This operation is called Index Scan

Why do we need sorting while scanning? the query could include an ORDER BY clause

requiring that a relation be sorted Various algorithms for relational-algebra

operations require one or both of their arguments to be sorted relation

Sort-scan takes a relation R and a specification of the attributes on which the sort is to be made, and produces R in that sorted order

If relation R must be sorted by attribute a, and there is a B-tree index on a, then a scan of the index allows us to produce R in the desired order

Sorting While Scanning Tables

Choosing physical plan operators wisely is an essential for a good query processor.

Cost for an operation is measured in number of disk i/o operations.

If an operator requires the final answer to a query to be written back to the disk, the total cost will depend on the length of the answer and will include the final write back cost to the total cost of the query.

If the operator produces the final answer to a query, and that result is indeed written to disk, then the cost of doing so depends only on the size of the answer.

Model of Computation for Physical Operators

Major improvements in cost of the physical operators can be achieved by avoiding or reducing the number of disk i/o operations.

This can be achieved by passing the answer of one operator to the other in the main memory itself without writing it to the disk.

We shall also see situations where several operations share the main memory, so M could be much smaller than the total main memory.

Improvements in cost

Parameters that affect the performance of a query Buffer space availability in the main

memory at the time of execution of the query

Size of input and the size of the output generated

The size of memory block on the disk and the size in the main memory also affects the performance

Parameters for Measuring Costs

Many physical operators can be implemented as an iterator

It is a group of three functions that allows a consumer of the result of the physical operator to get the result one tuple at a time

The three methods forming the iterator for an operation are:

1. Open()2. GetNext()3. Close()

Iterators for Implementation of Physical Operators

The three functions forming the iterator are: Open: This function starts the process of getting tuples. It initializes any data structures needed to perform the

operation

Iterator

GetNext This function returns the next tuple in the result Adjusts data structures as necessary to allow

subsequent tuples to be obtained If there are no more tuples to return, GetNext returns

a special value NotFound

Iterator

Close This function ends the iteration after all tuples it calls Close on any arguments of the operator

Iterator

Thank You !!!

Query Execution

One-pass algorithm for database operations

Chetan Sharma008565661

Overview

One-Pass Algorithm

One-Pass Algorithm Methods:

1) Tuple-at-a-time, unary operations.

2) Full-relation, unary operations.

3) Full-relation, binary operations.

One-Pass Algorithm

• Reading the data only once from disk.

• Usually, they require at least one of the arguments to fit in main memory

• The choice of algorithm for each operator is an essential part of the process of transforming a logical query plan into a physical query plan.

Tuple-at-a-Time

• These operations do not require an entire relation, or even a large part of it, in memory at once. Thus, we can read a block at a time, use one main memory buffer, and produce our output.

• Ex- selection and projection

Tuple-at-a-Time

A selection or projection being performed on a relation R

Full-relation, unary operations

• These one-argument operations require seeing all or most of the tuples in memory at once,

• so one-pass algorithms are limited to relations that are approximately of size M (the number of main-memory buffers available) or less.

• Ex - The grouping operator - The duplicate-elimination operator.

Full-relation, unary operations

Managing memory for a one-pass duplicate-elimination

Grouping

• A grouping operation gives us zero or more grouping attributes and presumably one or more aggregated attributes. If we create in main memory one entry for each group — that is, for each value of the grouping attributes — then we can scan the tuples of R, one block at a time.

• Ex- MIN(a) , MAX(a) , COUNT , SUM(a), AVG(a)

Full-relation, binary operations

• All other operations are in this class: set and bag versions of union, intersection, difference, joins, and products.

• Except for bag union, each of these operations requires at least one argument to be limited to size M, if we are to use a one-pass algorithm

Full-relation, binary operations examples

• Set Union:-We read S into M - 1 buffers of main memory and build a search structure where the search key is the entire tuple.

-All these tuples are also copied to the output.

-Read each block of R into the Mth buffer, one at a time.

-For each tuple t of R, see if t is in S, and if not, we copy t to the output. If t is also in S, we skip t.

• Set Intersection :-Read S into M - 1 buffers and build a search structure with full tuples as the search key.

-Read each block of R, and for each tuple t of R, see if t is also in S. If so, copy t to the output, and if not, ignore t.

Questions

&

NESTED LOOPS JOINS

Book Section of chapter 15.3

Submitted to : Prof. Dr. T.Y. LIN

Tuple-Based Nested-Loop Join An Iterator for Tuple-Based Nested-

Loop Join A Block-Based Nested-Loop Join

Algorithm Analysis of Nested-Loop Join

15.3.1 Tuple-Based Nested-Loop Join

The simplest variation of nested-loop join has loops that range over individual tuples of the relations involved. In this algorithm, which we call tuple-based nested-loop join, we compute the join as follows

RS

Continued

For each tuple s in S DO For each tuple r in R Do

if r and s join to make a tuple t THEN output t;

If we are careless about how the buffer the blocks of relations R and S, then this algorithm could require as many as T(R)T(S) disk .

There are many situations where this algorithm can be modified to have much lower cost.

Continued

One case is when we can use an index on the join attribute or attributes of R to find the tuples of R that match a given tuple of S, without having to read the entire relation R.

The second improvement looks much more carefully at the way tuples of R and S are divided among blocks, and uses as much of the memory as it can to reduce the number of disk I/O's as we go through the inner loop.

We shall consider this block-based version of nested-loop join.

15.3.2 An Iterator for Tuple-Based Nested-Loop Join

Open() { R.Open(); S.open(); A:=S.getnext();}

GetNext() {Repeat {

r:= R.Getnext();IF(r= Not found) {/* R is exhausted

for the current s*/R.close();s:=S.Getnext();

IF( s= Not found) RETURN Not Found;/* both R & S are exhausted*/R.Close();r:= R.Getnext();

}}until ( r and s join)RETURN the join of r and s;

}Close() {

R.close ();S.close ();

}

15.3.3 A Block-Based Nested-Loop Join

AlgorithmWe can Improve Nested loop Join by

compute R |><| S.1. Organizing access to both argument

relations by blocks. 2. Using as much main memory as we can

to store tuples belonging to the relation S, the relation of the outer loop.

The nested-loop join algorithm

FOR each chunk of M-1 blocks of S DO BEGINread these blocks into main-memory buffers;organize their tuples into a search structure whose

search key is the common attributes of R and S;FOR each block b of R DO BEGIN

read b into main memory;FOR each tuple t of b DO BEGIN

find the tuples of S in main memory thatjoin with t ;output the join of t with each of these

tuples;END ;

END ;END ;

15.3.4 Analysis of Nested-Loop Join

Assuming S is the smaller relation, the number of chunks or iterations of outer loop is B(S)/(M - 1).

At each iteration, we read hf - 1 blocks of S andB(R) blocks of R. The number of disk I/O's is thus

B(S)/M-1(M-1+B(R)) or B(S)+B(S)B(R)/M-1

Continued

Assuming all of M, B(S), and B(R) are large, but M is the smallest of these, an approximation to the above formula is B(S)B(R)/M.cost is proportional to the product of the sizes of the two relations, divided by the amount of available main memory.

Example B(R) = 1000, B(S) = 500, M = 101

Important Aside: 101 buffer blocks is not as unrealistic as it sounds. There may be many queries at the same time, competing for main memory buffers.

Outer loop iterates 5 times At each iteration we read M-1 (i.e. 100) blocks of S and all

of R (i.e. 1000) blocks. Total time: 5*(100 + 1000) = 5500 I/O’s

Question: What if we reversed the roles of R and S?

We would iterate 10 times, and in each we would read 100+500 blocks, for a total of 6000 I/O’s.

Compare with one-pass join, if it could be done! We would need 1500 disk I/O’s if B(S) M-1

Continued…….

1. The cost of the nested-loop join is not much greater than the cost of a one-pass join, which is 1500 disk 110's for this example. In fact.if B(S) 5 lZI - 1, the nested-loop join becomes identical to the one-pass join algorithm of Section 15.2.3

2. Nested-loop join is generally not the most efficient join algorithm.

Summary of the topic

In This topic we have learned about how the nested tuple Loop join are used in database using query execution and what is the process for that.

Any Questions

?

Thank You

Two Pass Algorithm Based On Sorting

Section 15.4CS257 Spring2013Swapna VemparalaClass ID : 131

ContentsTwo-Pass AlgorithmsTwo-Phase, Multiway Merge-SortDuplicate Elimination Using

SortingGrouping and Aggregation Using

SortingA Sort-Based Union AlgorithmSort-Based Intersection and

DifferenceA Simple Sort-Based Join

AlgorithmA More Efficient Sort-Based Join

Two-Pass AlgorithmsData from operand relation is

read into main memory, processed, written out to disk again, and reread from disk to complete the operation.

Extend this idea to any number of passes, where the data is read many times into main memory.

15.4.1 Two-Phase, Multiway Merge-SortTo sort very large relations in two passes

using an algorithm called Two-Phase, Multiway Merge-Sort (TPMMS),.

Phase 1: Repeatedly fill the M buffers with new tuples from R and sort them, using any main-memory sorting algorithm. Write out each sorted sublist to secondary storage.

Phase 2 : Merge the sorted sublists. For this phase to work, there can be at most M — 1 sorted sublists, which limits the size of R. We allocate one input block to each sorted sublist and one block to the output.

MergingFind the smallest key among the first

remaining elements of all the listsMove smallest element to first available

position of output block.If output block full -write to disk and

reinitialize the same buffer in main memory to hold the next output block.

If this block -exhausted of records, read next block from the same sorted sub list into the same buffer that was used for the block just exhausted.

If no blocks remain- stop.

15.4.2 Duplicate Elimination Using SortingSame as previous…Instead of sorting on the second

pass, -repeatedly select first unconsidered tuple t among all sorted sub lists.

Write one copy of t to the output and eliminate from the input blocks all occurrences of t.

Output - exactly one copy of any tuple in R.

15.4.3 Grouping and Aggregation Using SortingRead the tuples of R into memory, M

blocks at a time. Sort the tuples in each set of M blocks, using the grouping attributes of L as the sort key. Write each sorted sublist to disk.

Use one main-memory buffer for each sublist,

initially load the first block of each sublist into its buffer.

Repeatedly find the least value of the sort key present among the first available tuples in the buffers.

15.4.4 A Sort-Based Union AlgorithmIn the first phase, create sorted

sublists from both R and S.Use one main-memory buffer for

each sublist of R and S.Initialize each with the first block

from the corresponding sublist.Repeatedly find the first

remaining tuple t among all the buffers

15.4.5 Sort-Based Intersection and DifferenceFor both set version and bag version, the algorithm is

same as that of set-union except that the way we handle the copies of a tuple t at the fronts of the sorted sub lists.

For set intersection -output t if it appears in both R and S.

For bag intersection -output t the minimum of the number of times it appears in R and in S.

For set difference -output t if and only if it appears in R but not in S.

For bag difference-output t the number of times it appears in R minus the number of times it appears in S.

15.4.6 A Simple Sort-Based Join AlgorithmGiven relations R(X, Y) and S(Y,

Z) to join, and given M blocks of main memory for buffers

Sort R, using 2PMMS, with Y as the sort key

Sort S similarlyMerge the sorted R and S, use

only two buffers

15.4.8 A More Efficient Sort-Based JoinIf we do not have to worry about very large

numbers of tuples with a common value for the join attribute(s), then we can save two disk 1/0's per block by combining the second phase of the sorts with the join itself

To compute R(X, Y) S(Y, Z) using M►◄ main-memory buffers

Create sorted sublists of size M, using Y as the sort key, for both R and S.

Bring the first block of each sublist into a buffer

Repeatedly find the least Y-value y among the first available tuples of all the sublists. Identify all the tuples of both relations that have Y-value y. Output the join of all tuples from R with all tuples from S that share this common Y-value

We can perform the algorithm-on data that is almost as large as that of the previous algorithm.

Thank you

Two-Pass Algorithms Based on HashingCHAPTER – 15.5

CS 257

ID 131 SWAPNA VEMPARALA

Contents

Introduction

Partitioning Relations by Hashing

A Hash-Based Algorithm for Duplicate Elimination

Hash-Based Grouping and Aggregation

Hash-Based Union, Intersection, and Difference

The Hash-Join Algorithm

Saving Some Disk I /O ’s

Differences between sort-based and corresponding hashbased algorithms

Introduction

The essential idea behind all these previous algorithms is as follows:

If the data is too big to store in main-memory buffers, hash all the tuples of the argument or arguments using an appropriate hash key.

For all the common operations, there is a way to select the hash key so all the tuples that need to be considered together when we perform the operation fall into the same bucket.

We then perform the operation by working on one bucket at a time (or on a pair of buckets with the same hash value, in the case of a binary operation).

In effect, we have reduced the size of the operand(s) by a factor equal to the number of buckets, which is roughly M.

15.5.1 Partitioning Relations by Hashing

Take a relation R and, using M buffers, partition R into M — 1 buckets of roughly equal size.

assume that h is the hash function, and that h takes complete tuples of R as its argument

associate one buffer with each bucket

The last buffer holds blocks of R , one at a time. Each tuple t in the block is hashed to bucket h(t) and copied to the appropriate buffer.

If that buffer is full, we write it out to disk, and initialize another block for the same bucket.

At the end, we write out the last block of each bucket if it is not empty.

ALGORITHM:

15.5.2 A Hash-Based Algorithm for Duplicate Elimination

We shall now consider the details of hash-based algorithms for the various operations of relational algebra that might need two-pass algorithms.

First, consider duplicate elimination, that is, the operation S(R).

We hash R to M — 1 buckets, two copies of the same tuple t will hash to the same bucket.

Thus, we can examine one bucket at a time, perform <5 on that bucket in isolation, and take as the answer the union of S(Ri), where Ri is the portion of R that hashes to the ith bucket.

The one-pass algorithm eliminates duplicates from each Ri in turn and write out the resulting unique tuples

This method will work as long as the individual R i ’s are sufficiently small to fit in main memory and thus allow a one-pass algorithm.

Since we may assume the hash function h partitions R into equal-sized buckets, each Ri will be approximately B(R)/(M — 1) blocks in size.

If that number of blocks is no larger than M, B(R) < M(M — 1), then the two-pass, hash-based algorithm will work.

Thus, a conservative estimate (assuming M and M — 1 are essentially the same) is B(R) < M 2, exactly as for the sort-based, two-pass algorithm for 6.

The number of disk I/O ’s is also similar to that of the sort-based algorithm.

We read each block of R once as we hash its tuples, and we write each block of each bucket to disk.

We then read each block of each bucket again in theone-pass algorithm that focuses on that bucket.

Thus, the total number of disk I/O ’s is 3B(R).

15.5.3 Hash-Based Grouping and Aggregation

To perform the 7 l ( R ) operation, we again start by hashing all the tuples of R to M — 1 buckets.

However, in order to make sure that all tuples of the same group wind up in the same bucket

we must choose a hash function that depends only on the grouping attributes of the list L.

Having partitioned R into buckets, we can then use the one-pass algorithm for 7 to process each bucket in turn.

For S, we can process each bucket in main memory provided B{R) < M 2.

However, on the second pass, we need only one record per group as we process each bucket

Thus, even if the size of a bucket is larger than M, we can handle the bucket in one pass provided the records for all the groups in the bucket take no more than M buffers.

As a consequence, if groups are large, then we may actually be able to handle much larger relations R than is indicated by the B(R) < M 2 rule.

On the other hand, if M exceeds the number of groups, then we cannot fill all buckets.

Thus, the actual limitation on the size of R as a function of M is complex, but B(R) < M 2 is a conservative estimate.

Finally, we observe that the number of disk I/O ’s for 7 , as for 8, is 3B(R).

15.5.4 Hash-Based Union, Intersection, and Difference When the operation is binary, use the same hash

function to hash tuples of both arguments. For example, to compute R Us 5, we hash both R and S to M — 1 buckets each, say i? i, -R2, - - • , R m - 1 and S i,5 2, • • • , S m - 1.

We then take the set-union of Ri with Si for all i, and output the result.

Notice that if a tuple t appears in both R and S, then for some i we shall find t in both Ri and Si.

Thus, when we take the union of these two buckets, we shall output only one copy of t , and there is no possibility of introducing duplicates into the result.

To take the intersection or difference of R and S, we create the 2(M — 1) buckets exactly as for set-union and apply the appropriate one-pass algorithm to each pair of corresponding buckets.

Notice that all these one-pass algorithms require B(R) -I- B(S) disk I/O ’s.

To this quantity we must add the two disk I/O ’s per block that are necessary to hash the tuples of the two relations and store the buckets on disk, for a total of 3 (B{R) + 5 (5 )) disk I/O ’s.

In order for the algorithms to work, we must be able to take the one-pass union, intersection, or difference of Ri and Si, whose sizes will be approximately B(R)/(M - 1) and B(S)/(M - 1), respectively.

Recall that the onepass algorithms for these operations require that the smaller operand occupies at most M — 1 blocks.

Thus, the two-pass, hash-based algorithms require that m in(B(R),B(S)) < M 2, approximately.

15.5.5 The Hash-Join Algorithm

To compute R{X, Y) tx S(Y, Z) using a two-pass, hash-based algorithm, we act almost as for the other binary operations

The only difference is that we must use as the hash key just the join attributes,Y.

Then we can be sure that if tuples of R and S join, they will wind up in corresponding buckets Ri and Si for some i.

A one-pass join of all pairs of corresponding buckets completes this algorithm, which we call hash-join.

15.5.6 Saving Some Disk I /O ’s

If there is more memory available on the first pass than we need to hold one block per bucket, then we have some opportunities to save disk I/O ’s.

One option is to use several blocks for each bucket, and write them out as a group, in consecutive blocks of disk.

Strictly speaking, this technique doesn’t save disk I/O ’s, but it makes the I/O ’s go faster, since we save seek time and rotational latency when we write.

Effective ,method called hybrid hash-join, works as follows.

In general, suppose we decide that to join R t x S, with S the smaller relation, we need to create k buckets, where k is much less than M, the available memory. When we hash S, we can choose to keep m of the k buckets entirely in main memory, while keeping only one block for each of the other k — m buckets. We can manage to do so provided the expected size of the buckets in memory, plus one block for each of the other buckets, does not exceed M ; that is:

m B ( S ) / k + k — m < M

expected size of a bucket is B ( S )/k, and there are m buckets in memory.

Now, when we read the tuples of the other relation, R, to hash that relation into buckets, we keep in memory:

1. The rn buckets of 5 that were never written to disk, and

2. One block for each of the k — m buckets of R whose corresponding buckets of 5 were written to disk.

If a tuple t of R hashes to one of the first m buckets, then we immediately join it with all the tuples of the corresponding 5-bucket, as if this were a onepass, hash-join.

It is necessary to organize each of the in-memory buckets of 5 into an efficient search structure to facilitate this join, just as for the one-pass hash-join.

If t hashes to one of the buckets whose corresponding 5-bucket is on disk, then t is sent to the main-memory block for that bucket, and eventually migrates to disk, as for a two-pass, hash-based join.

On the second pass, we join the corresponding buckets of R and 5 as usual.

However, no need to join the pairs of buckets for which the 5-bucket was left in memory; these buckets already been joined

The savings in disk I/O ’s is equal to two for every block of the buckets of 5 that remain in memory, and their corresponding ft-buckets.

Since m / k of the buckets are in memory, the savings is 2(m/k)(B(R) + B(S)).

The intuitive justification is that all but k — m of the main-memory buffers can be used to hold tuples of 5 in main memory, and the more of these tuples, the fewer the disk I/O ’s.

Thus, we want to minimize k, the total number of buckets.

We do so by making each bucket about'as big as can fit in main memory; that is, buckets are of size M, and therefore k = B (S ) /M .

If that is the case, then there is only room for one bucket in the extra main memory; i.e., m — 1.

In fact, we really need to make the buckets slightly smaller than B( S ) /M , or else we shall not quite have room for one full bucket and one block for the other k — 1 buckets in memory at the same time.

Assuming, for simplicity, that k is about B ( S ) / M and m = 1, the savings in disk I/O ’s is 2 M ( B ( R ) + B { S ) ) / B { S ) and the total cost is (3 — 2M / B ( S ) ) (B ( R ) + B(S)).

15.5.7 Summary of Hash-Based Algorithms

Differences between sort-based and corresponding hashbased algorithms 1. Hash-based algorithms for binary operations

have a size requirement that depends only on the smaller of two arguments rather than on the sum of the argument sizes, that sort-based algorithms require.

2. Sort-based algorithms sometimes allow us to produce a result in sorted order and take advantage of that sort later.

3. Hash-based algorithms depend on the buckets being of equal size.

4. In sort-based algorithms, the sorted sublists may be written to consecutive blocks of the disk

5. Moreover, if M is much larger than the number of sorted sublists, then we may read in several consecutive blocks at a time from a sorted sublist, again saving some latency and seek time.

6. On the other hand, if we can choose the number of buckets to be less than M in a hash-based algorithm, then we can write out several blocks of a bucket at once.

Thank you

15.6 Index Based AlgorithmsBy: Tomas Tupy (123)

Outline Terminology Clustered Indexes

Example Non-Clustered Indexes Index Based Selection Joining Using an Index Join Using a Sorted Index

What is an Index? A data structure which improves the speed

of data retrieval ops on a relation, at the cost of slower writes and the use of more storage space.

Enables sub-linear time lookup. Data is stored in arbitrary order, while

logical ordering is achieved by using the index.

Index-based algorithms are especially useful for the selection operator

Terminology Recap B(R) – Number of blocks needed to hold R T(R) – Number of tuples in R V(R,a) – Number of distinct values of the

column for a in R. Clustered Relation – Tuples are packed into

as few blocks as possible. Clustered Indexes – Indexes on attribute(s)

such that all tuples with a fixed value for the search key appear on a few blocks as possible.

Clustering Indexes A relation is clustered if its tuples are

packed into relatively few blocks. Clustering indexes are indexes on an

attribute or attributes such that all the tuples with a fixed value for the search key of this index appear in as little blocks as possible.

Tuples are stored to match the index order. A relation that isn’t clustered cannot have a

clustering index.

Clustering Indexes Let R(a,b) be a relation sorted on attribute

a. Let the index on a be a clustering index. Let a1 be a specific value for a.

A clustering index has all tuples with a fixed value packed into minimum # of blocks.

a1 a1 a1 a1 a1 a1 a1 a1a1 a1 a1

All the a1 tuples

Pros/Cons Pros

Faster reads for particular selections Cons

Writing to a table with a clustered index can be slower since there might be a need to rearrange data.

Only one clustered index possible.

Clustered Index ExampleCustomer

ID

Name

Address

Order

ID

CustomerID

Price

Problem: We want to quickly retrieve all orders for a particular customer.

How do we do this?

Clustered Index Example Solution: Create a clustered index on

the “CustomerID” column of the Order table.

Now the tuples with the same CustomerID will be physically stored closed to one another on the disk.

Non Clustered Indexes There can be many per table Quicker for insert and update

operations. The physical order of tuples is not the

same as index order.

Index Based Algorithms Especially useful for the selection

operator. Algorithms for join and other binary

operators also use indexes to very good advantage

Join and other binary operators also benefit.

Index-Based Selection No index

Without an index on relation R, we have to read all the tuples in order to implement selection oC(R), and see which tuples match our condition C.

What is the cost in terms of disk I/O’s to implement oC(R)? (For both clustered and non-clustered relations)

Index-Based Selection No index

Answer: B(R) if our relation is clustered Upto T(R) if relations in not-clustered.

Index-Based Selection Let us consider an index on attribute a

where our condition C is a = v. oa=v(R) In this case we just search the index for

value v and we get pointers to exactly the tuples we need.

Index-Based Selection Let’s say for our selection oa=v(R), our

index is clustering. What is the cost in the # of disk I/O’s to

retrieve the set oa=v(R)?

Index-Based Selection Answer

the average is: B(R) / V(R,a) A few more I/Os:

Index might not be in main memory Tuples which a = v might not be block

aligned. Even if clustered, might not be packed as

tight as possible. (Extra space for insertion)

Index-Based Selection Non-clustering index for our selection

oa=v(R) What is the cost in the # of disk I/O’s to

retrieve the set oa=v(R)?

Index-Based Selection Answer

Worst case is: T(R) / V(R,a) This can happen if tuples live in different

blocks.

Joining by Using an Index(Algorithm 1) Consider natural join: R(X,Y) |><| S(Y,Z) Suppose S has and index on attribute Y. Start by examining each block of R, and

within each block consider each tuple t, where tY is a component of t corresponding to the attribute Y.

Now we use the index to find tuples of S that have tY in their Y component.

These tuples will create the join.

Joining by Using an Index(Algorithm 1) Analysis Consider R(X,Y) |><| S(Y,Z) If R is clustered, then we have to read B(R)

blocks to get all tuples of R. If R is not clustered then up to T(R) disk I/O’s are required.

For each tuple t of R, we must read an average of T(S) / V(S,Y) tuples of S.

Total: B(R)T(S) / V(S,Y) for clustered index, and T(R)T(S) / V(S,Y) for non-clustered index.

Join Using a Sorted Index Consider R(X,Y) |><| S(Y,Z) Data structures such as B-Trees provide

the best sorted indexes. In the best case, if we have sorting

indexes on Y for both R and S then we perform only the last step of the simple sort-based join.

Sometimes called zig-zag join

Join Using a Sorted Index(Zig-zag join) Consider R(X,Y) |><| S(Y,Z) where we

have indexes on Y for both R and S. Tuples from R with a Y value that does

not appear in S never need to be retrieved, and vice-versa…

Index on Y in R

Index on Y in S

Thank You! Questions?

Chapter 15.7Buffer Management

Class: CS257 Instructor: Dr. T.Y.Lin

What does a buffer manager do?

Assume there are M of main-memory buffers needed for the operators on relations to store needed data.

In practice: 1) rarely allocated in advance2) the value of M may vary depending on system

conditions Therefore, buffer manager is used to allow processes

to get the memory they need, while minimizing the delay and unclassifiable requests.

Buffer manager

Buffers

RequestsRead/Writes

Figure 1: The role of the buffer manager : responds to requests for main-memory access to disk blocks

The role of the buffer manager

15.7.1 Buffer Management Architecture

Two broad architectures for a buffer manager:

1) The buffer manager controls main memory directly. • Relational DBMS

2) The buffer manager allocates buffers in virtual memory, allowing the OS to decide how to use buffers. • “main-memory” DBMS • “object-oriented” DBMS

Buffer Pool

Key setting for the Buffer manager to be efficient:

The buffer manager should limit the number of buffers in use so that they fit in the available main memory, i.e. Don’t exceed available space.

The number of buffers is a parameter set when the DBMS is initialized.

No matter which architecture of buffering is used, we simply assume that there is a fixed-size buffer pool, a set of buffers available to queries and other database actions.

Data must be in RAM for DBMS to operate on it! Buffer Manager hides the fact that not all data is in RAM.

DB

MAIN MEMORY

DISK

disk page

free frame

Page Requests from Higher Levels

BUFFER POOL

choice of frame dictatedby replacement policy

Buffer Pool

15.7.2 Buffer Management Strategies

Buffer-replacement strategies:

When a buffer is needed for a newly requested block and the buffer pool is full, what block to throw out the buffer pool?

Buffer-replacement strategy -- LRU

Least-Recently Used (LRU):

To throw out the block that has not been read or written for the longest time.

• Requires more maintenance but it is effective. • Update the time table for every access.• Least-Recently Used blocks are usually less likely to

be accessed sooner than other blocks.

Buffer-replacement strategy -- FIFO

First-In-First-Out (FIFO):

The buffer that has been occupied the longest by the same block is emptied and used for the new block.

• Requires less maintenance but it can make more mistakes.• Keep only the loading time• The oldest block doesn’t mean it is less likely to be

accessed. Example: the root block of a B-tree index

Buffer-replacement strategy – “Clock”

The “Clock” Algorithm (“Second Chance”)

Think of the 8 buffers as arranged in a circle, shown as Figure 3

Flag 0 and 1:

buffers with a 0 flag are ok to sent their contents back to disk, i.e. ok to be replaced

buffers with a 1 flag are not ok to be replaced

Buffer-replacement strategy – “Clock”

0

0

1

1

1

0

0

0

Start point to search a 0 flag

the buffer with a 0 flag will be replaced

The flag will be set to 0

By next time the hand reaches it, if the content of this buffer is not accessed, i.e. flag=0, this buffer will be replaced.That’s “Second Chance”.

Figure 3: the clock algorithm

Buffer-replacement strategy -- Clock

a buffer’s flag set to 1 when:a block is read into a buffer

the contents of the buffer is accessed

a buffer’s flag set to 0 when:the buffer manager needs a buffer for a new

block, it looks for the first 0 it can find, rotating clockwise. If it passes 1’s, it sets them to 0.

System Control helps Buffer-replacement strategy

System ControlThe query processor or other components of a DBMS can give advice to the buffer manager in order to avoid some of the mistakes that would occur with a strict policy such as LRU,FIFO or Clock.

For example:

A “pinned” block means it can’t be moved to disk without first modifying certain other blocks that point to it.

In FIFO, use “pinned” to force root of a B-tree to remain in memory at all times.

15.7.3 The Relationship Between Physical Operator Selection and Buffer Management

Problem:

Physical Operator expected certain number of buffers M for execution.

However, the buffer manager may not be able to guarantee these M buffers are available.

15.7.3 The Relationship Between Physical Operator Selection and Buffer Management

Questions:

Can the algorithm adapt to changes of M, the number of main-memory buffers available?

When available buffers are less than M, and some blocks have to be put in disk instead of in memory.

How the buffer-replacement strategy impact the performance (i.e. the number of additional I/O’s)?

Example

FOR each chunk of M-1 blocks of S DO BEGIN

read these blocks into main-memory buffers;

organize their tuples into a search structure whose

search key is the common attributes of R and S;

FOR each block b of R DO BEGIN

read b into main memory;

FOR each tuple t of b DO BEGIN

find the tuples of S in main memory that

join with t ;

output the join of t with each of these tuples;

END ;

END ;

END ;

Figure 15.8: The nested-loop join algorithm

Example

The outer loop number (M-1) depends on the average number of buffers are available at each iteration.

The outer loop use M-1 buffers and 1 is reserved for a block of R, the relation of the inner loop.

If we pin the M-1 blocks we use for S on one iteration of the outer loop, we shall not lose their buffers during the round.

Also, more buffers may become available and then we could keep more than one block of R in memory.

Will these extra buffers improve the running time?

Example

CASE1: NO

Buffer-replacement strategy: LRUBuffers for R: kWe read each block of R in order into buffers.By end of the iteration of the outer loop, the last k blocks of R

are in buffers.However, next iteration will start from the beginning of R

again. Therefore, the k buffers for R will need to be replaced.

Example

CASE 2: YES

Buffer-replacement strategy: LRUBuffers for R: kWe read the blocks of R in an order that alternates: firstlast

and then lastfirst.In this way, we save k disk I/Os on each iteration of the outer

loop except the first iteration.

Other Algorithms and M buffers

Other Algorithms also are impact by M and the buffer-replacement strategy. Sort-based algorithm

If M shrinks, we can change the size of a sublist.

Unexpected result: too many sublists to allocate each sublist a buffer. Hash-based algorithm

If M shrinks, we can reduce the number of buckets, as long as the buckets still can fit in M buffers.

THANK YOU !

Buffer ManagementBy Snigdha Rao Parvatneni

SJSU ID: 008648978

Roll Number: 124

Course: CS257

Agenda

• Introduction• Role of Buffer Management• Architecture of Buffer Management• Buffer Management Strategies• Relation Between Physical Operator Selection And Buffer Management• Example

Introduction

• We assume that operators in relations have some main memory buffers, to store the needed data.

• It is very rare that these buffers are allocated in advance, to the operator.

• Task of assigning main memory buffers to process is given to Buffer Manager.

• Buffer manager is responsible for allocating main memory to the process as per the need and minimizing the delays and unsatisfiable requests.

Role of Buffer Manager

• Buffer Manager responds to the request of main memory access to disk blocks. Below picture depicts it.

• The buffer manager controls main memory directly.

Architecture of Buffer Management

• There are two broad architectures for a buffer manager:

– Buffer manager controls main memory directly like in many RDBMS.

– Buffer manager allocates buffers in virtual memory and let OS decide which buffers should be in main memory and which buffer should be in OS managed disk swap space like in many Object Oriented DBMS and main memory DBMS.

Problem

• Irrespective of approach there is a problem that buffer manager has to limit number of buffers, to fit in available main memory.

– In case where buffer manager controls main memory directly• If requests exceeds available space then buffer manager has to select a buffer to

empty by returning its contents to disk.• When blocks have not been changed then they are imply erased from main memory.

But, when blocks have been changed then they are written back to its place on disk.

– In case where Buffer manager allocates space in virtual memory • Buffer manager has the option of allocating more buffers, which can actually fit into

main memory. When all these buffers will be in use then there will be thrashing.• It is an operating system problem where many blocks are moved in and out of disk’s

swap space. Therefore, system will end up spending most of time in swapping blocks and getting very little work done.

Solution

• To resolve this problem When DBMS is initialized then the number of buffers is set.

• User need not worry about mode of buffering used.

• For users there is a fixed size buffer pool, in other words set of buffers are available to query and other database actions.

Buffer Management Strategies

• Buffer Manager needs to make a critical choice of which block to keep and which block to discard when buffer is needed for newly requested blocks.

• Then buffer manager uses buffer replacement strategies. Some common strategies are –

– Least-Recently Used (LRU)

– First-In-First-Out (FIFO)

– The Clock Algorithm (Second Chance)

– System Control

Last-Recently Used (LRU)

• This rule is to throw out the block which has not been read or written from

long time.

• To do this the Buffer Manager needs to maintain a table which will indicate

the last time when block in each buffer was accessed.

• It is also needed that each database access should make an entry in this table.

Significant amount of is involved effort in maintaining this information.

• Buffers which are not user from long time is less likely to be accessed before

than those buffers which have been accessed recently. Hence, It is an

effective strategy.

First-In-First-Out (FIFO)

• In this rule, when buffer is needed then the buffer which has been occupied for longest by same block is emptied and used by new block.

• To do this Buffer Manager needs to know only the time at which block occupying the buffer was loaded into the buffer.

• Entry in the table is made when block is read from disk, not every time it is accessed.

• Involves less maintenance than LRU but it is more prone to mistakes.

The Clock Algorithm

• It is an efficient approximation of LRU and is commonly implemented.

• Buffers are treated to be arranged in circle where arrow points to one of the buffers. Arrow will rotate clockwise if it needs to find a buffer to place a disk block.

• Each buffer has an associated flag with value 0 or 1. Buffers with flag vale 0 are vulnerable to content transfer to disk whereas buffer with flag value 1 are not vulnerable.

• Whenever block is read into buffer or contents of buffer are accessed, flag associated with it is set to 1.

Working of Clock’s Algorithm

• Whenever buffer is needed for the

block arrow looks for first 0 it

can find in clockwise direction.

• Arrow move changes flag value

from 1 to 0.

• Block is thrown out of buffer only

when it remains unaccessed i.e.

flag value 0 for the time

between two rotations of the arrow.

• First rotation when flag is set from

1 to 0 and second rotation when

arrow comes back to check flag value.

System Control

• Query processor and other DBMS components can advice buffer manager to avoid some mistake which occurs with LRU, FIFO or Clock.

• Some blocks cannot be moved out of main memory without modifying other blocks pointing to it. Such blocks are called pinned blocks.

• Buffer Manager needs to modify buffer replacement strategy, to avoid expelling pinned blocks. That’s why some blocks are remains in main memory even though there is no technical reason for not writing it to the disk.

Relation Between Physical Operator Selection And Buffer Management

• Physical operator is selected by query optimizer to execute the query. This selection may assume that certain number for the available buffers, for execution of these operators.

• But as we know that Buffer Manager does not guarantee availability of certain number of buffers when query is executed.

• Now two question arises– Can an algorithm adapt to changes in number of available main memory

buffers?– When expected number of available buffers are not available and certain

blocks are moved to disk which were expected in main memory, then how buffer replacement strategy of Buffer Manager impact the number of I/O performed?

Example

• Block based nested loop join – algorithm does not depends upon number of available buffers, M but performance depends.

• For each M-1 blocks of outer loop relation S, read blocks in main memory, organize the tuple into search structure whose key is common attribute of R and S.

• Now for each block b of R, read b into main memory and for each tuple t of b find tuples in main memory for S that join with t.

• Here outer loop number M-1 depends upon average number of buffers available at each iteration and the outer loop uses M-1 buffers and 1 is reserved for relation of inner loop, block of R.

• If we pin M-1 block that we use for S in one iteration of outer loop then we cannot loose these buffers during this round. In addition to this more buffers will become available and more than one block of R needs to be kept in the memory. Will it improve the running time?

Cases with LRU• Case1

– When LRU is used as buffer replacement strategy then k buffers will be available to hold blocks of R.

– R is read in order such that blocks that remains in the buffer at the end of iteration of outer look will be last k blocks of R.

– For next iteration we will again start from beginning of R. Therefore, k buffers for R needs to be replace.

• Case2– With better implementation of nested loop join when LRU is used visit blocks

of R in order that alternates first to last then last to first.

– In this we save k disk I/O on each iteration except first iteration.

With Other Algorithms

• Other algorithms also are impacted by the fact that availability of buffer can vary and by the buffer-replacement strategy used by the buffer manager.

• In sort based algorithm when availability of buffers reduces we can change the size of a sub-lists. Major limitation of this is we will be forced to create many sub-lists that we cannot then allocate a buffer for each sub-list in the merging process.

• In hash based algorithm when availability of buffers reduces we can reduce the size of buckets

• provided bucket then should not become so large that they do not fit into the allotted main memory.

References

• DATABASE SYSTEMS: The Complete Book, Second Edition by Hector Garcia-Molina, Jeffrey D. Ullman & Jennifer Widom

Thank You

David LeCS257, ID: 126Feb 28, 2013

15.9 Query Execution Summary

Query ProcessingOutline of Query CompilationTable ScanningCost MeasuresReview of Algorithms

One-pass MethodsNested-Loop JoinTwo-pass

Sort-basedHash-based

Index-basedMulti-pass

Overview

Query Processing

Query Compilation

Query Execution

query

query plan

metadata

data

Query is compiled. This involves extensive optimization using operations of relational algebra.

First compiled into a logical query plans, e.g. using expressions of relational algebra.

Then converted to a physical query plan such as selecting implementation for each operator, ordering joins and etc.

Query is then executed.

Outline of Query Compilation

Parse query

Select logical plan

SQL query

expressiontree

queryoptimization

Parsing: A parse tree for the query is constructed.

Query Rewrite: The parse tree is converted to an initial query plan and transformed into logical query plan.

Physical Plan Generation: Logical plan is converted into physical plan by selecting algorithms and order of executions.

Select physical plan

Execute plan

logical queryplan tree

physical queryplan tree

Table ScanningThere are two approaches for

locating tuples of relation R:Table-scan: Get the blocks one

by one.Index-scan: Use index to lead us

to all blocks holding R.Sort-scan takes a relation R and

sorting specifications and produces R in a sorted order.

This can be accomplished with SQL clause ‘ORDER BY’.

Estimates of cost are essential for query optimization.

It allows us to determine the slow and fast parts of a query plan.

Reading many consecutive blocks on a track is extremely important since disk I/O’s are expensive in term of time.

EXPLAIN SELECT * FROM a JOIN b on a.id = b.id;

Cost Measures

One-pass Methods

Tuple-at-a-time: Selection and projection that do not require an entire relation in memory at once.

Full-relation, unary operations. Must see all or most of tuples in memory at once. Uses grouping and duplicate-eliminator operators.

Full-relation, binary operations.These include union, intersection, difference, product and join.

Review of Algorithms

Nested-Loop Joins

In a sense, it is ‘one-and-a-half’ passes, since one argument has its tuples read only once, while the other will be read repeatedly.

Can use relation of any size and does not have to fit all in main memory.

Two variations of nested-loop joins:Tuple-based: Simplest form, can be very slow

since it takes T(R)*T(S) disk I/O’s if we are joining R(x,y) with S(y,z).

Block-based: Organizing access to both argument relations by blocks and use as much main memory as we can to store tuples.


Two-pass Algorithms

Usually enough even for large relations.Based on Sorting:

Partition the arguments into memory-sized, sorted sublists.

Sorted sublists are then merged appropriately to produce desired results.

Based on Hashing:Partition the arguments into buckets. Useful if data is too big to store in memory.


Two-pass Algorithms

Sort-based vs. Hash-based:Hash-based are often superior to sort-based

since they require only one of the arguments to be small.

Sorted-based works well when there is reason to keep some of the data sorted.


Index-based Algorithms

Index-based joins are excellent when one of the relations is small, and the other has an index on join attributes.

Clustering and non-clustering indexes:Clustering index has all tuples with fixed value

packed into minimum number of blocks.A clustered relation can have non-clustering

indexes.


Multi-pass Algorithms

Two-pass algorithms based on sorting or hashing can usually take three or more passes and will work for larger data sets.

Each pass of a sorting algorithm reads all data from disk and writes it out again.

Thus, a k-pass sorting algorithm requires 2·k·B(R) disk I/O’s.


Questions or Cookies?

THANK YOU.

secondary storage management the memory hierarchy

Documents