secondary storage management the memory hierarchy
TRANSCRIPT
Secondary Storage ManagementThe Memory Hierarchy
The Memory Hierarchy• Computer systems have several different
components in which data may be stored.
• Data capacities & access speeds range over at least seven orders of magnitude
• Devices with smallest capacity also offer the fastest access speed
• The term memory hierarchy is used in computer architecture when discussing performance issues in computer architectural design, algorithm predictions,
• The lower level programming constructs such as involving locality of reference.
Description of Levels
1. Cache
• Megabyte or more of Cache storage.
• On-board cache : On same chip.
• Level-2 cache : On another chip.
• Cache data accessed in few nanoseconds.
• Data moved from main memory to cache when needed by processor
• Volatile
Description of Levels 2. Main Memory
• 1 GB or more of main memory.• Instruction execution & Data Manipulation -
involves information resident in main memory.• Time to move data from main memory to the
processor or cache is in the 10-100 nanosecond range.
• Volatile3. Secondary Storage
• Typically a magnetic disk.• Capacity upto 1 TB.• One machine can have several disk units.• Time to transfer a single byte between disk &
main memory is around 10 milliseconds.
Description of Levels
4. Tertiary Storage
• Holds data volumes measured in terabytes.– As capacious as a collection of disk units can be, there are databases much
larger than what can be stored on the disk(s) of a single machine, or even several machines.
• Significantly higher read/write times.• Tertiary storage is characterized by significantly
higher read/write times than secondary storage• Smaller cost per bytes.• Retrieval takes seconds or minutes, but
capacities in the petabyte range are possible.
Transfer of Data Between Levels
• Data moves between adjacent levels of the hierarchy.
• Each level is organized to transfer large amounts of data to or from the level below
• Key technique for speeding up database operations is to arrange data so that when one piece of a disk block is needed
• It is likely that other data on the same block will also be needed at about the same time.
Volatile & Non Volatile Storage• A volatile device “forgets” what is stored in it
when the power goes off. • Example: Main Memory
• A nonvolatile device, on the other hand, is expected to keep its contents intact even for long periods when the device is turned off or there is a power failure.
• Example: Secondary & Tertiary Storage
Note: No change to the database can be considered final until ithas migrated to nonvolatile, secondary storage.
Virtual Memory• Managed by Operating System.• Typical software executes in virtual-memory, an
address space that is typically 32 bits;• There are 232 bytes, or 4 gigabytes, in a virtual
memory.
• Some memory in main memory & rest on disk.
• Transfer between the two is in units of disk blocks (pages).
• Not a level of the memory hierarchy
Thank you!
Section 13.2 – Secondary storage management
CS-257 Database System PrinciplesAvinash Anantharamu (102)
008629907
• 13.2 Disks
• 13.2.1 Mechanics of Disks • 13.2.2 The Disk Controller • 13.2.3 Disk Access Characteristics
Index
Structure of a Disk
• Two principal moving pieces of hard drive1- Head Assembly2- Disk Assembly
• Disk Assembly has 1 or more circular platters that rotate around a central spindle.• Platters are covered with thin magnetic material• The upper and lower surfaces of the platters are covered with a thin layer
of magnetic material,on which bits are stored.• 0’s and l ’s are represented by different patterns in the magnetic material.• A common diameter for disk platters is 3.5 inches, although disks with
diameters from an inch to several feet have been built.
Mechanics of Disks
Top View of Disk Surface
• Tracks are concentric circles on a platter.• The two principal moving pieces of a disk drive - disk assembly and a head
assembly.• The disk is organized into tracks,• Tracks are organized into sectors which are segments of circular platter.• In 2008, a typical disk has about 100,000 tracks per inch but stores about
a million bits per inch along the tracks.• Sectors are indivisible as far as errors are concerned.
• Blocks are logical data transfer units.
Mechanics of Disks
• Control the actuator to move head assembly
• Selecting the surface from which to read or write
• Transfer bits from desired sector to main memory• buffering an entire track or more in local memory of the disk
controller• additional accesses to the disk can be avoided.
Disk Controller
Simple Single Processor Computer
• Seek time
• Rotational latency
• Transfer time
• Latency of the disk.
Disk Access characteristics
Thank you
13.3 Accelerating Access to Secondary StorageSan Jose State University
Spring 2012
13.3 Accelerating Access to Secondary Storage
Section Overview
13.3.1: The I/O Model of Computation 13.3.2: Organizing Data by Cylinders 13.3.3: Using Multiple Disks 13.3.4: Mirroring Disks 13.3.5: Disk Scheduling and the Elevator
Algorithm 13.3.6: Prefetching and Large-Scale Buffering
13.3 Introduction Average block access is ~10ms. Disks may be busy. Requests may outpace access delays, leading
to infinite scheduling latency. There are various strategies to increase disk
throughput. The “I/O Model” is the correct model to
determine speed of database operations the scheduling latency becomes infinite.
13.3 Introduction (Contd.)
Actions that improve database access speed:
– Place blocks closer, within the same cylinder
– Increase the number of disks
– Mirror disks
– Use an improved disk-scheduling algorithm
– Use prefetching
– improve the throughput
13.3.1 The I/O Model of Computation
If we have a computer running a DBMS that:
– Is trying to serve a number of users
– Has 1 processor, 1 disk controller, and 1 disk
– Each user is accessing different parts of the DB It can be assumed that:
– Time required for disk access is much larger than access to main memory; and as a result:
– The number of block accesses is a good approximation of time required by a DB algorithm
13.3.2 Organizing Data by Cylinders
It is more efficient to store data that might be accessed together in the same or adjacent cylinder(s).
In a relational database, related data should be stored in the same cylinder.
we can approach the theoretical transfer rate for moving data on or off the disk.
13.3.3 Using Multiple Disks If the disk controller supports the addition of multiple disks
and has efficient scheduling, using multiple disks can improve performance significantly
By striping a relation across multiple disks, each chunk of data can be retrieved in a parallel fashion, improving performance by up to a factor of n, where n is the total number of disks the data is striped over
The disk controller, bus, and main memorycan handle n times the data-transfer rate,
n disks will have approximately the performance of one disk that operates n times as fast.
A drawback of striping data across multiple disks is that you increase your chances of disk failure.
To mitigate this risk, some DBMS use a disk mirroring configuration
Disk mirroring makes each disk a copy of the other disks, so that if any disk fails, the data is not lost
Since all the data is in multiple places, access speedup can be increased by more than n since the disk with the head closest to the requested block can be chosen
13.3.4 Mirroring Disks
13.3.4 Mirroring Disks
Advantages Disadvantages
Striping Read/Write speedup ~nCapacity increased by ~n
Higher risk of failure
Mirroring Read speedup ~nReduced failure riskFast initial access
High cost per bitSlow writes compared to striping
One way to improve disk throughput is to improve disk scheduling, prioritizing requests such that they are more efficient
– The elevator algorithm is a simple yet effective disk scheduling algorithm
– The algorithm makes the heads of a disk oscillate back and forth similar to how an elevator goes up and down
– The access requests closest to the heads current position are processed first
13.3.5 Disk Scheduling
When sweeping outward, the direction of head movement changes only after the largest cylinder request has been processed
When sweeping inward, the direction of head movement changes only after the smallest cylinder request has been processed
Example:
13.3.5 Disk Scheduling
Cylinder Time Requested (ms)
8000 0
24000 0
56000 0
16000 10
64000 20
40000 30
Cylinder Time Completed (ms)
8000 4.3
24000 13.6
56000 26.9
64000 34.2
40000 45.5
16000 56.8
In some cases we can anticipate what data will be needed
We can take advantage of this by prefetching data from the disk before the DBMS requests it
Since the data is already in memory, the DBMS receives it instantly
13.3.6 Prefetching and Large-Scale Buffering
? Questions ?
Disk Failures
Presented by Timothy ChenSpring 2013
Index
• 13.4 Disk Failures13.4.1 Intermittent Failures13.4.2 Organizing Data by Cylinders13.4.3 Stable Storage13.4.4 Error- Handling Capabilities of Stable
Storage13.4.5 Recovery from Disk Crashes13.4.6 Mirroring as a Redundancy Technique13.4.7 Parity Blocks13.4.8 An Improving: RAID 513.4.9 Coping With Multiple Disk Crashers
Intermittent Failures
• If we try to read the sector but the correct content of that sector is not delivered to the disk controller
• with repeated tries we are able to read or write successfully.• Controller will check good and bad sector• If the write is correct: Read is performed• Good sector and bad sector is known by the read operation• The controller may attempt to write a sector, but the
contents of the sector are not what was intended.• We assume the write was correct, and if the sector read is
bad, then the write was apparently unsuccessful and must be repeated.
CheckSum
• Read operation that determine the good or bad status
• If, on reading, we find that the checksum is not proper for the data bits, then we know there is an error in reading.
• If the checksum is proper, there is still a small chance that the block was not read correctly, but by using many checksum bits we can make the probability of missing a bad read arbitrarily small.
How CheckSum perform
• Each sector has some additional bits• A simple form of checksum is based on the parity of all
the bits in the sector.• Set depending on the values of the data bits stored in
each sector• If the data bit in the not proper we know there is an
error reading• Odd number of 1: bits have odd parity(01101000)• Even number of 1: bit have even parity (111011100)• Find Error is the it is one bit parity
Stable Storage
• Deal with disk error• Sectors are paired, and each pair represents one sector-
contents X .• Sectors are paired and each pair X showing left and right
copies as Xl and Xr • It check the parity bit of left and right by substituting
spare sector of Xl and Xr until the good value is returned• Assume that if the read function returns a good value w
for either X l or X r , then w is the true value of X .
Error-Handling Capabilities of Stable Storage
• Since it has XL and XR, one of them fail we can still read other one
• Chance both of them fail are pretty small• The write Fail, it happened during power outage• Media Failure• Write Failure
– The failure occurred as we were writing XL– The failure occurred after we wrote XL
Recover Disk Crash
• The most serious mode of failure for disks is “head crash” where data permanently destroyed.
• This situation represents a disaster for many DBMS applications, such as banking and other financial applications.
• The way to recover from crash , we use RAID method
• RAID- Redundant Arrays of Independent Disks.
Mirroring as a Redundancy Technique
• it is call Raid 1• Just mirror each disk• Mirroring, as a protection against data loss, is
often referred to as RAID level 1.• Essentially, with mirroring and the other
redundancy schemes we discuss, the only way data can be lost is if there is a second disk crash while the first crash is being repaired.
Raid 1 graph
Parity Block
• It often call Raid 4 technical• read block from each of the other disks and
modulo-2 sum of each column and get redundant disk
disk 1: 11110000disk 2: 10101010disk 3: 00111000
get redundant disk 4(even 1= 0, odd 1 =1)disk 4: 01100010
Raid 4 graphic
Parity Block- Fail Recovery
• It can only recover one disk fail• If it has more than one like two disk• Then it can’t be recover us modulo-2 sum• If the failed disk is one of the data disks, then
we need to swap in a good disk and recompute its data from the other disks.
An Improvement Raid 5
Coping with multiple Disk Crash
• For more one disk fail• Either raid 4 and raid 5 can’t be work• So we need raid 6• It is need at least 2 redundant disk
Raid 6
Reference
• http://www.definethecloud.net/wp-content/uploads/2010/12/325px-RAID_1.svg_.png
• http://en.wikipedia.org/wiki/RAID
Secondary Storage Management
13.5 Arranging data on disk
Mangesh Dahale
ID-105
CS 257
Outline
• Fixed-Length Records• Example of Fixed-Length Records• Packing Fixed-Length Records into
Blocks• Example of Packing Fixed-Length
Records into Blocks• Details of Block header
Arranging Data on Disk
• A data element such as a tuple or object is represented by a record,
• It consists of consecutive bytes in some disk block.
Fixed Length Records
The Simplest record consists of fixed length fields.
The record begins with a header, a fixed-length regionwhere information about the record itself is kept.• it is necessary to lay out the record so it can be moved to main
memory and accessed efficiently there.
Fixed Length Record header1. A pointer – To record schema.• 2. The length of the record. -This information helps us skip over
records without consulting the schema.3. A timestamp – To indicate when the record was created.
ExampleCREATE TABLE employee(
name CHAR(30) PRIMARY KEY,
address VARCHAR(255),
gender CHAR(1),
birthdate DATE
);
Packing Fixed Length Records into Blocks
• Records are stored in blocks of the disk and moved into main memory when we need to access or update them.
• A block header is written first and it is followed by series of blocks.
Example
•Along with the header we can pack as many record as we can in one block as shown in the figure and remaining space will be unused
Block header contains following information
• Links to one or more other blocks that are part of a network blocks
• Information about the role played by this block in such a network
• Information about which relation the tuples of this block belong to.
• A “directory” giving the offset of each round in the block
• Timestamp(s) indicating the time of the block's last modification and / or access
•Thank You
13.6 Representing Block and Record
AddressesLokmanyaThilakCS257 ID:106
Topics
Addresses in Client-Server Systems
Logical and Structured Addresses
Pointer Swizzling
Returning Blocks to Disk
Pinned Records and Blocks
Introduction
• Address of a block and Record
▫ In Main Memory Address of the block is the virtual memory address of the first byte
a database system consists of a server process that provides data from secondary storage to one or more client processes that are applications using the data.
Address of the record within the block is the virtual memory address of the first byte of the record
▫ In Secondary Memory Sequence of Bytes describe the location of the block : the device Id for the
disk, Cylinder number, etc.
Address in Client-Server System
Physical Addresses: Byte strings referring to the place within the secondary
storage system where the record can be found.The client application uses a conventional “virtual”
address space, typically 32 bits, or about 4 billion different addresses
Logical Addresses: Arbitrary string of bytes of some fixed length that
maps to physical addressA map table, stored on disk in a known location
Map Table
Logical Address
Physical Address
13.6.2 Logical and Structured Addresses
• Purpose of logical address:▫ Gives more flexibility, when we
Move the record around within the block Move the record to another block
▫ Easy updating of records▫ Structured address ▫ Gives us an option of deciding what to do when a
record is deleted
HeaderOffset table
Unused
• Having pointers is common in an object-relational database systems
• Every data item (block, record, etc.) has two addresses:– database address: address on the disk– memory address, if the item is in virtual memory
• When block is moved from secondary storage to main memory pointers within block are “swizzled”,i.e translated from database address space (Server) to virtual address space (Client)
Translation Table
Database address
Memory Address
Types of Swizzling Automatic Swizzling
As soon as block is brought into memory, swizzle all relevant pointers.
Swizzling on DemandOnly swizzle a pointer if and when it is actually followed.
No SwizzlingPointers are not swizzled they are accesses using the database
address.
• Unswizzling– When a block is moved from memory back to disk, all
pointers must go back to database (disk) addresses– Use translation table again– Important to have an efficient data structure for the
translation table
Returning Blocks to Disk
• When a block is moved from memory back to disk, any pointers within that block must be unswizzled, i.e. memory address must be replaced by database address.
• A block in memory is said to be pinned if it cannot be written back to disk safely
Thank you
Variable Length Data and Records
- Ashwin Kalbhor Class ID : 107
Agenda
• Records with Variable Length Fields• Records with Repeating Fields• Variable Format Records• Records that do not fit in a block
• Example of a record
name
address
gender
birth date
0 30 286 287 297
Records with Variable Length Fields
• Simple and Effective way to represent variable length records is as follows –1. Fixed length fields are kept ahead of the variable length records.2. A header is put in front of the of the record.3. Record header contains• Length of the record• Pointers to the beginning of all variable length
fields except the first one.
Example
Record with name and address as variable length field.
birth date
name address
header informationrecord
lengthto address
gender
Records with repeating fields
• Repeating fields simply means fields of the same length L.
• All occurrences of Field F grouped together.• Pointer pointing to the first field F is put in the
header.• Based on the length L the starting offset of any
repeating field can be obtained.
Example of a record with Repeating Fields
Movie star record with “movies” as the repeating field.
name address
other header informationrecord
lengthto addressto movie pointers
pointers to movies
Alternative representation
• Record is of fixed length• Variable length fields stored on a separate
block.• The record itself keeps track of -
1. Pointers to the place where each repeating field begins, and2. Either how many repetitions there are, or where the repetitions end.
Storing variable length fields separately from the record.
Variable Format Records
• Records that do not have fixed schema• Represented by sequence of tagged fields• Each of the tagged fields consist of information
• Attribute or field name• Type of the field• Length of the field• Value of the field
Variable Format Records
N 16
S S14
Clint Eastwood
Hog’s Breath Inn
R
code for name
code for restaurant ownedcode for string
typecode for string typelength
length
Records that do not fit in a block
• When the length of a record is greater than block size ,then record is divided and placed into two or more blocks
• Portion of the record in each block is referred to as a RECORD FRAGMENT
• Record with two or more fragments is called a SPANNED RECORD
• Record that do not cross a block boundary is called UNSPANNED RECORD
Spanned Records
• Spanned records require the following extra header information –• A bit indicates whether it is fragment or not• A bit indicates whether it is first or last fragment
of a record• Pointers to the next or previous fragment for the
same record
Spanned Records
record 1 record 3 record 2 - a
record 2 - b
block header
record header
block 1 block 2
Thank You.
13.8 Record Modifications
CS257Lok Kei Leong ( 108 )
Outline
• Record Insertion
• Record Deletion
• Record Update
Insertion• Insert new records into a relation
- records of a relation in no particular order- record of a relation in fixed order
(e.g. sorted by primary key)• A pointer to a record from outside the block is a “structured
address”
What If The Block is Full?
• If we need to insert the record in a particular block but the block is full. What should we do?
• Find room outside the Block• There are 2 solutions I. Find Space on Nearby BlockII. Create an Overflow Block
Insertion (solution 1)
• Find space on a “nearby” block• Block B1 has no space • If space available on block B2 move records of B1 to
B2 • If there are external pointers to records of B1 moved
to B2 leave forwarding address in offset table of B1
Insertion (solution 2)
• Create an overflow block• Each block B has its header pointer to an overflow
block where additional blocks of B can be placed
Deletion• Slide around the block• Cannot slide records
- maintain an available-space list in the block headerto keep track of space available
• Avoid dangle or wind up pointing to a new record
Tombstone• What about pointer to deleted records ?• A tombstone is placed in place of each
deleted record• A tombstone is a bit placed at first byte of
deleted record to indicate the record was deleted ( 0 – Not Deleted 1 – Deleted)
• A tombstone is permanent
Update
• For Fixed-Length Records, there is no effect on the storage system
• For variable length records:• associated with insertion and deletion
(never create a tombstone for old record) • Longer updated record
create more space on its block- sliding records - create an overflow block
Question?
BITMAP INDEXES
Mahathi Kashojula (Id :- 132)
Contents• 14.7.1 -- Motivation for Bitmap Indexes• 14.7.2 -- Compressed Bitmaps• 14.7.3 -- Operating on Run-Length-Encoded
Bit-Vectors• 14.7.4 -- Managing Bitmap Index
Introduction
• A bitmap index is a special kind of index that stores the data as bit arrays (commonly called "bitmaps").
• It answers most queries by performing bitwise logical operations on these bitmaps.
• The bitmap index is designed for cases where number of distinct values is low, in other words, the values repeat very frequently.
Example
No F G
1 30 FOO
2 30 BAR
3 40 BAZ
4 50 FOO
5 40 BAR
6 30 BAZ
• Suppose a file consists of records with two fields, F and G, of type integer and string, respectively. The current file has six records, numbered 1 through 6, with the following values in order:
Example (contd…)
Value Vector
30 11000140 00101050 000100
• A bitmap index for the first field, F, would have three bit-vectors, each of length 6 as shown in the table.
• In each case, the 1's indicate the records in which the corresponding string appears.
• Table 2
Example (contd…)
• Table 3• A bitmap index for the first field, G, would have three bit-vectors, each of length 6 as shown in the table.
• In each case, the 1's indicate the records in which the corresponding string appears.
Value Vector
FOO 100100
BAR 010010
BAZ 001001
Motivation for Bitmap Indexes:
• Table 4• Bitmap indexes can
help answer range queries.
• Example: Given is the data of a
jewelry store. The attributes are
age and salary.
No Age Salary
1 25 60
2 45 60
3 50 75
4 50 100
5 50 120
6 70 110
7 85 140
8 30 260
9 25 400
10 45 350
11 50 275
Motivation (contd…)
• Table 5• A bitmap index for the first field Age, would have seven bit-vectors, each of length 12 as shown in the table.
• In each case, the 1's indicate in which records the corresponding string appears.
Value Vector
25 100000001000
30 000000010000
45 010000000100
50 001110000010
60 000000000001
70 000001000000
85 000000100000
Motivation (contd…)
• Table 5• A bitmap index for the second field Salary, would have ten bit-vectors, each of length 12 as shown in the table.
• In each case, the 1's indicate the records in which the corresponding string appears.
Value Vector
60 110000000000
75 001000000000
100 000100000000
110 000001000000
120 000010000000140 000000100000260 000000010001275 000000000010
350 000000000100400 000000001000
Motivation (contd…)• Suppose we want to find the jewelry buyers with
an age in the range 45-55 and a salary in the range 100-200.
• We first have to find the bit-vectors for the age values in this range;
in this example there are only two: 010000000100 and 001110000010, for 45 and 50, respectively. If we take their bitwise OR, we have a new bit-vector with 1 in position i if and only if the ith record has an age in the desired range. • The new bit-vector is 011110000110.
Motivation (contd…)• Next, we have to find the bit-vectors for the
salaries between 100 and 200.• There are four, corresponding to salaries 100,
110, 120, and 140. 100: 000100000000110: 000001000000120: 000010000000140: 000000100000
• Their bitwise OR is 000111100000.
Motivation (contd…)• The last step is to take the bitwise AND of the two
bit-vectors we calculated by OR. • That is:
011110000110 AND 000111100000 -----------------------------------
000110000000• We thus find that only the fourth and fifth records,
which are (50,100) and (50,120), are in the desired range.
Compressed Bitmaps• Consider:
The number of records in F are n. Attribute A has m distinct values in F.
• The size of a bitmap index on attribute A is m*n.
• If m is large, then the number of 1’s in a bit-vector will be very rare.
• A common encoding approach is called run-length encoding.
Run-length encoding• Represents run:
A run is a sequence of i 0’s followed by a 1, by some suitable binary encoding of the integer i.
• A run of i 0’s followed by a 1 is encoded by: First computing how many bits are needed to represent i,
let be j. Then represent the run by j-1, 1’s and a single 0 followed
by j bits which represent i in binary. The encoding for i = 1 is 01. j = 1 The encoding for i = 0 is 00. j = 1
• We concatenate the codes for each run together, and the sequence of bits is the encoding of the entire bit-vector.
Run-length encoding (contd…)• Let us decode the sequence 11101101001011• Staring at the beginning (left most bit):
First run: The first 0 is at position 4, so j = 4. The next 4 bits are 1101, so we know that the first integer is i = 13
Second run: 001011 j=1
i=0 Last run: 1011
j = 1i = 3
• Our entire run length is thus 13,0,3, hence our bit-vector is: 0000000000000110001
Managing Bitmap Indexes1) Finding bit vectors
• Think of each bit-vector as a key to a value.• Any secondary storage technique will be efficient in
retrieving the values.• Create secondary key with the attribute value as a
search key 2) Finding records
• Create secondary key with the record number as a search key (if we need record k, you can create a secondary index using the kth position as a search key.)
3) Handling Modifications• Record numbers must remain fixed once assigned• Changes to data file require changes to bitmap index
References:http://en.wikipedia.org/wiki/Bitmap_indexhttp://en.wikipedia.org/wiki/Run-length_encoding
Thank You
Questions ????
Query ExecutionSection 15.1
Sweta ShahCS257: Database Systems
ID: 118
Query Processor Query compilation Physical Query Plan Operators
Scanning Tables Table Scan Index scan
Sorting while scanning tables Model of computation for physical operators Parameters for measuring cost Iterators
Agenda
The Query Processor is a group of components of a DBMS that turns user queries and data-modification commands into a sequence of database operations and executes those operations
Query processor is responsible for supplying details regarding how the query is to be executed
a naive execution strategy for a query may take far more time than necessary
Query Processor
The major parts of the query processor
Query compilation itself is a multi-step process consisting of :
Parsing: in which a parse tree representing query and its structure is constructed
Query rewrite: in which the parse tree is converted to an initial query plan
Physical plan generation: where the logical query plan is turned into a physical query plan by selecting algorithms.
The physical plan also includes details such as how the queried relations are accessed, and when and if a relation should be sorted.
Query compilation
Outline of query compilation
Physical query plans are built from operators Each of the operators implement one step of the plan. They are particular implementations for one of the
operators of relational algebra. we also need physical operators for other tasks that do
not involve an operation of relational algebra They can also be non relational algebra operators like
“scan” which scans tables.
Physical Query Plan Operators
One of the most basic things in a physical query plan. Necessary when we want to perform join or union of a
relation with another relation. There are two basic approaches to locating the tuples
of a relation R.1. table-scan.2. index-scan.
Scanning Tables
Two basic approaches to locating the tuples of a relation R
Table-scan Relation R is stored in secondary memory
with its tuples arranged in blocks it is possible to get the blocks one by one This operation is called Table Scan
Two basic approaches to locating the tuples of a relation R
Index-scan there is an index on any attribute of
Relation R Use this index to get all the tuples of R This operation is called Index Scan
Why do we need sorting while scanning? the query could include an ORDER BY clause
requiring that a relation be sorted Various algorithms for relational-algebra
operations require one or both of their arguments to be sorted relation
Sort-scan takes a relation R and a specification of the attributes on which the sort is to be made, and produces R in that sorted order
If relation R must be sorted by attribute a, and there is a B-tree index on a, then a scan of the index allows us to produce R in the desired order
Sorting While Scanning Tables
Choosing physical plan operators wisely is an essential for a good query processor.
Cost for an operation is measured in number of disk i/o operations.
If an operator requires the final answer to a query to be written back to the disk, the total cost will depend on the length of the answer and will include the final write back cost to the total cost of the query.
If the operator produces the final answer to a query, and that result is indeed written to disk, then the cost of doing so depends only on the size of the answer.
Model of Computation for Physical Operators
Major improvements in cost of the physical operators can be achieved by avoiding or reducing the number of disk i/o operations.
This can be achieved by passing the answer of one operator to the other in the main memory itself without writing it to the disk.
We shall also see situations where several operations share the main memory, so M could be much smaller than the total main memory.
Improvements in cost
Parameters that affect the performance of a query Buffer space availability in the main
memory at the time of execution of the query
Size of input and the size of the output generated
The size of memory block on the disk and the size in the main memory also affects the performance
Parameters for Measuring Costs
Many physical operators can be implemented as an iterator
It is a group of three functions that allows a consumer of the result of the physical operator to get the result one tuple at a time
The three methods forming the iterator for an operation are:
1. Open()2. GetNext()3. Close()
Iterators for Implementation of Physical Operators
The three functions forming the iterator are: Open: This function starts the process of getting tuples. It initializes any data structures needed to perform the
operation
Iterator
GetNext This function returns the next tuple in the result Adjusts data structures as necessary to allow
subsequent tuples to be obtained If there are no more tuples to return, GetNext returns
a special value NotFound
Iterator
Close This function ends the iteration after all tuples it calls Close on any arguments of the operator
Iterator
Thank You !!!
Query Execution
One-pass algorithm for database operations
Chetan Sharma008565661
Overview
One-Pass Algorithm
One-Pass Algorithm Methods:
1) Tuple-at-a-time, unary operations.
2) Full-relation, unary operations.
3) Full-relation, binary operations.
One-Pass Algorithm
• Reading the data only once from disk.
• Usually, they require at least one of the arguments to fit in main memory
• The choice of algorithm for each operator is an essential part of the process of transforming a logical query plan into a physical query plan.
Tuple-at-a-Time
• These operations do not require an entire relation, or even a large part of it, in memory at once. Thus, we can read a block at a time, use one main memory buffer, and produce our output.
• Ex- selection and projection
Tuple-at-a-Time
A selection or projection being performed on a relation R
Full-relation, unary operations
• These one-argument operations require seeing all or most of the tuples in memory at once,
• so one-pass algorithms are limited to relations that are approximately of size M (the number of main-memory buffers available) or less.
• Ex - The grouping operator - The duplicate-elimination operator.
Full-relation, unary operations
Managing memory for a one-pass duplicate-elimination
Grouping
• A grouping operation gives us zero or more grouping attributes and presumably one or more aggregated attributes. If we create in main memory one entry for each group — that is, for each value of the grouping attributes — then we can scan the tuples of R, one block at a time.
• Ex- MIN(a) , MAX(a) , COUNT , SUM(a), AVG(a)
Full-relation, binary operations
• All other operations are in this class: set and bag versions of union, intersection, difference, joins, and products.
• Except for bag union, each of these operations requires at least one argument to be limited to size M, if we are to use a one-pass algorithm
Full-relation, binary operations examples
• Set Union:-We read S into M - 1 buffers of main memory and build a search structure where the search key is the entire tuple.
-All these tuples are also copied to the output.
-Read each block of R into the Mth buffer, one at a time.
-For each tuple t of R, see if t is in S, and if not, we copy t to the output. If t is also in S, we skip t.
• Set Intersection :-Read S into M - 1 buffers and build a search structure with full tuples as the search key.
-Read each block of R, and for each tuple t of R, see if t is also in S. If so, copy t to the output, and if not, ignore t.
Questions
&
NESTED LOOPS JOINS
Book Section of chapter 15.3
Submitted to : Prof. Dr. T.Y. LIN
Tuple-Based Nested-Loop Join An Iterator for Tuple-Based Nested-
Loop Join A Block-Based Nested-Loop Join
Algorithm Analysis of Nested-Loop Join
15.3.1 Tuple-Based Nested-Loop Join
The simplest variation of nested-loop join has loops that range over individual tuples of the relations involved. In this algorithm, which we call tuple-based nested-loop join, we compute the join as follows
RS
Continued
For each tuple s in S DO For each tuple r in R Do
if r and s join to make a tuple t THEN output t;
If we are careless about how the buffer the blocks of relations R and S, then this algorithm could require as many as T(R)T(S) disk .
There are many situations where this algorithm can be modified to have much lower cost.
Continued
One case is when we can use an index on the join attribute or attributes of R to find the tuples of R that match a given tuple of S, without having to read the entire relation R.
The second improvement looks much more carefully at the way tuples of R and S are divided among blocks, and uses as much of the memory as it can to reduce the number of disk I/O's as we go through the inner loop.
We shall consider this block-based version of nested-loop join.
15.3.2 An Iterator for Tuple-Based Nested-Loop Join
Open() { R.Open(); S.open(); A:=S.getnext();}
GetNext() {Repeat {
r:= R.Getnext();IF(r= Not found) {/* R is exhausted
for the current s*/R.close();s:=S.Getnext();
IF( s= Not found) RETURN Not Found;/* both R & S are exhausted*/R.Close();r:= R.Getnext();
}}until ( r and s join)RETURN the join of r and s;
}Close() {
R.close ();S.close ();
}
15.3.3 A Block-Based Nested-Loop Join
AlgorithmWe can Improve Nested loop Join by
compute R |><| S.1. Organizing access to both argument
relations by blocks. 2. Using as much main memory as we can
to store tuples belonging to the relation S, the relation of the outer loop.
The nested-loop join algorithm
FOR each chunk of M-1 blocks of S DO BEGINread these blocks into main-memory buffers;organize their tuples into a search structure whose
search key is the common attributes of R and S;FOR each block b of R DO BEGIN
read b into main memory;FOR each tuple t of b DO BEGIN
find the tuples of S in main memory thatjoin with t ;output the join of t with each of these
tuples;END ;
END ;END ;
15.3.4 Analysis of Nested-Loop Join
Assuming S is the smaller relation, the number of chunks or iterations of outer loop is B(S)/(M - 1).
At each iteration, we read hf - 1 blocks of S andB(R) blocks of R. The number of disk I/O's is thus
B(S)/M-1(M-1+B(R)) or B(S)+B(S)B(R)/M-1
Continued
Assuming all of M, B(S), and B(R) are large, but M is the smallest of these, an approximation to the above formula is B(S)B(R)/M.cost is proportional to the product of the sizes of the two relations, divided by the amount of available main memory.
Example B(R) = 1000, B(S) = 500, M = 101
Important Aside: 101 buffer blocks is not as unrealistic as it sounds. There may be many queries at the same time, competing for main memory buffers.
Outer loop iterates 5 times At each iteration we read M-1 (i.e. 100) blocks of S and all
of R (i.e. 1000) blocks. Total time: 5*(100 + 1000) = 5500 I/O’s
Question: What if we reversed the roles of R and S?
We would iterate 10 times, and in each we would read 100+500 blocks, for a total of 6000 I/O’s.
Compare with one-pass join, if it could be done! We would need 1500 disk I/O’s if B(S) M-1
Continued…….
1. The cost of the nested-loop join is not much greater than the cost of a one-pass join, which is 1500 disk 110's for this example. In fact.if B(S) 5 lZI - 1, the nested-loop join becomes identical to the one-pass join algorithm of Section 15.2.3
2. Nested-loop join is generally not the most efficient join algorithm.
Summary of the topic
In This topic we have learned about how the nested tuple Loop join are used in database using query execution and what is the process for that.
Any Questions
?
Thank You
Two Pass Algorithm Based On Sorting
Section 15.4CS257 Spring2013Swapna VemparalaClass ID : 131
ContentsTwo-Pass AlgorithmsTwo-Phase, Multiway Merge-SortDuplicate Elimination Using
SortingGrouping and Aggregation Using
SortingA Sort-Based Union AlgorithmSort-Based Intersection and
DifferenceA Simple Sort-Based Join
AlgorithmA More Efficient Sort-Based Join
Two-Pass AlgorithmsData from operand relation is
read into main memory, processed, written out to disk again, and reread from disk to complete the operation.
Extend this idea to any number of passes, where the data is read many times into main memory.
15.4.1 Two-Phase, Multiway Merge-SortTo sort very large relations in two passes
using an algorithm called Two-Phase, Multiway Merge-Sort (TPMMS),.
Phase 1: Repeatedly fill the M buffers with new tuples from R and sort them, using any main-memory sorting algorithm. Write out each sorted sublist to secondary storage.
Phase 2 : Merge the sorted sublists. For this phase to work, there can be at most M — 1 sorted sublists, which limits the size of R. We allocate one input block to each sorted sublist and one block to the output.
MergingFind the smallest key among the first
remaining elements of all the listsMove smallest element to first available
position of output block.If output block full -write to disk and
reinitialize the same buffer in main memory to hold the next output block.
If this block -exhausted of records, read next block from the same sorted sub list into the same buffer that was used for the block just exhausted.
If no blocks remain- stop.
15.4.2 Duplicate Elimination Using SortingSame as previous…Instead of sorting on the second
pass, -repeatedly select first unconsidered tuple t among all sorted sub lists.
Write one copy of t to the output and eliminate from the input blocks all occurrences of t.
Output - exactly one copy of any tuple in R.
15.4.3 Grouping and Aggregation Using SortingRead the tuples of R into memory, M
blocks at a time. Sort the tuples in each set of M blocks, using the grouping attributes of L as the sort key. Write each sorted sublist to disk.
Use one main-memory buffer for each sublist,
initially load the first block of each sublist into its buffer.
Repeatedly find the least value of the sort key present among the first available tuples in the buffers.
15.4.4 A Sort-Based Union AlgorithmIn the first phase, create sorted
sublists from both R and S.Use one main-memory buffer for
each sublist of R and S.Initialize each with the first block
from the corresponding sublist.Repeatedly find the first
remaining tuple t among all the buffers
15.4.5 Sort-Based Intersection and DifferenceFor both set version and bag version, the algorithm is
same as that of set-union except that the way we handle the copies of a tuple t at the fronts of the sorted sub lists.
For set intersection -output t if it appears in both R and S.
For bag intersection -output t the minimum of the number of times it appears in R and in S.
For set difference -output t if and only if it appears in R but not in S.
For bag difference-output t the number of times it appears in R minus the number of times it appears in S.
15.4.6 A Simple Sort-Based Join AlgorithmGiven relations R(X, Y) and S(Y,
Z) to join, and given M blocks of main memory for buffers
Sort R, using 2PMMS, with Y as the sort key
Sort S similarlyMerge the sorted R and S, use
only two buffers
15.4.8 A More Efficient Sort-Based JoinIf we do not have to worry about very large
numbers of tuples with a common value for the join attribute(s), then we can save two disk 1/0's per block by combining the second phase of the sorts with the join itself
To compute R(X, Y) S(Y, Z) using M►◄ main-memory buffers
Create sorted sublists of size M, using Y as the sort key, for both R and S.
Bring the first block of each sublist into a buffer
Repeatedly find the least Y-value y among the first available tuples of all the sublists. Identify all the tuples of both relations that have Y-value y. Output the join of all tuples from R with all tuples from S that share this common Y-value
We can perform the algorithm-on data that is almost as large as that of the previous algorithm.
Thank you
Two-Pass Algorithms Based on HashingCHAPTER – 15.5
CS 257
ID 131 SWAPNA VEMPARALA
Contents
Introduction
Partitioning Relations by Hashing
A Hash-Based Algorithm for Duplicate Elimination
Hash-Based Grouping and Aggregation
Hash-Based Union, Intersection, and Difference
The Hash-Join Algorithm
Saving Some Disk I /O ’s
Differences between sort-based and corresponding hashbased algorithms
Introduction
The essential idea behind all these previous algorithms is as follows:
If the data is too big to store in main-memory buffers, hash all the tuples of the argument or arguments using an appropriate hash key.
For all the common operations, there is a way to select the hash key so all the tuples that need to be considered together when we perform the operation fall into the same bucket.
We then perform the operation by working on one bucket at a time (or on a pair of buckets with the same hash value, in the case of a binary operation).
In effect, we have reduced the size of the operand(s) by a factor equal to the number of buckets, which is roughly M.
15.5.1 Partitioning Relations by Hashing
Take a relation R and, using M buffers, partition R into M — 1 buckets of roughly equal size.
assume that h is the hash function, and that h takes complete tuples of R as its argument
associate one buffer with each bucket
The last buffer holds blocks of R , one at a time. Each tuple t in the block is hashed to bucket h(t) and copied to the appropriate buffer.
If that buffer is full, we write it out to disk, and initialize another block for the same bucket.
At the end, we write out the last block of each bucket if it is not empty.
ALGORITHM:
15.5.2 A Hash-Based Algorithm for Duplicate Elimination
We shall now consider the details of hash-based algorithms for the various operations of relational algebra that might need two-pass algorithms.
First, consider duplicate elimination, that is, the operation S(R).
We hash R to M — 1 buckets, two copies of the same tuple t will hash to the same bucket.
Thus, we can examine one bucket at a time, perform <5 on that bucket in isolation, and take as the answer the union of S(Ri), where Ri is the portion of R that hashes to the ith bucket.
The one-pass algorithm eliminates duplicates from each Ri in turn and write out the resulting unique tuples
This method will work as long as the individual R i ’s are sufficiently small to fit in main memory and thus allow a one-pass algorithm.
Since we may assume the hash function h partitions R into equal-sized buckets, each Ri will be approximately B(R)/(M — 1) blocks in size.
If that number of blocks is no larger than M, B(R) < M(M — 1), then the two-pass, hash-based algorithm will work.
Thus, a conservative estimate (assuming M and M — 1 are essentially the same) is B(R) < M 2, exactly as for the sort-based, two-pass algorithm for 6.
The number of disk I/O ’s is also similar to that of the sort-based algorithm.
We read each block of R once as we hash its tuples, and we write each block of each bucket to disk.
We then read each block of each bucket again in theone-pass algorithm that focuses on that bucket.
Thus, the total number of disk I/O ’s is 3B(R).
15.5.3 Hash-Based Grouping and Aggregation
To perform the 7 l ( R ) operation, we again start by hashing all the tuples of R to M — 1 buckets.
However, in order to make sure that all tuples of the same group wind up in the same bucket
we must choose a hash function that depends only on the grouping attributes of the list L.
Having partitioned R into buckets, we can then use the one-pass algorithm for 7 to process each bucket in turn.
For S, we can process each bucket in main memory provided B{R) < M 2.
However, on the second pass, we need only one record per group as we process each bucket
Thus, even if the size of a bucket is larger than M, we can handle the bucket in one pass provided the records for all the groups in the bucket take no more than M buffers.
As a consequence, if groups are large, then we may actually be able to handle much larger relations R than is indicated by the B(R) < M 2 rule.
On the other hand, if M exceeds the number of groups, then we cannot fill all buckets.
Thus, the actual limitation on the size of R as a function of M is complex, but B(R) < M 2 is a conservative estimate.
Finally, we observe that the number of disk I/O ’s for 7 , as for 8, is 3B(R).
15.5.4 Hash-Based Union, Intersection, and Difference When the operation is binary, use the same hash
function to hash tuples of both arguments. For example, to compute R Us 5, we hash both R and S to M — 1 buckets each, say i? i, -R2, - - • , R m - 1 and S i,5 2, • • • , S m - 1.
We then take the set-union of Ri with Si for all i, and output the result.
Notice that if a tuple t appears in both R and S, then for some i we shall find t in both Ri and Si.
Thus, when we take the union of these two buckets, we shall output only one copy of t , and there is no possibility of introducing duplicates into the result.
To take the intersection or difference of R and S, we create the 2(M — 1) buckets exactly as for set-union and apply the appropriate one-pass algorithm to each pair of corresponding buckets.
Notice that all these one-pass algorithms require B(R) -I- B(S) disk I/O ’s.
To this quantity we must add the two disk I/O ’s per block that are necessary to hash the tuples of the two relations and store the buckets on disk, for a total of 3 (B{R) + 5 (5 )) disk I/O ’s.
In order for the algorithms to work, we must be able to take the one-pass union, intersection, or difference of Ri and Si, whose sizes will be approximately B(R)/(M - 1) and B(S)/(M - 1), respectively.
Recall that the onepass algorithms for these operations require that the smaller operand occupies at most M — 1 blocks.
Thus, the two-pass, hash-based algorithms require that m in(B(R),B(S)) < M 2, approximately.
15.5.5 The Hash-Join Algorithm
To compute R{X, Y) tx S(Y, Z) using a two-pass, hash-based algorithm, we act almost as for the other binary operations
The only difference is that we must use as the hash key just the join attributes,Y.
Then we can be sure that if tuples of R and S join, they will wind up in corresponding buckets Ri and Si for some i.
A one-pass join of all pairs of corresponding buckets completes this algorithm, which we call hash-join.
15.5.6 Saving Some Disk I /O ’s
If there is more memory available on the first pass than we need to hold one block per bucket, then we have some opportunities to save disk I/O ’s.
One option is to use several blocks for each bucket, and write them out as a group, in consecutive blocks of disk.
Strictly speaking, this technique doesn’t save disk I/O ’s, but it makes the I/O ’s go faster, since we save seek time and rotational latency when we write.
Effective ,method called hybrid hash-join, works as follows.
In general, suppose we decide that to join R t x S, with S the smaller relation, we need to create k buckets, where k is much less than M, the available memory. When we hash S, we can choose to keep m of the k buckets entirely in main memory, while keeping only one block for each of the other k — m buckets. We can manage to do so provided the expected size of the buckets in memory, plus one block for each of the other buckets, does not exceed M ; that is:
m B ( S ) / k + k — m < M
expected size of a bucket is B ( S )/k, and there are m buckets in memory.
Now, when we read the tuples of the other relation, R, to hash that relation into buckets, we keep in memory:
1. The rn buckets of 5 that were never written to disk, and
2. One block for each of the k — m buckets of R whose corresponding buckets of 5 were written to disk.
If a tuple t of R hashes to one of the first m buckets, then we immediately join it with all the tuples of the corresponding 5-bucket, as if this were a onepass, hash-join.
It is necessary to organize each of the in-memory buckets of 5 into an efficient search structure to facilitate this join, just as for the one-pass hash-join.
If t hashes to one of the buckets whose corresponding 5-bucket is on disk, then t is sent to the main-memory block for that bucket, and eventually migrates to disk, as for a two-pass, hash-based join.
On the second pass, we join the corresponding buckets of R and 5 as usual.
However, no need to join the pairs of buckets for which the 5-bucket was left in memory; these buckets already been joined
The savings in disk I/O ’s is equal to two for every block of the buckets of 5 that remain in memory, and their corresponding ft-buckets.
Since m / k of the buckets are in memory, the savings is 2(m/k)(B(R) + B(S)).
The intuitive justification is that all but k — m of the main-memory buffers can be used to hold tuples of 5 in main memory, and the more of these tuples, the fewer the disk I/O ’s.
Thus, we want to minimize k, the total number of buckets.
We do so by making each bucket about'as big as can fit in main memory; that is, buckets are of size M, and therefore k = B (S ) /M .
If that is the case, then there is only room for one bucket in the extra main memory; i.e., m — 1.
In fact, we really need to make the buckets slightly smaller than B( S ) /M , or else we shall not quite have room for one full bucket and one block for the other k — 1 buckets in memory at the same time.
Assuming, for simplicity, that k is about B ( S ) / M and m = 1, the savings in disk I/O ’s is 2 M ( B ( R ) + B { S ) ) / B { S ) and the total cost is (3 — 2M / B ( S ) ) (B ( R ) + B(S)).
15.5.7 Summary of Hash-Based Algorithms
Differences between sort-based and corresponding hashbased algorithms 1. Hash-based algorithms for binary operations
have a size requirement that depends only on the smaller of two arguments rather than on the sum of the argument sizes, that sort-based algorithms require.
2. Sort-based algorithms sometimes allow us to produce a result in sorted order and take advantage of that sort later.
3. Hash-based algorithms depend on the buckets being of equal size.
4. In sort-based algorithms, the sorted sublists may be written to consecutive blocks of the disk
5. Moreover, if M is much larger than the number of sorted sublists, then we may read in several consecutive blocks at a time from a sorted sublist, again saving some latency and seek time.
6. On the other hand, if we can choose the number of buckets to be less than M in a hash-based algorithm, then we can write out several blocks of a bucket at once.
Thank you
15.6 Index Based AlgorithmsBy: Tomas Tupy (123)
Outline Terminology Clustered Indexes
Example Non-Clustered Indexes Index Based Selection Joining Using an Index Join Using a Sorted Index
What is an Index? A data structure which improves the speed
of data retrieval ops on a relation, at the cost of slower writes and the use of more storage space.
Enables sub-linear time lookup. Data is stored in arbitrary order, while
logical ordering is achieved by using the index.
Index-based algorithms are especially useful for the selection operator
Terminology Recap B(R) – Number of blocks needed to hold R T(R) – Number of tuples in R V(R,a) – Number of distinct values of the
column for a in R. Clustered Relation – Tuples are packed into
as few blocks as possible. Clustered Indexes – Indexes on attribute(s)
such that all tuples with a fixed value for the search key appear on a few blocks as possible.
Clustering Indexes A relation is clustered if its tuples are
packed into relatively few blocks. Clustering indexes are indexes on an
attribute or attributes such that all the tuples with a fixed value for the search key of this index appear in as little blocks as possible.
Tuples are stored to match the index order. A relation that isn’t clustered cannot have a
clustering index.
Clustering Indexes Let R(a,b) be a relation sorted on attribute
a. Let the index on a be a clustering index. Let a1 be a specific value for a.
A clustering index has all tuples with a fixed value packed into minimum # of blocks.
a1 a1 a1 a1 a1 a1 a1 a1a1 a1 a1
All the a1 tuples
Pros/Cons Pros
Faster reads for particular selections Cons
Writing to a table with a clustered index can be slower since there might be a need to rearrange data.
Only one clustered index possible.
Clustered Index ExampleCustomer
ID
Name
Address
Order
ID
CustomerID
Price
Problem: We want to quickly retrieve all orders for a particular customer.
How do we do this?
Clustered Index Example Solution: Create a clustered index on
the “CustomerID” column of the Order table.
Now the tuples with the same CustomerID will be physically stored closed to one another on the disk.
Non Clustered Indexes There can be many per table Quicker for insert and update
operations. The physical order of tuples is not the
same as index order.
Index Based Algorithms Especially useful for the selection
operator. Algorithms for join and other binary
operators also use indexes to very good advantage
Join and other binary operators also benefit.
Index-Based Selection No index
Without an index on relation R, we have to read all the tuples in order to implement selection oC(R), and see which tuples match our condition C.
What is the cost in terms of disk I/O’s to implement oC(R)? (For both clustered and non-clustered relations)
Index-Based Selection No index
Answer: B(R) if our relation is clustered Upto T(R) if relations in not-clustered.
Index-Based Selection Let us consider an index on attribute a
where our condition C is a = v. oa=v(R) In this case we just search the index for
value v and we get pointers to exactly the tuples we need.
Index-Based Selection Let’s say for our selection oa=v(R), our
index is clustering. What is the cost in the # of disk I/O’s to
retrieve the set oa=v(R)?
Index-Based Selection Answer
the average is: B(R) / V(R,a) A few more I/Os:
Index might not be in main memory Tuples which a = v might not be block
aligned. Even if clustered, might not be packed as
tight as possible. (Extra space for insertion)
Index-Based Selection Non-clustering index for our selection
oa=v(R) What is the cost in the # of disk I/O’s to
retrieve the set oa=v(R)?
Index-Based Selection Answer
Worst case is: T(R) / V(R,a) This can happen if tuples live in different
blocks.
Joining by Using an Index(Algorithm 1) Consider natural join: R(X,Y) |><| S(Y,Z) Suppose S has and index on attribute Y. Start by examining each block of R, and
within each block consider each tuple t, where tY is a component of t corresponding to the attribute Y.
Now we use the index to find tuples of S that have tY in their Y component.
These tuples will create the join.
Joining by Using an Index(Algorithm 1) Analysis Consider R(X,Y) |><| S(Y,Z) If R is clustered, then we have to read B(R)
blocks to get all tuples of R. If R is not clustered then up to T(R) disk I/O’s are required.
For each tuple t of R, we must read an average of T(S) / V(S,Y) tuples of S.
Total: B(R)T(S) / V(S,Y) for clustered index, and T(R)T(S) / V(S,Y) for non-clustered index.
Join Using a Sorted Index Consider R(X,Y) |><| S(Y,Z) Data structures such as B-Trees provide
the best sorted indexes. In the best case, if we have sorting
indexes on Y for both R and S then we perform only the last step of the simple sort-based join.
Sometimes called zig-zag join
Join Using a Sorted Index(Zig-zag join) Consider R(X,Y) |><| S(Y,Z) where we
have indexes on Y for both R and S. Tuples from R with a Y value that does
not appear in S never need to be retrieved, and vice-versa…
Index on Y in R
Index on Y in S
Thank You! Questions?
Chapter 15.7Buffer Management
Class: CS257 Instructor: Dr. T.Y.Lin
What does a buffer manager do?
Assume there are M of main-memory buffers needed for the operators on relations to store needed data.
In practice: 1) rarely allocated in advance2) the value of M may vary depending on system
conditions Therefore, buffer manager is used to allow processes
to get the memory they need, while minimizing the delay and unclassifiable requests.
Buffer manager
Buffers
RequestsRead/Writes
Figure 1: The role of the buffer manager : responds to requests for main-memory access to disk blocks
The role of the buffer manager
15.7.1 Buffer Management Architecture
Two broad architectures for a buffer manager:
1) The buffer manager controls main memory directly. • Relational DBMS
2) The buffer manager allocates buffers in virtual memory, allowing the OS to decide how to use buffers. • “main-memory” DBMS • “object-oriented” DBMS
Buffer Pool
Key setting for the Buffer manager to be efficient:
The buffer manager should limit the number of buffers in use so that they fit in the available main memory, i.e. Don’t exceed available space.
The number of buffers is a parameter set when the DBMS is initialized.
No matter which architecture of buffering is used, we simply assume that there is a fixed-size buffer pool, a set of buffers available to queries and other database actions.
Data must be in RAM for DBMS to operate on it! Buffer Manager hides the fact that not all data is in RAM.
DB
MAIN MEMORY
DISK
disk page
free frame
Page Requests from Higher Levels
BUFFER POOL
choice of frame dictatedby replacement policy
Buffer Pool
15.7.2 Buffer Management Strategies
Buffer-replacement strategies:
When a buffer is needed for a newly requested block and the buffer pool is full, what block to throw out the buffer pool?
Buffer-replacement strategy -- LRU
Least-Recently Used (LRU):
To throw out the block that has not been read or written for the longest time.
• Requires more maintenance but it is effective. • Update the time table for every access.• Least-Recently Used blocks are usually less likely to
be accessed sooner than other blocks.
Buffer-replacement strategy -- FIFO
First-In-First-Out (FIFO):
The buffer that has been occupied the longest by the same block is emptied and used for the new block.
• Requires less maintenance but it can make more mistakes.• Keep only the loading time• The oldest block doesn’t mean it is less likely to be
accessed. Example: the root block of a B-tree index
Buffer-replacement strategy – “Clock”
The “Clock” Algorithm (“Second Chance”)
Think of the 8 buffers as arranged in a circle, shown as Figure 3
Flag 0 and 1:
buffers with a 0 flag are ok to sent their contents back to disk, i.e. ok to be replaced
buffers with a 1 flag are not ok to be replaced
Buffer-replacement strategy – “Clock”
0
0
1
1
1
0
0
0
Start point to search a 0 flag
the buffer with a 0 flag will be replaced
The flag will be set to 0
By next time the hand reaches it, if the content of this buffer is not accessed, i.e. flag=0, this buffer will be replaced.That’s “Second Chance”.
Figure 3: the clock algorithm
Buffer-replacement strategy -- Clock
a buffer’s flag set to 1 when:a block is read into a buffer
the contents of the buffer is accessed
a buffer’s flag set to 0 when:the buffer manager needs a buffer for a new
block, it looks for the first 0 it can find, rotating clockwise. If it passes 1’s, it sets them to 0.
System Control helps Buffer-replacement strategy
System ControlThe query processor or other components of a DBMS can give advice to the buffer manager in order to avoid some of the mistakes that would occur with a strict policy such as LRU,FIFO or Clock.
For example:
A “pinned” block means it can’t be moved to disk without first modifying certain other blocks that point to it.
In FIFO, use “pinned” to force root of a B-tree to remain in memory at all times.
15.7.3 The Relationship Between Physical Operator Selection and Buffer Management
Problem:
Physical Operator expected certain number of buffers M for execution.
However, the buffer manager may not be able to guarantee these M buffers are available.
15.7.3 The Relationship Between Physical Operator Selection and Buffer Management
Questions:
Can the algorithm adapt to changes of M, the number of main-memory buffers available?
When available buffers are less than M, and some blocks have to be put in disk instead of in memory.
How the buffer-replacement strategy impact the performance (i.e. the number of additional I/O’s)?
Example
FOR each chunk of M-1 blocks of S DO BEGIN
read these blocks into main-memory buffers;
organize their tuples into a search structure whose
search key is the common attributes of R and S;
FOR each block b of R DO BEGIN
read b into main memory;
FOR each tuple t of b DO BEGIN
find the tuples of S in main memory that
join with t ;
output the join of t with each of these tuples;
END ;
END ;
END ;
Figure 15.8: The nested-loop join algorithm
Example
The outer loop number (M-1) depends on the average number of buffers are available at each iteration.
The outer loop use M-1 buffers and 1 is reserved for a block of R, the relation of the inner loop.
If we pin the M-1 blocks we use for S on one iteration of the outer loop, we shall not lose their buffers during the round.
Also, more buffers may become available and then we could keep more than one block of R in memory.
Will these extra buffers improve the running time?
Example
CASE1: NO
Buffer-replacement strategy: LRUBuffers for R: kWe read each block of R in order into buffers.By end of the iteration of the outer loop, the last k blocks of R
are in buffers.However, next iteration will start from the beginning of R
again. Therefore, the k buffers for R will need to be replaced.
Example
CASE 2: YES
Buffer-replacement strategy: LRUBuffers for R: kWe read the blocks of R in an order that alternates: firstlast
and then lastfirst.In this way, we save k disk I/Os on each iteration of the outer
loop except the first iteration.
Other Algorithms and M buffers
Other Algorithms also are impact by M and the buffer-replacement strategy. Sort-based algorithm
If M shrinks, we can change the size of a sublist.
Unexpected result: too many sublists to allocate each sublist a buffer. Hash-based algorithm
If M shrinks, we can reduce the number of buckets, as long as the buckets still can fit in M buffers.
THANK YOU !
Buffer ManagementBy Snigdha Rao Parvatneni
SJSU ID: 008648978
Roll Number: 124
Course: CS257
Agenda
• Introduction• Role of Buffer Management• Architecture of Buffer Management• Buffer Management Strategies• Relation Between Physical Operator Selection And Buffer Management• Example
Introduction
• We assume that operators in relations have some main memory buffers, to store the needed data.
• It is very rare that these buffers are allocated in advance, to the operator.
• Task of assigning main memory buffers to process is given to Buffer Manager.
• Buffer manager is responsible for allocating main memory to the process as per the need and minimizing the delays and unsatisfiable requests.
Role of Buffer Manager
• Buffer Manager responds to the request of main memory access to disk blocks. Below picture depicts it.
• The buffer manager controls main memory directly.
Architecture of Buffer Management
• There are two broad architectures for a buffer manager:
– Buffer manager controls main memory directly like in many RDBMS.
– Buffer manager allocates buffers in virtual memory and let OS decide which buffers should be in main memory and which buffer should be in OS managed disk swap space like in many Object Oriented DBMS and main memory DBMS.
Problem
• Irrespective of approach there is a problem that buffer manager has to limit number of buffers, to fit in available main memory.
– In case where buffer manager controls main memory directly• If requests exceeds available space then buffer manager has to select a buffer to
empty by returning its contents to disk.• When blocks have not been changed then they are imply erased from main memory.
But, when blocks have been changed then they are written back to its place on disk.
– In case where Buffer manager allocates space in virtual memory • Buffer manager has the option of allocating more buffers, which can actually fit into
main memory. When all these buffers will be in use then there will be thrashing.• It is an operating system problem where many blocks are moved in and out of disk’s
swap space. Therefore, system will end up spending most of time in swapping blocks and getting very little work done.
Solution
• To resolve this problem When DBMS is initialized then the number of buffers is set.
• User need not worry about mode of buffering used.
• For users there is a fixed size buffer pool, in other words set of buffers are available to query and other database actions.
Buffer Management Strategies
• Buffer Manager needs to make a critical choice of which block to keep and which block to discard when buffer is needed for newly requested blocks.
• Then buffer manager uses buffer replacement strategies. Some common strategies are –
– Least-Recently Used (LRU)
– First-In-First-Out (FIFO)
– The Clock Algorithm (Second Chance)
– System Control
Last-Recently Used (LRU)
• This rule is to throw out the block which has not been read or written from
long time.
• To do this the Buffer Manager needs to maintain a table which will indicate
the last time when block in each buffer was accessed.
• It is also needed that each database access should make an entry in this table.
Significant amount of is involved effort in maintaining this information.
• Buffers which are not user from long time is less likely to be accessed before
than those buffers which have been accessed recently. Hence, It is an
effective strategy.
First-In-First-Out (FIFO)
• In this rule, when buffer is needed then the buffer which has been occupied for longest by same block is emptied and used by new block.
• To do this Buffer Manager needs to know only the time at which block occupying the buffer was loaded into the buffer.
• Entry in the table is made when block is read from disk, not every time it is accessed.
• Involves less maintenance than LRU but it is more prone to mistakes.
The Clock Algorithm
• It is an efficient approximation of LRU and is commonly implemented.
• Buffers are treated to be arranged in circle where arrow points to one of the buffers. Arrow will rotate clockwise if it needs to find a buffer to place a disk block.
• Each buffer has an associated flag with value 0 or 1. Buffers with flag vale 0 are vulnerable to content transfer to disk whereas buffer with flag value 1 are not vulnerable.
• Whenever block is read into buffer or contents of buffer are accessed, flag associated with it is set to 1.
Working of Clock’s Algorithm
• Whenever buffer is needed for the
block arrow looks for first 0 it
can find in clockwise direction.
• Arrow move changes flag value
from 1 to 0.
• Block is thrown out of buffer only
when it remains unaccessed i.e.
flag value 0 for the time
between two rotations of the arrow.
• First rotation when flag is set from
1 to 0 and second rotation when
arrow comes back to check flag value.
System Control
• Query processor and other DBMS components can advice buffer manager to avoid some mistake which occurs with LRU, FIFO or Clock.
• Some blocks cannot be moved out of main memory without modifying other blocks pointing to it. Such blocks are called pinned blocks.
• Buffer Manager needs to modify buffer replacement strategy, to avoid expelling pinned blocks. That’s why some blocks are remains in main memory even though there is no technical reason for not writing it to the disk.
Relation Between Physical Operator Selection And Buffer Management
• Physical operator is selected by query optimizer to execute the query. This selection may assume that certain number for the available buffers, for execution of these operators.
• But as we know that Buffer Manager does not guarantee availability of certain number of buffers when query is executed.
• Now two question arises– Can an algorithm adapt to changes in number of available main memory
buffers?– When expected number of available buffers are not available and certain
blocks are moved to disk which were expected in main memory, then how buffer replacement strategy of Buffer Manager impact the number of I/O performed?
Example
• Block based nested loop join – algorithm does not depends upon number of available buffers, M but performance depends.
• For each M-1 blocks of outer loop relation S, read blocks in main memory, organize the tuple into search structure whose key is common attribute of R and S.
• Now for each block b of R, read b into main memory and for each tuple t of b find tuples in main memory for S that join with t.
• Here outer loop number M-1 depends upon average number of buffers available at each iteration and the outer loop uses M-1 buffers and 1 is reserved for relation of inner loop, block of R.
• If we pin M-1 block that we use for S in one iteration of outer loop then we cannot loose these buffers during this round. In addition to this more buffers will become available and more than one block of R needs to be kept in the memory. Will it improve the running time?
Cases with LRU• Case1
– When LRU is used as buffer replacement strategy then k buffers will be available to hold blocks of R.
– R is read in order such that blocks that remains in the buffer at the end of iteration of outer look will be last k blocks of R.
– For next iteration we will again start from beginning of R. Therefore, k buffers for R needs to be replace.
• Case2– With better implementation of nested loop join when LRU is used visit blocks
of R in order that alternates first to last then last to first.
– In this we save k disk I/O on each iteration except first iteration.
With Other Algorithms
• Other algorithms also are impacted by the fact that availability of buffer can vary and by the buffer-replacement strategy used by the buffer manager.
• In sort based algorithm when availability of buffers reduces we can change the size of a sub-lists. Major limitation of this is we will be forced to create many sub-lists that we cannot then allocate a buffer for each sub-list in the merging process.
• In hash based algorithm when availability of buffers reduces we can reduce the size of buckets
• provided bucket then should not become so large that they do not fit into the allotted main memory.
References
• DATABASE SYSTEMS: The Complete Book, Second Edition by Hector Garcia-Molina, Jeffrey D. Ullman & Jennifer Widom
Thank You
David LeCS257, ID: 126Feb 28, 2013
15.9 Query Execution Summary
Query ProcessingOutline of Query CompilationTable ScanningCost MeasuresReview of Algorithms
One-pass MethodsNested-Loop JoinTwo-pass
Sort-basedHash-based
Index-basedMulti-pass
Overview
Query Processing
Query Compilation
Query Execution
query
query plan
metadata
data
Query is compiled. This involves extensive optimization using operations of relational algebra.
First compiled into a logical query plans, e.g. using expressions of relational algebra.
Then converted to a physical query plan such as selecting implementation for each operator, ordering joins and etc.
Query is then executed.
Outline of Query Compilation
Parse query
Select logical plan
SQL query
expressiontree
queryoptimization
Parsing: A parse tree for the query is constructed.
Query Rewrite: The parse tree is converted to an initial query plan and transformed into logical query plan.
Physical Plan Generation: Logical plan is converted into physical plan by selecting algorithms and order of executions.
Select physical plan
Execute plan
logical queryplan tree
physical queryplan tree
Table ScanningThere are two approaches for
locating tuples of relation R:Table-scan: Get the blocks one
by one.Index-scan: Use index to lead us
to all blocks holding R.Sort-scan takes a relation R and
sorting specifications and produces R in a sorted order.
This can be accomplished with SQL clause ‘ORDER BY’.
Estimates of cost are essential for query optimization.
It allows us to determine the slow and fast parts of a query plan.
Reading many consecutive blocks on a track is extremely important since disk I/O’s are expensive in term of time.
EXPLAIN SELECT * FROM a JOIN b on a.id = b.id;
Cost Measures
One-pass Methods
Tuple-at-a-time: Selection and projection that do not require an entire relation in memory at once.
Full-relation, unary operations. Must see all or most of tuples in memory at once. Uses grouping and duplicate-eliminator operators.
Full-relation, binary operations.These include union, intersection, difference, product and join.
Review of Algorithms
Nested-Loop Joins
In a sense, it is ‘one-and-a-half’ passes, since one argument has its tuples read only once, while the other will be read repeatedly.
Can use relation of any size and does not have to fit all in main memory.
Two variations of nested-loop joins:Tuple-based: Simplest form, can be very slow
since it takes T(R)*T(S) disk I/O’s if we are joining R(x,y) with S(y,z).
Block-based: Organizing access to both argument relations by blocks and use as much main memory as we can to store tuples.
Review of Algorithms
Two-pass Algorithms
Usually enough even for large relations.Based on Sorting:
Partition the arguments into memory-sized, sorted sublists.
Sorted sublists are then merged appropriately to produce desired results.
Based on Hashing:Partition the arguments into buckets. Useful if data is too big to store in memory.
Review of Algorithms
Two-pass Algorithms
Sort-based vs. Hash-based:Hash-based are often superior to sort-based
since they require only one of the arguments to be small.
Sorted-based works well when there is reason to keep some of the data sorted.
Review of Algorithms
Index-based Algorithms
Index-based joins are excellent when one of the relations is small, and the other has an index on join attributes.
Clustering and non-clustering indexes:Clustering index has all tuples with fixed value
packed into minimum number of blocks.A clustered relation can have non-clustering
indexes.
Review of Algorithms
Multi-pass Algorithms
Two-pass algorithms based on sorting or hashing can usually take three or more passes and will work for larger data sets.
Each pass of a sorting algorithm reads all data from disk and writes it out again.
Thus, a k-pass sorting algorithm requires 2·k·B(R) disk I/O’s.
Review of Algorithms
Questions or Cookies?
THANK YOU.