cs2300: file structures and introduction to database...

38
CS2300: File Structures and Introduction to Database Systems Lecture 4: File Structure Doug McGeehan

Upload: others

Post on 22-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

CS2300: File Structures and

Introduction to Database Systems

Lecture 4: File Structure

Doug McGeehan

Page 2: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

How is data stored

in the database?

File Structure

Page 3: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Outline

▪ Disk storage devices

▪ Files of records

▪ Operations on files

▪ Types of files

▪ Unordered files

▪ Ordered files

▪ Hash files

Page 4: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Disk Storage Devices

▪ Storage hierarchy.

▪ Primary storage: main memory, cache

▪ Secondary storage: magnetic disk,CD-ROM/DVD, tape, solid state, etc

▪ Most databases are stored on disk

▪ Disk is cheaper and non-volatile,though slower.

▪ DBMS files often optimized for spinning magnetic disks

Page 5: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

5

Disk Storage Devices

▪ Disks are divided into concentric circular trackson each disk surface.

▪ A track is further divided into sectors, whose size is traditionally 512 bytes (modern: 4096 bytes).

▪ A sector is the smallest addressable unit on a disk.

▪ Tracks from all surfaces which are at the same diameter form a cylinder.

Platters

Spindle

Disk head

Arm movement

Arm assembly

Tracks

Sector

Page 6: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction
Page 7: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Disk Storage Devices

▪ Continuous sectors are organized into blocks

▪ The block size B is fixed during disk formatting

▪ Typical range: 512 bytes to 8192 bytes

▪ Blocks are separated by fixed-size interblock gaps

▪ Whole blocks are transferred between disk and main

memory for processing

▪ A read-write head moves to the track that contains

the block to be transferred

▪ Disk rotation moves the block under the read-write

head for reading or writing.

Page 8: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Disk Storage Devices

▪ A physical disk block address consists of:

▪ Surface number

▪ Track number (within surface)

▪ Block number (within track)

▪ Disk drives typically rotate continuously at a

constant speed

i.e. a fixed number of revolutions per minute

(rpm)

Page 9: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Disk Storage Devices

▪ The total time of reading or writing a disk block is the sum of:

– Seek time: the time required to move the read-write head to the correct cylinder.

– Rotational delay: the time required to rotate the disk so the desired block can be placed under the read-write head.

– Transfer time: the time required to transfer the data from the disk to main memory (buffer).

Page 10: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Disk Storage Devices

Let B be block size in bytes

T be track size in bytes

V be spindle speed in rpm (rotation per minute)

s be seek time

Average rotational delay (rd) = the time of half revolution

=

Transfer rate (tr) = track size / time to spin once

=

Block transfer time (btt) = block size / bytes transferred per ms

=

Total time to find and transfer a block is (s+rd+btt)

Page 11: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Example

Block size = 500 bytes

# of blocks per track = 20

# of tracks per surface = 400

A disk pack consists of 15 double-sided disks.

Seek time = 30ms

1. How many cylinders?

2. What is the total capacity of a disk pack?

3. At 5000 rpm (revolutions per minutes)

1. What is the transfer rate and block transfer time?

2. What is the total time to find and transfer a block?

Page 12: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Example

1. 400 cylinders

2. Total capacity = 500*20*400*15*2 = 120,000,000(bytes)

3. Track size = 500*20 = 10,000 bytes

Transfer rate = 10,000/ ((60*1000)/5000) = 833(bytes/ms)

Block transfer time = 500/833 = 0.6 ms

Average rotational delay = 0.5 * (60*1000/5000) = 6ms

Total time to find and transfer a block

= seek time + average rotational delay + Block transfer time

= 30 + 6 + 0.6 = 36.6 (ms)

Page 13: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Disk Storage Devices

▪ Reading or writing a disk block is time

consuming

▪ Seek time

▪ Rotational delay (latency)

▪ Locating data on disk is a major

bottleneck in database applications.

Page 14: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Outline

▪ Disk storage devices

▪ Files of records

▪ Operations on files

▪ Types of files

▪ Unordered files

▪ Ordered files

▪ Hash files

Page 15: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Files of Records

▪ A file is a sequence of records

▪ A record is a collection of fields

▪ Corresponds to an entity

▪ Records are stored on disk blocks

▪ The blocking factor (bfr) for a file:The (average) number of file records stored

in a disk block

▪ A file descriptor (or file header) includes information

▪ Describes a file (e.g. field names and data types)

▪ The addresses of the file blocks on disk

Page 16: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Files of Records

Fields

Records

Blocks

Files

Page 17: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Files of Records

▪ A file can have fixed-length records or

variable-length records.

▪ A record may have fixed-length fields or

variable-length fields.

Page 18: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Fixed-length Records

Employee record

(1) EID, 2 byte integer

(2) Name, 10 char. Schema

(3) Dept, 2 byte code

Records

Page 19: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

19

Variable-length Records

• Variable-length fields: exact length not known

ahead of time

• Special separators characters to denote end

of a field, such as ?, %, or $

Page 20: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

20

Records

▪ Records can be unspanned (no record can span

two blocks) or spanned (a record can be stored

in more than one block).

▪ Unspanned

▪ Spanned

R1 R2 R3 R4 R5

R1 R2 R3(a)

R3(b)

R6R5R4 R7(a)

Block 1 Block 2

Block 1 Block 2

Page 21: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

21

Example

• Suppose that blocks are of size 1024 bytes;

a table has 1000 tuples of 100 bytes each;

No tuple is allowed to span two blocks

• How many blocks are needed to store these

tuples?

1024/100 = 10 tuples/block

1000/10 = 100 blocks

• How much space is “wasted”?

100*(1024 – 100*10) = 2400 bytes

Page 22: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

22

Spanned vs. Unspanned

• Unspanned is much simpler, but may

waste space…

• Spanned essential if

record size > block size

Page 23: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

23

Files of Records

▪ Allocated blocks for storing records may

be contiguous, linked, or indexed.

• Contiguous:

Page 24: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

24

Files of Records

▪ Allocated blocks for storing records may

be contiguous, linked, or indexed.

• Linked:

Page 25: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

25

Files of Records

▪ Allocated blocks for storing records may

be contiguous, linked, or indexed.

• Indexed:

Page 26: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

26

Operations on Files

Actual operations vary from system to system

▪ OPEN

▪ FIND

▪ FINDNEXT

▪ READ

▪ INSERT

▪ DELETE

▪ MODIFY

▪ CLOSE

Page 27: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

27

Types of Files

• Typically, all the tuples in a given table in the database are stored in files

• The database takes over all interactions with these files from the operating system

• Files can be organized in three different ways:– Heap files (unordered files)

– Sorted files (ordered files)

– Hash files

Page 28: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

28

Unordered Files

▪ Also called a heap or a pile file.

▪ New records are inserted at the end of the file. --- Efficient

▪ To search for a record, a linear searchthrough the file records is necessary. ▪ Requires reading / searching half the file blocks

on the average

▪ Quite expensive for large files

Page 29: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Example of an Unordered File

29

123456 CS305 F1998 3.0

232323 MAT123 S1996 2.0 Block 1 (Page 1)

123456 CS305 F1995 2.0

234567 EE101 F1995 3.0

123456 CS315 S1997 4.0

111111 MGT123 F1994 4.0

123456 EE101 S1998 3.0 Block 2 (Page 2)

Page 30: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

30

Ordered Files

▪ Also called a sorted file.

▪ File records are kept sorted by the values of

an ordering field.

▪ Insertion is expensive: records must be

inserted in the correct order.

▪ Searching the records in order of the ordering

field is quite efficient.

▪ Can use a binary search

▪ Requires accessing log2 of the file blocks (average)

Page 31: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

Example of an Ordered File

CS 238 - Dan Lin 31

111111 CS238 F2008 4.0

111111 CS238 S2008 4.0 Block 1 (Page 1)

123456 CS101 F2007 2.0

123456 CS338 F2008 3.0

123456 CS315 S2008 4.0

123456 EE101 S2008 3.0 Block 2 (Page 2)

234567 EE101 F2008 3.0

600017 EE101 F2008 3.0

Insert new record “150000 CS101 S2007 3.0”

Page 32: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

32

Hash Files

▪ Also called a direct file

▪ Execute a hash function on a key field

▪ Yields disk block address containing record

(ideally)

▪ Static external hashing

▪ Dynamic hashing techniques

Page 33: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

33

Static External Hashing

▪ The file blocks are divided into M equal-sized

buckets, numbered bucket0, bucket1, ...,

bucketM-1

▪ Typically, a bucket corresponds to one

(or a fixed number of) disk block(s)

▪ One of the key fields is designated the

hash key of the file

Page 34: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

34

Static External Hashing

▪ Record with key value k

▪ Hash value i = h(K) for hash function h()

▪ Store record in bucketi

This hash

table is

maintained in

the file header

Page 35: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

CS 238 - Dan Lin 35

Example of a Hash File

• There are 10 buckets.

• Suppose the hash

function on the branch

names are:

• h(Perryridge) = 5

• h(Round Hill) = 3

• h(Brighton) = 3

Page 36: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

36

Static External Hashing

▪ Search is very efficient on the hash key.▪ Search for the fields other than the hash key is as

expensive as the unordered file.

▪ Collisions can occur▪ New record hashes to a full bucket

▪ Overflow file is kept for storing such records

▪ Overflow records from a particular bucketcan be linked together

Page 37: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

CS 238 - Dan Lin 37

Hashed Files - Overflow handling

Page 38: CS2300: File Structures and Introduction to Database Systemsweb.mst.edu/~djmvfb/courses/cs2300/static/media/cs2300 - 04 - Disk Storage and File...CS2300: File Structures and Introduction

38

Static External Hashing

▪ To reduce overflow records, a hash file is typically kept 70-80% full.

▪ A good hash function h▪ Distributes records uniformly among buckets

▪ Otherwise, search time will increase;Many overflow records will exist

▪ Fixed bucket count M is problematic▪ The number of records in the file grows or shrinks