disks and files vivek pai princeton university. 2 gedankyou imagine the following: a disk scheduling...

Disks and Files

Vivek Pai

Princeton University

2

Gedankyou

Imagine the following: A disk scheduling policy says “handle the

request that is closest to where the disk head currently is”

On a system with lots of disk-intensive jobs, what problem can arise?

What tweaks can avoid this problem?

3

Why Files

Physical reality Block oriented Physical sector #s No protection

among users of the system

Data might be corrupted if machine crashes

Filesystem model Byte oriented Named files Users protected

from each other Robust to machine

failures

4

File Structures

Byte sequence Read or write a number of bytes Unstructured or linear

Record sequence Fixed or variable length Read or write a number of records

Tree Records with keys Read, insert, delete a record (typically using B-tree)

5

File Structures Today

Stream of bytes Simplest to implement in kernel Easy to manipulate in other forms Little performance loss

More complicated structures Hardware assist fell out of favor Special-purpose hardware slower, costly

6

File Types

ASCII – plain text A Unix executable file

header: magic number, sizes, entry point, flags Text (code) Data relocation bits symbol table

Devices Everything else in the system

7

So What Makes Filesystems Hard?

Files grow and shrink in pieces

Little a priori knowledge 6 orders of magnitude in

file sizes Overcoming disk

performance behavior Desire for efficiency Coping with failure

8

File System Components

Disk management Arrange collection of disk blocks

into files Naming

User gives file name, not track or sector number, to locate data

Security Keep information secure

Reliability/durability When system crashes, lose stuff in

memory, but want files to be durable

User

FileNaming

Fileaccess

Diskmanagement

Diskdrivers

9

Some Definitions

File descriptor (fd) – an integer used to represent a file – easier than using names

Metadata – data about data - bookkeeping data used to eventually access the “real” data

Open file table – system-wide list of descriptors in use

10

Kinds of Metadata

inode – index node, or a specific set of information kept about each file Two forms – on disk and in memory

Directory – names and location information for files and subdirectories Note: stored in files in Unix

Superblock – contains information to describe the file system, disk layout

Information about free blocks/inodes on disk

11

Contents of an Inode

Disk inode: File type, size, blocks on disk Owner, group, permissions (r/w/x) Reference count Times: creation, last access, last mod Inode generation number Padding & other stuff

128 bytes on classic Unix

12

Directories in Unix

Stored like regular files Contents are file names and inode #s Names are nul-terminated strings

Logic Separates file from location in tree File can appear in multiple places

What are the drawbacks?

13

Effects of Corruption

inode – file gets “damaged” Maybe some “free” block gets viewed

Directory – “lose” files/directories Might get to read deleted files

Superblock – can’t figure out anything This is why we replicate the superblock

14

Data Structures for A Typical File System

Processcontrolblock

...

Openfile

pointerarray

Open filetable

(systemwide)Memory Inode

Diskinode

15

Opening A File

File name lookup and authentication

Copy the file metadata into the in-memory data structure, if it is not in yet

Create an entry in the open file table (system wide) if there isn’t one

Create an entry in PCB Link up the data structures Return a pointer to user

PCB

fd = open( FileName, access)

Openfile

table

Metadata

Allocate & link updata structures

File name lookup& authenticate

File system on disk

16

Reading And Writing

What happens when you… read 10 bytes from a file? write 10 bytes into an existing file? write 4096 bytes into a file?

Disk works on blocks (sectors) Can have temporary (ephemeral) buffers Longer lasting buffers = disk cache

17

Reading A Block

PCB

Openfile

table

Metadata

read( fd, userBuf, size )

Logical phyiscal

read( device, phyBlock, size )

Get physical block to sysBufcopy to userBuf

Disk device driver

Buffercache

18

A Disk Layout for A File System

Superblock defines a file system size of the file system size of the file descriptor area free list pointer, or pointer to bitmap location of the file descriptor of the root directory other meta-data such as permission and various times

For reliability, replicate the superblock

Superblock

File metadata(i-node in Unix)

File data blocksBootblock

19

File Usage Patterns

How do users access files? Sequential: bytes read in order Random: read/write element out of middle of arrays Whole file or partial file

How are files used? Most files are small Large files use up most of the disk space Large files account for most of the bytes transferred

Bad news Need everything to be efficient

20

Data Structures for Disk Management

A “header” for each file (part of the file meta-data) Disk sectors associated with each file

A data structure to represent free space on disk Bit map

1 bit per block (sector) blocks numbered in cylinder-major order, why?

Linked list Others?

How much space does a bit map need for a 4G disk?

21

Linked Files (Alto)

File header points to 1st block on disk

Each block points to next Pros

Can grow files dynamically Free list is similar to a file

Cons random access: horrible unreliable: losing a block

means losing the rest

File header

null

. . .

22

Contiguous Allocation

Request in advance for the size of the file Search bit map or linked list to locate a space File header

first sector in file number of sectors

Pros Fast sequential access Easy random access

Cons External fragmentation Hard to grow files

23

Single-Level Indexed Files orExtent-based Filesystems A user declares max size A file header holds an array

of pointers to point to disk blocks

Pros Can grow up to a limit Random access is fast

Cons Clumsy to grow beyond limit Periodic cleanup of new files Up-front declaration a real pain

File headerDiskblocks

24

217

File Allocation Table (FAT) Approach

A section of disk for each partition is reserved

One entry for each block A file is a linked list of

blocks A directory entry points to

the 1st block of the file Pros

Simple Cons

Always go to FAT Wasting space

619

399

foo 217

EOF

FAT

0

399

619

25

Multi-Level Indexed Files (Unix)

13 Pointers in a header 10 direct pointers 11: 1-level indirect 12: 2-level indirect 13: 3-level indirect

Pros & Cons In favor of small files Can grow Limit is 16G and lots of

seek What happens to reach

block 23, 5, 340?

1 2

data

data

...11 12 13

data...

...

data...

...

data...

...

26

Reliability In Disk Systems

Make sure certain actions have occurred before function completes Known as “synchronous” operation Ex: make sure new inode is on disk & that the

directory has been modified before declaring a file creation is complete

Drawback: speed Some ops easily asynchronous: access time Some filesystems don’t care: Linux ext2fs

27

Recovery After Failure

Need to ensure consistency Does free bitmap match tree walk? Do reference counts in inodes match directory

entries? Do blocks appear in multiple inodes?

This kind of recovery grows with disk size Clean shutdown – mark as such, no recovery

28

Reducing Synchronous Times

Write to a faster storage Nonvolatile memory – expensive, requires some

additional OS/firmware support Write to a special disk or section – logging

Only have to examine log when recovering Eventually have to put information in place Some information dies in the log itself

Write in a special order Write metadata in a way that is consistent but

possibly recovers less

29

Challenges

Unix filesystem has great flexibility Extent-based filesystems have speed Seeks kill performance – locality Bitmaps show contiguous free space Linked lists easy to search How do you perform backup/restore?

30

A Quick XOR Overview

XOR = eXclusive OR a XOR a = 0 a XOR 0 = a a XOR b = b XOR a (a XOR b) XOR c = a XOR (b XOR c) In other words, count the bits,

even = 0, odd = 1

31

More Fun With XOR

Result = XOR (a1, a2, a3, a4,…) a2 goes bad Can we reconstruct a2?

a2 = XOR (a1, result, a3, a4,…) What does this imply for disks?

What kinds of failures does it handle?

32

Bigger, Faster, Stronger

Making individual disks larger is hard Throw more disks at the problem

Capacity increases Effective access speed may increase Probability of failure also increases

Use some disks to provide redundancy Generally assume a fail-stop model Fail-stop versus Byzantine failures

33

RAID (Redundant Array of Inexpensive Disks)

Main idea Store the error correcting codes

on other disks General error correcting codes

are too powerful Use XORs or single parity Upon any failure, one can

recover the entire block from the spare disk (or any disk) using XORs

Pros Reliability High bandwidth

Cons The controller is complex

RAID controller

XOR

34

Synopsis of RAID Levels

RAID Level 0: Non redundant (JBOD)

RAID Level 1:Mirroring

RAID Level 2:Byte-interleaved, ECC

RAID Level 3:Byte-interleaved, parity

RAID Level 4:Block-interleaved, parity

RAID Level 5:Block-interleaved, distributed parity

35

Did RAID Work?

Performance: yes Reliability: yes Cost: no

Controller design complicated Fewer economies of scale High-reliability environments don’t care

Now also software implementations

36

RAID’s Real Benefit

Partly addresses the failure problem Backup/restore less of an issue Failed disk “rebuilt” at sector level Lower performance during rebuild, but system

still on-line Still not perfect

Geographic problems Failure during rebuild

disks and files vivek pai princeton university. 2 gedankyou imagine the following: a disk scheduling...

Documents