ucdavis, ecs150 fall 2007 11/13/2007ecs150, fall 20071 operating system ecs150 fall 2007 : operating...

11/13/2007 ecs150, Fall 2007 1

UCDavis, ecs150Fall 2007

ecs150 Fall 2007:Operating SystemOperating System#5: File Systems(chapters: 6.4~6.7, 8)

Dr. S. Felix Wu

Computer Science Department

University of California, Davishttp://www.cs.ucdavis.edu/~wu/

[email protected]

11/13/2007 ecs150, Fall 2007 2


File System AbstractionFile System Abstraction

Files Directories

11/13/2007 ecs150, Fall 2007 3


System-call interfaceActive file entries

VNODE Layer or VFS

Local naming (UFS)

FFS

Buffer cache

Block or character device driver

Hardware

11/13/2007 ecs150, Fall 2007 4


11/13/2007 ecs150, Fall 2007 5


11/13/2007 ecs150, Fall 2007 6


11/13/2007 ecs150, Fall 2007 7


dirp = opendir(const char *filename);struct dirent *direntp = readdir(dirp);

struct dirent {ino_t d_ino;char d_name[NAME_MAX+1];

};

directory

direntinode

file_name

file

file

file

direntinode

file_name

direntinode

file_name

11/13/2007 ecs150, Fall 2007 8


Local versus RemoteLocal versus Remote

System Call Interface V-node Local versus remote

– NFS or i-node– Stackable File System

Hard-disk blocks

11/13/2007 ecs150, Fall 2007 9


File-System StructureFile-System Structure File structure

– Logical storage unit– Collection of related information

File system resides on secondary storage (disks).

File system organized into layers. File control block – storage structure

consisting of information about a file.

11/13/2007 ecs150, Fall 2007 10

UCDavis, ecs150Fall 2007 File File Disk Disk

separate the disk into blocks separate the file into blocks as well paging from file to disk

blocks: 4 - 7- 2- 10- 12

How to represent the file??How to link these 5 pages together??

11/13/2007 ecs150, Fall 2007 11


Bit torrent piecesBit torrent pieces

1 big file (X Gigabytes) with a number of pieces (5%) already in (and sharing with others).

How much disk space do we need at this moment?

11/13/2007 ecs150, Fall 2007 12

UCDavis, ecs150Fall 2007 Hard DiskHard Disk

Track, Sector, Head– Track + Heads Cylinder

Performance– seek time– rotation time– transfer time

LBA– Linear Block Addressing

11/13/2007 ecs150, Fall 2007 13

UCDavis, ecs150Fall 2007 File File Disk blocks Disk blocks

fileblock

0

4

fileblock

1

7

fileblock

2

2

fileblock

3

10

0file

block4

12

What are the disadvantages?1. disk access can be slow for “random access”.2. How big is each block? 64 bytes? 68 bytes?

11/13/2007 ecs150, Fall 2007 14


Kernel Hacking SessionKernel Hacking Session

This Friday from 7:30 p.m. until midnight.. 3083 Kemper

– Bring your laptop– And bring your mug…

11/13/2007 ecs150, Fall 2007 15

UCDavis, ecs150Fall 2007 A File SystemA File System

partition partition partition

i-list directory and data blockssb

i-node i-node ……. i-node

d

11/13/2007 ecs150, Fall 2007 16


One Logical File One Logical File Physical Disk Blocks Physical Disk Blocks

efficient representation & access

11/13/2007 ecs150, Fall 2007 17

UCDavis, ecs150Fall 2007 An i-nodeAn i-node

Typical:each block 8K or 16K bytes

??? entries inone disk block

A file

11/13/2007 ecs150, Fall 2007 18


inode (index node) structureinode (index node) structure meta-data of the file.

– di_mode 02– di_nlinks 02– di_uid 02– di_gid 02– di_size 04– di_addr 39– di_gen 01– di_atime 04– di_mtime 04– di_ctime 04

11/13/2007 ecs150, Fall 2007 19


System-call interfaceActive file entries

VNODE Layer or VFS

Local naming (UFS)

FFS

Buffer cache

Block or character device driver

Hardware

11/13/2007 ecs150, Fall 2007 20


11/13/2007 ecs150, Fall 2007 21





d

11/13/2007 ecs150, Fall 2007 22


11/13/2007 ecs150, Fall 2007 23


125 struct ufs2_dinode {126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */127 int16_t di_nlink; /* 2: File link count. */128 u_int32_t di_uid; /* 4: File owner. */ 129 u_int32_t di_gid; /* 8: File group. */ 130 u_int32_t di_blksize; /* 12: Inode blocksize. */ 131 u_int64_t di_size; /* 16: File byte count. */ 132 u_int64_t di_blocks; /* 24: Bytes actually held. */ 133 ufs_time_t di_atime; /* 32: Last access time. */ 134 ufs_time_t di_mtime; /* 40: Last modified time. */ 135 ufs_time_t di_ctime; /* 48: Last inode change time. */ 136 ufs_time_t di_birthtime; /* 56: Inode creation time. */ 137 int32_t di_mtimensec; /* 64: Last modified time. */ 138 int32_t di_atimensec; /* 68: Last access time. */ 139 int32_t di_ctimensec; /* 72: Last inode change time. */ 140 int32_t di_birthnsec; /* 76: Inode creation time. */ 141 int32_t di_gen; /* 80: Generation number. */ 142 u_int32_t di_kernflags; /* 84: Kernel flags. */ 143 u_int32_t di_flags; /* 88: Status flags (chflags). */ 144 int32_t di_extsize; /* 92: External attributes block. */ 145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */ 146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */ 147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */ 148 int64_t di_spare[3]; /* 232: Reserved; currently unused */ 149 };

11/13/2007 ecs150, Fall 2007 24

UCDavis, ecs150Fall 2007166 struct ufs1_dinode {

167 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ 168 int16_t di_nlink; /* 2: File link count. */ 169 union { 170 u_int16_t oldids[2]; /* 4: Ffs: old user and group ids. */ 171 } di_u; 172 u_int64_t di_size; /* 8: File byte count. */ 173 int32_t di_atime; /* 16: Last access time. */ 174 int32_t di_atimensec; /* 20: Last access time. */ 175 int32_t di_mtime; /* 24: Last modified time. */ 176 int32_t di_mtimensec; /* 28: Last modified time. */ 177 int32_t di_ctime; /* 32: Last inode change time. */ 178 int32_t di_ctimensec; /* 36: Last inode change time. */ 179 ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ 180 ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ 181 u_int32_t di_flags; /* 100: Status flags (chflags). */ 182 int32_t di_blocks; /* 104: Blocks actually held. */ 183 int32_t di_gen; /* 108: Generation number. */ 184 u_int32_t di_uid; /* 112: File owner. */ 185 u_int32_t di_gid; /* 116: File group. */ 186 int32_t di_spare[2]; /* 120: Reserved; currently unused */ 187 };

11/13/2007 ecs150, Fall 2007 25


Bittorrent piecesBittorrent pieces

File size: 10 GBPieces downloaded: 512 MBHow much disk space do we need?

11/13/2007 ecs150, Fall 2007 26


#include <stdio.h>#include <stdlib.h>

intmain(void){ FILE *f1 = fopen("./sss.txt", "w"); int i;

for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1);}

# ./t# ls –l ./sss.txt

11/13/2007 ecs150, Fall 2007 27


11/13/2007 ecs150, Fall 2007 28


11/13/2007 ecs150, Fall 2007 29


11/13/2007 ecs150, Fall 2007 30


Typical:each block 1K


A file

11/13/2007 ecs150, Fall 2007 31


i-nodei-node

How many disk blocks can a FS have? How many levels of i-node indirection will be

necessary to store a file of 2G bytes? (I.e., 0, 1, 2 or 3) What is the largest possible file size in i-node? What is the size of the i-node itself for a file of 10GB

with only 512 MB downloaded?

11/13/2007 ecs150, Fall 2007 32


AnswerAnswer How many disk blocks can a FS have?

– 264 or 232: Pointer (to blocks) size is 8/4 bytes. How many levels of i-node indirection will be

necessary to store a file of 2G (231) bytes? (I.e., 0, 1, 2 or 3)– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10 >? 231

What is the largest possible file size in i-node?– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10

– 264 –1– 232 * 210

You need to consider three issues and find the minimum!

11/13/2007 ecs150, Fall 2007 33


Answer: Lower BoundAnswer: Lower Bound

How many pointers?– 512MB divided by the block size (1K)– 512K pointers times 8 (4) bytes = 4 (2) MB

11/13/2007 ecs150, Fall 2007 34


Bittorrent piecesBittorrent pieces

File size: 10 GBPieces downloaded: 512 MBHow much disk space do we need?

11/13/2007 ecs150, Fall 2007 35


Answer: Upper BoundAnswer: Upper Bound

In the worst case, EVERY indirection block has at least one entry!

How many indirection blocks?– Single: 1 block– Double: 1 + 28

– Tripple: 1 + 28 + 216

Total ~ 216 blocks times 1K = 64 MB– 214 times 1K = 16MB (ufs2 inode)

11/13/2007 ecs150, Fall 2007 36


Answer (4)Answer (4)

2 MB ~ 64 MB ufs1 4 MB ~ 16 MB ufs2

Answer: sss.txt ~17 MB– ~16 MB (inode indirection blocks)– 1000 writes times 1K ~ 1MB

11/13/2007 ecs150, Fall 2007 37


Typical:each block 1K


A file

11/13/2007 ecs150, Fall 2007 38





d

11/13/2007 ecs150, Fall 2007 39


FFS and UFSFFS and UFS

/usr/src/sys/ufs/ffs/*– Higher-level: directory structure– Soft updates & Snapshot

/usr/src/sys/ufs/ufs/*– Lower-level: buffer, i-node

11/13/2007 ecs150, Fall 2007 40


# of i-nodes# of i-nodes

UFS1: pre-allocation– 3% of HD, about < 25% used.

UFS2: dynamic allocation– Still limited # of i-nods

11/13/2007 ecs150, Fall 2007 41


di_size vs. di_blocksdi_size vs. di_blocks

???

11/13/2007 ecs150, Fall 2007 42


One Logical File One Logical File Physical Disk Blocks Physical Disk Blocks

efficient representation & access

11/13/2007 ecs150, Fall 2007 43


di_size vs. di_blocksdi_size vs. di_blocks

Logical Physical

fstat du

11/13/2007 ecs150, Fall 2007 44


Extended Attributes in UFS2Extended Attributes in UFS2 Attributes associated with the File

– di_extb[2]; – two blocks, but indirection if needed.

Format– Length 4– Name Space 1– Content Pad Length 1– Name Length 1– Name mod 8– Content variable

Applications: ACL, Data Labelling

11/13/2007 ecs150, Fall 2007 45


Some thoughts….Some thoughts…. What can you do with “extended attributes”? How to design/implement?

– Should/can we do it “Stackable File Systems”?– Otherwise, the program to manipulate the EA’s

will have to be very UFS2-dependent or FiST with an UFS2 optimization option.

Are there any counter examples?– security and performance considerations.

11/13/2007 ecs150, Fall 2007 46





d

11/13/2007 ecs150, Fall 2007 47

UCDavis, ecs150Fall 2007 struct dirent {

ino_t d_ino;char d_name[NAME_MAX+1];

};

struct stat {…short nlinks;

…};

directory

direntinode

file_name

file

file

file

direntinode

file_name

direntinode

file_name

11/13/2007 ecs150, Fall 2007 48


11/13/2007 ecs150, Fall 2007 49


drwxr-xr-xApr 1 2004

root wheel


root wheel

rwxr-xr-xApr 15 2004

root wheel

rw-rw-r--Jan 19 2004

kirk staff


root wheel

rwxr-xr-xApr 15 2004

bin bin

2

3

4

5

6

7

8

9

. 2

.. 2usr 4

vmunix 5

. 4

.. 2bin 7foo 6

text data

Hello World!

. 7

.. 4ex 9

groff 10vi 9

text data

directory/

directory/usr

directory/usr/bin

file/vmunix

file/usr/foo

file/usr/bin/vi

11/13/2007 ecs150, Fall 2007 50


What is the difference?What is the difference?

ln –s /usr/src/sys/sys/proc.h ppp.h ln /usr/src/sys/sys/proc.h ppp.h

11/13/2007 ecs150, Fall 2007 51


Hard versus SymbolicHard versus Symbolic

ln –s /usr/src/sys/sys/proc.h ppp.h– Link to anything, any mounted partitions– Delete a Symbolic link?

ln /usr/src/sys/sys/proc.h ppp.h– Link only to “file” (not directory)– Link only within the same partition -- why?– Delete a Hard Link?

11/13/2007 ecs150, Fall 2007 52


125 struct ufs2_dinode {126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */127 int16_t di_nlink; /* 2: File link count. */128 u_int32_t di_uid; /* 4: File owner. */ 129 u_int32_t di_gid; /* 8: File group. */ 130 u_int32_t di_blksize; /* 12: Inode blocksize. */ 131 u_int64_t di_size; /* 16: File byte count. */ 132 u_int64_t di_blocks; /* 24: Bytes actually held. */ 133 ufs_time_t di_atime; /* 32: Last access time. */ 134 ufs_time_t di_mtime; /* 40: Last modified time. */ 135 ufs_time_t di_ctime; /* 48: Last inode change time. */ 136 ufs_time_t di_birthtime; /* 56: Inode creation time. */ 137 int32_t di_mtimensec; /* 64: Last modified time. */ 138 int32_t di_atimensec; /* 68: Last access time. */ 139 int32_t di_ctimensec; /* 72: Last inode change time. */ 140 int32_t di_birthnsec; /* 76: Inode creation time. */ 141 int32_t di_gen; /* 80: Generation number. */ 142 u_int32_t di_kernflags; /* 84: Kernel flags. */ 143 u_int32_t di_flags; /* 88: Status flags (chflags). */ 144 int32_t di_extsize; /* 92: External attributes block. */ 145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */ 146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */ 147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */ 148 int64_t di_spare[3]; /* 232: Reserved; currently unused */ 149 };

11/13/2007 ecs150, Fall 2007 53

UCDavis, ecs150Fall 2007 struct dirent {

ino_t d_ino;char d_name[NAME_MAX+1];

};

struct stat {…short nlinks;

…};

directory

direntinode

file_name

file

file

file

direntinode

file_name

direntinode

file_name

11/13/2007 ecs150, Fall 2007 54


File System Buffer CacheFile System Buffer Cacheapplication: read/write files

OS: translate file to disk blocks

...buffer cache ...maintains

controls disk accesses: read/write blocks

hardware:

Any problems?

11/13/2007 ecs150, Fall 2007 55


File System ConsistencyFile System Consistency

To maintain file system consistency the ordering of updates from buffer cache to disk is critical

Example:– if the directory block is written back before the

i-node and the system crashes, the directory structure will be inconsistent

11/13/2007 ecs150, Fall 2007 56


File System ConsistencyFile System Consistency File system almost always use a buffer/disk cache for

performance reasons This problem is critical especially for the blocks that

contain control information: i-node, free-list, directory blocks

Two copies of a disk block (buffer cache, disk) consistency problem if the system crashes before all the modified blocks are written back to disk

Write back critical blocks from the buffer cache to disk immediately

Data blocks are also written back periodically: sync

11/13/2007 ecs150, Fall 2007 57


Two StrategiesTwo Strategies Prevention

– Use un-buffered I/O when writing i-nodes or pointer blocks

– Use buffered I/O for other writes and force sync every 30 seconds

Detect and Fix– Detect the inconsistency

– Fix them according to the “rules”

– Fsck (File System Checker)

11/13/2007 ecs150, Fall 2007 58


File System IntegrityFile System Integrity Block consistency:

– Block-in-use table

– Free-list table

File consistency:– how many directories pointing to that i-node?

– nlink?

– three cases: D == L, L > D, D > L What to do with the latter two cases?

0 1 1 1 0 0 0 1 0 0 0 2

1 0 0 0 1 1 1 0 1 0 2 0

11/13/2007 ecs150, Fall 2007 59

UCDavis, ecs150Fall 2007 File System IntegrityFile System Integrity

File system states(a) consistent(b) missing block(c) duplicate block in free list(d) duplicate data block

11/13/2007 ecs150, Fall 2007 60


Metadata OperationsMetadata Operations

Metadata operations modify the structure of the file system– Creating, deleting, or renaming

files, directories, or special files– Directory & I-node

Data must be written to disk in such a way that the file system can be recovered to a consistent state after a system crash

11/13/2007 ecs150, Fall 2007 61


Metadata IntegrityMetadata Integrity

FFS uses synchronous writes to guarantee the integrity of metadata– Any operation modifying multiple pieces of

metadata will write its data to disk in a specific order

– These writes will be blocking Guarantees integrity and durability of

metadata updates

11/13/2007 ecs150, Fall 2007 62


Deleting a file (I)Deleting a file (I)

abc

def

ghi

i-node-1

i-node-2

i-node-3

Assume we want to delete file “def”

11/13/2007 ecs150, Fall 2007 63


Deleting a file (II)Deleting a file (II)

abc

def

ghi

i-node-1

i-node-3

Cannot delete i-node before directory entry “def”

?

11/13/2007 ecs150, Fall 2007 64


Deleting a file (III)Deleting a file (III)

Correct sequence is1. Write to disk directory block containing deleted

directory entry “def”

2. Write to disk i-node block containing deleted i-node

Leaves the file system in a consistent state

11/13/2007 ecs150, Fall 2007 65


Creating a file (I)Creating a file (I)

abc

ghi

i-node-1

i-node-3

Assume we want to create new file “tuv”

11/13/2007 ecs150, Fall 2007 66


Creating a file (II)Creating a file (II)

abc

ghi

tuv

i-node-1

i-node-3

Cannot write directory entry “tuv” before i-node

?

11/13/2007 ecs150, Fall 2007 67


Creating a file (III)Creating a file (III)

Correct sequence is1. Write to disk i-node block containing new i-node

2. Write to disk directory block containing new directory entry

Leaves the file system in a consistent state

11/13/2007 ecs150, Fall 2007 68


Synchronous UpdatesSynchronous Updates

Used by FFS to guarantee consistency of metadata:– All metadata updates are done through blocking

writes

Increases the cost of metadata updates Can significantly impact the performance

of whole file system

11/13/2007 ecs150, Fall 2007 69


11/13/2007 ecs150, Fall 2007 70


SOFT UPDATESSOFT UPDATES

Use delayed writes (write back) Maintain dependency information about

cached pieces of metadata:This i-node must be updated before/after this directory entry

Guarantee that metadata blocks are written to disk in the required order

11/13/2007 ecs150, Fall 2007 71


3 Soft Update Rules3 Soft Update Rules

Never point to a structure before it has been initialized.

Never reuse a resource before nullifying all previous pointers to it.

Never reset the old pointer to a live resource before the new pointer has been set.

11/13/2007 ecs150, Fall 2007 72


Problem #1 with S.U.Problem #1 with S.U.

Synchronous writes guaranteed that metadata operations were durable once the system call returned

Soft Updates guarantee that file system will recover into a consistent state but not necessarily the most recent one– Some updates could be lost

11/13/2007 ecs150, Fall 2007 73


We want to delete file “foo” and create new file “bar”

i-node-2 foo

NEW bar

NEW i-node-3

Block A Block B

What are the dependency relationship?

11/13/2007 ecs150, Fall 2007 74


We want to delete file “foo” and create new file “bar”

i-node-2 foo

NEW bar

NEW i-node-3

Block A Block B

Circular DependencyX-2nd Y-1st

11/13/2007 ecs150, Fall 2007 75


Problem #2 with S.U.Problem #2 with S.U.

Cyclical dependencies:– Same directory block contains entries to be

created and entries to be deleted– These entries point to i-nodes in the same block

Brainstorming:– How to resolve this issue in S.U.?

11/13/2007 ecs150, Fall 2007 76


FS: buffer or disk??FS: buffer or disk??

They appear in both and we try to synchronize them..

11/13/2007 ecs150, Fall 2007 77


DiskDisk

i-node-2 foo

Block A-Dir Block B-i-Node

11/13/2007 ecs150, Fall 2007 78


BufferBuffer

NEW bar

NEW i-node-3

Block A-Dir Block B-i-Node

11/13/2007 ecs150, Fall 2007 79


Synchronize??Synchronize??

i-node-2 foo

NEW bar

NEW i-node-3

Block A Block B

11/13/2007 ecs150, Fall 2007 80


How to update?? i-node first or director block first?

11/13/2007 ecs150, Fall 2007 81


11/13/2007 ecs150, Fall 2007 82


Solution in S.U.Solution in S.U.

Roll back metadata in one of the blocks to an earlier, safe state

(Safe state does not contain new directory entry)

def

Block A’

11/13/2007 ecs150, Fall 2007 83


Write first block with metadata that were rolled back (block A’ of example)

Write blocks that can be written after first block has been written (block B of example)

Roll forward block that was rolled back Write that block Breaks the cyclical dependency but must now

write twice block A

11/13/2007 ecs150, Fall 2007 84


Before any Write Operation

After any Write Operation

SU Dependency Checking(roll back if necessary)

SU Dependency Processing(task list updating)(roll forward if necessary)

11/13/2007 ecs150, Fall 2007 85


two most popular approaches for improving the performance of metadata operations and recovery:– Journaling – Soft Updates

Journaling systems record metadata operations on an auxiliary log

Soft Updates uses ordered writes

11/13/2007 ecs150, Fall 2007 86

UCDavis, ecs150Fall 2007 JOURNALINGJOURNALING

Journaling systems maintain an auxiliary log that records all meta-data operations

Write-ahead logging ensures that the log is written to disk before any blocks containing data modified by the corresponding operations.– After a crash, can replay the log to bring the file

system to a consistent state

11/13/2007 ecs150, Fall 2007 87


JOURNALINGJOURNALING

Log writes are performed in addition to the regular writes

Journaling systems incur log write overhead but– Log writes can be performed efficiently

because they are sequential (block operation consideration)

– Metadata blocks do not need to be written back after each update

11/13/2007 ecs150, Fall 2007 88


JOURNALINGJOURNALING

Journaling systems can provide– same durability semantics as FFS if log is

forced to disk after each meta-data operation– the laxer semantics of Soft Updates if log

writes are buffered until entire buffers are full

11/13/2007 ecs150, Fall 2007 89


Soft Updates vs. JournalingSoft Updates vs. Journaling

Advantages disadvantages

11/13/2007 ecs150, Fall 2007 90


With Soft Updates??With Soft Updates??

CPU

Do we still need “FSCK”? at boot time?

11/13/2007 ecs150, Fall 2007 91


Recover the Missing ResourcesRecover the Missing Resources

In the background, in an active FS…– We don’t want to wait for the lengthy FSCK

process to complete…

A related issue:– the virus scanning process– what happens if we get a new virus signature?

11/13/2007 ecs150, Fall 2007 92


Snapshot of the FSSnapshot of the FS

backup and restore dump reliably an active File System

– what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)

“background FSCK checks”

11/13/2007 ecs150, Fall 2007 93


What is a snapshot?What is a snapshot?(I mean “conceptually”.)(I mean “conceptually”.)

Freeze all activities related to the FS. Copy everything to “some space”. Resume the activities.

How do we efficiently implement this concept such that the activities will only be blocked for about 0.25 seconds, and we don’t have to buy a really big hard drive?

11/13/2007 ecs150, Fall 2007 94


11/13/2007 ecs150, Fall 2007 95


11/13/2007 ecs150, Fall 2007 96


Copy-on-Write

11/13/2007 ecs150, Fall 2007 97

UCDavis, ecs150Fall 2007 Snapshot: a fileSnapshot: a file

Logical sizeVersus physical size

11/13/2007 ecs150, Fall 2007 98


ExampleExample

# mkdir /backups/usr/noon# mount –u –o snapshot /usr/snap.noon /usr# mdconfig –a –t vnode –u 0 –f /usr/snap.noon# mount –r /dev/md0 /backups/usr/noon

/* do whatever you want to test it */

# umount /backups/usr/noon# mdconfig –d –u 0# rm –f /usr/snap.noon

11/13/2007 ecs150, Fall 2007 99


11/13/2007 ecs150, Fall 2007 100


11/13/2007 ecs150, Fall 2007 101


#include <stdio.h>#include <stdlib.h>

intmain(void){ FILE *f1 = fopen("./sss.txt", "w"); int i;

for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1);}

11/13/2007 ecs150, Fall 2007 102


ExampleExample




11/13/2007 ecs150, Fall 2007 103


11/13/2007 ecs150, Fall 2007 104


11/13/2007 ecs150, Fall 2007 105


11/13/2007 ecs150, Fall 2007 106


11/13/2007 ecs150, Fall 2007 107


11/13/2007 ecs150, Fall 2007 108


11/13/2007 ecs150, Fall 2007 109


11/13/2007 ecs150, Fall 2007 110


ExampleExample




11/13/2007 ecs150, Fall 2007 111


Copy-on-Write

11/13/2007 ecs150, Fall 2007 112


11/13/2007 ecs150, Fall 2007 113



A file

11/13/2007 ecs150, Fall 2007 114

UCDavis, ecs150Fall 2007 A Snapshot i-nodeA Snapshot i-node


A file

Not used orNot yet copy

11/13/2007 ecs150, Fall 2007 115

UCDavis, ecs150Fall 2007 Copy-on-writeCopy-on-write


A file


11/13/2007 ecs150, Fall 2007 116

UCDavis, ecs150Fall 2007 Copy-on-writeCopy-on-write


A file


11/13/2007 ecs150, Fall 2007 117


Multiple SnapshotsMultiple Snapshots

about 20 snapshots Interactions/sharing among snapshots

11/13/2007 ecs150, Fall 2007 118


Snapshot of the FSSnapshot of the FS

backup and restore dump reliably an active File System

– what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)

“background FSCK checks”

11/13/2007 ecs150, Fall 2007 119


11/13/2007 ecs150, Fall 2007 120


VFS: the FS SwitchVFS: the FS Switch

syscall layer (file, uio, etc.)

user space

Virtual File System (VFS)networkprotocol

stack(TCP/IP) NFS FFS LFS etc.*FS etc.

device drivers

Sun Microsystems introduced the virtual file system interface in 1985 to accommodate diverse filesystem types cleanly.

VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules.

VFS was an internal kernel restructuringwith no effect on the syscall interface.

Incorporates object-oriented concepts:a generic procedural interface withmultiple implementations.

Based on abstract objects with dynamicmethod binding by type...in C.Other abstract interfaces in the kernel: device drivers,

file objects, executable files, memory objects.

11/13/2007 ecs150, Fall 2007 121


vnodevnode In the VFS framework, every file or directory in active use

is represented by a vnode object in kernel memory.

syscall layer

NFS UFS

free vnodes

Each vnode has a standardfile attributes struct.

Vnode operations aremacros that vector tofilesystem-specificprocedures.

Generic vnode points atfilesystem-specific struct(e.g., inode, rnode), seenonly by the filesystem. Each specific file system

maintains a cache of its resident vnodes.

11/13/2007 ecs150, Fall 2007 122


vnode Operations and vnode Operations and AttributesAttributes

directories onlyvop_lookup (OUT vpp, name)vop_create (OUT vpp, name, vattr)vop_remove (vp, name)vop_link (vp, name)vop_rename (vp, name, tdvp, tvp, name)vop_mkdir (OUT vpp, name, vattr)vop_rmdir (vp, name)vop_symlink (OUT vpp, name, vattr, contents)vop_readdir (uio, cookie)vop_readlink (uio)

files onlyvop_getpages (page**, count, offset)vop_putpages (page**, count, sync, offset)vop_fsync ()

vnode attributes (vattr)type (VREG, VDIR, VLNK, etc.)mode (9+ bits of permissions)nlink (hard link count)owner user IDowner group IDfilesystem IDunique file IDfile size (bytes and blocks)access timemodify timegeneration number

generic operationsvop_getattr (vattr)vop_setattr (vattr)vhold()vholdrele()

11/13/2007 ecs150, Fall 2007 123


Network File System (NFS)Network File System (NFS)

syscall layer

UFS

NFSserver

VFS

VFS

NFSclient

UFS

syscall layer

client

user programs

network

server

11/13/2007 ecs150, Fall 2007 124


vnode Cachevnode CacheHASH(fsid, fileid)

VFS free list headActive vnodes are reference- counted by the structures that hold pointers to them.

- system open file table

- process current directory

- file system mount points

- etc.

Each specific file system maintains its own hash of vnodes (BSD).

- specific FS handles initialization

- free list is maintained by VFSvget(vp): reclaim cached inactive vnode from VFS free listvref(vp): increment reference count on an active vnodevrele(vp): release reference count on a vnode vgone(vp): vnode is no longer valid (file is removed)

11/13/2007 ecs150, Fall 2007 125


11/13/2007 ecs150, Fall 2007 126


11/13/2007 ecs150, Fall 2007 127


struct vnode {struct mtx v_interlock; /* lock for "i" things */u_long v_iflag; /* i vnode flags (see below) */int v_usecount; /* i ref count of users */long v_numoutput; /* i writes in progress */struct thread *v_vxthread; /* i thread owning VXLOCK */int v_holdcnt; /* i page & buffer references */struct buflists v_cleanblkhd; /* i SORTED clean blocklist */struct buf *v_cleanblkroot;/* i clean buf splay tree */int v_cleanbufcnt; /* i number of clean buffers */struct buflists v_dirtyblkhd; /* i SORTED dirty blocklist */struct buf *v_dirtyblkroot; /* i dirty buf splay tree */int v_dirtybufcnt;

11/13/2007 ecs150, Fall 2007 128

UCDavis, ecs150Fall 2007 Distributed FSDistributed FS

/

usr sys dev etc bin

/

local adm home lib bin

ftp.cs.ucdavis.edu fs0: /dev/hd0a

Server.yahoo.com fs0: /dev/hd0e

11/13/2007 ecs150, Fall 2007 129


logical diskslogical disks/

usr sys dev etc bin

/

local adm home lib bin

fs0: /dev/hd0a

fs1: /dev/hd0e

mount -t ufs /dev/hd0e /usr

mount -t nfs 152.1.23.12:/export/cdrom /mnt/cdrom

11/13/2007 ecs150, Fall 2007 130

UCDavis, ecs150Fall 2007 CorrectnessCorrectness

One-copy Unix Semantics– every modification to every byte of a file has to

be immediately and permanently visible to every client.

11/13/2007 ecs150, Fall 2007 131

UCDavis, ecs150Fall 2007 CorrectnessCorrectness

One-copy Unix Semantics– every modification to every byte of a file has to

be immediately and permanently visible to every client.

– Conceptually FS sequent access Make sense in a local file system Single processor versus shared memory

Is this necessary?

11/13/2007 ecs150, Fall 2007 132

UCDavis, ecs150Fall 2007 DFS ArchitectureDFS Architecture

Server– storage for the distributed/shared files.– provides an access interface for the clients.

Client– consumer of the files.– runs applications in a distributed environment.

open closeread writeopendir statreaddir

applications

11/13/2007 ecs150, Fall 2007 133

UCDavis, ecs150Fall 2007 NFS (SUN, 1985)NFS (SUN, 1985)

Based on RPC (Remote Procedure Call) and XDR (Extended Data Representation)

Server maintains no state– a READ on the server opens, seeks, reads, and closes– a WRITE is similar, but the buffer is flushed to disk before

closing Server crash: client continues to try until server reboots –

no loss Client crashes: client must rebuild its own state – no effect

on server

11/13/2007 ecs150, Fall 2007 134


RPC - XDRRPC - XDR

RPC: Standard protocol for calling procedures in another machine

Procedure is packaged with authorization and admin info

XDR: standard format for data, because manufacturers of computers cannot agree on byte ordering.

11/13/2007 ecs150, Fall 2007 135


rpcgenrpcgen

RPC program

rpcgen

RPC client.c RPC server.cRPC.h

datastructure

datastructure

11/13/2007 ecs150, Fall 2007 136


NFS OperationsNFS Operations

Every operation is independent: server opens file for every operation

File identified by handle -- no state information retained by server

client maintains mount table, v-node, offset in file table etc.

What do these imply???

11/13/2007 ecs150, Fall 2007 137


Client computer Server computer

UNIXfile

system

NFSclient

NFSserver

UNIXfile

system

Applicationprogram

Applicationprogram

Virtual file systemVirtual file system

Oth

er f

ile s

yste

mUNIX kernel

system calls

NFSprotocol

(remote operations)

UNIX

Operations on local files

Operationson

remote files

*

Applicationprogram

NFSClient

KernelApplicationprogram

NFSClient

Client computer

mount –t nfs home.yahoo.com:/pub/linux /mnt/linux

11/13/2007 ecs150, Fall 2007 138


11/13/2007 ecs150, Fall 2007 139

UCDavis, ecs150Fall 2007 State-ful vs. State-lessState-ful vs. State-less

A server is fully aware of its clients– does the client have the newest copy?

– what is the offset of an opened file?

– “a session” between a client and a server!

A server is completely unaware of its clients– memory-less: I do not remember you!!

– Just tell me what you want to get (and where).

– I am not responsible for your offset values (the client needs to maintain the state).

11/13/2007 ecs150, Fall 2007 140

UCDavis, ecs150Fall 2007 The StateThe State

applications

openreadstatlseek

applications

openreadstatlseek

offset

11/13/2007 ecs150, Fall 2007 141


Network File SharingNetwork File Sharing Server side:

– Rpcbind (portmap)– Mountd - respond to mount requests (sometimes called

rpc.mountd). Relies on several files

– /etc/dfs/dfstab, – /etc/exports, – /etc/netgroup

– nfsd - serves files - actually a call to kernel level code.– lockd – file locking daemon.– statd – manages locks for lockd.– rquotad – manages quotas for exported file systems.

11/13/2007 ecs150, Fall 2007 142


Network File SharingNetwork File Sharing Client Side

– biod - client side caching daemon

– mount must understand the hostname:directory convention.

– Filesystem entries in /etc/[v]fstab tell the client what filesystems to mount.

11/13/2007 ecs150, Fall 2007 143


Unix file semanticsUnix file semantics

NFS:– open a file with read-write mode– later, the server’s copy becomes read-only

mode– now, the application tries to write it!!

11/13/2007 ecs150, Fall 2007 144


Problems with NFSProblems with NFS

Performance not scaleable:– maybe it is OK for a local office.– will be horrible with large scale systems.

11/13/2007 ecs150, Fall 2007 145


Similar to UNIX file caching for local files:– pages (blocks) from disk are held in a main memory buffer cache until

the space is required for newer pages. Read-ahead and delayed-write optimisations.

– For local files, writes are deferred to next sync event (30 second intervals)

– Works well in local context, where files are always accessed through the local cache, but in the remote case it doesn't offer necessary synchronization guarantees to clients.

NFS v3 servers offers two strategies for updating the disk:– write-through - altered pages are written to disk as soon as they are

received at the server. When a write() RPC returns, the NFS client knows that the page is on the disk.

– delayed commit - pages are held only in the cache until a commit() call is received for the relevant file. This is the default mode used by NFS v3 clients. A commit() is issued by the client whenever a file is closed.

*

11/13/2007 ecs150, Fall 2007 146


Server caching does nothing to reduce RPC traffic between client and server– further optimisation is essential to reduce server load in large networks– NFS client module caches the results of read, write, getattr, lookup and

readdir operations– synchronization of file contents (one-copy semantics) is not guaranteed

when two or more clients are sharing the same file. Timestamp-based validity check

– reduces inconsistency, but doesn't eliminate it– validity condition for cache entries at the client:

(T - Tc < t) v (Tmclient = Tmserver)– t is configurable (per file) but is typically set to

3 seconds for files and 30 secs. for directories– it remains difficult to write distributed

applications that share files with NFS

*

t freshness guaranteeTc time when cache entry was

last validatedTm time when block was last

updated at serverT current time

11/13/2007 ecs150, Fall 2007 147

UCDavis, ecs150Fall 2007 AFSAFS

State-ful clients and servers. Caching the files to clients.

– File close ==> check-in the changes. How to maintain consistency?

– Using “Callback” in v2/3 (Valid or Cancelled)

openread

applications

invalidate and re-cache

11/13/2007 ecs150, Fall 2007 148


Why AFS?Why AFS?

Shared files are infrequently updated Local cache of a few hundred mega bytes

– Now 50~100 giga bytes Unix workload:

– Files are small, Read Operations dominated, sequential access is common, read/written by one user, reference bursts.

– Are these still true?

11/13/2007 ecs150, Fall 2007 151


Fault Tolerance in AFSFault Tolerance in AFS

a server crashes

a client crashes– check for call-back tokens first.

11/13/2007 ecs150, Fall 2007 152


Problems with AFSProblems with AFS

Availability what happens if call-back itself is lost??

11/13/2007 ecs150, Fall 2007 153


GFS – Google File SystemGFS – Google File System

“failures” are norm Multiple-GB files are common Append rather than overwrite

– Random writes are rare Can we relax the consistency?

11/13/2007 ecs150, Fall 2007 154


11/13/2007 ecs150, Fall 2007 155

UCDavis, ecs150Fall 2007 The MasterThe Master

Maintains all file system metadata.names space, access control info, file to chunk mappings, chunk (including replicas) location, etc.

Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state

11/13/2007 ecs150, Fall 2007 156


The MasterThe Master

Helps make sophisticated chunk placement and replication decision, using global knowledge

For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers

Master is not a bottleneck for reads/writes

11/13/2007 ecs150, Fall 2007 157

UCDavis, ecs150Fall 2007 ChunkserversChunkservers

Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle.

handle is assigned by the master at chunk creation

Chunk size is 64 MB

Each chunk is replicated on 3 (default) servers

11/13/2007 ecs150, Fall 2007 158


ClientsClients

Linked to apps using the file system API.

Communicates with master and chunkservers for reading and writing

Master interactions only for metadata

Chunkserver interactions for data

Only caches metadata informationData is too large to cache.

11/13/2007 ecs150, Fall 2007 159


Chunk LocationsChunk Locations

Master does not keep a persistent record of locations of chunks and replicas.

Polls chunkservers at startup, and when new chunkservers join/leave for this.

Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)

11/13/2007 ecs150, Fall 2007 160


11/13/2007 ecs150, Fall 2007 161


CODACODA

Server Replication:– if one server goes down, I can get another.

Disconnected Operation:– if all go down, I will use my own cache.

11/13/2007 ecs150, Fall 2007 162


11/13/2007 ecs150, Fall 2007 163


Disconnected OperationDisconnected Operation

Continue critical work when that repository is inaccessible.

Key idea: caching data.– Performance– Availability

Server Replication

11/13/2007 ecs150, Fall 2007 164


11/13/2007 ecs150, Fall 2007 165


11/13/2007 ecs150, Fall 2007 166


11/13/2007 ecs150, Fall 2007 167


11/13/2007 ecs150, Fall 2007 168


11/13/2007 ecs150, Fall 2007 169


11/13/2007 ecs150, Fall 2007 170


ConsistencyConsistency

If John update file X on server A and Mary read file X on server B….

Read-one & Write-all

11/13/2007 ecs150, Fall 2007 171

UCDavis, ecs150Fall 2007 Read x & Write (N-x+1)Read x & Write (N-x+1)

read

write

11/13/2007 ecs150, Fall 2007 172

UCDavis, ecs150Fall 2007 Example: R3W4 (6+1)Example: R3W4 (6+1)

Initial 0 0 0 0 0 0Alice-W 2 2 0 2 2 0Bob-W 2 3 3 3 3 0Alice-R 2 3 3 3 3 0Chris-W 2 1 1 1 1 0Dan-R 2 1 1 1 1 0Emily-W 7 7 1 1 1 7Frank-R 7 7 1 1 1 7

ucdavis, ecs150 fall 2007 11/13/2007ecs150, fall 20071 operating system ecs150 fall 2007 : operating...

Documents