more on file systemscs-502 fall 20071 more on file systems cs-502, operating systems fall 2007...

35
More on File Syst ems CS-502 Fall 2007 1 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz, Galvin, & Gagne and from Modern Operating Systems, 2 nd ed., by Tanenbaum)

Upload: bernard-darren-maxwell

Post on 26-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 1

More on File Systems

CS-502, Operating SystemsFall 2007

(Slides include materials from Operating System Concepts, 7th ed., by Silbershatz, Galvin, & Gagne and from Modern Operating Systems, 2nd ed., by Tanenbaum)

Page 2: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 2

Reading Assignments

• Silbershatz, §12.7 & 12.8• §12.7 – RAID systems

• §12.8 – Stable Storage

• Silbershatz, §11.8• Log-structured file systems (aka journaling file

systems)

• Silbershatz, §21.7• Linux file systems, including journaling

Page 3: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 3

Mapping files to Virtual Memory

• Instead of “reading” from disk into virtual memory, why not simply use file as the swapping storage for certain VM pages?

• Called mapping

• Page tables in kernel point to disk blocks of the file

Page 4: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 4

Memory-Mapped Files

• Memory-mapped file I/O allows file I/O to be treated as routine memory access by mapping a disk block to a page in memory

• A file is initially “read” using demand paging. A page-sized portion of the file is read from the file system into a physical page. Subsequent reads/writes to/from the file are treated as ordinary memory accesses.

• Simplifies file access by allowing application to simple access memory rather than be forced to use read() & write() calls to file system.

Page 5: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 5

Memory-Mapped Files (continued)

• A tantalizingly attractive notion, but …

• Cannot use C/C++ pointers within mapped data structure

• Corrupted data structures likely to persist in file• Recovery after a crash is more difficult

• Don’t really save anything in terms of• Programming energy

• Thought processes

• Storage space & efficiency

Page 6: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 6

Memory-Mapped Files (continued)

Nevertheless, the idea has its uses1. Simpler implementation of file operations

– read(), write() are memory-to-memory operations

– seek() is simply changing a pointer, etc…

– Called memory-mapped I/O

2. Shared Virtual Memory among processes

Page 7: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 7

Shared Virtual Memory

Page 8: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 8

Shared Virtual Memory (continued)

• Supported in – Apollo DOMAIN– Windows XP– Linux (shmget, etc.)

• Synchronization is the responsibility of the sharing applications– OS retains no knowledge– Few (if any) synchronization primitives

between processes in separate address spaces

Page 9: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 9

Questions?

Page 10: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 10

Problem

• Question:–– If mean time to failure of a disk drive is 100,000 hours,– and if your system has 100 identical disks,– what is mean time between drive replacement?

• Answer:–– 1000 hours (i.e., 41.67 days 6 weeks)

• I.e.:–– You lose 1% of your data every 6 weeks!

• But don’t worry – you can restore most of it from backup!

Page 11: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 11

Can we do better?

• Yes, mirrored– Write every block twice, on two separate disks– Mean time between simultaneous failure of

both disks is >57,000 years

• Can we do even better?– E.g., use fewer extra disks?– E.g., get more performance?

Page 12: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 12

RAID – Redundant Array of Inexpensive Disks

• Distribute a file system intelligently across multiple disks to– Maintain high reliability and availability– Enable fast recovery from failure– Increase performance

Page 13: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 13

“Levels” of RAID

• Level 0 – non-redundant striping of blocks across disk

• Level 1 – simple mirroring

• Level 2 – striping of bytes or bits with ECC

• Level 3 – Level 2 with parity, not ECC

• Level 4 – Level 0 with parity block

• Level 5 – Level 4 with distributed parity blocks

Page 14: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 14

RAID Level 0 – Simple Striping

• Each stripe is one or a group of contiguous blocks

• Block/group i is on disk (i mod n)

• Advantage– Read/write n blocks in parallel; n times bandwidth

• Disadvantage– No redundancy at all. System MBTF is 1/n disk MBTF!

stripe 8stripe 4stripe 0

stripe 9stripe 5stripe 1

stripe 10stripe 6stripe 2

stripe 11stripe 7stripe 3

Page 15: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 15

RAID Level 1– Striping and Mirroring

• Each stripe is written twice• Two separate, identical disks

• Block/group i is on disks (i mod 2n) & (i+n mod 2n)• Advantages

– Read/write n blocks in parallel; n times bandwidth– Redundancy: System MBTF = (Disk MBTF)2 at twice the cost– Failed disk can be replaced by copying

• Disadvantage– A lot of extra disks for much more reliability than we need

stripe 8stripe 4stripe 0

stripe 9stripe 5stripe 1

stripe 10stripe 6stripe 2

stripe 11stripe 7stripe 3

stripe 8stripe 4stripe 0

stripe 9stripe 5stripe 1

stripe 10stripe 6stripe 2

stripe 11stripe 7stripe 3

Page 16: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 16

RAID Levels 2 & 3

• Bit- or byte-level striping

• Requires synchronized disks• Highly impractical

• Requires fancy electronics • For ECC calculations

• Not used; academic interest only

• See Silbershatz, §12.7.3 (pp. 471-472)

Page 17: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 17

Observation

• When a disk or stripe is read incorrectly,

we know which one failed!

• Conclusion:– A simple parity disk can provide very high

reliability• (unlike simple parity in memory)

Page 18: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 18

RAID Level 4 – Parity Disk

• parity 0-3 = stripe 0 xor stripe 1 xor stripe 2 xor stripe 3• n stripes plus parity are written/read in parallel• If any disk/stripe fails, it can be reconstructed from others

– E.g., stripe 1 = stripe 0 xor stripe 2 xor stripe 3 xor parity 0-3

• Advantages– n times read bandwidth– System MBTF = (Disk MBTF)2 at 1/n additional cost– Failed disk can be reconstructed “on-the-fly” (hot swap)– Hot expansion: simply add n + 1 disks all initialized to zeros

• However– Writing requires read-modify-write of parity stripe only 1x write

bandwidth.

stripe 8stripe 4stripe 0

stripe 9stripe 5stripe 1

stripe 10stripe 6stripe 2

stripe 11stripe 7stripe 3

parity 8-11parity 4-7parity 0-3

Page 19: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 19

RAID Level 5 – Distributed Parity

• Parity calculation is same as RAID Level 4• Advantages & Disadvantages – Mostly same as RAID Level 4• Additional advantages

– avoids beating up on parity disk– Some writes in parallel (if no contention for parity drive)

• Writing individual stripes (RAID 4 & 5)– Read existing stripe and existing parity– Recompute parity– Write new stripe and new parity

stripe 12stripe 8stripe 4stripe 0

parity 12-15stripe 9stripe 5stripe 1

stripe 13parity 8-11stripe 6stripe 2

stripe 14stripe 10parity 4-7stripe 3

stripe 15stripe 11stripe 7parity 0-3

Page 20: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 20

RAID 4 & 5

• Very popular in data centers– Corporate and academic servers

• Built-in support in Windows XP and Linux– Connect a group of disks to fast SCSI port (320

MB/sec bandwidth)– OS RAID support does the rest!

• Other RAID variations also available

Page 21: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 21

New Topic

Page 22: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 22

Incomplete Operations

• Problem – how to protect against disk write operations that don’t finish– Power or CPU failure in the middle of a block– Related series of writes interrupted before all

are completed

• Examples:– Database update of charge and credit– RAID 1, 4, 5 failure between redundant writes

Page 23: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 23

Solution (part 1) – Stable Storage

• Write everything twice to separate disks• Be sure 1st write does not invalidate previous 2nd copy

• RAID 1 is okay; RAID 4/5 not okay!

• Read blocks back to validate; then report completion

• Reading both copies• If 1st copy okay, use it – i.e., newest value

• If 2nd copy different or bad, update it with 1st copy

• If 1st copy is bad; update it with 2nd copy – i.e., old value

Page 24: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 24

Stable Storage (continued)

• Crash recovery• Scan disks, compare corresponding blocks• If one is bad, replace with good one• If both good but different, replace 2nd with 1st copy

• Result:–• If 1st block is good, it contains latest value• If not, 2nd block still contains previous value

• An abstraction of an atomic disk write of a single block

• Uninterruptible by power failure, etc.

Page 25: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 25

What about more complex disk operations?

• E.g., File create operation involves• Allocating free blocks

• Constructing and writing i-node– Possibly multiple i-node blocks

• Reading and updating directory

• Update Free list and store back onto disk

• What if system crashes with the sequence only partly completed?

• Answer: inconsistent data structures on disk

Page 26: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 26

Solution (Part 2) –Journaling File System

• Make changes to cached copies in memory• Collect together all changed blocks

• Including i-nodes and directory blocks

• Write to log file (aka journal file)• A circular buffer on disk

• Fast, contiguous write

• Update log file pointer in stable storage

• Later: Play back log file to actually update directories, i-nodes, free list, etc.

Page 27: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 27

Journaling File System – Crash Recovery

• If crash occurs before log pointer is updated– File system reverts to previous state– Contents of log discarded

• If crash occurs after log pointer is updated but before log is replayed– Replay log at system restart– File system reflects updated contents

• …

Page 28: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 28

Journaling File System – Crash Recovery

• …

• If crash occurs during replay of log– Replay log again at system restart– Replaying log multiple times does not hurt

• If replay succeeds, update log pointer in stable storage

Page 29: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 29

Journaling File System (continued)

• What if a process wants to use blocks that are currently in the log and not replayed?– Log is a cache of disk blocks– Must check there first for valid contents

• Further updates are added to the log after current log pointer– Just as if they had been in their original places– Log pointer can be updated in stable storage

after each set of updates

Page 30: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 30

Transaction Data Base Systems

• Similar techniques– Every transaction is recorded in log before

recording on disk– Stable storage techniques for managing log

pointers– One log exist is confirmed, disk can be updated

in place– After crash, replay log to redo disk operations

Page 31: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 31

Journaling File Systems

• Linux ext3 file system

• Windows NTFS

Page 32: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 32

Berkeley LFS — a slight variation

• Everything is written to log• i-nodes point to updated blocks in log• i-node cache in memory updated whenever i-node is written• Cleaner daemon follows behind to compact log

• Advantages:– LFS is always consistent– LFS performance

• Much better than Unix file system for small writes• At least as good for reads and large writes

• Tanenbaum, §6.3.8, pp. 428-430• Rosenblum & Ousterhout, Log-structured File System (pdf

)

• Note: not same as Linux LFS (large file system)

Page 33: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 33

Example

i-node

modified blocksa

b c

Before

old i-node

old blocksa

b c

loga b c

new blocks

new i-node

After

Page 34: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 34

Reading Assignments

• Silbershatz, §12.7 & 12.8• §12.7 – RAID systems

• §12.8 – Stable Storage

• Silbershatz, §11.8• Log-structured file systems (aka journaling file

systems)

• Silbershatz, §21.7• Linux file systems, including journaling

Page 35: More on File SystemsCS-502 Fall 20071 More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,

More on File SystemsCS-502 Fall 2007 35

Questions?