computer science 213 © 2006 donald acton 244 the role of unix i/o file system works at the block...

93
1 Computer Science 213 © 2006 Donald Acton The Role of Unix I/O File system works at the block level Applications work at the byte level Unix I/O converts the byte level access to block level operations Application Unix I/O File System Disk Drive File System Layering

Upload: joel-cummings

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

1Computer Science 213© 2006 Donald Acton

The Role of Unix I/OThe Role of Unix I/O

• File system works at the block level

• Applications work at the byte level

• Unix I/O converts the byte level access to block level operations

Application

Unix I/O

File System

Disk Drive

File System

Layering

2Computer Science 213© 2006 Donald Acton

Unix I/O APIUnix I/O API

• Some of the most common Unix I/O API functions used by applications are:– open()– close()– read()– write()– lseek()

3Computer Science 213© 2006 Donald Acton

Opening FilesOpening Files

• Opening a file informs the kernel that an application wants to access a file

• Allows the kernel to set aside resourcesint source_fd;

if ((source_fd = open(argv[1], O_RDONLY)) < 0) {

perror("Open source failed:");

exit(2);

}

4Computer Science 213© 2006 Donald Acton

Opening cont’dOpening cont’d

• Open returns a small integer called a file descriptor

• Application passes this value back to the kernel in subsequent requests to work with a file

• Each process created starts with three open files:– 0: standard input (stdin)– 1: standard output (stdout)– 2: standard error (stderr)

5Computer Science 213© 2006 Donald Acton

Closing FilesClosing Files

• Closing a file tells the kernel it may free resources associated with managing the file

int rc;

if ((rc = close(source_fd)) < 0){

perror("close");

exit(10);

}

6Computer Science 213© 2006 Donald Acton

Reading FilesReading Files

• Each open file has a notion of a current position in the stream of bytes

• read() copies bytes from the current file position to memory and updates the file position

• read() returns the number of bytes read– If bytes read < 0 – read may return fewer bytes than requested

(short reads)

error

7Computer Science 213© 2006 Donald Acton

Read ExampleRead Examplechar buf[512];

int chars_read;

chars_read = read(source_fd, buf, sizeof(buf));

while (chars_read > 0) {

// Do something

chars_read = read(source_fd, buf,

sizeof(buf));

}

if (chars_read < 0) {

perror("Reading error:");

exit(5);

}

8Computer Science 213© 2006 Donald Acton

Writing FilesWriting Files

• Writing copies bytes from memory to the file position and updates position

• Returns the number of bytes written• If bytes written < 0 • It is possible that fewer bytes were

written than requested (short writes) this is not an error, but certainly a challenge to deal with

error

9Computer Science 213© 2006 Donald Acton

Writing ExampleWriting Example

while (chars_read > 0) {

if (write(stdout, buf,

chars_read) < chars_read) {

perror("Write problems:");

exit(4);

}

// Do another read and work

}

10Computer Science 213© 2006 Donald Acton

SeekSeek

• Causes the logical position in the file to change (i.e. where the next read or write will commence from)

• Position can be changed – To absolute offset in file– Relative to the current location– Relative to the end of the file

11Computer Science 213© 2006 Donald Acton

Seek exampleSeek example

long new_offset;

new_offset = lseek(fd, 2346, SEEK_CUR);

new_offset = lseek(fd, 10, SEEK_SET);

new_offset = lseek(fd, 25, SEEK_END);

12Computer Science 213© 2006 Donald Acton

Unix I/O ExampleUnix I/O Example

• Simple program that copies contents of file named by argument 1 to file named by argument 2 (i.e. the cp command)

cs213copy fname1 [fname2]

13Computer Science 213© 2006 Donald Acton

Pseudo CodePseudo Code

open argument 1 for inputopen argument 2 for output (if present)if arg 2 present then connect stdout to this

fileread from inputwhile read succeeds write to stdout read from input

14Computer Science 213© 2006 Donald Acton

Unix I/O Copy CommandUnix I/O Copy Command // Includes

int main(int argc, char **argv) { // Check arguments int source_fd; if ((source_fd = open(argv[1], O_RDONLY)) < 0) { perror("Open source failed:"); exit(2); }

int dest_fd; if (argc > 2) { if ((dest_fd = open(argv[2], O_WRONLY |

O_CREAT, 0600)) < 0) { perror("Destination open failed:");

int rc; if ((rc = close(source_fd)) < 0) {

perror("close");exit(10);

} exit(3); }

dup2(dest_fd, STDOUT_FILENO); }

char buf[512]; int chars_read;

chars_read = read(source_fd, buf, sizeof(buf)); while (chars_read > 0) { if (write(STDOUT_FILENO, buf, chars_read) <

chars_read) { perror("Write problems:"); exit(4); } chars_read = read(source_fd, buf,

sizeof(buf)); } if (chars_read < 0) { perror("Reading error:"); exit(5); }}

15Computer Science 213© 2006 Donald Acton

1) Unix I/O1) Unix I/O#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>#include <stdio.h>#include <unistd.h>#include <stdlib.h>

int main(int argc, char **argv) {

if (argc <= 1) { printf("Usage: cs213cp source_file [destination_file]\n"); exit(1); }

16Computer Science 213© 2006 Donald Acton

2) Unix I/O2) Unix I/O

int source_fd;

if ((source_fd = open(argv[1],

O_RDONLY)) < 0) {

perror("Open source failed:");

exit(2);

}

17Computer Science 213© 2006 Donald Acton

3) Unix I/O3) Unix I/O int dest_fd; if (argc > 2) { if ((dest_fd = open(argv[2], O_WRONLY | O_CREAT,

0600)) < 0) { perror("Destination open failed:");

int rc; if ((rc = close(source_fd)) < 0) {

perror("close"); exit(10);

} exit(3); } dup2(dest_fd, STDOUT_FILENO); }

18Computer Science 213© 2006 Donald Acton

4) Unix I/O4) Unix I/O char buf[512]; int chars_read;

chars_read = read(source_fd, buf, sizeof(buf)); while (chars_read > 0) { if (write(STDOUT_FILENO, buf, chars_read) <

chars_read) { perror("Write problems:"); exit(4); } chars_read = read(source_fd, buf, sizeof(buf)); } if (chars_read < 0) { perror("Reading error:"); exit(5); }}

19Computer Science 213© 2006 Donald Acton

Unix I/OUnix I/O

• By making everything appear to be a file, the kernel can provide a single simple interface for performing I/O to a variety of devices

• Recall the basic operations are:– Opening and closing files

• open() and close()

– Changing the current file position• lseek()

– Reading and writing files• read() and write()

20Computer Science 213© 2006 Donald Acton

Adding Other DevicesAdding Other Devices

• Most devices tend to be producers or consumers of streams of data and fit UNIX I/O API model described

Mouse producer

Joystick producer

Keyboard producer

Display Consumer

Audio device consumer

Tape both

21Computer Science 213© 2006 Donald Acton

New DevicesNew Devices

Disk

UNIX I/O

Application

File SystemFile System

Disk Drive

Keyboard Terminal Tape Audio

22Computer Science 213© 2006 Donald Acton

Getting data to/from the hardware

Getting data to/from the hardware

•There are 2 main issues to deal with

– buffering of data going to and from the disk– I/O requests that are not block aligned or in block multiples

Application

Unix I/O

File System

Disk Drive

File System

Layering

23Computer Science 213© 2006 Donald Acton

File DescriptorsFile Descriptors

• Calls to routines like open(), socket(), accept() and pipe() return file descriptors

• A file descriptor is just a small integer

• When this “integer” is passed back to the kernel via calls like read() or write() the kernel manipulates the opened “file” the descriptor corresponds to

24Computer Science 213© 2006 Donald Acton

The Kernel’s View of a File Descriptor

The Kernel’s View of a File Descriptor

• Each process has associated with it a fixed size file descriptor table

• The file descriptor is just the index into this table!

• Each active entry in the table identifies an entry in a shared system wide open file table

• Entries are created in the open file table each time open() succeeds

25Computer Science 213© 2006 Donald Acton

Open File TableOpen File Table

• Entries in the open file table identify the I/O target in a v-node table

• Open file table keeps current position and reference count of its usage

• v-node – virtual inode, basically a cache of an inode– may contain pointers to buffers/caches

for the file/device– identifies legal operations on a

file/device

26Computer Science 213© 2006 Donald Acton

The Kernel ViewThe Kernel View

fd 0

fd 1

fd 2

fd 3

fd 4

Descriptor table

(one table

per process)

Open file table

(shared by

all processes)

v-node table

File pos

refcnt=1...

stderr

stdout

stdin File access

...

File size

File type

File A

Adapted from: Computer Systems: A Programmer’s Perspective

The above is one struct in the open file table

27Computer Science 213© 2006 Donald Acton

v-node rolev-node role

UNIX I/O

Application

File SystemFile System

Disk Drive

Keyboard Terminal Tape Audio

28Computer Science 213© 2006 Donald Acton

To the DeviceTo the Device

• Unix I/O uses the open file table and v-node table to determine the “device” specific code for the standard operations (open, close read, write…)

• These routines use buffers identified by the v-node table

• Buffers are caches of on disk blocks• Changes to buffers result in writes

being scheduled

29Computer Science 213© 2006 Donald Acton

write()write()

• lseek(fd, 931, SET_SEEK);– Change file position in open file table to 931

• write(fd, buff, 128);– If block #1 (bytes 512 – 1023) not cached -

read it– If block #2 (bytes 1024 – 1535) not cached -

read it– Change bytes 931- 1023, and 1024-1058– Have blocks 1 and 2 scheduled for writing to

disk

30Computer Science 213© 2006 Donald Acton

read()read()

• lseek(fd, 500, SET_SEEK);– Change file position in open file table to

500• read(fd, buff, 1024);

– If any of blocks 0 (0 – 511), 1 (512-1023) or 2 (1024 – 1535) not cached order them read

– Transfer bytes 500 – 511, 512 – 1023, and 1024 – 1523 to buff when blocks availability

31Computer Science 213© 2006 Donald Acton

Sharing FilesSharing Files

• At this point we have– File descriptors– The open file table– V-nodes

• It is relatively easy to explain what happens when file sharing results from:– Opens in the same process– Opens in different processes– Forks

32Computer Science 213© 2006 Donald Acton

Actions on open()Actions on open()

fd 0

fd 1

fd 2

fd 3

fd 4

Descriptor table

(one table

per process)

Open file table

(shared by

all processes)

v-node table

File pos

refcnt=1...

File pos

refcnt=1

...

stderr

stdout

stdin File access

...

File size

File type

File access

...

File size

File type

File A

File B

fd = open("B",…)

Adapted from: Computer Systems: A Programmer’s Perspective

33Computer Science 213© 2006 Donald Acton

Same File Different ProcessSame File Different ProcessDescriptor table

(one table

per process)

Open file table

(shared by

all processes)

v-node table

File pos

refcnt=1...

File pos

refcnt=1

...

fd 0

fd 1

fd 2

fd 3

fd 4

stderr

stdout

stdin

File access

...

File size

File type

File A

File A

fd = open("A",…)

fd 0

fd 1

fd 2

fd 3

fd 4

stderr

stdout

stdin

Adapted from: Computer Systems: A Programmer’s Perspective

34Computer Science 213© 2006 Donald Acton

Same File Same ProcessSame File Same ProcessDescriptor table

(one table

per process)

Open file table

(shared by

all processes)

v-node table

File pos

refcnt=1...

File pos

refcnt=1

...

fd 0

fd 1

fd 2

fd 3

fd 4

stderr

stdout

stdin

File access

...

File size

File type

File A

File A

fd = open("A",…);

Adapted from: Computer Systems: A Programmer’s Perspective

35Computer Science 213© 2006 Donald Acton

Close()Close()

Empty

fd 0

fd 1

fd 2

fd 3

fd 4

Descriptor table

(one table

per process)

Open file table

(shared by

all processes)

v-node table

(shared by

all processes)

File pos

refcnt=1...

File pos

refcnt=1

...

stderr

stdout

stdin File access

...

File size

File type

File access

...

File size

File type

File A

File B

close(4);refcnt=0

36Computer Science 213© 2006 Donald Acton

I/O RedirectionI/O Redirection

COMOX(114): ls > /tmp/out• The above causes standard output (file descriptor

1) to be set to /tmp/out

fd 0

fd 1

fd 2

fd 3

fd 4

Process file

descriptor table

stderr

stdout

stdin

File pos

refcnt=4

terminalFile access

...

File size

File type

File access

...

File size

File typeFile pos

refcnt=1

.../tmp/out

refcnt=3

...

Adapted from: Computer Systems: A Programmer’s Perspective

37Computer Science 213© 2006 Donald Acton

dup2dup2

• The Unix system call dup2, which has the form dup2(fd, newfd), copies fd to newfd in the descriptor table.

a

b

fd 0

fd 1

fd 2

fd 3

fd 4 b

b

fd 0

fd 1

fd 2

fd 3

fd 4

dup2(4,1)

Adapted from: Computer Systems: A Programmer’s Perspective

38Computer Science 213© 2006 Donald Acton

dup2 exampledup2 example

Process file

descriptor tableFile pos

terminalFile access

...File size

File type

File access

...

File size

File typeFile pos

...

/tmp/out

...

open("/tmp/foo",…);

dup2(4,1);

close(4);

refcnt=1

refcnt=1

fd 0

fd 1

fd 2

fd 3

fd 4

refcnt=0

refcnt=2

41Computer Science 213© 2006 Donald Acton

Application Application

• Given what we know, are there interesting things we can do at the application layer to speed things up?

• Making a system call is several orders of magnitude more expensive than a function call

Application

Unix I/O

File System

Disk Drive

File System

Layering

42Computer Science 213© 2006 Donald Acton

Caching in the ApplicationCaching in the Application

• Applications can use caching to improve performance just like the kernel

• Most I/O has both– Spatial locality– Temporal locality– An application level cache in

the form of the Standard I/O library attempts to take advantage of this

Unix I/O

File System

Disk Drive

File System

Layering

Buffered I/OApplication

43Computer Science 213© 2006 Donald Acton

STDIO (Caching)STDIO (Caching)

• Each Unix I/O call has a corresponding stdio call– open() fopen(), close fclose()– read() fread(), write() fwrite()

• Instead of returning a file descriptor fopen() returns a FILE *

• The FILE struct contains: – actual file descriptor – pointer to a buffer– position in buffer– other bookkeeping information

44Computer Science 213© 2006 Donald Acton

How it works - writesHow it works - writes

• When fwrite() is called bytes are copied to the stream buffer

• If the stream buffer fills during the fwrite()– write() called to “write” the stream

buffer– Stream buffer cleared

45Computer Science 213© 2006 Donald Acton

fwrite()fwrite()

• Buffer• Buffer offset• fd

Kernel boundary write()

Cached File Block Cached File Block

46Computer Science 213© 2006 Donald Acton

How it works - readsHow it works - reads

• When fread() is called bytes are copied from the stream buffer to the application designated location

• If the stream buffer empties during the fread()– read() called to refill the stream buffer– Position in stream buffer reset

47Computer Science 213© 2006 Donald Acton

fread()fread()

• Buffer• Buffer offset• fd

Kernel boundary read()

Cached File Block

48Computer Science 213© 2006 Donald Acton

AnalysisAnalysis

• Costs over doing a system call– Need extra buffer space– One extra set of copies– Bookkeeping to ensure the stream buffer

exactly matches real file location – I/O to random locations can be inefficient

• Advantage over system call– If application I/O requests much less data than

underlying buffer holds then greatly reduces the number of system calls

– System calls are very expensive

49Computer Science 213© 2006 Donald Acton

What are files good for?What are files good for?

• A bulk storage mechanism• A more permanent form of storing

information• A form of interprocess

communication– The mere existence of a file can mean

something– Data in a file can be a message to a

process that doesn’t exist yet

50Computer Science 213© 2006 Donald Acton

Sharing data on diskSharing data on disk

write()

Application 1 Application 2

read()

Hi

Hi

?

51Computer Science 213© 2006 Donald Acton

Two processes, same timeTwo processes, same time

• As the file access times between the two processes narrows just what one process sees relative to the actions of the other becomes unpredictable

• Two common problems– Lost update– Inconsistent retrievals

52Computer Science 213© 2006 Donald Acton

The Lost UpdateThe Lost Update

• Withdraw(A, 4);• Deposit(B, 4);

– Bal = A.read(); 100– A.write(Bal – 4); 96

– Bal = B.read() 200

– B.write(Bal + 4) 204

• Withdraw(C, 3);• Deposit(B, 3);

– Bal = C.read(); 300– C.write(Bal – 3); 297

– Bal = B.read() 200– B.write(Bal + 3) 203

53Computer Science 213© 2006 Donald Acton

Aside - cache consistencyAside - cache consistency

• The previous problem illustrates the issue of cache consistency

• The values read from disk and then used are cached

• Multiple programs cache and change the same data simultaneously without regard for one another

• Result

54Computer Science 213© 2006 Donald Acton

Inconsistent RetrievalsInconsistent Retrievals

• Withdraw(A, 5);• Deposit(B, 5);

– Bal = A.read(); 200– A.write(Bal – 5); 195

– Bal = B.read(); 100– B.write(bal + 5) 105

• TotalAccounts();

– Bal = A.read(); 195

– Bal += B.read() 295

– Bal += C.read() …

55Computer Science 213© 2006 Donald Acton

Are these familiar types of problems?

Are these familiar types of problems?

• How was it solved?• Would the same solution work here?• Would it scale?

59Computer Science 213© 2006 Donald Acton

Lock a file regionLock a file region

int sharedData;Lock aLock;…aLock.acquire (); read or write sharedDataaLock.release ();…

Shared Region

60Computer Science 213© 2006 Donald Acton

lockf()lockf()

lockf(fd, function, size)F_UNLOCK

F_LOCK

F_TLOCK

F_TEST

An open file descriptor that allows writing

Starting from the current file position, the number of bytes to lock

61Computer Science 213© 2006 Donald Acton

Using lockf()Using lockf()

int main(int argc, char **argv) {

int fd = open(argv[1], O_RDWR);

if ((status = lockf(fd, F_TLOCK, 60)) < 0) {

printf("locked\n");

lockf(fd,F_LOCK, 60);

}

}

62Computer Science 213© 2006 Donald Acton

The Lost Update (2)The Lost Update (2)

• Withdraw(A, 4);• Deposit(B, 4);

– Bal = A.read(); 100– A.write(Bal – 4); 96

– Bal = B.read() 200

– B.write(Bal + 4) 204

• Withdraw(C, 3);• Deposit(B, 3);

– Bal = C.read(); 300– C.write(Bal – 3); 297

– Bal = B.read() 200– B.write(Bal + 3) 203

63Computer Science 213© 2006 Donald Acton

Types of lock requestsTypes of lock requests

• Regular lock (really a writer lock)– Only one acquisition allowed at a time

• Read lock– Allows multiple readers to hold the lock at the

same time – increased concurrency– Basically prevents a writer from making

changes

• Write lock– Only one acquisition allowed at a time– Prevents read lock from being acquired

64Computer Science 213© 2006 Donald Acton

Reader – Writer locksReader – Writer locks

int sharedData;Lock aLock;…aLock.acquireWrite ();

write sharedDataaLock.release ();…

aLock.acquireRead ();read sharedData

aLock.release ();…

Shared Region

65Computer Science 213© 2006 Donald Acton

Implementing LocksImplementing Locks

• Each lock requires– Lists of process IDS

• Process with lock• Processes waiting for lock

– Regions – what part of the file is being locked and how (read/write)

66Computer Science 213© 2006 Donald Acton

Where are locks implemented?Where are locks implemented?

• Requirements– Must be (potentially) 1 per file– All processes must be able to locate the

lock– Created on demand (sort of)

• What kernel data structure associated with file management has these properties?

67Computer Science 213© 2006 Donald Acton

Locking and VnodesLocking and VnodesDescriptor table

(one table

per process)

Open file table

(shared by

all processes)

v-node table

File pos

refcnt=1...

File pos

refcnt=1

...

fd 0

fd 1

fd 2

fd 3

fd 4

stderr

stdout

stdin

File access

...

File size

File type

File A

File A

fd = open("A",…)

fd 0

fd 1

fd 2

fd 3

fd 4

stderr

stdout

stdin

Adapted from: Computer Systems: A Programmer’s Perspective

68Computer Science 213© 2006 Donald Acton

Are locks enough?Are locks enough?

• Locks can control concurrency• Sometimes a collection of actions

need to be atomic – Locks can’t ensure this in the face of

failures– Undoing (rolling back) things can be a

challenge

69Computer Science 213© 2006 Donald Acton

Transactions - DefinitionTransactions - Definition• A transaction is a sequence of data

operations with the following properties:– A Atomic – all or nothing– C Consistent - consistent state in =>

consistent state out– I Independent - partial results are

not visible to concurrent transactions– D Durable - once completed, new state

survives crashes

70Computer Science 213© 2006 Donald Acton

Transaction OperationsTransaction Operations

• tid = beginTx() – Start a new transaction and return a

transaction identifier

• status = commitTX(tid)– Cause the transaction to commit– Return success indication if transaction

committed otherwise return failure indication

71Computer Science 213© 2006 Donald Acton

Transaction Operations cont’dTransaction Operations cont’d

• abortTX(tid)– Abort the transaction and cause all files

to take on the values they had before the transaction started

• readTX(tid, file values)– Read the given “values” from a file and

associate the read with the indicated transaction

72Computer Science 213© 2006 Donald Acton

Transaction Operations cont’dTransaction Operations cont’d

• writeTX(tid, values)– Write the given values to the file and

associate the write with the indicated transaction

73Computer Science 213© 2006 Donald Acton

Example transactionExample transaction

tid = beginTX();

readTX(tid, &a, file_to_read_from, …);

readTX(tid, &b, file_to_read_from, …);

perform computations

writeTX(tid, &a, file_to_write_to, ...);

readTX(tid, &c, file_to_read_from, …);

if (error reading) { abortTX(tid); return; }

perform computations

writeTX(tid, &c, file_to_write_to, …)

commitTX(tid);

74Computer Science 213© 2006 Donald Acton

Ensuring AtomicityEnsuring Atomicity

• Problem– ensure all changes get made or none

get made• If no failure, it’s easy

– just do the updates• If failure occurs while updates are

performed must either– Go back to the initial state– Go to the final state

75Computer Science 213© 2006 Donald Acton

StrategyStrategy

• Use another file, called a log file, to record our intentions

• Write information to indicate– That a transaction has started– The new values a file is to have– That a transaction has committed– That a transaction has aborted– The transaction can be truncated

76Computer Science 213© 2006 Donald Acton

LoggingLogging

• Persistent (on disk) log – records information to support recovery and

abort

• Types of log records– begin, update, abort, commit, and truncate

• Atomic update– atomic operation is write of commit record to

disk– transaction committed iff commit record in log

77Computer Science 213© 2006 Donald Acton

Ways to log the “values”Ways to log the “values”

• Value logging– write new value of modified data to log– simple, but not always space efficient or

easy• hard for some things such as malloc and

system calls

• Operation logging– write name of operation and its

arguments– usually used for roll forward logging

79Computer Science 213© 2006 Donald Acton

Logging for Roll ForwardLogging for Roll Forward

• For each transactional update– Change in-memory copy– Write new value to log– Do not change on-disk copy until commit

• Commit– Write commit record to log– Write changed data to disk – Write truncate record to log

• Abort– Write abort record to log– Invalidate in-memory data– Nothing to do with on disk copies

80Computer Science 213© 2006 Donald Acton

Roll forward recoveryRoll forward recovery

• When the system restarts after a failure– use log to roll forward committed

transactions– normal access stopped until recover is

completed

81Computer Science 213© 2006 Donald Acton

Recovery ContinuedRecovery Continued

• Complete committed, but un-truncated transactions– for every trans with a commit but no truncate– read new values from log and update disk

values– write truncate record to log

• Abort all uncommitted trans– for every trans with no commit or abort– write abort record to log

82Computer Science 213© 2006 Donald Acton

Logging/Recover ExampleLogging/Recover Example

• Application Actions– tid = beginTX– ReadTX(tid, &a, …)– ReadTX(tid, &b, …) – WriteTX(tid, &b, …)– WriteTX(tid, &a, …)– commitTX(tid)

• Write out a and b to real file

• Write truncate to log

• Log File Records– BEGIN<1>

– NVAL<1, b, newval>– NVAL<1, a, newval>

– COMMIT<1>

– TRUNC<1>

83Computer Science 213© 2006 Donald Acton

Role of LockingRole of Locking

• Locks must still be acquired to prevent inconsistent retrieval and lost updates

• Upon first time access of a value its source must be locked

• Locks released after all writes to real file completed (or reads if no writes being done)

• Locks are also used on the log file

84Computer Science 213© 2006 Donald Acton

Log FileLog File

• Log file can be shared by different processes

• Writes are always done to the end• Before doing a write, a lock is

acquired and released upon write completion

• Write consists of one or more log records

85Computer Science 213© 2006 Donald Acton

Roll backwards loggingRoll backwards logging

• This is the opposite of redo or roll-forward logging

• Instead of writing new values to the log file old values are written

• Real files are updated before commit is written

• On abort, log is used to restore old values

86Computer Science 213© 2006 Donald Acton

Undo logging - roll backward

Normal operation Undo logging - roll backward

Normal operation • For each transactional update

– write old value to log– modify data and write to disk any time

• Commit– ensure that all updates have been

written to disk– write commit record to log

• Abort– use log to recover disk to old values

87Computer Science 213© 2006 Donald Acton

Undo logging - roll backward

RecoveryUndo logging - roll backward

Recovery• When the system restarts after a failure

– use log to rollback uncommitted transactions– normal access stopped until recovery completed

• Undo effect of any uncommitted transactions – for every trans with no commit or abort use log

to recover disk to old values– write abort record to log

88Computer Science 213© 2006 Donald Acton

Logging/Recover ExampleLogging/Recover Example

• Application Actions– tid = beginTX– ReadTX(tid, &a, …)– ReadTX(tid, &b, …) – WriteTX(tid, &b, …)– WriteTX(tid, &a, …)– commitTX(tid)

• Ensure updated a and b written to real file

• Write commit to log

• Log File Records– BEGIN<1>

– OVAL<1, b, oldval>– OVAL<1, a, oldval>

– COMMIT<1>

89Computer Science 213© 2006 Donald Acton

Outstanding problems?Outstanding problems?

• What about disk write order?– When application writes to disk the

operating system decides write time and order

– This is a problem for transactions

• Keeping the log file from growing infinitely large– Log file truncation

90Computer Science 213© 2006 Donald Acton

fsync()fsync()

• The order of writes is important• For example in redo logging

– All new values must be written to the log file before the commit is written

– All updates to the “real” files need to be onto disk before truncate is written

• fsync(fd) – will not return until all outstanding writes on the file descriptor are complete

91Computer Science 213© 2006 Donald Acton

fsync() cont’dfsync() cont’d

• fsync() does not guarantee that writes go to the disk in program order

• If disk write order is important (e.g. when commit is written) then– Call fsync() before writing commit– Write commit– Call fsync() again

• Could also open file with O_SYNC option

92Computer Science 213© 2006 Donald Acton

Shrinking the Log File (Truncation)

Shrinking the Log File (Truncation)

• Truncation is the process of– removing unneeded records from

transaction log• For redo logging

– remove transactions with truncate or abort records

• For undo logging– Remove transactions with commit or

abort records

93Computer Science 213© 2006 Donald Acton

Layering - revisitedLayering - revisited

• STDIO and transaction systems are layers within the application layer

• Notice that layers don’t have to extend completely across the level they are in

• When using a layer don’t circumvent it – Example - when using STDIO don’t get the

file descriptor and then do your own reads or writes and continue to use the f*() calls

94Computer Science 213© 2006 Donald Acton

Application

Application LayeringApplication Layering

UNIX I/O

File SystemFile System

Disk Drive

Keyboard Terminal Tape Audio

STDIO Transaction System

95Computer Science 213© 2006 Donald Acton

Layering in the File SystemLayering in the File System

• Disks present very similar interfaces but the precise way to control different disk types differ

• To simplify the task of dealing with different disk types the notion of a virtual disk interface is used

• Each time a new type of drive is introduced one simply implements the virtual interface

96Computer Science 213© 2006 Donald Acton

Yet Another LayerYet Another Layer

SCSI ESDI

Virtual Disk Interface

UNIX I/O

File SystemOther

Devices

IDE

Disk Drive

Application

STDIO Transaction System

97Computer Science 213© 2006 Donald Acton

Extending the File SystemExtending the File System

• Layering makes it “easy” to extend the file system architecture provided the various boundaries are well defined

• Example:– Journaling/logging file systems – Network File Systems (NFS)– iSCSI

• Just insert the new service at the appropriate layer

98Computer Science 213© 2006 Donald Acton

File System

Inserting New FunctionalityInserting New Functionality

SCSI IDE

iSCSI

Virtual Disk Interface

Unix FFS Logging FS NFS Client

Network Protocol Stack

UNIX I/O

Application

Virtual Disk Interface

Other Devices

99Computer Science 213© 2006 Donald Acton

Layering Yet Again!Layering Yet Again!

Application programs

Operating system

Hardware

General Layering

Structure

Application

Transport

Network

Link

Network

Layering

Application

Unix I/O

File System

Disk Drive

File System

Layering