log-structured file system sriram govindan sgovinda@cse

Log-structured File System

Sriram Govindan

sgovinda@cse

The Hierarchy

User programs Libraries

System call interface File System

Buffer cache Device driver

Hardware

Userlevel

Kernel level

File System ..1

Kernel has three tables,

Per Process User file descriptor table. System wide Open file descriptor table. Inode table.

Partition a physical disk in to several file system each (with a different logical block size).

Conversion between logical address and physical address is done by the device driver.

File system ..2

File system structures

Boot block – beginning of the file system, typically the first sector, has bootstrap code that is read in to the machine to boot the system. Every file system has a boot block (could be empty)

Super block – state of the file system – how large it is, how many files it can store, where to find free space etc.

Inode list – list of inodes, reference inodes by index in to the inode list

Data blocks – can belong to one and only one file in the file system.

Inode -> in-core and disk

Earlier UNIX file system – 1970's

Assign disk addresses to new blocks that are being created

Wrote modified blocks back to their original addresses (overwrite)

So disk became fragmented over time New files are allocated randomly across the

disk Even for reading/writing to the same file

sequentially, it required lot of seeks.

Berkeley - Unix FFS - 1984

Increased block size – improved bandwidth Place related information close together –

blocks of same file are placed on close cylinders.

Limiting factor: Synchronous I/O for file creation and deletion.

For better crash recovery Seek times between I/O request for different files.

Motivation/Need ?

Any optimization/design is dependent on the workload.

General observation on workloads: Small file accesses Meta data update

FFS - problems

Problems with FFS: Inodes, corresponding directory structure and

associated data blocks are not close together. Synchronous meta data update.

Creating a file in FFS, each separated by a seek Get free i-node, mark it used, insert name/time/… Get and Go to directory data block and insert the entry Get a free file block and write into it Update file i-node with pointer to this block, and update

modification time. All the above were short writes. !!!

Log structured file system

Store all file system information in a single continuous log.

Improve write performance by buffering a sequence of file system changes including those to the meta data in buffer cache and reflect changes sequentially to disk on a single disk write operation.

Optimized for writing, since “no” seek is involved – also note that buffer cache does little for write performance. ( writing to a same block in short period of time gets help from buffer cache, but not writing to multiple files)

Helps long reads since data is placed contiguously – i would assume the otherwise ??

Temporal locality, Of course.

LFS vs FFS

Major differences Disk layout (data structures). Recovery mechanisms. Performance

For writes - FFS uses only 5 to 10 percent of the disk bandwidth whereas LFS can use up to 70% of disk bandwidth.

That is, for writes, FFS uses 90 to 95 percent of disk bandwidth for seeking, LFS use 30% of disk bandwidth for cleaning.

FFS disk data structures

Inodes, Of course

Super Block

Block size, file system size, rotational delay, number of sectors per track, number of cylinders

Replicated throughput the system – for crashes

Disk is statically partitioned in to Cylinder Groups

Each cylindrical group,

is a collection of around 16 to 32 cylinders.

Fixed number of inodes ( for every 2 kb of data blocks).

Bitmap to record free inodes and data blocks.

From Inode numbers we can calculate its disk address.

New blocks are allocated in the same cylinder possibly with optimal rotational latency position – optimize for sequential accesses.

FFS disk layout

Now, LFS LFS is a hybrid between sequential database logs and FFS

Sequential database logs as in writing sequentially.

FFS as in indexing in to log to support efficient random retrieval ??

Disk Layout

Analogous to cylinder group, in LFS, disk is statically partitioned in to fixed size “segments” (say around 500 KB)

Logical ordering of these segments creates a log. Has super block similar to FFS.

Accumulates writes in many dirty pages in memory and write them along with their inodes, sequentially to the “next” (in terms of its spacial contiguity) available segment in the disk.

Inodes are not in fixed locations anymore. Have an additional data structure called “inode map” that map

inode numbers to their location in the disk.

LFS writes

Since the dirty blocks in LFS are written sequentially in to the next available segment in the disk (called the no-overwrite policy), the old data are not valid anymore and therefore has to be cleaned.

A “cleaner” is a garbage collection process that reclaims space from the file system, and therefore should ensure that always large extents of free space is available in the system.

Policies to determine,

What to clean – segment utilization, rate of change of segment etc.

When and how many to clean - watermark How to groups/re-organize live blocks – age sort, etc.

More on segment cleaning..

Log Thread through the free extents or/and Move, if someone comes on the way.

LFS chose the “AND” option – thread through cold blocks, groups/re-organize hot blocks.

A cleaner reads a fixed no of segments in to memory and cleans the dead blocks, those that are either deleted or overwritten and appends any live blocks in those segments.

No need to maintain the list of free blocks – no need for bitmap (as in FFS).

How does the cleaner determine if a block is dead or not?

“Segment Summary Block(s)” is/are included in all the segments for this purpose.

Segment Summary Block (SSB)

SSB Contains the inode and logical block number information for every block in the segment.

Cleaner checks for all blocks if they are still pointed to by their inode (else dead)

Optimize on this by associating a version number for each of the block – incremented on every file deletion/truncation to length 0, compared with version number of its inode in inode map.

Kernel maintains a “segment usage table”, which shows the number of live bits in that segment and its last modified time

Used by the cleaner to determine which segment to clean. On a sync system call (update super block)

Inode map and segment usage table are written to the disk – checkpoint.

Physical layout of the LFS

LFS – cleaning policies - performance

Performance metric – “write cost”

Average amount of time, the disk is busy for writing a unit data including all cleaning overhead.

normalized to The write if done in full disk bandwidth (no seek and

cleaning delays) – write cost of 1. Associate write cost with the fraction of live data in the

segments.

Simulate cleaning

Uniform – write out live data in same order as i read it in.

Hot and cold – regroup live data.

What to clean – least utilized segments (GREEDY)

Recovery Will involve,

r1) Bringing the file system to a physically consistent state.

Consistent to what the the disk layout/data structures (free bitmap in the cylinder group block) say.

r2) Verify logical structure of the file system.

Verify all directories and inode pointers, dangling pointers. What happens when a block is added to a file in both LFS and FFS ?

We may need to modify,

The block itself, inode, the free block map (not in LFS), possibly indirect blocks, and the location of last allocation.

More than that, do this modifications atomically.

r1) and r2) – done by Unix FFS fsck system call

Recovery in FFS - fcsk

Recovery in FFS vs LFS

FFS cannot localize inconsistencies, since the modifications mentioned in the previous slide are happening throughout or anywhere on the disk.

Therefore is it has to check the whole file system for errors (fsck) – highly time consuming.

Whereas in LFS, since the modifications themselves are localized to the end of the log, extensive checking of the whole file system is not required.

Similar to standard database recovery.

Recovery in LFS Find the most recent checkpoint, (possible in FFS)

We would have check pointed the file system at some point earlier before the system crashed eg. as in the “sync” system call, where we store all the file system data structure to the disk.

Initialize the file system data structures with this last check pointed data structure. (possible in FFS)

Replay all modifications done after the checkpoint. (“NOT” possible in FFS)

Read the segments after the checkpoint in time order and update the events in the file system state (data structures), checksums used to validate valid segments.

Also since the segments are threaded together using the next segment pointer, we can easily traverse to the end of the log.

Cleaning and re-grouping live blocks – overwrite old data – captured by the time stamp field.

FINFO – update inodes, inode map, segment usage table.

LFS recovery - replay

Replay If inode block present, update inode map If data block present, without corresponding

inode, then ignore it. Verification of block pointers and directory

structure is crucial to recover from media failures.\ LFS check pointing done every 30 seconds Keep the last two checkpoints

Data structures used by LFS

LFS vs FFS

Problems with LFS design

Append/insert block into a file – a single file is scattered throughout the disk ??

What percentage of the total disk requests are writes? They probably missed on read cost.

Why not used now? EXT3?

Meta data journal.

Thank you :)

Acknowledge: Few information were taken from CSE511 slides.

log-structured file system sriram govindan sgovinda@cse

Documents

file system information

file creation

free file block

diskearlier unix file

itupdate file inode

directory data block

disk data structuresinodes

physical disk