an introduction to disk-based linux file systems · an introduction to disk-based linux file...

32
An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course October 2012 v1.3

Upload: lycong

Post on 17-Aug-2019

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

An Introduction to Disk-Based Linux File Systems

Avishay Traeger

IBM Haifa Research Lab Internal Storage Course

―October 2012

v1.3

Page 2: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

Outline

The Basics The Virtual File System (VFS) File System Layout Journaling

Page 3: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

3

What does a disk-basedfile system do?

Provides structure to the array of bits residing on the disk

File and directory naming and hierarchy File access – open, read, write, seek, close, ... Knows how to map <file,offset> to <sector,offset> Tracks which sectors are used and which are “free” Access control

Extra features (e.g., improved reliability, snapshots, compression, encryption)

Page 4: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

4

Linux File System Types

Disk (ext2/3/4, xfs, btrfs, ntfs, vfat, etc.) Network (nfs, cifs, afs, ceph, etc.) Memory (ramfs, tmpfs, etc.) Pseudo (proc, sysfs, etc.) Stackable (ecryptfs, etc.) Object store (exofs) FUSE (Filesystem in USErspace): allows

developers to implement file systems in userspace (easier to develop, slower to run)

... Approximately 60 file systems currently in the

Linux kernel!

Page 5: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

5

Important Metadata Structures

Superblock: on-disk metadata for entire file system Block size, pointer to fs root directory, ...

Inode: on-disk metadata for a single file Inode number (unique ID), owners, timestamps, size,

data block pointers, ... Dentry: metadata for a directory entry, a single

component of a path (not synced to disk) File: open file structure (not synced to disk)

File → Dentry → Inode → Superblock Each structure has associated operations that are

implemented by each file systemNote: All Linux file system implementations have the above structures in memory, but not all have superblocks and inodes on disk (especially file systems not native to Linux/Unix, like FAT). These must map on-disk structures to those in memory.

Page 6: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

6

Directories

A directory is simply a special type of file Can contain other files, directories, links, etc. Each entry has an inode number and name The file system knows how to find a file based

on its inode number What are the basic steps for performing a

lookup on file /foo/bar?

Page 7: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

7

Hard Links

Associate several names with one inode When creating a link, increment the inode's

reference count (refcount) The inode and associated data will only be

deleted when the refcount is zero Can only be used within a single

file-system Can only point to files. This

prevents cycles in the directory tree

Not supported by all file systems

dentry dentry

inode

datablocks

Page 8: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

8

Symbolic Links (Symlinks)

A special file that contains a file name When the kernel encounters a symlink during a

pathname lookup it replaces the name of the link by its contents (the name of the target file), and restarts the pathname interpretation

Can point to files on another file system Can point to any type of file (e.g., directory) Can become a dangling pointer if the target file

is deleted Use more inodes than hard links (2 vs. 1) Higher overhead than hard links for resolution

Page 9: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

Device Files

In Linux, devices can be accessed via special files, generally found under /dev

Two main types: Character: stream of bytes (keyboard, serial) Block: random access of blocks (hard disk, CD-ROM)

Page 10: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

Outline

The Basics The Virtual File System (VFS) File System Layout Journaling

Page 11: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

11

The Virtual File System (VFS)

When we have so many file systems, we need to ensure that:

User programs do not need to be file-system--aware File systems don't re-implement similar functionality

Solution: The VFS. A kernel layer that: Handles all system calls related to a standard Unix

file system (all file systems have the same API) Handles generic activities (e.g., caching, readahead) Has generic file system “library” functions that can be

used by any file system (e.g., fs/libfs.c) Each specific file system implements a set of

functions (operations vectors) Object oriented programming in C

Page 12: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

12

The Virtual File System (VFS)

ext3

Application

isofs NFS

System callhandler

Scheduler

Memorymanagement

Interrupthandler Driver

(Disk)Driver

(CD-ROM)Driver

(Network)

user-space

kernel

VFS

Page Cache

Page 13: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

13

Readahead

Takes advantage of the page cache When a page is read, the VFS code may ask

the file system to read the next several contiguous blocks.

Hopefully, the next block read by the application will already be loaded into the page cache.

Performed during: Sequential reads on files Directory reads

The VFS contains the logic to perform readahead effectively

Page 14: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

14

Example File System Operations

ext3

/

mnthome etc …

Page 15: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

15

Example File System Operations

xfs

ext3

/

mnthome etc …

avishay

mount ­t xfs /dev/sdb1 /home

The VFS mount operation:1) Calls the xfs get_sb function to read the superblock from the partition2) This function also reads the inode of the root directory

Note that performing a lookup on 'home' would have previously invoked ext3, but now it is xfs. Any files/directories in 'home' on ext3 will now be hidden by 'home' on xfs.

Page 16: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

16

Example File System Operations

isofsxfs

ext3

/

cdrom

foo

mnthome etc …

avishay

mount ­t xfs /dev/sdb1 /home

mount ­t isofs /dev/hdc1 /mnt/cdrom

A similar sequence of events occurs here, this time mounting an isofs file system on a CD-ROM drive.

Page 17: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

17

Example File System Operations

isofsxfs

ext3

/

cdrom

bar foo

mnthome etc …

avishay

mount ­t xfs /dev/sdb1 /home

mount ­t isofs /dev/hdc1 /mnt/cdrom

cp /mnt/cdrom/foo /home/avishay/bar

Lookup operations will be performed on all 3 file systems. The copy operation will read from 'foo' (isofs) and write to 'bar' (xfs). The VFS determines which file system to invoke.

Page 18: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

Outline

The Basics The Virtual File System (VFS) File System Layout Journaling

Page 19: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

19

File System Layout

Some considerations: Minimize seeks between metadata and related data Minimize number of disk reads required to get to data Maximize readahead (sequential access) Recovery from disk corruption, power outage, etc. Management: fragmentation, compaction, etc.

Page 20: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

20

Contiguous Allocation

Files are allocated contiguously on the disk Space for entire file must be requested in advance Search bit map or linked list to locate a space

Pros Fast sequential access Easy random access

Cons External fragmentation Hard to grow files: may have to move (large) files May need compaction

B CA D

E

Page 21: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

21

Linked Files (Alto)

Each file is a linked list File header (like inode) points to first block on disk Each block points to the next

Pros Can grow files dynamically Free list is similar to a file No external fragmentation

or need to move files Cons Random access is horrible Even sequential access needs one seek per block Unreliable: losing one block means losing the rest

File block 1

File header

File block N

Page 22: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

22

File Allocation Table (FAT)

Table of “next pointers”, indexed by block Dentry points to 1st block of file Two copies of FAT, at the beginning of the volume

Pros Faster random access Cache FAT table and

traverse in memory Cons FAT table may be too large to cache - long seeks Pointers for all files are interspersed in FAT table

Need full table in memory, even for one file Solution: indexed files

Page 23: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

23

Single-Level Indexed Files

User declares maximum file size A file header holds an array of pointers to disk

blocks Pros Random access is fast Better metadata caching than FAT

Cons Clumsy to grow beyond the limit Many seeks

Fileheader

Diskblocks

Page 24: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

24

ext2: Block Groups

BootBlock

Blockgroup 0

Blockgroup 1 ... Block

group n

Superblock

GroupDescriptors

Data BlockBitmap

inodeBitmap

inodeTable

DataBlocks

Improved reliability Control structures are replicated Easy to recover the superblock

Improved performance Reduces the distance between the inodes and

related data blocks It is possible to reduce the disk head seeks during

I/O on files

Page 25: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

25

ext2: Multi-Level Indexed Files

The inode contains 15 pointers: 12 direct pointers 13: 1-level indirect 14: 2-level indirect 15: 3-level indirect

Pros & Cons In favor of small files Can grow Lots of seeking

(somewhat limited byblock groups)

ext3: same on-disk formatplus journal (covered later)

1

inode

data

data2

...

131415

data

data

data

Page 26: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

26

ext4/xfs/Btrfs: Extents & Trees

Extent: set of logically contiguous blocks within a file that are stored contiguously on disk

Single ext4 extent: up to 128MB with 4KB block size Less meta-data: Only need to remember:

<1st logical block, # blocks, 1st physical block> xfs and Btrfs store extents in B-tree variants

These are newer and very interesting Linux disk-based file systems and have become more “standard”

Page 27: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

27

Log-structured File System

Will be covered separately tomorrow

Page 28: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

Outline

The Basics The Virtual File System (VFS) File System Layout Journaling

Page 29: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

File System Corruption

Some FS operations require multiple writes which may not all complete (power fail, crash)

The on-disk state will be invalid on next mount Example: To write to a file, 3 main operations:

1.Write data to disk block2.Update the free space map3.Update pointer from inode to block

With no help, detecting and recovering from errors require examining all data structures

In Linux, this is done by fsck (file system check) This was acceptable in the past, but takes too

long for larger file systems

Page 30: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

Journaling

Journal: a special file that logs the changes destined for the file system in a circular buffer

Idea: use a journal to log changes before they're committed to the file system to avoid metadata corruption

Examples: JFS/JFS2, ext3/4, XFS, ReiserFS

Page 31: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

ext3 Journaling Modes

Writeback: Only metadata is journaled. Data is written indepentently. Preserves file system structure and avoids corruption, but files may contain stale data (like ext2 + fast fsck).

Ordered (default): Data written to disk before metadata transactions commit → no stale data blocks.

Journal: Journals all data and metadata, so data is written twice (same consistency guarantees as 'ordered', different performance).

Page 32: An Introduction to Disk-Based Linux File Systems · An Introduction to Disk-Based Linux File Systems Avishay Traeger IBM Haifa Research Lab Internal Storage Course ― October 2012

32

References & Further Reading

References in this presentation refer to Linux 2.6.35 http://lxr.linux.no/#linux+v2.6.35/

Further reading Linux Kernel Development (Love): Good for overview – 3rd edition recently published

2nd edition: http://linuxkernel2.atw.hu/ (hopefully posted with the author's permission...) Understanding the Linux Kernel (Bovet & Cesati): Good for reference btrfs: http://lwn.net/Articles/342892/

Some of content in these slides taken from: http://www.cs.princeton.edu/courses/archive/fall09/cos318/lectures/FileLayout.pdf http://www.ntfs.com/fat-allocation.htm http://www.ibm.com/developerworks/library/l-journaling-filesystems/index.html Tel-Aviv University advanced storage course slides by Ronen Kat and Ohad Rodeh Various wikipedia articles http://static.usenix.org/event/usenix05/tech/general/full_papers/prabhakaran/prabhakaran_ht

ml/main.html