sistemas operativos: file systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · sistemas operativos i...

24
Sistemas Operativos: File System Reliability and Performance Pedro F. Souto ([email protected]) May 25, 2012

Upload: others

Post on 27-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Sistemas Operativos: File SystemReliability and Performance

Pedro F. Souto ([email protected])

May 25, 2012

Page 2: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Sumário

Reliability

Performance

Virtual File System (VFS)

Further Reading

Page 3: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Topics

Reliability

Performance

Virtual File System (VFS)

Further Reading

Page 4: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

File System Reliability

I Users expect data in disk to persist until they explicitly change itI Different events contribute to filesystems failing those

expectationsDisk Failures Disk are fragile electromechanical devices with a

relatively short lifetime (about 5 years)I Google has reported failure rates of 2% per year

Human Errors Many users type faster than they thinkI Windows uses the recycle binI In Unix/Linux one can change rm:

alias rm ’mv -i /tmp/${LOGNAME}’

System Failures caused by power failures or crashesI Backups can address the first two problems

I Disk failures can also be addressed by redundant media such asRAID

Page 5: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

System Failures and FS Reliability

Facts1. File systems cache data and metadata in main memory

I Use write back rather than write through

2. Some metadata updates require changing more than one disksector

Problem System failures (that do not damage the media) mayI Lead to loss of data that has not made it to diskI Lead to inconsistency of file system data structures on disk

I Some sectors are updated but others don’t

Example File creation:1. Allocate an inode, and initialize it2. Allocate a directory entry and make it point to inode

If system goes down after writing directory entry to disk but beforethe inode is written, the file system becomes inconsistent

I What if the writes to disk are done in inverse order?

Page 6: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

File System RecoveryI Upon restart, if the FS was not cleanly shutdown, the OS

executes an utility (fsck/scandisk) that:I Checks the integrity of the FSI Tries to fix the inconsistencies found

I For example, in the case of the Unix FS, fsck checks, at least:I The bitmap of free blocksI The inodes and their reference counts

by scanning the FS metadata (including directory entries)

Also possible that a block be in use and in the free list.

Page 7: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Reducing File System Inconsistencies

FS Inconsistencies cannot be avoided (in the “Unix FS”)I Even if the FS uses synchronous writes for metadata update

ChallengesAsynchrony System failure may happen at any timeRecovery Metadata must be updated in the right order to:

I Allow recoveryI Avoid full disk scan

Performance Synchronous writes hurt performanceGoals are to reduce:

I The metadata update overhead during normal operationI The recovery time at startup after system failure

SolutionsI Enforcing order in metadata updates, taking advantage of

metadata semanticsI FS dependent, but usually very hard

Page 8: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Reducing File System Inconsistencies with LogsIdea Use logs like transactions in databases. Indeed we want disk

metadata updates to be:Atomic i.e. either all of them are performed, or none areConsistent i.e. they must preserve system invariantsIsolated i.e. as if metadata updates were executed by a single

threadDurable i.e. they should persist until modified by other metadata

updatesThese are known as the ACID properties of transactions

Advantage Systematic approach using a very mature tecnhologyVariations Pratically all modern FS use logs

What is logged? is data also logged?How is (meta)data logged? values vs. operationsLog contains all FS data and metadata? log vs. journaled FSType of log redo (write-ahead) vs. undo logGuarantees fully transactional or only order

I Some do not ensure isolationI May still have consistency problems

Page 9: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Metada-only Write-Ahead (Redo) LogData structures

Log An append only file (on disk)I Its tail may be in main memory

FS Metadata On diskI Cached in main memory

Operation Metadata updates are grouped in transactions, sequencesof updates that must have ACID properties

I Update the cached metadataI Add entries with the updates at the log tail in main memory

I Must contain enough information to be able to redo themI At the end, add an “end of transaction” entry to the log

I An alternative is to use a single log entry per transaction

Disk log Log entries must be written to disk before the cachedmetadata

I Either, at the end of each transactionI Or, when convenient

Page 10: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Metadata-only Redo Log: Recovery

Idea Reconstruct the cached metadata by scanning the log andapplying its entries

Problem If log size is large, this may take too longSolution checkpoint the metadata on disk

I This is a consistent snapshot of the metadataand keep track of the first log entry whose update is not in thatcheckpoint

I This also prevents the log from growing too largeI Log entries for transactions that made it to disk can be freed

Recovery becomes a two step process:1. Read the most recent metadata checkpoint from disk2. Apply all the entries in the log for transactions that terminated

since that checkpointWhy does this work?

Page 11: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Metadata-only Redo Log: AssessmentAdvantages

Recovery There is no need to scan and check the entire FSmetadata. Need only:

I Scan the log since the last checkpointI Replay it

I Must read the metadata that was changed since then

Normal operation Log entries are appended at the end of the logI Writing to disk may be deferredI Minimizes seeks

DisadvantagesI Log requires extra spaceI Metadata updates written to disk more than onceI Log cleanup adds overheadI Optimizing log performance is not trivial

What about the data?I Programmers can invoke fsync()/fdatasync()

Page 12: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Topics

Reliability

Performance

Virtual File System (VFS)

Further Reading

Page 13: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Performance

Problem Disks are too slowSolution

Avoid disk accessI Cache metadata and data in memory

I Often, data has to be read from diskI To ensure data persistency, data has to be written to disk

Avoid seeks when disk access is unavoidableI Try to put close on disk

I Data that belongs to the same fileI Data and metadata for the same file

Problem Fine tunning these tecniques is very hardI Filesystem and disks are complexI File sizes and access patterns vary widely

Page 14: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

CacheWhat to cache? Everything that can be frequently reused

Data blocks i.e. disk blocks with data – a.k.a. buffer cacheInodes of opened filesDirectory names But not the on-disk blocks/inodes of directoriesIndirect blocks i.e. disk blocks with pointers to data blocks

How to manage the buffer cache? Can use pure LRU

Rear (MRU)Hash table Front (LRU)

... almost. Ensuring consistency in system failures, may prevent it.

Page 15: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Cache ManagementHow large should be the cache? Difficult to say ...In systems with VM use integrated buffer management I.e., any

frame can be used either for VM pages or for the buffer cache, asneeded. For example, Linux:$top[...]Mem: 4048160k total, 1672080k used, 2376080k free, 45560k buffersSwap: 4000180k total, 1582636k used, 2417544k free, 348752k cached[...]pedro@ceuta:~/tmp/snapshots$ free

total used free buffers cachedMem: 4048160 1672804 2375356 45592 348860-/+ buffers/cache: 1278352 2769808Swap: 4000180 1582628 2417552

buffers is the buffer cachecached appears to be the in memory cache of swap

These can be freed, if the system needs more pages, hence the2nd line in free’s output

I This is useful, because in this system the swap area is smallerthan the physical memory, and hibernation ...

Page 16: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Buffer Cache and Reads/WritesReads

I Prefetch, i.e. read block aheadI Works with sequential accessI Usually, disks controllers cache entire tracks in the disks cache

I Why not free-behind/replace-behind?I Discard buffer from cache whent next is requested

Writes

Synchronous writes write block to disk immediatelyI No data loss

Deferred writes write block laterI May lead to less disk writes

I If a block is modified several times between writes to diskI Temporary files may not even go to disk

I Allows further performance gains, by disk schedulingI Applications may flush the cache by invoking fsync()

Page 17: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Performance: Avoiding Seeks (1/2)I Access to even small file requires reading at least two blocks

I The file inodeI The file data block

I Performance may be improved by locating a file’s metadata closeto its data

I-nodes are located near the start of the disk

Disk is divided into cylinder groups, each with its own i-nodes

(a) (b)

Cylinder group

I What about multi-platter disks?

Anyway, nowadays disk controllers hide the disk geometry

Page 18: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Performance: Avoiding Seeks (2/2)

I Keep data blocks of the same file sequentially on disk. UseExtents is a set of consecutive blocks on disk

I Extent sizes range from 128 KiB (Kibi =210) to several MiBI Space in each extent is allocated sequentiallyI For each extent, keeps only the first block number and its length

I When a new file is created, allocate an extent rather than a singleblock for it

I As the file grows use the remaining space in the extentI If the extent runs out of space allocate another extent

src: Getting to know the Solarois filesystem, Part 1

Page 19: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Other Issues (Not covered)

Disk Caches Nowadays disks have caches of tens of MiBI And some disks write sectors to disk when they deem best, not

when the OS tells them to do itSSDs won’t be on servers for a while

I No seeksI Access time gap is much shorter than for disk

Networked file systems add the network, server and client sidecaches, consistency issues ...

I The design space is considerably larger

Page 20: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Topics

Reliability

Performance

Virtual File System (VFS)

Further Reading

Page 21: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Virtual File System (VFS) Layer (1/2)

Problem How to use diferentFS types on the same OS?

I ext2/ext3/ext4 andNTFS, for disk FS

I (V)FAT, on USB penI ISO9660 on CDs/DVDsI NFS via the networkI /proc, for access to

kernel structuresSolution Add another layer on

top of the disk stacksrc: Anatomy of the Linux virtual file system switch

I The VFS layer is implemented with main memory data structures onlyI The VFS layer was originally designed by Sun for NFS

Page 22: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Virtual File System (VFS) Layer (2/2)

Each File System Must provide a uniform interface, i.e. a set offilesystem (e.g. mount()) and file/directory operations

I Just like character device drivers in the Linux kernel mustimplement a set of functions defined instruct file_operations

The VFS LayerI Provides file system independent functionality

I Validates system call parametersI Copies data to and from user-spaceI Manages the directory name caches

I Maps system calls to the VFS operations that are implementedby the underlying FS

I In Linux, the buffer cache is in the Block Layer, between thedifferent FS and the device drivers

Page 23: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Topics

Reliability

Performance

Virtual File System (VFS)

Further Reading

Page 24: Sistemas Operativos: File Systemweb.fe.up.pt/~pfs/aulas/so2013/at/13fs.pdf · Sistemas Operativos I Subsecção 9.2.3: Estruturas de Suporte à Utilização dos Ficheiros I Secção

Leitura Adicional

Sistemas Operativos

I Subsecção 9.2.3: Estruturas de Suporte à Utilizaçãodos Ficheiros

I Secção 9.3: LinuxI Starting at Subsecção 9.3.2.3 (inclusive)

Modern Operating Systems, 2nd. Ed.

I Secções 6.1 e 6.2: Files e DirectoriesI Secção 6.3: File System Implementation

I Subsecções 6.3.6, 6.3.7 e 6.3.8