filesystemimplementationpre final-160919095849

Outline FILE-SYSTEM STRUCTURE

FILE-SYSTEM IMPLEMENTATIONo Overviewo Partitions and Mountingo Virtual File Systems

DIRECTORY IMPLEMENTATIONo Linear Listo Hash Table

ALLOCATION METHODSo Contiguous Allocationo Linked Allocationo Indexed Allocationo Performance

FREE-SPACE MANAGEMENTo Bit Vectoro Linked Listo Groupingo Countingo Space Maps

EFFICIENCY AND PERFORMANCEo Efficiencyo Performance

RECOVERYo Consistency Checkingo Log-Structured File Systemso Other Solutionso Backup and Restore

NFS (Optional)o Overviewo The Mount Protocolo The NFS Protocolo Path-Name Translationo Remote Operations

EXAMPLE: THE WAFL FILE SYSTEM (Optional--SKIPPED)

ContentsFILE-SYSTEM STRUCTURE

The file system resides permanently on secondary storage. This chapter is primarily concerned with issues surrounding file storage and access on the most common secondary-storage medium, the disk.

Hard disks have two important properties that make them suitable for secondary storage of files in file systems: (1) Blocks of data can be rewritten in place; it is possible to read a block from the disk, modify the block, and write it back into the same place, and (2) they are direct access, allowing any block of data to be accessed with only (relatively) minor movements of the disk heads and rotational latency. (Disks are usually accessed in physical blocks – one or more sectors - rather than a byte at a time. Block sizes may range from 512 bytes to 4K or larger.)

1File-System Implementation (Galvin)

To provide efficient and convenient access to the disk, the OS imposes one or more file systems to allow the data to be stored, located, and retrieved easily. One of the design problems a file system poses is creating algorithms and data structures to map the logical file system onto the physical secondary-storage devices.

The file system itself is generally composed of many different levels. The structure shown in Figure 11.1 is an example of a layered design, where each level in the design uses the features of lower levels to create new features for use by higher levels.

File systems organize storage on disk drives, and can be viewed as a layered design: o At the lowest layer are the physical devices, consisting of the magnetic media, motors & controls, and the electronics connected to them and

controlling them. Modern disk put more and more of the electronic controls directly on the disk drive itself, leaving relatively little work for the disk controller card to perform.

o I/O Control consists of device drivers, special software programs (often written in assembly ) which communicate with the devices by reading and writing special codes directly to and from memory addresses corresponding to the controller card's registers. Each controller card (device) on a system has a different set of addresses (registers, a.k.a. ports) that it listens to, and a unique set of command codes and results codes that it understands. (Book: The I/O control is the lowest level and consists of device drivers and interrupt handlers to transfer information between the main memory and the disk system. A device driver can be thought of as a translator. Its input consists of high-level commands such as "retrieve block 123". Its output consists of low-level, hardware-specific instructions that are used by the hardware controller, which interfaces the I/O device to the rest of the system. The device driver usually writes specific bit patterns to special locations in the I/O controller's memory to tell the controller which device location to act on and what actions to take.)

o The basic file system level works directly with the device drivers in terms of retrieving and storing raw blocks of data, without any consideration for what is in each block. Depending on the system, blocks may be referred to with a single block number, (e.g. block # 234234), or with head-sector-cylinder combinations. ((Book: The basic file system needs only to issue generic commands to the appropriate device driver to read and write physical blocks on the disk. Each physical block is identified by its numeric disk address (for example, drive 1, cylinder 73,track 2,sector 10)

o The file organization module knows about files and their logical blocks, and how they map to physical blocks on the disk. In addition to translating from logical to physical blocks, the file organization module also maintains the list of free blocks, and allocates free blocks to files as needed. (Book: The file-organization module knows about files and their logical blocks, as well as physical blocks. By knowing the type of file allocation used and the location of the file, the file-organization module can translate logical block addresses to physical block addresses for the basic file system to transfer. Each file's logical blocks are numbered from 0(or 1) through N. Since the physical blocks containing the data usually do not match the logical numbers, a translation is needed to locate each block. The file-organization module also includes the free-space manager, which tracks unallocated blocks and provides these blocks to the file-allocation module when requested.)

o The logical file system deals with all of the meta data associated with a file (UID, GID, mode, dates, etc), i.e. everything about the file except the data itself. This level manages the directory structure and the mapping of file names to file control blocks, FCBs, which contain all of the meta data as well as block number information for finding the data on the disk. (IBM KnowledgeCenter: The logical file system is the level of the file system at which users can request file operations by system call. This level of the file system provides the kernel with a consistent view of what might be multiple physical file systems and multiple file system implementations. As far as the logical file system is concerned, file system types, whether local, remote, or strictly logical, and regardless of implementation, are indistinguishable. ((Book: The logical file system manages metadata information. Metadata includes all of the file-system structure except the actual data (or contents of the files). The logical file system manages the directory structure to provide the file-organization module with the information the latter needs, given a symbolic file name. It maintains file structure via FCBs. An FCB contains information about the file, including ownership, permissions, and location of the file contents. The logical file system is also responsible for protection and security.)

The layered approach to file systems means that much of the code can be used uniformly for a wide variety of different file systems, and only certain layers need to be filesystem specific. (Book: When a layered structure is used for file-system implementation, duplication of code is minimized. The I/O control and sometimes back file-system code can be used by multiple file systems. Each file system can then have its own logical file system and file-organization modules.)

Most operating systems support more than one file systems. In addition to removable-media file systems, each OS has one disk-based file system (or more). UNIX uses the UNIX file system (UFS), which is based on the Berkeley Fast File System (FFS). Windows NT, 2000, and XP support disk file-system formats of FAT, FAT 32, and NTFS (or Windows NT File System), as well as CD-ROM, DVD and floppy-disk file-system formats. Although Linux supports over 40 different file systems, the standard Linux file system is known as the extended file system, with the most common version being ext2 and ext3.

File System Implementation

As was described in Section 10.1.2, operating systems implement open() and close() system calls for processes to request access to file contents. In this section, we delve into the structures and operations used to implement file-system operations.

Overview


Author, 03/01/-1,

Does it use the volume control block mentioned below?

Several on-disk and in-memory structures are used to implement a file system. These structures vary depending on the OS and the file system, but some general principles apply. On disk, the file system may contain information about how to boot an operating system stored there, the total number of blocks, the number and location of free blocks, the directory structure, and individual files. Many of these structures are detailed throughout the remainder of this chapter; here we describe them briefly.

File systems store several important data structures on the disk (Ilinois part is erroneous, refer book parts): o A boot-control block, (per volume) a.k.a. the boot block in UNIX or the partition boot sector in Windows contains information about

how to boot the system off of this disk. This will generally be the first sector of the volume if there is a bootable system loaded on that volume, or the block will be left vacant otherwise. (Book: A boot control block (per volume) can contain information needed by the system to boot an operating system from that volume. If the disk does not contain an operating system, this block can be empty. It is typically the first block of a volume. In UFS, this is called the boot block; in NTFS, it is the partition boot sector.)

o A volume control block, (per volume) a.k.a. the master file table in UNIX or the superblock in Windows, which contains information such as the partition table, number of blocks on each filesystem, and pointers to free blocks and free FCB blocks. (Book: A volume control block (per volume) contains volume (or partition) details, such as the number of blocks in the partition, size of the blocks, free-block count and free-block pointers, and free FCB count and FCB pointers. In UFS, this is called a superblock; in NTFS, it is stored in the master file table)

o A directory structure (per file system), containing file names and pointers to corresponding FCBs. UNIX uses inode numbers, and NTFS uses a master file table. (Book: A directory structure per file system is used to organize the files. In UFS, this includes file name and associated inode numbers. In NTFS, it is stored in the master file table.)

o The File Control Block, FCB, (per file) containing details about ownership, size, permissions, dates, etc. UNIX stores this information in inodes, and NTFS in the master file table as a relational database structure. (Book: A per-file FCB contains many details about the file, including file permissions, ownership, size, and location of the data blocks. In UFS, this is called the inode. In NTFS, this information is actually stored within the master file table, which uses a relational database structure, with a row per file.)

There are also several key data structures stored in memory ((Book: The in-memory information is used for both file-system management and performance improvement via caching. The data are loaded at mount time and discarded at dismount. The structures may include the ones described below):

o An in-memory mount table contains information about each mounted volume. o An in-memory directory-structure cache holds the directory information of recently accessed directories. (For directories at which

volumes are mounted, it can contain a pointer to the volume table.). o The system-wide open-file table contains a copy of the FCB of each open file, as well as other information.o A per-process open file table, containing a pointer to the system open file table as well as some other information. (For example the

current file position pointer may be either here or in the system file table, depending on the implementation and whether the file is being shared or not.) (Book: The per-process open-file table contains a pointer to the appropriate entry in the system-wide open-file table, as well as other information.)

Interactions of file system components when files are created and/or used:To create a new file, an application program calls the logical file system, which knows the format of the directory structures. To create a new file, it allocates a new FCB. (Alternatively, if the file-system implementation creates all FCBs at file-system creation time, an FCB is allocated from the set of free FCBs.) The system then reads the appropriate directory into memory, updates it with the new file name and FCB, and writes it back to the disk. A typical FCB is shown in Figure 11.2. Some operating systems, including UNIX, treat a directory exactly the same as a file – one with a type field indicating that it is a directory. Other operating systems, including Windows NT, implement separate system calls for files and directories and treat directories as entities separate from files. Whatever the larger structural issues, the logical file system can call the file-organization module to map the directory I/O into disk-block numbers, which are passed on to the

basic file system and I/O control system. Now that a file has been created, it can be used for I/O. First, though, it must be opened. The open() call passes a file name to the file system. The open() system call first searches the system-wide open-file table to see if the file is already in use by another process. If it is, a per-process open-file table entry is created pointing to the existing system-wide open-file table. This algorithm can save substantial overhead. When a file is opened, the directory structure is searched for the given file name. Parts of the directory structure are usually cached in memory to speed directory operations. Once the file is found, the FCB is copied into a system-wide open-file table in memory. This table not only stores the FCB but also tracks the number of processes that have the file open. Next, an entry is made in the per-process open-file table, with a pointer to the entry in the system-wide open-file table and some other fields. These other fields can include a pointer to the current location in the file (for the next read() or write() operation) and the access mode in which the file is open. The open() call returns a pointer to the appropriate entry in the per-process file-system table. All file operations are then performed via this pointer. The file name may not be part of the open-file table, as the system has no use for it once the appropriate FCB is located on disk. It could be cached, though, to save time on subsequent opens of the same file. The name given to the entry varies. UNIX systems refer to it as a file descriptor; Windows refers to it as a file handle. Consequently, as long as the file is not closed, all file operations are done on


Author, 03/01/-1,

Refer book and bold the words like these. I had to type the book content and hence boldface missing.

the open-file table. When a process closes the file, the per-process table entry is removed, and the system-wide entry's open count is decremented. When all users that have opened the file close it, any updated metadata is copied back to the disk-based directory structure, and the system-wide open-file table entry is removed. Some systems complicate this scheme further by using the file system as an interface to other system aspects, such as networking. For example, in UFS, the system-wide open-file table holds the inodes and other information for files and directories. It also holds similar information for network connections and devices. In this way, once mechanism is used for multiple purposes. The caching aspects of file-system structures should not be overlooked. Most systems keep all information about an open file, except for its actual data blocks in memory. The BSD UNIX system is typical in its use of caches wherever disk I/O can be saved. Its average cache hit rate of 85% shows that these techniques are well worth implementing. The operating structures of a file-system implementation are summarized in Figure 11.3.

Before moving on to the next section, go to the reference material on MBT, MFT, VBR and FCB in the “Assorted Content” section.

Partitions and Mounting:

Partitions can either be used as raw devices (with no structure imposed upon them), or they can be formatted to hold a filesystem (i.e. populated with FCBs and initial directory structures as appropriate.) Raw partitions are generally used for swap space, and may also be used for certain programs such as databases that choose to manage their own disk storage system. Partitions containing filesystems can generally only be accessed using the file system structure by ordinary users, but can often be accessed as a raw device also by root.

The boot block is accessed as part of a raw partition, by the boot program prior to any operating system being loaded. Modern boot programs understand multiple OSes and filesystem formats, and can give the user a choice of which of several available systems to boot.

The root partition contains the OS kernel and at least the key portions of the OS needed to complete the boot process. At boot time the root partition is mounted, and control is transferred from the boot program to the kernel found there. (Older systems required that the root partition lie completely within the first 1024 cylinders of the disk, because that was as far as the boot program could reach. Once the kernel had control, then it could access partitions beyond the 1024 cylinder boundary.)

Continuing with the boot process, additional filesystems get mounted, adding their information into the appropriate mount table structure. As a part of the mounting process the file systems may be checked for errors or inconsistencies, either because they are flagged as not having been closed properly the last time they were used, or just for general principals. Filesystems may be mounted either automatically or manually. In UNIX a mount point is indicated by setting a flag in the in-memory copy of the inode, so all future references to that inode get re-directed to the root directory of the mounted filesystem.

Virtual File Systems: Virtual File Systems, VFS, provide a common interface to multiple different filesystem types. In addition, it provides for a unique identifier (vnode) for files across the entire space, including across all filesystems of different types. (UNIX inodes are unique only across a single filesystem, and certainly do not carry across networked file systems.) The VFS in Linux is based upon four key object types: (a) The inode object, representing an individual file (b) The file object, representing an open file. (c) The superblock object, representing a filesystem. (d) The dentry object, representing a

directory entry.

Directory Implemenatation

The selection of directory-allocation and directory-management algorithms significantly affects the efficiency, performance and reliability of the file system. In this section, we discuss the trade-off involved in choosing one of these algorithms. (Directories need to be fast to search, insert, and delete, with a minimum of wasted disk space).

Linear List: The simplest method of implementing a directory is to use a linear list of file names with pointers to the data blocks. This method is simple to program but time-consuming to execute. To create a new file, we must first search the directory to be sure that no existing file has the same name. Then, we add a new entry at the end of the directory. To delete a file, we search the directory for the named file, then release


the space allocated to it. To reuse the directory entry, we can do one of several things. We can mark the entry as unused (by assigning it a special name, such as an all-blank name, or with a used-unused bit in each entry), or we can attach it to a list of free directory entries. A third alternative is to copy the last entry in the directory into the freed location and to decrease the length of the directory. A linked list can also be used to decrease the time required to delete a file (there is an overhead for the links). The real disadvantage of a linear list of directory entries is that finding a file requires a linear search. Directory information is used frequently, and users will notice if access to it is slow. A sorted binary list allows a binary search and decreases the average search time. However, the requirement that the list be kept sorted may complicate creating and deleting files, since we may have to move substantial amounts of directory information to maintain a sorted directory. A more sophisticated tree data structure, such as a B-tree, might help here. An advantage of the sorted list is that a sorted directory listing can be produced without a separate sort step.

Hash table: Another data structure for a file directory is a hash table. With this method, a linear list stores the directory entries, but a hash data structure is also used. The hash table takes a value computed from the file name and returns a pointer to the file name in the linear list. Therefore it can greatly decrease the directory search time.

Allocation methods

Here we discuss how to allocate space to files so that disk space is utilized effectively and files can be accessed quickly. Three major methods of allocating disk space are in wide use: Contiguous, linked and indexed. Some systems (such as Data General's RDOS for its Nova line of computers) support all three. More commonly, a system uses one method for all file within a file system type.

Contiguous Allocation: It requires that all blocks of a file be kept together contiguously. Performance is very fast, because reading successive blocks of the same file generally requires no movement of the disk heads, or at most one small step to the next adjacent cylinder.

Storage allocation involves the same issues discussed earlier for the allocation of contiguous blocks of memory (first fit, best fit, fragmentation problems, etc.) The distinction is that the high time penalty required for moving the disk heads from spot to spot may now justify the benefits of keeping files contiguously when possible. (Even file systems that do not by default store files contiguously can benefit from certain utilities that compact the disk and make all files contiguous in the process.)

Problems can arise when files grow, or if the exact size of a file is unknown at creation time: Over-estimation of the file's final size increases external fragmentation and wastes disk space. Under-estimation may require that a file be moved or a process aborted if the file grows beyond its originally allocated space. If a file grows slowly over a long time period and the total final space must be allocated initially, then a lot of space becomes unusable before the file fills the space.

To minimize these drawbacks, some operating systems use a modified contiguous-allocation scheme. Here, a contiguous chunk of space is allocated initially; and then, if that amount proves not to be large enough, another chunk of contiguous space, known as an extent, is added. The location of the file's blocks is then recorded as a location and a block count, plus a link to the first block of the next extent (used by Veritas

file system).

Linked Allocation: Linked allocation solves all problems of contiguous allocation. With linked allocation, each file is a linked list of disk blocks; the disk blocks may be scattered anywhere on the disk. The directory contains a pointer to the first and last blocks of the file (Each block contains a pointer to the next block). These pointers are not made available to the user. Thus, if each block is 512 bytes in size, and a disk address (the pointer) requires 4 bytes, then the user sees blocks of 508 bytes.

To create a new file, we simply create a new entry in the directory. With linked allocation, each directory entry has a pointer to the first disk block of the file. This pointer is initialized to nil (the end-of-list pointer value) to signify an empty file. The size field is also set to 0. A write to the file causes the free-space management system to fine a free block, and this new block is written to and is linked to the end of the file. To read a file, we simply read blocks by following the pointers from block to block. There is no external fragmentation with linked allocation, and any free block on the free-space list can be used to satisfy a request. The size of a file need not be declared when that file is created. A file can continue to grow as long as free blocks are available. Consequently, it is never necessary to compact disk space.


Linked allocation does have disadvantages, however. The major problem is that it can be used effectively only for sequential-access files. To find the ith block of a file, we must start at the beginning of that file and follow the pointers till we get to the ith block. Each access to a pointer requires a disk read, and some require a disk seek. Consequently, it is inefficient to support a direct-access capability for linked- allocation files. (Another disadvantage is the space required for the pointers).

The usual solution to this problem is to collect blocks into multiples, called clusters, and to allocate clusters rather than blocks. For instance, the file system may define a cluster as four blocks and operate on the disk only in cluster units. Pointers then use a much smaller percentage of the file's disk space. The cost of this approach is an increase in internal fragmentation, because more space is wasted when a cluster is partially full than when a block is partially full. Clusters can be used to improve the disk-access time for many other algorithms as well, so they are used in most file systems.

Another problem of linked allocation is reliability. The files are linked together by pointers scattered all over the disk, so consider what would happen if a pointer were lost or damaged. One partial solution is to use doubly-linked lists, and another is to store the file name and relative block number in each block; however, these schemes require even more overhead for each file.

An important variation on linked allocation is the use of a file-allocation table (FAT). This simple but efficient method of disk-space allocation is used by the MS-DOS and OS/2 operating systems. A section of disk at the beginning of each volume is set aside to contain the table. The table has one entry for each disk block and is indexed by block number. The FAT is used in much the same way as a linked list. The directory entry contains the block number of the first block of teh file. The table entry indexed by that block number contains the block number of the next block in the file. This chain continues until the last block, which has a special end-of-file value as the table entry. Unused blocks are indicated by a 0 table value. Allocating a new block to a file is a simple matter of finding the first 0-valued table entry and replacing the previous end-of-file value with the address of the new block. The 0 is then replaced with end-of-file value. An illustrative example is the FAT structure shown in Figure 11.7 for a file consisting of disk blocks 217, 618, and 339. The FAT allocation scheme can result in a significant number of disk head seeks, unless the FAT is cached. The disk head must move to the start of the volume to read the FAT and find the location of the block in question, then move to the location of the block itself. In the worst case, both moves occur for each of the blocks. A benefit is that random-access time is improved, because the disk head can

find the location of any block by reading the information in the FAT.

Indexed Allocation: Linked allocation solves the external-fragmentation and size-declaration problems of contiguous allocation. However, in the absence of a FAT, linked allocation cannot support efficient direct access, since the pointers to the blocks are scattered with the blocks themselves all over the disks and must be retrieved in order. Indexed allocation solves this problem by bringing all the pointers together into one location: the index block.

Each file has its own index block, which is an array of disk-block addresses. The ith entry in the index block points to the ith block of the file. The directory contains the address of the index block (Figure 11.8). To find and read the ith block, we use the pointer in the ith index-block entry. This scheme is similar to the paging scheme described in Section 8.4.

When the file is created, all pointers in the index block are set to nil. When the ith block is first written, a block is obtained from the free-space manager, and its address is put in the ith index-block entry.

Indexed allocation supports direct access, without suffering from external fragmentation, because any free block on the disk can satisfy a request for more space. Indexed allocation does suffer from wasted space, however. The pointer overhead of the index block is generally greater than the pointer overhead of linked allocation. Consider a common case in which we have a file of only one or two blocks. With linked allocation, we lose the space of only one pointer per block. With indexed allocation, an entire index block must be allocated, even if only one or two pointers will be non-nil.

This point raises the question of how large the index block should be. Every file must have an index block, so we want the index block to be as small as possible. If the index block is too small, however, it will not be able to hold enough pointers for a large file, and a mechanism will have to be available to deal with the issue. Mechanisms for this purpose include the following:


o Linked scheme – An index block is normally one disk block. Thus, it can be read and written directly by itself. To allow for large files, we can link together several index blocks. For example, an index block might contain a small header giving the name of the file and a set of the first 100 disk-block addresses. The next address (the last word in the index block) is nil (for a small file) or is a pointer to

another index block (for a large file). o Multilevel index – A variant of the linked representation is to use a first-level index block to a set of second-level index blocks, which

in turn point to the file blocks. To access a block, the OS uses the first-level index to find a second-level index block and then uses that block to find the desired data block. This approach could be continued to a third or fourth level, depending on the desired maximum file size. With 4096-byte blocks, we could store 1,024 4-byte pointers in an index block. Two levels of indexes allow 1,048,576 data blocks and a file size of up to 4 GB.

o Combined scheme – Another alternative, used in the UFS, is to keep the first, say, 15 pointers of the index block in the file's inode. The first 12 of these pointers point to direct blocks; that is, they contain addresses of blocks that contain data of the file. Thus, the data for small files (of no more than 12 blocks) do not need a separate index block. If the block size is 4KB, then up to 48 KB of data can be accessed directly. The next three pointers point to indirect blocks. The first points to a single indirect block, which is an index block containing not data but the addresses of blocks that do contain data. The second points to a double indirect block, which contains the address of a block that contains the addresses of blocks that contain pointers to the actual data blocks. The last pointer contains the address of a triple indirect block. Under this method, the number of blocks that can be allocated to a file exceeds the amount of space addressable by the 4-byte file pointers used by many OSes. A 32-bit file pointer reaches only 2^32 bytes, or 4 GB. Many UNIX implementations, including Solaris and IBM's AIX, now support up to 64-bit file pointers. Pointers of this size allow files and file

systems to be terabytes in size. A UNIX inode is shown in Figure 11.9. Indexed-allocation schemes suffer from some of the same performance problems as does linked allocation. Specifically, the index blocks can be

cached in memory, but the data blocks may be spread all over a volume.

Performance: The optimal allocation method is different for sequential access files than for random access files, and is also different for small files than for large files. Some systems support more than one allocation method, which may require specifying how the file is to be used (sequential or random access) at the time it is allocated. Such systems also provide conversion utilities. Some systems have been known to use contiguous access for small files, and automatically switch to an indexed scheme when file sizes surpass a certain threshold. And of course some systems adjust their allocation schemes (e.g. block sizes) to best match the characteristics of the hardware for optimum performance.

Free-Space Management Another important aspect of disk management is keeping track of and allocating free space.

Bit Vector: One simple approach is to use a bit vector, in which each bit represents a disk block, set to 1 if free or 0 if allocated. Fast algorithms exist for quickly finding contiguous blocks of a given size The down side is that a 40GB disk requires over 5MB just to store the bitmap (For example).

Linked List: A linked list can also be used to keep track of all free blocks. Traversing the list and/or finding a contiguous block of a given size are not easy, but fortunately are not frequently needed operations. Generally the system just adds and removes single blocks from the beginning of the list. The FAT table keeps track of the free list as just one more linked list on the table.

Grouping: A variation on linked list free lists is to use links of blocks of indices of free blocks. If a block holds up to N addresses, then the first block in the linked-list contains up to N-1 addresses of free blocks and a pointer to the next block of free addresses.

Counting: When there are multiple contiguous blocks of free space then the system can keep track of the starting address of the group and the number of contiguous free blocks. As long as the average length of a contiguous group of free blocks is greater than two this offers a savings in space needed for the free list. (Similar to compression techniques used for graphics images

when a group of pixels all the same color is encountered.) Space Maps: Sun's ZFS file system was designed for HUGE numbers and sizes of files,

directories, and even file systems. The resulting data structures could be VERY inefficient if not implemented carefully. For example, freeing up a 1 GB file on a 1 TB file system could involve updating thousands of blocks of free list bit maps if the file was spread across the disk. ZFS uses a combination of techniques, starting with dividing the disk up into (hundreds of) metaslabs of a manageable size, each having their own space map. Free blocks are managed using the counting technique, but rather than write the information to a table, it is recorded in a log-structured transaction record. Adjacent free blocks are also coalesced into a larger single free


block. An in-memory space map is constructed using a balanced tree data structure, constructed from the log data. The combination of the in-memory tree and the on-disk log provide for very fast and efficient management of these very large files and free blocks.

Efficiency and Performance Efficiency: The efficient use of disk space depends heavily on the disk allocation and directory algorithms in use. For instance, UNIX pre-

allocates inodes, which occupies space even before any files are created. UNIX also distributes inodes across the disk, and tries to store data files near their inode, to reduce the distance of disk seeks between the inodes and the data. Some systems use variable size clusters depending on the file size. The more data that is stored in a directory (e.g., information like last access time), the more often the directory blocks have to be re-written. As technology advances, addressing schemes have had to grow as well. Sun's ZFS file system uses 128-bit pointers, which should theoretically never need to be expanded. (The mass required to store 2^128 bytes with atomic storage would be at least 272 trillion kilograms!) Kernel table sizes used to be fixed, and could only be changed by rebuilding the kernels. Modern tables are dynamically allocated, but that requires more complicated algorithms for accessing them.

Performance: Even after the basic file-system algorithms have been selected, we can still improve performance in several ways. Disk controllers generally include on-board caching. When a seek is requested, the heads are moved into place, and then an entire track is read, starting from whatever sector is currently under the heads (reducing latency). The requested sector is returned and the unrequested portion of the track is cached in the disk's electronics. Some OSes cache disk blocks they expect to need again in a buffer cache. A page cache connected to the virtual memory system is actually more efficient as memory addresses do not need to be converted to disk block addresses and back again. Some systems (Solaris, Linux, Windows 2000, NT, XP) use page caching for both process pages and file data in a unified virtual memory. Figures 11.11 and 11.12 show the advantages of the unified buffer cache found in some versions of UNIX and Linux - Data does not need to be stored twice, and problems of inconsistent buffer information are avoided. (Book: Some systems maintain a separate section of main memory for a buffer cache, where blocks are kept under the assumption that they will be used again shortly. Other systems cache file data using a page cache. The page cache uses virtual memory techniques to cache file data as pages rather than as file-system-oriented blocks. Caching file data using virtual addresses is far more efficient than caching through physical blocks, as accesses interface with virtual memory rather than the file system. Several systems, including Solaris/Linus/WIndows NT/XP, use page caching

to cache both process pages and file data. This is known as unified virtual memory.) (Book: Some versions of UNIX and Linux provide a unified buffer cache. To illustrate the benefits of the unified buffer cache, consider the two alternatives for opening and accessing a file. One approach is to use memory mapping (section 9.7); the second is to use the standard system calls read() and write(). Without a unified buffer cache, we have a situation similar to Figure 11.11. Here, read() and write() system calls go through the buffer cache. The memory-mapping call, however, requires using two caches - the page cache and the buffer cache. A memory mapping proceeds by reading in disk blocks from the file system and storing them in the buffer cache. Because the virtual memory does not interface with the buffer cache, the contents of the file in the buffer cache must be copied in to the page cache. This situation is known as double caching and requires caching file-system data twice. Not only does it waste memory but it also wastes significant CPU and I/O cycles due to the extra data movement within system memory. In addition, inconsistencies between the two caches can result in corrupt files. In contrast, when a unified buffer cache is provided, both memory mapping and the read() and write() system calls use the same page cache. This has the benefit of avoiding double caching, and it allows the virtual memory system to manage file-system data. The unified buffer cache is shown in Figure 11.12.)

o Page replacement strategies can be complicated with a unified cache, as one needs to decide whether to replace process or file pages, and how many pages to guarantee to each category of pages. Solaris, for example, has gone through many variations, resulting in priority paging giving process pages priority over file I/O pages, and setting limits so that neither can knock the other completely out of memory.

o Another issue affecting performance is the question of whether to implement synchronous writes or asynchronous writes. Synchronous writes occur in the order in which the disk subsystem receives them, without caching; Asynchronous writes are cached, allowing the disk subsystem to schedule writes in a more efficient order (See Chapter 12.) Metadata writes are often done synchronously. Some systems support flags to the open call requiring that writes be synchronous, for example for the benefit of database systems that require their writes be performed in a required order.

o The type of file access can also have an impact on optimal page replacement policies. For example, LRU is not necessarily a good policy for sequential access files. For these types of files progression normally goes in a forward direction only, and the most recently used page will not be needed again until after the file has been rewound and re-read from the beginning, (if it is ever needed at all.) On the other hand, we can expect to need the next page in the file fairly soon. For this reason sequential access files often take advantage of two special policies:

Free-behind frees up a page as soon as the next page in the file is requested, with the assumption that we are now done with the old page and won't need it again for a long time.


Read-ahead reads the requested page and several subsequent pages at the same time, with the assumption that those pages will be needed in the near future. This is similar to the track caching that is already performed by the disk controller, except it saves the future latency of transferring data from the disk controller memory into motherboard main memory.

o The caching system and asynchronous writes speed up disk writes considerably, because the disk subsystem can schedule physical writes to the disk to minimize head movement and disk seek times. (See Chapter 12). Reads, on the other hand, must be done more synchronously in spite of the caching system, with the result that disk writes can counter-intuitively be much faster on average than disk reads.

Recovery

Files and directories are kept both in main memory and on disk, and care must be taken to ensure that system failure does not result in loss of data or in data inconsistency. We deal with these issues in the following sections.

Consistency Checking: The storing of certain data structures (e.g. directories and inodes) in memory and the caching of disk operations can speed up performance, but what happens in the result of a system crash? All volatile memory structures are lost, and the information stored on the hard drive may be left in an inconsistent state. A Consistency Checker (fsck in UNIX, chkdsk or scandisk in Windows) is often run at boot time or mount time, particularly if a filesystem was not closed down properly. Some of the problems that these tools look for include:

o Disk blocks allocated to files and also listed on the free list. o Disk blocks neither allocated to files nor on the free list. o Disk blocks allocated to more than one file. o The number of disk blocks allocated to a file inconsistent with the file's stated size. o Properly allocated files / inodes which do not appear in any directory entry. o Link counts for an inode not matching the number of references to that inode in the directory structure.o Two or more identical file names in the same directory. o Illegally linked directories, e.g. cyclical relationships where those are not allowed, or files/directories that are not accessible from the

root of the directory tree.o Consistency checkers will often collect questionable disk blocks into new files with names such as chk00001.dat. These files may

contain valuable information that would otherwise be lost, but in most cases they can be safely deleted, (returning those disk blocks to the free list.) UNIX caches directory information for reads, but any changes that affect space allocation or metadata changes are written synchronously, before any of the corresponding data blocks are written to.

Log-Structured File Systems: Log-based transaction-oriented (a.k.a. journaling) filesystems borrow techniques developed for databases, guaranteeing that any given transaction either completes successfully or can be rolled back to a safe state before the transaction commenced:

o All metadata changes are written sequentially to a log.o A set of changes for performing a specific task (e.g. moving a file) is a transaction. o As changes are written to the log they are said to be committed, allowing the system to return to its work.o In the meantime, the changes from the log are carried out on the actual filesystem, and a pointer keeps track of which changes in

the log have been completed and which have not yet been completed.o When all changes corresponding to a particular transaction have been completed, that transaction can be safely removed from the

log. o At any given time, the log will contain information pertaining to uncompleted transactions only, e.g. actions that were committed but

for which the entire transaction has not yet been completed. From the log, the remaining transactions can be completed, or if the transaction was aborted, then the partially completed changes can be undone.

Backup and Restore: A full backup copies every file on a filesystem. Incremental backups copy only files which have changed since some previous time. A combination of full and incremental backups can offer a compromise between full recoverability, the number and size of backup tapes needed, and the number of tapes that need to be used to do a full restore. For example, one strategy might be: At the beginning of the month do a full backup. At the end of the first and again at the end of the second week, backup all files which have changed since the beginning of the month. At the end of the third week, backup all files that have changed since the end of the second week. Every day of the month not listed above, do an incremental backup of all files that have changed since the most recent of the weekly backups described above.

Other Solutions: Sun's ZFS and Network Appliance's WAFL file systems take a different approach to file system consistency. No blocks of data are ever over-written in place. Rather the new data is written into fresh new blocks, and after the transaction is complete, the metadata (data block pointers) is updated to point to the new blocks. The old blocks can then be freed up for future use. Alternatively, if the old blocks and old metadata are saved, then a snapshot of the system in its original state is preserved. This approach is taken by WAFL. ZFS combines this with check-summing of all metadata and data blocks, and RAID, to ensure that no inconsistencies are possible, and therefore ZFS does not incorporate a consistency checker.


NFS (Optional)

The NFS protocol is implemented as a set of remote procedure calls (RPCs): Searching for a file in a directory, Reading a set of directory entries, Manipulating links and directories, Accessing file attributes, Reading and writing files. For remote operations, buffering and caching improve

performance, but can cause a disparity in local versus remote views of the same file(s).

(In addition to the figure 12.15, you can also view the preceding figures illustrating NFS file system mounting if you forgot)

Assorted Content

Master Boot Record (MBR: Wiki): A master boot record (MBR) is a special type of boot sector at the very beginning of partitioned computer mass storage devices like fixed disks or removable drives intended for use with IBM PC-compatible systems and beyond. The MBR holds the information on how the logical partitions, containing file systems, are organized on that medium. The MBR also contains executable code to function as a loader for the installed operating system—usually by passing control over to the loader's second stage, or in conjunction with each partition's volume boot record (VBR). This MBR code is usually referred to as a boot loader. MBRs are not present on non-partitioned media such as floppies, super floppies or other storage devices configured to behave as such. The MBR is not located in a partition; it is located at a first sector of the device (physical offset 0), preceding the first partition. (The boot sector present on a non-partitioned device or within an individual partition is called a volume boot record instead.) The organization of the partition table in the MBR limits the maximum addressable storage space of a disk to 2 TiB (232 × 512 bytes). Approaches to slightly raise this limit assuming 33-bit arithmetics or 4096-byte sectors are not officially supported as they fatally break compatibility with existing boot loaders and most MBR-compliant operating systems and system tools, and can causes serious data corruption when used outside of narrowly controlled system environments. Therefore, the MBR-based partitioning scheme is in the process of being superseded by the GUID Partition Table (GPT) scheme in new computers. A GPT can coexist with an MBR in order to provide some limited form of backward compatibility for older systems. The MBR consists of 512 or more bytes located in the first sector of the drive. It may contain one or more of: (A) A partition table describing the partitions of a storage device. In this context the boot sector may also be called a partition sector. (B) Bootstrap code: Instructions to identify the configured bootable partition, then load and execute its volume boot record (VBR) as a chain loader. (C) Optional 32-bit disk timestamp. (D) Optional 32-bit disk signature. By convention, there are exactly four primary partition table entries in the MBR partition table scheme:

Second-stage boot loader: Second-stage boot loaders, such as GNU GRUB, BOOTMGR, Syslinux, NTLDR or BootX, are not themselves operating systems, but are able to load an operating system properly and transfer execution to it; the operating system subsequently initializes itself and may load extra device drivers. The second-stage boot loader does not need drivers for its own operation, but may instead use generic storage access methods provided by system firmware such as the BIOS or Open Firmware, though typically with restricted hardware functionality and lower performance.

Volume Boot Record (VBR): A Volume Boot Record (VBR) (also known as a volume boot sector, a partition boot record or a partition boot sector) is a type of boot sector introduced by the IBM Personal Computer. It may be found on a partitioned data storage device such as a hard disk, or an unpartitioned device such as a floppy disk, and contains machine code for bootstrapping programs (usually, but not necessarily, operating systems) stored in other parts of the device. On non-partitioned storage devices, it is the first sector of the device. On partitioned devices, it is the first sector of an individual partition on the device, with the first sector of the entire device being a Master Boot Record (MBR) containing the partition table. The code in volume boot records is invoked either directly by the machine's firmware or indirectly by code in the master boot record or a boot manager. Code in the MBR and VBR is in essence loaded the same way. Invoking a VBR via a boot manager is known as chain loading.

Master File Table (MFT): The NTFS file system contains a file called the master file table, or MFT. There is at least one entry in the MFT for every file on an NTFS file system volume, including the MFT itself. All information about a file, including its size, time and date stamps, permissions, and data content, is stored either in MFT entries, or in space outside the MFT that is described by MFT entries. As files are added to an NTFS file system volume, more entries are added to the MFT and the MFT increases in size. When files are deleted from an NTFS file system volume, their MFT entries are marked as free and may be reused. However, disk space that has been allocated for these entries is not reallocated, and the size of the MFT does not decrease. (The master file table (MFT) is a database in which information about every file and directory on an NT File System (NTFS) volume is stored. There is at least one record for every file and directory on the NTFS logical volume. Each record contains attributes that tell the operating system (OS) how to deal with the file or directory associated with the record.)

File Control Block (FCB): A File Control Block (FCB) is a file system structure in which the state of an open file is maintained. A FCB is managed by the operating system, but it resides in the memory of the program that uses the file, not in operating system memory. This allows a process to have as many


files open at one time as it wants to, provided it can spare enough memory for an FCB per file. A full FCB is 36 bytes long; in early versions of CP/M, it was 33 bytes. This fixed size, which could not be increased without breaking application compatibility, lead to the FCB's eventual demise as the standard method of accessing files. The meanings of several of the fields in the FCB differ between CP/M and DOS, and also depending on what operation is being performed. The following fields have consistent meanings:

To be cleared

I

Q’s Later

XXX

Glossary

Read Later

Further Reading

S

Grey Areas

XXX


filesystemimplementationpre final-160919095849

Art & Photos