the ext2 file system presented by: s. arun nair abhinav golas

45
The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Upload: charleen-hall

Post on 29-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The EXT2 File System

Presented by:

S. Arun Nair

Abhinav Golas

Page 2: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The Second Extended File System

• The Second Extended File system was devised (by Rémy Card) as an extensible and powerful file system for Linux.

• It is also the most successful file system so far in the Linux community and is the basis for all of the currently shipping distributions.

• Due to this, it is extremely well integrated into the kernel, with good performance enhancements.

Page 3: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Disk Layout

• Partition Table– Master Boot Record (MBR)– List of partitions:

• Primary Partitions• Extended Partitions

– How to specify• Block size for partition table• Beginning block• Ending block

Page 4: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Filesystem

• Way to organize data on 1 or multiple partitions

• Basic abstraction for file usage

Page 5: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Ext2 File System Layout

BLOCK GP BLOCK GP . . . . . . . . . . BLOCK GP BLOCK GP 0 1 N-1 N

SUPER BLOCK GROUP DESCRIPTOR

BLOCK BITMAP

INODE BITMAP

INODE TABLE

DATA BLOCKS

Page 6: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Partition Layout – ext2

• The Boot sector block is optional, not required if you do not want to make this partition bootable

• Each Block group has the same number of available data blocks and inodes

• Having multiple block groups helps counter fragmentation, improves reliability (since backups of the superblock are there) and even speeds up access as the inode table is near the data blocks – reduced seek time for data blocks

Page 7: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Partition layout – ext2

• Each block group has the following structure

• Again, not all block groups have the superblock . The first block group however, must have it, and it is the one used by the kernel. Others are backups to be used by filesystem checkers for consistency checks.

Page 8: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Some definitions

• Boot sector – Block which may contain the stage 1 boot loader and which points to the stage 1.5 or stage 2 boot loader

• Superblock – The filesystem header, identifies and represents the filesystem and provides relevant information about the fs. It must be present at block 1 if a boot sector is present, otherwise at block 0

• FS/Group descriptor – Pointers to the bitmaps and table in the block group

Page 9: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Some definitions

• Block bitmap – Block usage information, tells which blocks in the block group are empty(0) or used(1)

• Inode Bitmap – Inode usage information• Inode table – Table of the inodes. Each

inode provides necessary and relevant information about each file.

• Data blocks – blocks where the data is stored!

Page 10: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The Ext2 Superblock

• The Superblock contains a description of the basic size and shape of this file system.

• System keeps multiple copies of the Superblock in many Block Groups.

• It holds the following information : Magic Number : 0xef53 for the current

implementation. Revision Level : for checking compatibility Mount Count and Maximum Mount Count : to ensure

that the filesystem is periodically checked Block Group Number : The Block Group that holds

this copy of Superblock. Block Size : size of block for the file system in bytes.

Page 11: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The Ext2 Superblock

Blocks per Group : fixed when file system is created – the block bitmap must fit into 1 block, hence number of blocks per group = 8*block size

Free Blocks : Number of free blocks in the system – excludes the blocks reserved for root

Free Inodes : Number of free Inodes in the system – again excludes inodes reserved for root

First Inode : The first Inode in an EXT2 root file system would be the directory entry for the '/'

directory.

Page 12: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Superblock

• Defined in include/linux/fs.h

struct super_block {struct list_head s_list; /* Keep this first

*/dev_t s_dev;

/* search index; _not_ kdev_t */unsigned long s_blocksize;unsigned long s_old_blocksize;unsigned char s_blocksize_bits;unsigned char s_dirt;unsigned long long s_maxbytes; /* Max file size */struct file_system_type *s_type;struct super_operations *s_op;struct dquot_operations *dq_op;

struct quotactl_ops *s_qcop;struct export_operations *s_export_op;unsigned long s_flags;unsigned long s_magic;struct dentry *s_root;struct rw_semaphore s_umount;struct semaphore s_lock;int s_count;int s_syncing;int s_need_sync_fs;atomic_t s_active;void *s_security;

struct list_head s_dirty; /* dirty inodes */struct list_head s_io; /* parked for

writeback */struct hlist_head s_anon; /* anonymous

dentries for (nfs) exporting */struct list_head s_files;

struct block_device *s_bdev;struct list_head s_instances;struct quota_info s_dquot; /* Diskquota specific options */

int s_frozen;wait_queue_head_t s_wait_unfrozen;

char s_id[32];/* Informational name */

void *s_fs_info;/* Filesystem private info */

/* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */struct semaphore s_vfs_rename_sem; /* Kludge */

};

Page 13: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The Ext2 Group Descriptor

• All the group descriptors for all of the Block Groups are duplicated in each Block Group in case of file system corruption.

• The Group Descriptor contains the following: Blocks Bitmap : block number of block allocation

bitmap Inode Bitmap : block number of Inode allocation

bitmap Inode Table : The block number of the starting block

for the Inode table for this Block Group. Free blocks count : number of data blocks free in the

Group Free Inodes count : number of Inodes free in the

Group Used directory count : number of inodes allocated to

directories

Page 14: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The superblock usage sequence

• Mount – VFS, sets the s_state variable to EXT2_ERROR_FS if mounted as rw. At all other time it is at EXT2_VALID_FS – check for clean mount/unmount.

• Cached copies of this superblock and the group descriptor are always kept.

• Most VFS superblock operations are inherited for ext2

Page 15: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The Ext2 Inode

Direct Blocks

Mode

Owner Info.

Size

Timestamps

Indirect Blocks

Double Indirect

Triple Indirect

Data

Data

Data

Data

Data

Page 16: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The Ext2 Inode• Direct/Indirect Blocks : Pointers to the blocks that

contain the data that this Inode is describing.

• Timestamp: The time that the Inode was created and the last time that it was modified.

• Size : The size of the file in bytes.

• Owner info : This stores user and group identifiers of the owners of this file or directory

• Mode : This holds two pieces of information; what this inode describes and the permissions that users have to it .

Page 17: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

• Struct inode {Kdev_t I_dev;Unsigned long I_ino;Umode_t I_mode;Nlink_t I_nlinkl;Uid,gid etc….

}

• Inodes are managed as doubly linked lists as well as a hash table.

• iget() function can be used to get the inode specified by the superblock.It uses hints to resolve cross mounted file systems as well.Any access to inode increments a usage counter.

Page 18: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Inode Allocation

• There are two policies for allocating an inode. If the new inode is a directory, then a forward search is made for a block group with both free space and a low directory-to-inode ratio (find_group_dir); if that fails, then of the groups with above-average free space, that group with the fewest directories already is chosen (find_group_orlov). For other inodes, search forward from the parent directory's block group to find a free inode (find_group_other).

Page 19: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

struct inode *ext2_new_inode(struct inode *dir, int mode)

{….

if (S_ISDIR(mode)) {

if (test_opt(sb, OLDALLOC))

group = find_group_dir(sb, dir);

else

group = find_group_orlov(sb, dir);

} else

group = find_group_other(sb, dir);

….

loop (through all block groups starting with the one computed above)

find the first zero bit in the group’s inode bitmap

if no bit is zero then group = (group+1)/N ; continue;

else if that bitmap is now 1 then {

Page 20: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

if no more free bitmaps the group = (group+1)/N

else find a new zero bitmap and try to set it to 1 again

}

. . . . . . .

Set all the inode parameters from the mode information and from the parent directory.

if (test_opt (sb, GRPID))

inode->i_gid = dir->i_gid;

else if (dir->i_mode & S_ISGID) {

inode->i_gid = dir->i_gid;

if (S_ISDIR(mode))

mode |= S_ISGID;

} else

Page 21: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

insert_inode_hash(inode);

. . . . . .

ext2_preread_inode(inode);

return inode;

We perform asynchronous prereading of the new inode's inode block when we create the inode, in the expectation that the inode will be written back soon. There are two reasons: – When creating a large number of files, the async prereads

will be nicely merged into large reads– When writing out a large number of inodes, we don't need

to keep on stalling the writes while we read the inode block.

}

Page 22: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Inode De-allocation

• When we get the inode, we're the only people that have access to it, and as such there are no race conditions we have to worry about. The inode is not on the hash-lists, and it cannot be reached through the file system because the directory entry has been deleted earlier.

Page 23: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

void ext2_free_inode (struct inode * inode)

{

we must free any quota before locking the superblock, as writing the quota to disk may need the lock as well.

. . . . .

if (!is_bad_inode(inode)) {

ext2_xattr_delete_inode(inode);

DQUOT_FREE_INODE(inode);

DQUOT_DROP(inode);

}

. . . . . . .

Page 24: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

We must make sure that we get no aliases, which means that we have to call "clear_inode()" _before_ we mark the inode not in use in the inode bitmaps. Otherwise a newly created file might use the same inode number (not actually the same pointer though), and then we'd have two inodes sharing the same inode number and space on the hard disk.

. . . .

clear_inode (inode);

if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group),

bit, (void *) bitmap_bh->b_data))ext2_error (sb, "ext2_free_inode",

"bit already cleared for inode %lu", ino);else

ext2_release_inode(sb, block_group, is_directory);

Page 25: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

mark_buffer_dirty(bitmap_bh);

if (sb->s_flags & MS_SYNCHRONOUS)

sync_dirty_buffer(bitmap_bh);

error_return:

brelse(bitmap_bh);

}

Page 26: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Inode Updation• First of all get the pointer to buffer head and the inode

in the memory using ext2_get_inode(pointer to the superblock, inode no., pointer to the pointer to the head of the buffer)

. . . . . . .struct ext2_inode * raw_inode = ext2_get_inode(sb, ino,

&bh); . . . . . . .

• Then update the Inode there using the inode given. This updation depends on what file does that inode represent

raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);raw_inode->i_flags = cpu_to_le32(ei->i_flags);raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);raw_inode->i_frag = ei->i_frag_no;raw_inode->i_fsize = ei->i_frag_size;

Page 27: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

• If it is a regular file then we copy the attributes as well as the address of the blocks containing data into the field i_block[i] of the raw_inode.

. . . . . . if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {

if (old_valid_dev(inode->i_rdev)) {raw_inode->i_block[0] =cpu_to_le32(old_encode_dev(inode->i_rdev));raw_inode->i_block[1] = 0;

} else {raw_inode->i_block[0] = 0;raw_inode->i_block[1] =cpu_to_le32(new_encode_dev(inode->i_rdev));raw_inode->i_block[2] = 0;

}} else for (n = 0; n < EXT2_N_BLOCKS; n++)

raw_inode->i_block[n] = ei->i_data[n]

Page 28: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

• Else if it is a special file type then we set its attributes in a different manner.

Page 29: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Inode Deletion

• Check whether the inode actually exists or not.Record the current time(time keeping is done in this way).

• Mark this inode dirty and the call the update function.

void ext2_delete_inode (struct inode * inode)

{

if (is_bad_inode(inode))

goto no_delete;

EXT2_I(inode)->i_dtime = get_seconds();

mark_inode_dirty(inode);

ext2_update_inode(inode, inode_needs_sync(inode));

Page 30: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

• Then free the inode i.e we first release the space used by the inode and then make changes in the inode and block bitmaps (ext2_free_inode()) to reflect these changes in the block group.

inode->i_size = 0;

if (inode->i_blocks)

ext2_truncate (inode);

ext2_free_inode (inode);

return;

no_delete:

clear_inode(inode); /* We must guarantee clearing of inode... */

}

Page 31: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The EXT2 Directories

i1 15 5 file i2 40 14 arbit

INODE TABLE

0 15 55

Directories are special files that are used

to create and hold access paths to the files in the file system

The first two entries for every directoryare always the standard ‘.’ and ‘..’

entries meaning ``this directory'' and

``the parent directory'' respectively

Page 32: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Block manipulation• Aims:

– Avoid fragmentation• Block groups

– Low access times• Access : use logical address for

inside the block group and translate using block group number

• Allocation : ext2_get_block() -> ext2_alloc_block() called with inode pointer and a goal.

Page 33: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Block allocation• Goal decided by ext2_getblk()• Heuristic:

– Block to be allocated is next to the last allocated block– If not 1, then it is next to some previously allocated

block– If not 2, then it is in the same block group as the inode

• Search:– If goal is in preallocated blocks, allocate it– If goal is free, allocate it – and preallocate upto 8

blocks after that– Else search the next 64 blocks– Consider all block groups, first for a set of atleast 8

blocks, then for solo blocks

Page 34: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Block allocation

• Preallocation :– Set in superblock, used to avoid extra

disk accesses– Used even if disk is close to being

filled – because big time saver– Preallocated blocks are released on

truncation, close or a non-sequential write

• Also, corresponding fields in group descriptor, inode, block bitmap are updated

Page 35: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Filesystem consistency

• Superblock, group descriptors etc. – all metadata must be consistent with each other

• E2fsck – file system consistency checker, invoked if partition not unmounted before shutdown, or timeout – each disk must be checked after a certain number of mounts

• Will try to ensure that the metadata, superblock downwards, is consistent.

• Consistency of data with metadata is not ensured – big problem

Page 36: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Possible solutions

• Careful ordering of changes can minimize damage. eg. Increment link counter for inode for an inode before putting the hardlink on the disk

• Still not completely safe, as is required for certain systems

• Journaling

Page 37: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Journaling

• Possible solution to data-metadata consistency problem

• Log all changes before writing onto disk, while keeping the log on preferably on a separate partition/disk as the data itself

• Recovery:– System failure before commit to journal – ignore, as

no changes have been made to data or metadata– After commit – make all changes mentioned to

filesystem

• Expensive operation – too many disk writes• Another problem – a change may involve many

low level ops – all may not be safeguarded by journal -> partially copied files etc.

Page 38: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

The ext solution

• Journaling not a part of ext2, but can be switched on

• A pivotal part of ext3• 3 modes to balance performance with

safety:– Journal – safest, log all data and metadata

before write– Ordered – log only metadata, but group

metadata and related data, and write data to disk before metadata. Because metadata will be restored from log - Default

– Writeback – log only metadata, mode similar to journalling mode found in other filesystems

Page 39: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Ext3 journalling

• Uses JBD layer in kernel – Journaling Block Device layer

• Intended as general journaling support, currently used only be ext3

Page 40: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

JBD logging

• Log structure:– Log record – describes single update

of disk block in filesystem– Atomic operation handle – includes

log records relative to a single high-level change of filesystem

– Transaction – includes several atomic operations, basic unit for fsck retrieval

Page 41: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Log record

• Description of low-level operation to be executed by system

• Represented as normal blocks of data, marked with journal_block_tag_t tag – saves logical block number affected, and status flags

• Journal_head attached to head if ordered mode type order is to be maintained

Page 42: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Atomic operations

• Group of log records• Journal_start() indicates start, journal_stop() indicates end

• Ensures that a subset of the intended operations doesn’t get executed

Page 43: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Transaction

• A grouping of consecutive atomic operations

• MUST be stored in consecutive blocks

• After creation, end if:– Fixed timeout, typically 5 seconds (fs

set)– No free blocks left in journal for new

atomic operation handle

• Described by descriptor of type transaction_t

Page 44: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

Transaction

• State described by t_state• Only complete transactions are processed for

recovery – all log records included in transaction have been physically written into journal – t_state stores T_FINISHED

• Incomplete transactions – skipped by fsck. Possible t_state values– T_RUNNING – still accepting atomic operation

handles– T_LOCKED – Not accepting new op handles, but some

are incomplete– T_FLUSH - All atomic op handles have finished, but

some log records are being written to journal– T_COMMIT – all log records written to disk,

transaction to be marked complete

Page 45: The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

JBD functioning

• At any time, there can be several transactions in journal, but only 1 may be incomplete

• Completed transaction removed from journal after JBD verifies that all buffers referred to be log records have been successfully written to disk