file systems ext4 - skkunyx.skku.ac.kr/wp-content/uploads/2019/11/12-ext4-1.pdf · journaling...

51
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) File Systems – EXT4 Dongkun Shin ([email protected] ) Embedded Software Laboratory Sungkyunkwan University http://nyx.skku.ac.kr/

Upload: others

Post on 27-May-2020

14 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected])

File Systems – EXT4

Dongkun Shin ([email protected])

Embedded Software Laboratory

Sungkyunkwan University

http://nyx.skku.ac.kr/

Page 2: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 2

Evolution of EXT File System

• EXT: a very popular Linux file system, reliability, rich feature set, relatively good performance, strong compatibility between versions

EXT2 EXT3 EXT4

Introduced in 1993 in 2001 (2.4.15)in 2006 (2.6.19)in 2008 (2.6.28)

Max file size 16GB ~ 2TB 16GB ~ 2TB 16GB ~ 16TB

Max FS size 2TB ~ 32TB 2TB ~ 32TB 1EB

FeatureBlock group,no Journaling

JournalingExtent MappingMultiblock allocationDelayed allocation

1 EB (exabyte) = 1024 PB (petabyte)

1 PB = 1024 TB (terabyte)

Block size Max file size Max FS size

1 KB 16 GB 2 TB

2 KB 256 GB 8 TB

4 KB 2 TB 16 TB

8 KB 2 TB 32 TB

Size limits on EXT2/3

Page 3: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 3

Disk Data Structures

• Block groups– keep the data blocks belonging to a file in the same block group

• Both the superblock and the group descriptors are duplicated in each block group

– Only SB and GD in block group 0 are used by the kernel

– e2fsck can refer to old copies in other block groups

• Block bitmap must be stored in a single block– 32-GB EXT3 w/ 4-KB block size

– 4-KB block bitmap describes 32K data blocks (128 MB)

– 32 GB / 128MB = 256 block groups

Page 4: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 4

Superblock

• 2 sectors (1024 bytes) that describe the file system

– Volume label

– Block size

– # of blocks per group

– # of reserved blocks before the 1st block group

– Superblock block group number

– # of free inodes & blocks (total all groups)

• For block group 0, the first 1024 bytes are unused

– boot sectors and other oddities.

• Copies of the superblock are in the first block of each

block group

Page 5: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 5

Superblock

struct ext4_super_block (1024 bytes)

Type Field Description

__le32 s_inodes_count # of inodes in filesystem

__le32 s_blocks_count # of blocks in filesystem

__le32 s_free_blocks_count Free blocks counter

__le32 s_free_inodes_count Free inodes counter

__le32 s_log_block_size Block size (0:1024 bytes, 1: 2048 bytes, …)

__le32 s_blocks_per_group # of blocks per group

__le32 s_inodes_per_group # of inodes per group

__le16 s_state Status flag (mounted, unmounted, error)

__le16 s_block_group_nr Block group number of this superblock

char [64] s_last_mounted Pathname of last mount point

….. …… …….

Page 6: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 6

Block Group Descriptor

• Group descriptors: One for each block group

• Starting block addresses

– block bitmap, inode bitmap, inode table

• # of free inodes & blocks for the group

• Located in the block after the superblock

• Backup copies are in the same block groups as the

superblock backups

# dumpe2fs /dev/hda3 | grep -i superblock

Primary superblock at 0, Group descriptors at 1-1

Backup superblock at 32768, Group descriptors at 32769-32769

Backup superblock at 98304, Group descriptors at 98305-98305

Backup superblock at 163840, Group descriptors at 163841-163841

Backup superblock at 229376, Group descriptors at 229377-229377

Backup superblock at 294912, Group descriptors at 294913-294913

Page 7: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 7

SB & BDT

Super

Block

Block

Desc

Table

Block

Bitmap

Block

Bitmap

Block

Bitmap

Group 0

Group 1

Group n

Inode B

itmap

Inode

Table

Inode B

itmap

Inode

Table

Inode B

itmap

Inode

Table

Page 8: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 8

Block Group Descriptor

struct ext4_group_desc (64 bytes)

Type Field Description

__le32 bg_block_bitmap Block number of block bitmap

__le32 bg_inode_bitmap Block number of inode bitmap

__le32 bg_inode_table Block number of first inode table block

__le16 bg_free_blocks_count Number of free blocks in the group

__le16 bg_free_inodes_count Number of free inodes in the group

__le16 bg_used_dirs_count Number of directories in the group

__le16 bg_flags

0x1: inode table and bitmap are not initialized

(EXT4_BG_INODE_UNINIT).

0x2: block bitmap is not initialized

(EXT4_BG_BLOCK_UNINIT).

0x4: inode table is zeroed (EXT4_BG_INODE_ZEROED).

…….. ……… ……

Page 9: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 9

Bitmaps

• Block bitmap

– manages the allocation status of the blocks in the group.

– One bit per block in the group

– A block bitmap is always 1 block in size

– block group size

= 8 * number_of_bytes_in_a_logical_block

• Inode bitmap

– manages the allocation status of the inodes in the group.

– Size = #inodes per group / 8

– Size defined at file system creation

– Typically fewer inodes than blocks per group

Page 10: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 10

Inode Table

• Contains the inodes that describes the files in

the group

• Inode Table

– Multiple consecutive blocks

– Each contains a predefined number of inodes

• Size = # inodes * 256 bytes (inode size)

• Inode– Corresponds to one file/dir, and stores file’s primary

metadata

– file’s size, ownership, and temporal information.

– Typically 128~256 bytes

– Inode points to the file content blocks

– Directory entry has file/directory name and pointer to inode in Inode table

inode

inode

inode

inode

inode

inode

inode

inode

inode

inode

inode

inode

inode

inode

inode

inode

256B

DEData

block

Page 11: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 11

Inode

Type Field Description

__le16 i_mode File type and access rights

__le16 i_uid Owner identifier

__le32 i_size File length in bytes

__le32

i_atime

i_ctime

i_mtime

i_dtime

Last access time

Last inode change time

Last data modification time

Deletion time

__le16 i_links_count Hard links counter

__le32 i_blocks Number of data blocks of the file

__le32 [EXT4_N_BLOCKS] i_block Pointers to data blocks

__le32 i_file_acl File access control list

__le32 i_dir_acl Directory access control list

struct ext4_inode (128/256 bytes)

Read can update atime thus change inode

noatime: Only change/write update atime

Page 12: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 12

Inode (EXT3)

• All inodes have the same size: 128 bytes.

– 4KB block contains 32 inodes

• i_block

– Allocated block numbers 4KB

Page 13: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 13

0 1 2 3 4 5 6 … 1 2 3 4 5 6 7 8 9 10 11 …

Inode & Directory

• Map a file name with the related inode

• Directory is itself a file (supporting file hierarchy.)

inode number file name

Entry for each directory

inode table disk blocks

status : dirsize : **…data blocks: 1 _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 ..2 .3 usr4 home6 dev7 etc…

status : dirsize : **…data blocks: 7 _ _ _ _ _ _ _ _ _ _ _ _ _ _

status : filesize : 26…data blocks: 10 _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 ..4 .8 reports.doc9 hello.c10 sudbir5 alphabet.txt…

abcdefghi…/* comment for

hello.c */int main(){…}

/home/alphabet.txt

③①

②⑤ ④ ⑥

Page 14: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 14

Inode

• Mode– Different types of files recognized

– regular files, pipes, etc.

– use data blocks in different ways

• Regular File– needs data blocks only when it starts to have data

– When first created, empty and need no data blocks

• Directory– a special kind of file whose data blocks store filenames together

with the corresponding inode numbers

– ext4_dir_entry_2

– variable length

– the last name field is a variable length array of up to EXT4_NAME_LEN characters (255).

File_type Description

0 Unknown

1 Regular file

2 Directory

3 Character device

4 Block device

5 Named pipe

6 Socket

7 Symbolic link

Page 15: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 15

Deleted

Directory Entry

• rec_len field may be interpreted as a pointer to

the next valid directory entry

• Performance problem by linear search

• EXT4_INDEX_FL (htree)

– tune2fs -O dir_index /dev/XX

– e2fsck -fD /dev/XX Type Field Description

__le32 inode Inode number

__le16 rec_lenDirectory entry length(pointer to next item)

__u8 name_len Filename length (real)

__u8 file_type File type

char [EXT4_NAME_LEN] name Filename (A multiple of 4 )

When Ext4 wants to delete a directory entry, it just

increase the record length of the previous entry to

the end to deleted entry.

struct ext4_dir_entry_2

Page 16: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 16

Inode

• Symbolic Link

– If the pathname of a symbolic link has up to 60

characters, it is stored in the i_block field of the inode;

no data block is therefore required.

– If the pathname is longer than 60 characters, however, a

single data block is required.

• Device file, Pipe, and Socket

– No data blocks are required for these kinds of files.

– All the necessary information is stored in the inode

Page 17: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 17

0 1 2 3 4 5 6 … … 20 21 22 23 24 25 26 27 28 29 30

Overall Disk Structure

inode table

Root D

ir

2 .

2 ..

3 File1.c

4 mydir

5 myfile

7 mydir2

myd

irFile1.c myfile

Status : dirSize : **Data blocks: 20 _ _ _ _ _ __ _ _ _ _ _ _ _

Status : fileSize : ***Data blocks : 21 22 23 _ _ _ __ _ _ _ _ _ _

Status : dirSize : **Data blocks : 24 _ _ _ _ _ __ _ _ _ _ _ _ _

Status : file

Size : ****

25 26 27 29 30 31 32

33 34 35 36 37 28 _ _

Root dir File1.c mydir myfile

38

39

40

41

42

43

4 .

2 ..

10 a.hwp

11 b.c

24 Test.c

19 Note.doc

myfile

indirect• Data search

– 1. Boot sector -> starting address

– 2. Get SB information

– 3. GD -> inode table address

– 4. Inode#2 -> root directory -> data block#

– 5. Directory entry -> files & sub directory inode#

directory entry

Page 18: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 18

Example

• create a file1.dat under /dir1/ in Ext3.

Page 19: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 19

Memory Data Structures

• For performance, most information stored in the

disk data structures of an Ext4 partition are

copied into RAM when the filesystem is mounted

• Kernel uses the page cache to keep disk data

structures up-to-date

In dynamic mode, the data is kept in a cache as long as the associated object is in use;

when the file is closed or the data block is deleted, may be removed from the cache.

Page 20: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 20

Memory Data Structures

• Inode object– When opening a file, a pathname lookup is performed.

– For each component of the pathname that is not already in the dentry cache, a new dentry object and a new inodeobject are created.

– When the VFS accesses a disk inode, it creates a corresponding inode descriptor of type ext3_inode_info• Most of the fields found in the disk's inode structure that are not kept

in the VFS inode• i_next_alloc_block and i_next_alloc_goal: the logical block number and

the physical block number of the disk block that was most recently allocated to the file, respectively

• i_prealloc_block and i_prealloc_count: used for data block preallocation

– ext4_inode_info• i_prealloc_list

Page 21: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 21

Methods

• VFS methods have a corresponding Ext4

implementation

• Superblock operations1068 static const struct super_operations ext4_sops = {1069 .alloc_inode = ext4_alloc_inode,1070 .destroy_inode = ext4_destroy_inode,1071 .write_inode = ext4_write_inode,1072 .dirty_inode = ext4_dirty_inode,1073 .drop_inode = ext4_drop_inode,1074 .evict_inode = ext4_evict_inode,1075 .put_super = ext4_put_super,1076 .sync_fs = ext4_sync_fs,1077 .freeze_fs = ext4_freeze,1078 .unfreeze_fs = ext4_unfreeze,1079 .statfs = ext4_statfs,1080 .remount_fs = ext4_remount,1081 .show_options = ext4_show_options,1082 #ifdef CONFIG_QUOTA1083 .quota_read = ext4_quota_read,1084 .quota_write = ext4_quota_write,1085 #endif1086 .bdev_try_to_free_page = bdev_try_to_free_page,1087 };

fs/ext4/super.c

Page 22: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 22

Methods

• Ext4 Inode Operations

3166 const struct inode_operations ext4_dir_inode_operations = {3167 .create = ext4_create,3168 .lookup = ext4_lookup,3169 .link = ext4_link,3170 .unlink = ext4_unlink,3171 .symlink = ext4_symlink,3172 .mkdir = ext4_mkdir,3173 .rmdir = ext4_rmdir,3174 .mknod = ext4_mknod,3175 .rename = ext4_rename,3176 .setattr = ext4_setattr,3177 .setxattr = generic_setxattr,3178 .getxattr = generic_getxattr,3179 .listxattr = ext4_listxattr,3180 .removexattr = generic_removexattr,3181 .get_acl = ext4_get_acl,3182 .fiemap = ext4_fiemap,3183 };

/fs/ext4/namei.c

Page 23: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 23

Methods

• Ext4 File Operations

626 const struct file_operations ext4_file_operations = {627 .llseek = ext4_llseek,628 .read = do_sync_read,629 .write = do_sync_write,630 .aio_read = generic_file_aio_read,631 .aio_write = ext4_file_write,632 .unlocked_ioctl = ext4_ioctl,633 #ifdef CONFIG_COMPAT634 .compat_ioctl = ext4_compat_ioctl,635 #endif636 .mmap = ext4_file_mmap,637 .open = ext4_file_open,638 .release = ext4_release_file,639 .fsync = ext4_sync_file,640 .splice_read = generic_file_splice_read,641 .splice_write = generic_file_splice_write,642 .fallocate = ext4_fallocate,643 };

/fs/ext4/file.c

Page 24: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 24

Managing Disk Space

• Creating inodesi_sb

ext4_alloc_inode

ext4_new_inode() in /fs/ext4/ialloc.c

ext4_read_inode_bitmap

Page 25: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 25

Directory inode allocation

• __ext4_new_inode() ➔ find_group_orlov()

– tries to spread first-level directories

– Using total number of free inodes and blocks in the

superblock, Ext4 calculates the average free inodes and

blocks per group.

– If there are block groups with both free inodes and free

blocks counts not worse than average, return one with

smallest directory count.

– Otherwise simply return a random group.

– COBERT, J. The Orlov block allocator.

http://lwn.net/Articles/14633/

Page 26: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 26

File inode allocation

– __ext4_new_inode() ➔ find_group_other()

– allocates an inode in the same flex group as the parent

directory.• If can't find space, use the Orlov algorithm to find another flex group,

and store that information in the parent directory's inode information

(i_last_alloc_group) so that use that flex group for future allocations.

– Try to place the inode in the same block group as its parent

directory• If fails, place this inode in a different block group from its parent.

– Use a quadratic hash to find a group with a free inode and some free blocks.

– That failed; try linear search for a free inode, even if that group has no free blocks.

• files in a common directory to land in the same block group.

• files which are in a different directory which shares a block group with

our parent to land in a different block group.

Page 27: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 27

Data Blocks Addressing

• Each nonempty regular file consists of a group of

data blocks

• Such blocks may be referred to either by their

relative position inside the file (their file block

number) or by their position inside the disk

partition (their logical block number, LBN)

• Mapping: an offset f inside a file → LBN

– Derive the file block number from the offset f

– Translate the file block number to LBN

Page 28: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 28

Data Blocks Addressing

• The i_block field in the disk inode is an array of EXT2_N_BLOCKS components that contain logical block numbers

• b is the filesystem's block size

• each logical block number is stored in 4 bytes, so divide by 4 in the formula

Block size = 4K

0 1

0 4096 8192

LBN of file block 1

indirect double indirect

triple indirect

4KB contains the points

to 1024 LBNs

Page 29: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 29

Ext3 Block Allocator

• ext3_alloc_block() searches for a free block

• When allocating a block for a file, to reduce large file

fragmentation

– ext3_get_block() sends the parameter to ext3_alloc_block()

– use goal block, preferred LBN of the new block

– Try to get a new block for a file near the last block already

allocated for the file

– If fail, search for a new block in the block group that include

the file’s inode

• try to keep the meta-data and data blocks closely

• try to keep the files under the same directory

Page 30: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 30

Ext3 Block Allocator

• Ext3 block reservation (preallocation)

– In case of multiple files allocating blocks concurrently

– used block reservation that subsequent request for

blocks for a file get served before interleaved

• Preallocate up to 8 free blocks adjacent

• i_prealloc_block, i_prealloc_count

– A per-file reservation window which sets aside a range

of blocks is created and the actual block allocations are

taken from the window985 ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,986 unsigned int group, struct buffer_head *bitmap_bh,987 int goal, struct ext3_reserve_window_node * my_rsv,988 int *errp)

/fs/ext3/balloc.c

Page 31: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 31

Ext4 features

• Bigger file/filesystem size support.

– Compared to ext3, ext4 is 8 times larger in file size,

– 65536 times larger in filesystem size.

• I/O performance improvement

– delayed allocation, multi block allocator extent map and

persistent preallocation

– Fast fsck: flex_bg and uninit_bg

– Reliability: journal checksumming

– Maintenance: online defrag

– Misc: backward compatibility with ext2/ext3,

nanosec timestamps, subdir scalability, etc.

Page 32: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 32

Feature Flags• EXT2

– ext_attr (extendended attributes like access control)

– resize_inode (Reserve space so the block group descriptor table may grow in the future.

Useful for online resizing)

– dir_index (Use hashed b-trees to speed up lookups in large directories)

– filetype (Store file type information in directory entries)

– sparse_super (Create a filesystem with fewer superblock backup copies. saves space on

large filesystems)

• EXT3

– has_journal

– journal_dev (Create an external ext3 journal on the given device)

• EXT4

– huge_file (larger file)

– uninit_bg (Create a filesystem without initializing all of the block groups. reduced

e2fsck time)

– dir_nlink (larger directory, 32,000→64,000 subdirectories)

– extra_isize (nanosecond inode timestamps)

– extent (extent-mapped files, no removable)

– flex_bg

• Note: Feature flags are enabled/disabled by tune2fs, except for flex_bg (only by mkfs).

Page 33: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 33

Ext3 vs. Ext4

Page 34: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 34

Ext4 Scalability Enhancements

• Extent: represent a range of contiguous physical blocks

• Efficient to represent large files

• Better CPU utilization, fewer metadata IOs

• One extent: 215 contiguous blocks (128MB, 1 block=4KB)

• 4 extents in ext4 inode structure or extent_header

Page 35: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 35

Ext4 extent/* This is the extent on-disk structure. It's used at the bottom of the tree. */

struct ext4_extent { /* 12 Bytes */

__le32 ee_block; /* first logical block extent covers */

__le16 ee_len; /* number of blocks covered by extent, max 128MB */

__le16 ee_start_hi; /* high 16 bits of physical block */

__le32 ee_start_lo; /* low 32 bits of physical block, 48bits = 1EB */

};

/* This is index on-disk structure. It's used at all the levels except the bottom. */

struct ext4_extent_idx {

__le32 ei_block; /* index covers logical blocks from 'block' */

__le32 ei_leaf_lo; /* pointer to the physical block of the next

* level. leaf or next index could be there */

__le16 ei_leaf_hi; /* high 16 bits of physical block */

__u16 ei_unused;

};

struct ext4_extent_header {

__le16 eh_magic; /* probably will support different formats */

__le16 eh_entries; /* number of valid entries */

__le16 eh_max; /* capacity of store in entries */

__le16 eh_depth; /* has tree real underlying blocks? */

__le32 eh_generation; /* generation of the tree */

};

Page 36: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 36

Ext3 vs. Ext4: block addressing

Page 37: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 37

Ext3 vs. Ext4: block addressing

Page 38: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 38

Ext4: Block Allocation Enhancements

• Persistent preallocation

– Preallocate blocks for a file up-front

– fallocate() system call

– DB, Streaming Media Server

– ensure contiguous allocation as far as possible for a file

– allocated but uninitialized

– The MSB of the extent length field indicates whether a given extent

contains uninitialized data.

• Delayed block allocation

– block allocations are postponed to page flush time rather than during the

write()

– Combine many block allocation requests into a single request

• Reduce fragmentation and save CPU cycles.

• avoids unnecessary block allocation for short-lived files

– There is a trade-off between performance and reliability

– 30% improved throughput, 50% reduction in CPU

Page 39: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 39

Ext4: Block Allocation Enhancements

• Online defragmentation

– with age, the filesystem still become quite fragmented

– e4defrag

• Creates a temporary inode and allocates contiguous extents

using multiple block allocation

• Copies the original file data to the page cache and flushes the

dirty pages to the temporary inode’s blocks

• Migrates the block pointers from the temporary inode to the

original inode

Page 40: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 40

Problems with Ext2/3 block allocator

• Lack of free extent information across the file system

– poor allocation pattern for multiple blocks since the

allocator searches for free blocks only inside the reservation

window.

• Doesn’t differentiate allocation for small / large files

– Large directories, such as /etc, contain large numbers of small

configuration files that need to be read during boot.

– If the files are placed far apart on the disk the bootup

process would be delayed by expensive seeks across the

underlying device to load all the files.

– If the block allocator could place these related small files

closer it would be a great benefit to the read performance.

Page 41: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 41

Problems with Ext2/3 block allocator

• Test case 1

– used one thread to sequentially create 20 small files of 12KB

– The locality of the small files are bad though the files are not

fragmented

– Those small files are generated by the same process so

should be kept close to each other

• Test case 2

– created a single large file and multiple small files in parallel

(with two threads)

– Illustrate the fragmentation of a large file

– The allocations for the large file and the small files are

fighting for free spaces close to each other

Page 42: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 42

Ext2/3 Block Allocator

• Small files are kept apart by Ext3

allocator intentionally to avoid too much

fragmentation in case the files are large

files.

• Caused by lack of information that those

small files are generated by the same

process.

• The locality of those small files are bad.

• The allocations for the large file and the small files

are fighting for free spaces close to each other.

• (Ext3 don’t know that the large file is unrelated to

the small files)

• A better solution is to keep the large file

allocation far apart from unrelated allocation at

the very beginning to avoid interleaved

fragmentation.

Page 43: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 43

Ext4: Multiple Blocks Allocator

• EXT3 block reservation

– subsequent request for blocks for a file get served before

interleaved

– per-file reservation window

• EXT4 Multiple Blocks Allocator (/fs/ext4/mballoc.c)

– Different strategy for different allocation requests

– Per-block-group buddy cache • Contiguous multiple blocks are allocated at once to prevent file

fragmentation. ➔ decreases CPU utilization and seek time.

• builds per-block group free extents information based on the on-disk

block bitmap to guide the search for free extents

• generated at filesystem mount time and stored in memory using a

buddy structure.

Page 44: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 44

Ext4: Multiple Blocks Allocator

• Different strategy for different allocation requests– Better allocation for small and large files

• Ext4 multiple block allocator maintains two preallocatedspaces

– Small allocation request, • per-CPU locality group preallocation• used for small files are places closer on disk

– Large allocation request, • per-file (per-inode) preallocation• used for larger files are less interleaved

• Which preallocation space to use – depends on the total size derived out of current file size and

allocation request size.

– If the total size < stream_req blocks, per-CPU locality group preallocation space.

– Default is 16 (/prof/fs/ext4/<partition>/stream_req)

Page 45: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 45

Ext4: Multiple Blocks Allocator

• Per-block-group buddy cache

– When it can’t allocate blocks from the preallocation

– Contiguous free blocks of block group are managed by

the buddy system in memory (21 - 2size of block(bit) +1).

ext4_mb_load_buddy

Page 46: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 46

Ext4: Multiple Blocks Allocator

• Per-file(per-inode) preallocation– Before allocating blocks via buddy cache we normalize the request

blocks.

– Heuristics based on file size (ext4_mb_normalize_request)

– More blocks that we needed.

– Extra blocks unused by the current allocation are added to inodeprealloc list

– Inode preallocation enables blocks will be assigned preferentially when the next block allocation comes.

– Consequently contiguous multiple blocks are used.

Page 47: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 47

Ext4: Multiple Blocks Allocator

• Per-cpu preallocation

– In group prealloc, normalize the request to

sbi->s_mb_group_prealloc.

– default value is 512 blocks.

/sys/fs/ext4/<partition>/mb_group_prealloc

• For a file smaller than 16 blocks is added to the per-

CPU locality group to pack small files together.

e.g. Allocate 3 blocks to small file, per-CPU preallocation is used.

Page 48: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 48

Ext4: Multiple Blocks Allocator

Page 49: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 49

File System Layout Improvements

• Ext3

– Metadata for each block group (inode table, block/inode allocation bitmaps) is located at the beginning of each block group

– With 4k file system blocks, block groups are 128 MB each

– files > 128 MB cannot be contiguous

• Ext4– Block groups are grouped together into “flex_bg groups”

– By default mke2fs uses 16 block groups/flex_bg group (must be power of 2)

– The inode table and bitmaps are placed at the beginning of the flex_bg group (in the first block group)

– Tightly allocating bitmaps and inode tables close together, could build a large virtual block group

– Moving meta-data blocks to the beginning of a large virtual block group, the chances of allocating larger extents are improved

• Reserve the first block group in each flex_bg group for extent tree blocks and directory blocks

– reduces seek times when reading the directory blocks

Page 50: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 50

Ext4: flex_BG

Page 51: File Systems EXT4 - SKKUnyx.skku.ac.kr/wp-content/uploads/2019/11/12-EXT4-1.pdf · Journaling Extent Mapping Multiblock allocation Delayed allocation 1 EB (exabyte) = 1024 PB

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 51

Performance Evaluation

• FFSB(Flexible File System Benchmark)

FFSB small meta-data FiberChannel (1 thread) –FLEX_BG with 64 block groups10% overall improvement

FFSB small meta-data FiberChannel (16 thread) –FLEX_BG with 64 block groups18% overall improvement