scalable filesystems xfs & cxfs - digital technology center home page

Silicon Graphics, Inc.

January 31, 2007

Scalable FilesystemsXFS & CXFS

Yingping LuPresented by:

January 31, 2007 | |

Outline

• XFS Overview• XFS Architecture• XFS Fundamental Data Structure

–Extent list–B+Tree– Inode

• XFS Filesystem On-Disk Layout• XFS Directory Structure• CXFS: shared file system

January 31, 2007 | |

XFS: A World-Class File System

–Scalable• Full 64 bit support• Dynamic allocation of metadata space• Scalable structures and algorithms

–Fast• Fast metadata speeds • High bandwidths• High transaction rates

–Reliable• Field proven• Log/Journal

January 31, 2007 | |

Scalable

–Full 64 bit support• Large Filesystem

– 18,446,744,073,709,551,615 = 264-1 = 18 million TB (exabytes)• Large Files

– 9,223,372,036,854,775,807 = 263-1 = 9 million TB (exabytes)

–Dynamic allocation of metadata space• Inode size configurable, inode space allocated dynamically• Unlimited number of files (constrained by storage space)

–Scalable structures and algorithms (B-Trees)• Performance is not an issue with large numbers of files and directories

January 31, 2007 | |

Fast–Fast metadata speeds

• B-Trees everywhere (Nearly all lists of metadata information)– Directory contents– Metadata free lists– Extent lists within file

–High bandwidths (Storage: RM6700)• 7.32 GB/s on one filesystem (32p Origin2000, 897 FC disks)• >4 GB/s in one file (same Origin, 704 FC disks) • Large extents (4 KB to 4 GB)• Request parallelism (multiple AGs)• Delayed allocation, Read ahead/Write behind

–High transaction rates: 92,423 IOPS (Storage: TP9700)

January 31, 2007 | |

Reliable–Field proven

• Run for years on 100s of 1,000 of IRIX systems for over a decade• Ships as default filesystem on SGI Altix family : 64-bit Linux• Commercial vendors shipping XFS

– Ciprico DiMeda NAS Solutions, The Quantum Guardian™ 14000, BigStorage K2~NAS, Echostar DishPVR 721, Sun Cobalt RaQ™ 550

• Linux Distributions shipping XFS– Mandrake Linux, SuSE Linux, Gentoo Linux, Slackware Linux, JB Linux

• Now in Linux Kernel

–Log/Journal• XFS designed around log• No UNIX fsck is needed• Recovery time is independent of system size

– Depends on system activity levels• Usually, recovery completes in under a second

January 31, 2007 | |

Other XFS Features

– Large Range of Block Sizes (512 B to 64 KB)

– Extended attributes

– Sparse files (holes do not use disk space)

– Guaranteed rate I/O

– Online Dump, Resize, Defrag active file systems

– DMAPI interface supported (for HSM)

– Support real-time file stream

– Run on IRIX, Linux, FreeBSD

January 31, 2007 | |

The System Architecture (Linux)

VFS

XFS

Cache Manager

SCSI middle layer

Kernel

FS drv

Scsi drv

HBA drv

mount open read write lseek

ext2 ext3nfs

sd srst

qla2200

User System call

January 31, 2007 | |

XFS Architecture

XVM

January 31, 2007 | |

Inode Structure

di_core (96 bytes) xfs_dinode_t

xfs_dinode_t

di_u data fork

di_a extended attribute fork

di_next_unlink (4 bytes)

–IO Core component• Time stamp, size, ino• Formats of two other components

–Data component (union)• Btree/extent list• Symbolic link/small data• Small directory list

–Attribute component (union)• Btree/extent list• Local attributes

January 31, 2007 | |

Extent

• An extent is a number of contiguous file blocks• The minimal size of an extent is one file block• Extent is represented as a triple: start file system block,

number of blocks and flag• Extent can significantly reduce the space used to record the

allocated and free space if space has a large number of contiguous blocks

• For a regular file, extent also includes the file offset.• With extents, sparse files, i.e. potential “holes” in file are

supported.

January 31, 2007 | |

Extent List Mapping

0… 4 blocks 34

… 10 blocks

13

16… 5 blocks20

sizetimemodedformat=extentaformat=local<0, 20, 4, 0><4, 32, 10, 0><16, 50, 5, 0>totalCountnlenvlennamevalue

Hole

20…23

32…

41

50…54

Inode File spaceFile system space

January 31, 2007 | |

Data B+tree

sizetimemodedformat=btreeaformat=local

level=2numrecs=3file offset/fs block0/13|5000/14|9800/15

level=2numrecs=1off (0)/fs block(7)totalCountnlenvlennamevalue

Inode

level=1numrecs=50file offset/fs block0/40|100/51|200/55290/56| …

level=1numrecs=32file offset/fs block9800/802|9910/803|10005/805…

level=1numrecs=50file offset/fs block5000/120|5100/121|5190/122…

13 14 15

7

Root node

intermediate nodes

level=0numrecs=50file offset/fs blk/#blk0/201/2|2/206/24/210/3|7/218/1…

level=0numrecs=50file offset/fs blk/#blk100/311/2|102/317/2104/325/4|110/340/3…

40 51

leave nodes

January 31, 2007 | |

File System On-Disk Layout

A File System

AG0 AG1 AG2 …

SB FDB1 FDB2

AG: Allocation Group

AGF AGFL AGI …

SB: Super Block (1 sector, xfs_sb_t)AGF: Allocation Group Free Space (1 sector, xfs_agf_t)AGI: Allocation Group Inode (1 sector, xfs_agi_t) AGFL: Allocation Group Freelist (1 sector, xfs_agfl_t)FDB: File Data Block (file block size: 4K(default))

January 31, 2007 | |

AGF B+Tree

January 31, 2007 | |

AGF B+Tree (2 levels)

January 31, 2007 | |

AGI B+Tree

January 31, 2007 | |

Inode Allocation

• Inode size–Configurable at FS creation time–Can be 256B, 512B, up to 4096B

• Inode allocation–Allocation unit is a cluster, 64 inodes– Inode mask to show which inodes are free

• Inode number consists of:–AG #–FS Block # for the inode cluster– Index within a cluster

January 31, 2007 | |

XFS Directory Structure

• Unix files are organized in an inverted tree structure• Each directory has a list of files under its directory• Each entry in a directory represents an file object, or a

logical link or a sub-directory• Each entry has the object name, the length of name, the

corresponding inode number• Directory data are usually stored in directory blocks, the size

of a directory block is multiple of file data block size. Superblock’s sb_dirblklog designates the size. It can range from 4KB to 64KB

January 31, 2007 | |

Directory Forms

• Directory data– Directory entries (name, len, ino, offset)– Leaf array: hash/address for lookup– Freeindex array for allocation

• Directory forms– Shortform directory: Directory data stored within inode– Block directory: 1 extent, All directory entries stored within a directory block– Leaf directory: extent list, Multiple data blocks, one leaf block– Node directory: extent list, Multiple data blocks, B+tree like leaf blocks– B+tree directory: btree format for data fork

• The system dynamically adjusts the format with the addition or removal of the directory entries

January 31, 2007 | |

Shortform Directory

January 31, 2007 | |

Block Directory

• Use a directory block to store directory entries• The location of the block is stored in the inode’s incore extent list: the

di_u.u_bmx[0]• The directory blocks (xfs_dir2_block_t) has the following data fields:

– A header specifies the magic number and freespace list (3 largest free space)

– Directory entry list (name len, name, ino, offset)– Leaf array: contains an array of hashval/address pairs for quickly looking up a

name by the hash value.– Tail structure specifies the number of elements in the leaf array and the

number of stale entries in the array. The tail is always located at the end of the block.

January 31, 2007 | |

January 31, 2007 | |

Leaf Directory

• When the # of directory entries cannot be stored in one block, we use extent list to store directory entries

• Data and leaf are split into different blocks.• One or more data blocks, each directory data block has its

own header and bestfree list• Only one leaf block (the last one). Leaf block has its own

header, hash/address array, best free space array maintains each data block’s bestfree[0]’s length. The tail part has the number of bestfree elements

January 31, 2007 | |

Leaf Block

January 31, 2007 | |

Node Directory

• When leaf fills a block, another separation is needed.• The “data” blocks are the same as leaf directory• The leaf blocks are changed into B+tree with generic

header pointing to directory “leaves”.• A new freeindex block contains the best for each data block• The location of the leaf blocks can be in any order, the only

way to determine the appropriate is by the node block hash/before values.

January 31, 2007 | |

B+Tree-style Leaf Blocksleaf blocks

Node block

January 31, 2007 | |

B+Tree Directory

• With very large number of directory entries, inode format is changed to B+tree

• B+tree extents contains extent maps for data (directory entries), node, leaf(hash/address), freeindex.

• The node/leaf trees can be more than one level• More than one freelist may exist

January 31, 2007 | |

CXFS Clustered File System

Fibre ChannelStorage

AreaNetwork

Full standard Unix interface As easy to share filesas with NFS, but faster

Near-local fileperformance from

direct data channels

Fullyresilient

(HA)

January 31, 2007 | |

CXFS Concepts

–Metadata• The data about a file, including:

• size, inode, create/modify times, and permissions

–Metadata server node (a.k.a. CXFS server)• One machine in the cluster that is responsible for controlling the metadata of

files. Plays “traffic cop” to control access to the file.

–Metadata client node (a.k.a. CXFS client)• A machine in the cluster that is not the metadata server. Must obtain

permission from metadata server before accessing the file.

–Single server manages metadata• Backup metadata servers designated for fail-over

• No single point of failure

January 31, 2007 | |

CXFS networks

• Besides the storage area network CXFS uses the following networks:

–Metadata network• IP network (dedicated) for metadata and tokens

–Membership network• IP network used for heart beating

–Reset network between metadata servers• non IP serial lines used to reset nodes

– I/O Fencing• SAN switch port disable/enable

January 31, 2007 | |

Data Integrity - IO fencing

• CXFS nodes all have direct access to FC storage.• Integrity of a shared file system requires a unified view of

who's allowed to read/write what.• Tokens control access.• A failed node may retain write tokens; need to prevent such

a node unilaterally writing to a shared file system.• Applies to all CXFS platforms & is independent of disk sub-

systems• Uses Brocade switch to disable/enable FC ports • I/O Fencing architecture could be ported to other switches.

January 31, 2007 | |

CXFS Architecture CXFS MetadataServer

January 31, 2007 | |

Research Issues

•Self-healing, especially deadlock detection and self-recovery

• I/O fencing•Fail-over• Intelligent data placement algorithm•QoS provisioning•Scalable cluster•OSD support

scalable filesystems xfs & cxfs - digital technology center home page

Documents