scalable filesystems xfs & cxfs - digital technology center home page
TRANSCRIPT
Silicon Graphics, Inc.
January 31, 2007
Scalable FilesystemsXFS & CXFS
Yingping LuPresented by:
January 31, 2007 Page 2| |
Outline
• XFS Overview• XFS Architecture• XFS Fundamental Data Structure
–Extent list–B+Tree– Inode
• XFS Filesystem On-Disk Layout• XFS Directory Structure• CXFS: shared file system
January 31, 2007 Page 3| |
XFS: A World-Class File System
–Scalable• Full 64 bit support• Dynamic allocation of metadata space• Scalable structures and algorithms
–Fast• Fast metadata speeds • High bandwidths• High transaction rates
–Reliable• Field proven• Log/Journal
January 31, 2007 Page 4| |
Scalable
–Full 64 bit support• Large Filesystem
– 18,446,744,073,709,551,615 = 264-1 = 18 million TB (exabytes)• Large Files
– 9,223,372,036,854,775,807 = 263-1 = 9 million TB (exabytes)
–Dynamic allocation of metadata space• Inode size configurable, inode space allocated dynamically• Unlimited number of files (constrained by storage space)
–Scalable structures and algorithms (B-Trees)• Performance is not an issue with large numbers of files and directories
January 31, 2007 Page 5| |
Fast–Fast metadata speeds
• B-Trees everywhere (Nearly all lists of metadata information)– Directory contents– Metadata free lists– Extent lists within file
–High bandwidths (Storage: RM6700)• 7.32 GB/s on one filesystem (32p Origin2000, 897 FC disks)• >4 GB/s in one file (same Origin, 704 FC disks) • Large extents (4 KB to 4 GB)• Request parallelism (multiple AGs)• Delayed allocation, Read ahead/Write behind
–High transaction rates: 92,423 IOPS (Storage: TP9700)
January 31, 2007 Page 6| |
Reliable–Field proven
• Run for years on 100s of 1,000 of IRIX systems for over a decade• Ships as default filesystem on SGI Altix family : 64-bit Linux• Commercial vendors shipping XFS
– Ciprico DiMeda NAS Solutions, The Quantum Guardian™ 14000, BigStorage K2~NAS, Echostar DishPVR 721, Sun Cobalt RaQ™ 550
• Linux Distributions shipping XFS– Mandrake Linux, SuSE Linux, Gentoo Linux, Slackware Linux, JB Linux
• Now in Linux Kernel
–Log/Journal• XFS designed around log• No UNIX fsck is needed• Recovery time is independent of system size
– Depends on system activity levels• Usually, recovery completes in under a second
January 31, 2007 Page 7| |
Other XFS Features
– Large Range of Block Sizes (512 B to 64 KB)
– Extended attributes
– Sparse files (holes do not use disk space)
– Guaranteed rate I/O
– Online Dump, Resize, Defrag active file systems
– DMAPI interface supported (for HSM)
– Support real-time file stream
– Run on IRIX, Linux, FreeBSD
January 31, 2007 Page 8| |
The System Architecture (Linux)
VFS
XFS
Cache Manager
SCSI middle layer
Kernel
FS drv
Scsi drv
HBA drv
mount open read write lseek
ext2 ext3nfs
sd srst
qla2200
User System call
January 31, 2007 Page 9| |
XFS Architecture
XVM
January 31, 2007 Page 10| |
Inode Structure
di_core (96 bytes) xfs_dinode_t
xfs_dinode_t
di_u data fork
di_a extended attribute fork
di_next_unlink (4 bytes)
–IO Core component• Time stamp, size, ino• Formats of two other components
–Data component (union)• Btree/extent list• Symbolic link/small data• Small directory list
–Attribute component (union)• Btree/extent list• Local attributes
January 31, 2007 Page 11| |
Extent
• An extent is a number of contiguous file blocks• The minimal size of an extent is one file block• Extent is represented as a triple: start file system block,
number of blocks and flag• Extent can significantly reduce the space used to record the
allocated and free space if space has a large number of contiguous blocks
• For a regular file, extent also includes the file offset.• With extents, sparse files, i.e. potential “holes” in file are
supported.
January 31, 2007 Page 12| |
Extent List Mapping
0… 4 blocks 34
… 10 blocks
13
16… 5 blocks20
sizetimemodedformat=extentaformat=local<0, 20, 4, 0><4, 32, 10, 0><16, 50, 5, 0>totalCountnlenvlennamevalue
Hole
20…23
32…
41
50…54
Inode File spaceFile system space
January 31, 2007 Page 13| |
Data B+tree
sizetimemodedformat=btreeaformat=local
level=2numrecs=3file offset/fs block0/13|5000/14|9800/15
level=2numrecs=1off (0)/fs block(7)totalCountnlenvlennamevalue
Inode
level=1numrecs=50file offset/fs block0/40|100/51|200/55290/56| …
level=1numrecs=32file offset/fs block9800/802|9910/803|10005/805…
level=1numrecs=50file offset/fs block5000/120|5100/121|5190/122…
13 14 15
7
Root node
intermediate nodes
level=0numrecs=50file offset/fs blk/#blk0/201/2|2/206/24/210/3|7/218/1…
level=0numrecs=50file offset/fs blk/#blk100/311/2|102/317/2104/325/4|110/340/3…
40 51
leave nodes
January 31, 2007 Page 14| |
File System On-Disk Layout
A File System
AG0 AG1 AG2 …
SB FDB1 FDB2
AG: Allocation Group
AGF AGFL AGI …
SB: Super Block (1 sector, xfs_sb_t)AGF: Allocation Group Free Space (1 sector, xfs_agf_t)AGI: Allocation Group Inode (1 sector, xfs_agi_t) AGFL: Allocation Group Freelist (1 sector, xfs_agfl_t)FDB: File Data Block (file block size: 4K(default))
January 31, 2007 Page 15| |
AGF B+Tree
January 31, 2007 Page 16| |
AGF B+Tree (2 levels)
January 31, 2007 Page 17| |
AGI B+Tree
January 31, 2007 Page 18| |
Inode Allocation
• Inode size–Configurable at FS creation time–Can be 256B, 512B, up to 4096B
• Inode allocation–Allocation unit is a cluster, 64 inodes– Inode mask to show which inodes are free
• Inode number consists of:–AG #–FS Block # for the inode cluster– Index within a cluster
January 31, 2007 Page 19| |
XFS Directory Structure
• Unix files are organized in an inverted tree structure• Each directory has a list of files under its directory• Each entry in a directory represents an file object, or a
logical link or a sub-directory• Each entry has the object name, the length of name, the
corresponding inode number• Directory data are usually stored in directory blocks, the size
of a directory block is multiple of file data block size. Superblock’s sb_dirblklog designates the size. It can range from 4KB to 64KB
January 31, 2007 Page 20| |
Directory Forms
• Directory data– Directory entries (name, len, ino, offset)– Leaf array: hash/address for lookup– Freeindex array for allocation
• Directory forms– Shortform directory: Directory data stored within inode– Block directory: 1 extent, All directory entries stored within a directory block– Leaf directory: extent list, Multiple data blocks, one leaf block– Node directory: extent list, Multiple data blocks, B+tree like leaf blocks– B+tree directory: btree format for data fork
• The system dynamically adjusts the format with the addition or removal of the directory entries
January 31, 2007 Page 21| |
Shortform Directory
January 31, 2007 Page 22| |
Block Directory
• Use a directory block to store directory entries• The location of the block is stored in the inode’s incore extent list: the
di_u.u_bmx[0]• The directory blocks (xfs_dir2_block_t) has the following data fields:
– A header specifies the magic number and freespace list (3 largest free space)
– Directory entry list (name len, name, ino, offset)– Leaf array: contains an array of hashval/address pairs for quickly looking up a
name by the hash value.– Tail structure specifies the number of elements in the leaf array and the
number of stale entries in the array. The tail is always located at the end of the block.
January 31, 2007 Page 23| |
January 31, 2007 Page 24| |
Leaf Directory
• When the # of directory entries cannot be stored in one block, we use extent list to store directory entries
• Data and leaf are split into different blocks.• One or more data blocks, each directory data block has its
own header and bestfree list• Only one leaf block (the last one). Leaf block has its own
header, hash/address array, best free space array maintains each data block’s bestfree[0]’s length. The tail part has the number of bestfree elements
January 31, 2007 Page 25| |
Leaf Block
January 31, 2007 Page 26| |
Node Directory
• When leaf fills a block, another separation is needed.• The “data” blocks are the same as leaf directory• The leaf blocks are changed into B+tree with generic
header pointing to directory “leaves”.• A new freeindex block contains the best for each data block• The location of the leaf blocks can be in any order, the only
way to determine the appropriate is by the node block hash/before values.
January 31, 2007 Page 27| |
B+Tree-style Leaf Blocksleaf blocks
Node block
January 31, 2007 Page 28| |
B+Tree Directory
• With very large number of directory entries, inode format is changed to B+tree
• B+tree extents contains extent maps for data (directory entries), node, leaf(hash/address), freeindex.
• The node/leaf trees can be more than one level• More than one freelist may exist
January 31, 2007 Page 29| |
CXFS Clustered File System
Fibre ChannelStorage
AreaNetwork
Full standard Unix interface As easy to share filesas with NFS, but faster
Near-local fileperformance from
direct data channels
Fullyresilient
(HA)
January 31, 2007 Page 30| |
CXFS Concepts
–Metadata• The data about a file, including:
• size, inode, create/modify times, and permissions
–Metadata server node (a.k.a. CXFS server)• One machine in the cluster that is responsible for controlling the metadata of
files. Plays “traffic cop” to control access to the file.
–Metadata client node (a.k.a. CXFS client)• A machine in the cluster that is not the metadata server. Must obtain
permission from metadata server before accessing the file.
–Single server manages metadata• Backup metadata servers designated for fail-over
• No single point of failure
January 31, 2007 Page 31| |
CXFS networks
• Besides the storage area network CXFS uses the following networks:
–Metadata network• IP network (dedicated) for metadata and tokens
–Membership network• IP network used for heart beating
–Reset network between metadata servers• non IP serial lines used to reset nodes
– I/O Fencing• SAN switch port disable/enable
January 31, 2007 Page 32| |
Data Integrity - IO fencing
• CXFS nodes all have direct access to FC storage.• Integrity of a shared file system requires a unified view of
who's allowed to read/write what.• Tokens control access.• A failed node may retain write tokens; need to prevent such
a node unilaterally writing to a shared file system.• Applies to all CXFS platforms & is independent of disk sub-
systems• Uses Brocade switch to disable/enable FC ports • I/O Fencing architecture could be ported to other switches.
January 31, 2007 Page 33| |
CXFS Architecture CXFS MetadataServer
January 31, 2007 Page 34| |
Research Issues
•Self-healing, especially deadlock detection and self-recovery
• I/O fencing•Fail-over• Intelligent data placement algorithm•QoS provisioning•Scalable cluster•OSD support