scalable filesystems xfs & cxfs - digital technology center home page

34
Silicon Graphics, Inc. January 31, 2007 Scalable Filesystems XFS & CXFS Yingping Lu Presented by:

Upload: others

Post on 03-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

Silicon Graphics, Inc.

January 31, 2007

Scalable FilesystemsXFS & CXFS

Yingping LuPresented by:

Page 2: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 2| |

Outline

• XFS Overview• XFS Architecture• XFS Fundamental Data Structure

–Extent list–B+Tree– Inode

• XFS Filesystem On-Disk Layout• XFS Directory Structure• CXFS: shared file system

Page 3: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 3| |

XFS: A World-Class File System

–Scalable• Full 64 bit support• Dynamic allocation of metadata space• Scalable structures and algorithms

–Fast• Fast metadata speeds • High bandwidths• High transaction rates

–Reliable• Field proven• Log/Journal

Page 4: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 4| |

Scalable

–Full 64 bit support• Large Filesystem

– 18,446,744,073,709,551,615 = 264-1 = 18 million TB (exabytes)• Large Files

– 9,223,372,036,854,775,807 = 263-1 = 9 million TB (exabytes)

–Dynamic allocation of metadata space• Inode size configurable, inode space allocated dynamically• Unlimited number of files (constrained by storage space)

–Scalable structures and algorithms (B-Trees)• Performance is not an issue with large numbers of files and directories

Page 5: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 5| |

Fast–Fast metadata speeds

• B-Trees everywhere (Nearly all lists of metadata information)– Directory contents– Metadata free lists– Extent lists within file

–High bandwidths (Storage: RM6700)• 7.32 GB/s on one filesystem (32p Origin2000, 897 FC disks)• >4 GB/s in one file (same Origin, 704 FC disks) • Large extents (4 KB to 4 GB)• Request parallelism (multiple AGs)• Delayed allocation, Read ahead/Write behind

–High transaction rates: 92,423 IOPS (Storage: TP9700)

Page 6: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 6| |

Reliable–Field proven

• Run for years on 100s of 1,000 of IRIX systems for over a decade• Ships as default filesystem on SGI Altix family : 64-bit Linux• Commercial vendors shipping XFS

– Ciprico DiMeda NAS Solutions, The Quantum Guardian™ 14000, BigStorage K2~NAS, Echostar DishPVR 721, Sun Cobalt RaQ™ 550

• Linux Distributions shipping XFS– Mandrake Linux, SuSE Linux, Gentoo Linux, Slackware Linux, JB Linux

• Now in Linux Kernel

–Log/Journal• XFS designed around log• No UNIX fsck is needed• Recovery time is independent of system size

– Depends on system activity levels• Usually, recovery completes in under a second

Page 7: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 7| |

Other XFS Features

– Large Range of Block Sizes (512 B to 64 KB)

– Extended attributes

– Sparse files (holes do not use disk space)

– Guaranteed rate I/O

– Online Dump, Resize, Defrag active file systems

– DMAPI interface supported (for HSM)

– Support real-time file stream

– Run on IRIX, Linux, FreeBSD

Page 8: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 8| |

The System Architecture (Linux)

VFS

XFS

Cache Manager

SCSI middle layer

Kernel

FS drv

Scsi drv

HBA drv

mount open read write lseek

ext2 ext3nfs

sd srst

qla2200

User System call

Page 9: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 9| |

XFS Architecture

XVM

Page 10: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 10| |

Inode Structure

di_core (96 bytes) xfs_dinode_t

xfs_dinode_t

di_u data fork

di_a extended attribute fork

di_next_unlink (4 bytes)

–IO Core component• Time stamp, size, ino• Formats of two other components

–Data component (union)• Btree/extent list• Symbolic link/small data• Small directory list

–Attribute component (union)• Btree/extent list• Local attributes

Page 11: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 11| |

Extent

• An extent is a number of contiguous file blocks• The minimal size of an extent is one file block• Extent is represented as a triple: start file system block,

number of blocks and flag• Extent can significantly reduce the space used to record the

allocated and free space if space has a large number of contiguous blocks

• For a regular file, extent also includes the file offset.• With extents, sparse files, i.e. potential “holes” in file are

supported.

Page 12: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 12| |

Extent List Mapping

0… 4 blocks 34

… 10 blocks

13

16… 5 blocks20

sizetimemodedformat=extentaformat=local<0, 20, 4, 0><4, 32, 10, 0><16, 50, 5, 0>totalCountnlenvlennamevalue

Hole

20…23

32…

41

50…54

Inode File spaceFile system space

Page 13: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 13| |

Data B+tree

sizetimemodedformat=btreeaformat=local

level=2numrecs=3file offset/fs block0/13|5000/14|9800/15

level=2numrecs=1off (0)/fs block(7)totalCountnlenvlennamevalue

Inode

level=1numrecs=50file offset/fs block0/40|100/51|200/55290/56| …

level=1numrecs=32file offset/fs block9800/802|9910/803|10005/805…

level=1numrecs=50file offset/fs block5000/120|5100/121|5190/122…

13 14 15

7

Root node

intermediate nodes

level=0numrecs=50file offset/fs blk/#blk0/201/2|2/206/24/210/3|7/218/1…

level=0numrecs=50file offset/fs blk/#blk100/311/2|102/317/2104/325/4|110/340/3…

40 51

leave nodes

Page 14: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 14| |

File System On-Disk Layout

A File System

AG0 AG1 AG2 …

SB FDB1 FDB2

AG: Allocation Group

AGF AGFL AGI …

SB: Super Block (1 sector, xfs_sb_t)AGF: Allocation Group Free Space (1 sector, xfs_agf_t)AGI: Allocation Group Inode (1 sector, xfs_agi_t) AGFL: Allocation Group Freelist (1 sector, xfs_agfl_t)FDB: File Data Block (file block size: 4K(default))

Page 15: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 15| |

AGF B+Tree

Page 16: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 16| |

AGF B+Tree (2 levels)

Page 17: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 17| |

AGI B+Tree

Page 18: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 18| |

Inode Allocation

• Inode size–Configurable at FS creation time–Can be 256B, 512B, up to 4096B

• Inode allocation–Allocation unit is a cluster, 64 inodes– Inode mask to show which inodes are free

• Inode number consists of:–AG #–FS Block # for the inode cluster– Index within a cluster

Page 19: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 19| |

XFS Directory Structure

• Unix files are organized in an inverted tree structure• Each directory has a list of files under its directory• Each entry in a directory represents an file object, or a

logical link or a sub-directory• Each entry has the object name, the length of name, the

corresponding inode number• Directory data are usually stored in directory blocks, the size

of a directory block is multiple of file data block size. Superblock’s sb_dirblklog designates the size. It can range from 4KB to 64KB

Page 20: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 20| |

Directory Forms

• Directory data– Directory entries (name, len, ino, offset)– Leaf array: hash/address for lookup– Freeindex array for allocation

• Directory forms– Shortform directory: Directory data stored within inode– Block directory: 1 extent, All directory entries stored within a directory block– Leaf directory: extent list, Multiple data blocks, one leaf block– Node directory: extent list, Multiple data blocks, B+tree like leaf blocks– B+tree directory: btree format for data fork

• The system dynamically adjusts the format with the addition or removal of the directory entries

Page 21: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 21| |

Shortform Directory

Page 22: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 22| |

Block Directory

• Use a directory block to store directory entries• The location of the block is stored in the inode’s incore extent list: the

di_u.u_bmx[0]• The directory blocks (xfs_dir2_block_t) has the following data fields:

– A header specifies the magic number and freespace list (3 largest free space)

– Directory entry list (name len, name, ino, offset)– Leaf array: contains an array of hashval/address pairs for quickly looking up a

name by the hash value.– Tail structure specifies the number of elements in the leaf array and the

number of stale entries in the array. The tail is always located at the end of the block.

Page 23: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 23| |

Page 24: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 24| |

Leaf Directory

• When the # of directory entries cannot be stored in one block, we use extent list to store directory entries

• Data and leaf are split into different blocks.• One or more data blocks, each directory data block has its

own header and bestfree list• Only one leaf block (the last one). Leaf block has its own

header, hash/address array, best free space array maintains each data block’s bestfree[0]’s length. The tail part has the number of bestfree elements

Page 25: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 25| |

Leaf Block

Page 26: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 26| |

Node Directory

• When leaf fills a block, another separation is needed.• The “data” blocks are the same as leaf directory• The leaf blocks are changed into B+tree with generic

header pointing to directory “leaves”.• A new freeindex block contains the best for each data block• The location of the leaf blocks can be in any order, the only

way to determine the appropriate is by the node block hash/before values.

Page 27: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 27| |

B+Tree-style Leaf Blocksleaf blocks

Node block

Page 28: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 28| |

B+Tree Directory

• With very large number of directory entries, inode format is changed to B+tree

• B+tree extents contains extent maps for data (directory entries), node, leaf(hash/address), freeindex.

• The node/leaf trees can be more than one level• More than one freelist may exist

Page 29: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 29| |

CXFS Clustered File System

Fibre ChannelStorage

AreaNetwork

Full standard Unix interface As easy to share filesas with NFS, but faster

Near-local fileperformance from

direct data channels

Fullyresilient

(HA)

Page 30: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 30| |

CXFS Concepts

–Metadata• The data about a file, including:

• size, inode, create/modify times, and permissions

–Metadata server node (a.k.a. CXFS server)• One machine in the cluster that is responsible for controlling the metadata of

files. Plays “traffic cop” to control access to the file.

–Metadata client node (a.k.a. CXFS client)• A machine in the cluster that is not the metadata server. Must obtain

permission from metadata server before accessing the file.

–Single server manages metadata• Backup metadata servers designated for fail-over

• No single point of failure

Page 31: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 31| |

CXFS networks

• Besides the storage area network CXFS uses the following networks:

–Metadata network• IP network (dedicated) for metadata and tokens

–Membership network• IP network used for heart beating

–Reset network between metadata servers• non IP serial lines used to reset nodes

– I/O Fencing• SAN switch port disable/enable

Page 32: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 32| |

Data Integrity - IO fencing

• CXFS nodes all have direct access to FC storage.• Integrity of a shared file system requires a unified view of

who's allowed to read/write what.• Tokens control access.• A failed node may retain write tokens; need to prevent such

a node unilaterally writing to a shared file system.• Applies to all CXFS platforms & is independent of disk sub-

systems• Uses Brocade switch to disable/enable FC ports • I/O Fencing architecture could be ported to other switches.

Page 33: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 33| |

CXFS Architecture CXFS MetadataServer

Page 34: Scalable Filesystems XFS & CXFS - Digital Technology Center Home Page

January 31, 2007 Page 34| |

Research Issues

•Self-healing, especially deadlock detection and self-recovery

• I/O fencing•Fail-over• Intelligent data placement algorithm•QoS provisioning•Scalable cluster•OSD support