gluster fs for_storage_admins_glusterfs_meetup_07_feb

Vikhyat and Bipin

GlusterFS for

Storage Admins

GlusterFS Meetup7th Feb. 2014

02/09/15 2GlusterFS Pune Meetup

Quick Start

● Available in Fedora, Debian, NetBSD and others

● Community packages in multiple versions for different distributions on http://download.gluster.org/

● Quick Start guides on http://gluster.org

http://download.gluster.org/


Quick Start

1.Install the packages (on all storage servers)

2.Start the glusterd service (on all storage servers)

3.Peer probe other storage servers

4.Create and mount a local filesystem to host a brick

5.Create a volume

6.Start the new volume


Configuration Files

● Configuration files can be found at : /var/lib/glusterd

● Geo-replication

● Glusterd.info

● Glustershd

● Hooks

● NFS

● Peers

● Quotad

● Snaps

● Vols


LogsComponent/Service Name

Location of the Log File Remarks

glusterd /var/log/glusterfs/etc-glusterfs-glusterd.vol.log One glusterd log file per server. This log file also contains the snapshot and user logs.

bricks /var/log/glusterfs/bricks/<path extraction of brick path>.log One log file per brick on the server

quota /var/log/glusterfs/quotad.log /var/log/glusterfs/quota-crawl.log/var/log/glusterfs/quota-mount-VOLNAME.log

➔ Log of the quota daemons running on each node.

➔ Whenever quota is enabled, a file system crawl is performed and the corresponding log is stored in this file

➔ An auxiliary FUSE client is mounted in <gluster-run-dir>/VOLNAME of the glusterFS and the corresponding client logs found in this file.


Logs

Component/Service Name

Location of the Log File Remarks

glusterfs-nfs /var/log/glusterfs/ nfs.log One log file per server

samba-gluster /var/log/samba/glusterfs-VOLNAME-<ClientIP>.log If the client mounts this on a glusterFS server node, the actual log file or the mount point may not be found. In such a case, the mount outputs of all the glusterFS type mount operations need to be considered.

glusterfs-fuse-client

/var/log/ glusterfs/<mountpoint path extraction>.log

geo-replication /var/log/glusterfs/geo-replication/<master> /var/log/glusterfs/geo-replication-slaves

rebalance /var/log/glusterfs/VOLNAME- rebalance.log One log file per volume on the server

self heal deamon /var/log/glusterfs/ glustershd.log One log file per server


Key Points to remember

● XFS Inode Size = 512 bytes

● XFS Allocation Strategy : inode64

● Access Time: noatime

● Iptables

● Selinux

● Do not use mix of FQDN and IP

● You should not create GlusterFS volume bricks using raw disks.Bricks must be created on thin-provisioned Logical Volumes (Lvs).This is also recommended because GlusterFS snapshot featurewhich is based on LVM snapshot feature.


Key Points to remember cont..

● Each brick should be mapped to single thin-provisioned Logical Volume (LV) because of snapshot feature.

● Bricks should be of same size as of now GlusterFs does not support bricks of different sizes. This feature might be added in future releases.

● If you have small file workload then you can set below given volume options :

● # gluster volume set VOLUME group small-file-perf

● For checking more option available , please use below given command :

● # gluster volume set help


Statedump

● Statedump is a mechanism through which you can get details of all internal variables and state of the glusterFS process at the time of issuing the command. You can perform statedumps of the brick processes and NFS server process of a volume using the statedump command.

● # gluster volume statedump VOLNAME [nfs] [all|mem|iobuf|callpool|priv|fd|inode|history]

● To retrieve the statedump information for glusterfs-fuse client processes:

● ps -ef | grep glusterfs it will give the process_ID

● kill -USR1 process_ID

● You can locate the statedump files in /var/run/gluster.


Statedump Cont..

● Following options to determine what information is to be dumped:

● mem - Dumps the memory usage and memory pool details of the bricks.

● iobuf - Dumps iobuf details of the bricks.● priv - Dumps private information of loaded translators.● callpool - Dumps the pending calls of the volume.● fd - Dumps the open file descriptor tables of the volume.● inode - Dumps the inode tables of the volume.● history - Dumps the event history of the volume


Repair/Replace Faulty Brick

● We do have replace-brick and remove-brick commands.

● How to replace a failed glusterfs brick from new brick having same name as old brick in same glusterfs node ?

● These below given steps assume that new glusterfs brick XFS partition is mounted in same path as old brick was and /etc/fstab has entry of it.

● If XFS filesystem is mounted on /bricks then create brick directories inside /bricks after checking other active bricks path , for example brick path is /bricks/brick1:

● # mkdir -p /bricks/brick1

● Create the .glusterfs directory

● # mkdir -p /bricks/brick1/.glusterfs/00/00

● Create symlink for root of the brick inside from /bricks/brick1/.glusterfs/00/00

● #cd /bricks/brick1/.glusterfs/00/00

● #ln -s ../../.. 00000000-0000-0000-0000-000000000001


Repair/Replace Faulty Brick Cont..

● ll should return this:

● lrwxrwxrwx 1 root root <date> <time> 00000000-0000-0000-0000-000000000001 -> ../../..

● /bricks/brick1 should have a pair replica brick in any of the node of trusted pool, check the details of it.

● Login to that node as later we need to verify the data from this node to new brick node.

● Also check the "trusted.glusterfs.volume-id" from the pair replica brick node and get the volume-id, for example pair replica brick has name /bricks/brick2:

● #getfattr -d -m. -e hex /bricks/brick2

● trusted.glusterfs.volume-id= <volume id>

● Go back to new brick node and set trusted.glusterfs.volume-id to /bricks/brick1:

● #setfattr -n trusted.glusterfs.volume-id -v <volume id> /bricks/brick1


Repair/Replace Faulty Brick Cont..● After setting the volume-id verify it:

● #getfattr -d -m. -e hex /bricks/brick1

● trusted.glusterfs.volume-id= <volume id>

● If volume is in stop state start the volume:

● #gluster volume start VOLNAME

● Check #gluster volume status to check new brick came online and have pid and portnumber

● If brick is not online restart the volume:

● #gluster volume stop VOLNAME force

● #gluster volume start VOLNAME

● Run full self-heal:

● #gluster volume heal VOLNAME full

● Compare the data from replica brick /bricks/brick2 to this new brick /bricks/brick1, it should have same data.


Client Side Quorum & Split Brain

● Client side quorum should be properly set in order to minimize split-brain

● Client side quorum can be set as below:

* # gluster volume set VOLNAME cluster.quorum-type fixed/auto

* # gluster volume set VOLNAME cluster.quorum-count 1/2

● cluster.quorum-type : If set to fixed, this option allows writes to a file only if the number of active bricks in that replica set (to which the file belongs) is greater than or equal to the count specified in the cluster.quorum-count option. If set to auto, this option allows writes to the file only if the percentage of active replicate bricks is more than 50% of the total number of bricks that constitute that replica. If there are only two bricks in the replica group, the first brick must be up and running to allow modifications.

● cluster.quorum-count : The minimum number of bricks that must be active in a replica-set to allow writes. This option is used in conjunction with cluster.quorum-type =fixed option to specify the number of bricks to be active to participate in quorum. The cluster.quorum-type = auto option will override this value.

● Data Integrity vs Data availability


Troubleshooting Split Brain● Run

● # gluster volume heal VOLNAME info split-brain

● Close the application using these files.

● Obtain and verify AFR changelog extended attributes of the file using getfattr command

● For example,

# getfattr -d -e hex -m. brick-a/file.txt

\#file: brick-a/file.txt

security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000

trusted.afr.vol-client-2=0x000000000000000000000000


trusted.gfid=0x307a5c9efddd4e7c96e94fd4bcdcbd1b

● Interpreting changelog

0x 000003d7 00000001 0000000011

| | \_ changelog of directory entries

| \_ changelog of metadata

\ _ changelog of data


Troubleshooting Split Brain cont.● The following is an example of both data, metadata split-brain on the same file:

# getfattr -d -m . -e hex /gfs/brick-?/a

getfattr: Removing leading '/' from absolute path names

\#file: gfs/brick-a/a


trusted.afr.vol-client-1=0x000003d70000000100000000

trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57

\#file: gfs/brick-b/a

trusted.afr.vol-client-0=0x000003b00000000100000000


trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57

● The client log for the volume

[afr-common.c:1215:afr_detect_self_heal_by_lookup_status] 0-my-replicate-0: entries are missing in lookup of /path/to/file.[afr-common.c:1341:afr_launch_self_heal] 0-my-replicate-0: background meta-data data entry missing-entry gfid self-heal triggered. path: /path/to/file, reason: lookup detected pending operations[afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-my-replicate-0: path /path/to/file on subvolume my-client-0 => -1 (No such file or directory)[afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-my-replicate-0: path /path/to/file on subvolume my-client-2 => -1 (No such file or directory)[afr-self-heal-data.c:769:afr_sh_data_fxattrop_fstat_done] 0-my-replicate-0: Unable to self-heal contents of '/path/to/file' (possible split-brain). Please delete the file from all but the preferred subvolume.[afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-my-replicate-0: background meta-data data entry missing-entry gfid self-heal failed on /path/to/file


Questions?

gluster fs for_storage_admins_glusterfs_meetup_07_feb

Technology