marian marinov, 1h ltd

Performance comparisson of Distributed File Systems

GlusterFSXfreemFSFhgFS

Marian Marinov CEO of 1H Ltd.

What have I tested?

➢ GlusterFS http://glusterfs.org

➢ XtremeFS http://www.xtreemfs.org/

➢ FhgFS http://www.fhgfs.com/cms/ Fraunhofer

➢ Tahoe-LAFS http://tahoe-lafs.org/

➢ PlasmaFS http://blog.camlcity.org/blog/plasma4.html

http://glusterfs.org/

http://www.xtreemfs.org/

http://www.fhgfs.com/cms/

http://tahoe-lafs.org/

http://blog.camlcity.org/blog/plasma4.html

What will be compared?

➢ Ease of install and configuration

➢ Sequential write and read (large file)

➢ Sequential write and read (many same size, small files)

➢ Copy from local to distributed

➢ Copy from distributed to local

➢ Copy from distributed to distributed

➢ Creating many random file sizes (real cases)

➢ Creating many links (cp -al)

Why only on 1Gbit/s ?

➢ It is considered commodity

➢ 6-7 years ago it was considered high performance

➢ Some projects have started around that time

➢ And last, I only had 1Gbit/s switches available for the tests

Lets get the theory first

1Gbit/s has ~950Mbit/s usable Bandwidth

Which is 118.75 MBytes/s usable speed

iperf tests - 512Mbit/s -> 65MByte/s

iperf tests - 938Mbit/s -> 117MByte/s

hping3 tcp pps tests - 50096 PPS (75MBytes/s)

- 62964 PPS (94MBytes/s)

Wikipedia - Ethernet frame

There are many 1Gbit/s adapters that can not go beyond 70k pps

http://en.wikipedia.org/wiki/Ethernet_frame

Verify what the hardware can deliver locally

# echo 3 > /proc/sys/vm/drop_caches

# time dd if=/dev/zero of=test1 bs=XX count=1000

# time dd if=test1 of=/dev/null bs=XX

bs=1M Local write 141MB/s real 0m7.493s

bs=1M Local read 228MB/s real 0m4.605s

bs=100K Local write 141MB/s real 0m7.639s

bs=100K Local read 226MB/s real 0m4.596s

bs=1K Local write 126MB/s real 0m8.354s

bs=1K Local read 220MB/s real 0m4.770s

* most distributed filesystems write with the speed of the slowest member node

Linux Kernel Tuning

sysctl

net.core.netdev_max_backlog=2000

Default 1000

Congestion control

selective acknowledgments

net.ipv4.tcp_sack=0

net.ipv4.tcp_dsack=0

Default enabled

Linux Kernel Tuning

TCP memory optimizations

min pressure max

net.ipv4.tcp_mem=41460 42484 82920

min default max

net.ipv4.tcp_rmem=8192 87380 6291456

net.ipv4.tcp_wmem=8192 87380 6291456

Double the tcp memory

Linux Kernel Tunning

➢ net.ipv4.tcp_syncookies=0 default 1

➢ net.ipv4.tcp_timestamps=0 default 1

➢ net.ipv4.tcp_app_win=40 default 31

➢ net.ipv4.tcp_early_retrans=1 default 2

* For more information - Documentation/networking/ip-sysctl.txt

More tuning :)

Ethernet Tuning

➢ TSO (TCP segmentation offload)

➢ GSO (generic segmentation offload)

➢ GRO/LRO (Generic/Large receive offload)

➢ TX/RX checksumming

➢ ethtool -K ethX tx on rx on tso on gro on lro on

GlusterFS setup

1. gluster peer probe nodeX

2. gluster volume create NAME replica/stripe 2 node1:/path/to/storage node2:/path/to/storage

3. gluster volume start NAME

4. mount -t glusterfs nodeX:/NAME /mnt

XtreemeFS setup

1. Configure and start the directory server(s)

2. Configure and start the metadata server(s)

3. Configure and start the storage server(s)

4. mkfs.xtreemfs localhost/myVolume

5. mount.xtreemfs localhost/myVolume /some/local/path

FhgFS setup

1. Configure /etc/fhgfs/fhgfs-*

2. /etc/init.d/fhgfs-client rebuild

3. Start daemons fhgfs-mgmtd fhgfs-meta fhgfs-storage fhgfs-admon fhgfs-helperd

4. Configure the local client on all machines

5. Start the local client fhgfs-client

Tahoe-LAFS setup

➢ Download

➢ python setup.py build

➢ export PATH=”$PATH:$(pwd)/bin”

➢ Install sshfs

➢ Setup ssh rsa key

Tahoe-LAFS setup

➢ mkdir /storage/tahoe

➢ cd /storage/tahoe && tahoe create-introducer .

➢ tahoe start .

➢ cat /storage/tahoe/private/introducer.furl

➢ mkdir /storage/tahoe-storage

➢ cd /storage/tahoe-storage && tahoe create-node .

➢ Add the introducer.furl to tahoe.cfg

➢ Add [sftpd] section to tahoe.cfg

Tahoe-LAFS setup

➢ Configure the shares

➢ shares.needed = 2

➢ shares.happy = 2

➢ shares.total = 2

➢ Add accounts to the accounts file# This is a password line, (username, password, cap)

alice password URI:DIR2:ioej8xmzrwilg772gzj4fhdg7a:wtiizszzz2rgmczv4wl6bqvbv33ag4kvbr6prz3u6w3geixa6m6a

Statistics

Sequential write

1K 100K 1M0

50

100

150

200

250

300

350

400

450

500

13.7

112.6 106.3

1.7

43.5359.83

342358

467

dd if=/dev/zero of=test1 bs=1M count=1000dd if=/dev/zero of=test1 bs=100K count=10000dd if=/dev/zero of=test1 bs=1K count=1000000

GlusterFS

XtreemeFS

FhgFS

MB

yte

s /s

* higher is better

Sequential read

1K 100K 1M0

50

100

150

200

250

185.3179.6 181.3

74.6

105 105.6

214.6225

209

dd if=/mnt/test1 of=/dev/zero bs=XXGlusterFS

XtreemeFS

FhgFS

MB

yte

s /s

* higher is better

Sequential write (local to cluster)

1K 100K 1M0

20

40

60

80

100

120

11.36

76.770.3

96.33 93.787.26

43.7

57.96

5.41

dd if=/tmp/test1 of=/mnt/test1 bs=XXGlusterFSXtreemeFSFhgFSTahoe-LAFS

MB

yte

s /s

* higher is better

Sequential read (cluster to local)

1K 100K 1M0

10

20

30

40

50

60

70

80

90

74.8372.56

82.5677.5

83.76 85.4

66.1 67.13

dd if=/mnt/test1 of=/tmp/test1 bs=XX GlusterFS

XtreemeFS

FhgFS

MB

yte

s /s

* higher is better

Sequential read/write (cluster to cluster)

1K 100K 1M0

20

40

60

80

100

120

11.8

62.759.6

94.4 93.73

103.96

3640.7

dd if=/mnt/test1 of=/mnt/test2 bs=XXGlusterFS

XtreemeFS

FhgFS

MB

yte

s /s

* higher is better

Joomla tests (local to cluster)28MB6384 inodes

copy0

10

20

30

40

50

60

70

19.26

62.83

31.42

# for i in {1..100}; do time cp -a /tmp/joomla /mnt/joomla$i; doneGlusterFS

XtreemeFS

FhgFS

seco

nd

s

* lower is better

Joomla tests (cluster to local)28MB6384 inodes

copy0

50

100

150

200

250

19.26

200.73

39.7

# for i in {1..100}; do time cp -a /mnt/joomla /tmp/joomla$i; doneGlusterFS

XtreemeFS

FhgFS

seco

nd

s

* lower is better

Joomla tests (cluster to cluster)28MB6384 inodes

copy link0

50

100

150

200

250

300

51.31

22.53

265.02

113.46

89.5276.44

# for i in {1..100}; do time cp -a joomla joomla$i; done# for i in {1..100}; do time cp -al joomla joomla$i; done

GlusterFS

XtreemeFS

FhgFS

seco

nd

s

* lower is better

Conclusion

➢Distributed FS for large file storage – FhgFS

➢ General purpose distributed FS - GlusterFS

* lower is better

QUESTIONS?

Marian Marinov<[email protected]>http://www.1h.comhttp://hydra.azilian.netirc.freenode.net hackmanICQ: 7556201Jabber: [email protected]

marian marinov, 1h ltd

Technology

fhgfs setup

local read

local client fhgfsclient

default max

local copy

mbs real 0m4

glusterfs setup

distributed filesystems