marian marinov, 1h ltd
DESCRIPTION
HighLoad++ 2013TRANSCRIPT
Performance comparisson of Distributed File Systems
GlusterFSXfreemFSFhgFS
Marian Marinov CEO of 1H Ltd.
What have I tested?
➢ GlusterFS http://glusterfs.org
➢ XtremeFS http://www.xtreemfs.org/
➢ FhgFS http://www.fhgfs.com/cms/ Fraunhofer
➢ Tahoe-LAFS http://tahoe-lafs.org/
➢ PlasmaFS http://blog.camlcity.org/blog/plasma4.html
What will be compared?
➢ Ease of install and configuration
➢ Sequential write and read (large file)
➢ Sequential write and read (many same size, small files)
➢ Copy from local to distributed
➢ Copy from distributed to local
➢ Copy from distributed to distributed
➢ Creating many random file sizes (real cases)
➢ Creating many links (cp -al)
Why only on 1Gbit/s ?
➢ It is considered commodity
➢ 6-7 years ago it was considered high performance
➢ Some projects have started around that time
➢ And last, I only had 1Gbit/s switches available for the tests
Lets get the theory first
1Gbit/s has ~950Mbit/s usable Bandwidth
Which is 118.75 MBytes/s usable speed
iperf tests - 512Mbit/s -> 65MByte/s
iperf tests - 938Mbit/s -> 117MByte/s
hping3 tcp pps tests - 50096 PPS (75MBytes/s)
- 62964 PPS (94MBytes/s)
Wikipedia - Ethernet frame
There are many 1Gbit/s adapters that can not go beyond 70k pps
Verify what the hardware can deliver locally
# echo 3 > /proc/sys/vm/drop_caches
# time dd if=/dev/zero of=test1 bs=XX count=1000
# time dd if=test1 of=/dev/null bs=XX
bs=1M Local write 141MB/s real 0m7.493s
bs=1M Local read 228MB/s real 0m4.605s
bs=100K Local write 141MB/s real 0m7.639s
bs=100K Local read 226MB/s real 0m4.596s
bs=1K Local write 126MB/s real 0m8.354s
bs=1K Local read 220MB/s real 0m4.770s
* most distributed filesystems write with the speed of the slowest member node
Linux Kernel Tuning
sysctl
net.core.netdev_max_backlog=2000
Default 1000
Congestion control
selective acknowledgments
net.ipv4.tcp_sack=0
net.ipv4.tcp_dsack=0
Default enabled
Linux Kernel Tuning
TCP memory optimizations
min pressure max
net.ipv4.tcp_mem=41460 42484 82920
min default max
net.ipv4.tcp_rmem=8192 87380 6291456
net.ipv4.tcp_wmem=8192 87380 6291456
Double the tcp memory
Linux Kernel Tunning
➢ net.ipv4.tcp_syncookies=0 default 1
➢ net.ipv4.tcp_timestamps=0 default 1
➢ net.ipv4.tcp_app_win=40 default 31
➢ net.ipv4.tcp_early_retrans=1 default 2
* For more information - Documentation/networking/ip-sysctl.txt
More tuning :)
Ethernet Tuning
➢ TSO (TCP segmentation offload)
➢ GSO (generic segmentation offload)
➢ GRO/LRO (Generic/Large receive offload)
➢ TX/RX checksumming
➢ ethtool -K ethX tx on rx on tso on gro on lro on
GlusterFS setup
1. gluster peer probe nodeX
2. gluster volume create NAME replica/stripe 2 node1:/path/to/storage node2:/path/to/storage
3. gluster volume start NAME
4. mount -t glusterfs nodeX:/NAME /mnt
XtreemeFS setup
1. Configure and start the directory server(s)
2. Configure and start the metadata server(s)
3. Configure and start the storage server(s)
4. mkfs.xtreemfs localhost/myVolume
5. mount.xtreemfs localhost/myVolume /some/local/path
FhgFS setup
1. Configure /etc/fhgfs/fhgfs-*
2. /etc/init.d/fhgfs-client rebuild
3. Start daemons fhgfs-mgmtd fhgfs-meta fhgfs-storage fhgfs-admon fhgfs-helperd
4. Configure the local client on all machines
5. Start the local client fhgfs-client
Tahoe-LAFS setup
➢ Download
➢ python setup.py build
➢ export PATH=”$PATH:$(pwd)/bin”
➢ Install sshfs
➢ Setup ssh rsa key
Tahoe-LAFS setup
➢ mkdir /storage/tahoe
➢ cd /storage/tahoe && tahoe create-introducer .
➢ tahoe start .
➢ cat /storage/tahoe/private/introducer.furl
➢ mkdir /storage/tahoe-storage
➢ cd /storage/tahoe-storage && tahoe create-node .
➢ Add the introducer.furl to tahoe.cfg
➢ Add [sftpd] section to tahoe.cfg
Tahoe-LAFS setup
➢ Configure the shares
➢ shares.needed = 2
➢ shares.happy = 2
➢ shares.total = 2
➢ Add accounts to the accounts file# This is a password line, (username, password, cap)
alice password URI:DIR2:ioej8xmzrwilg772gzj4fhdg7a:wtiizszzz2rgmczv4wl6bqvbv33ag4kvbr6prz3u6w3geixa6m6a
Statistics
Sequential write
1K 100K 1M0
50
100
150
200
250
300
350
400
450
500
13.7
112.6 106.3
1.7
43.5359.83
342358
467
dd if=/dev/zero of=test1 bs=1M count=1000dd if=/dev/zero of=test1 bs=100K count=10000dd if=/dev/zero of=test1 bs=1K count=1000000
GlusterFS
XtreemeFS
FhgFS
MB
yte
s /s
* higher is better
Sequential read
1K 100K 1M0
50
100
150
200
250
185.3179.6 181.3
74.6
105 105.6
214.6225
209
dd if=/mnt/test1 of=/dev/zero bs=XXGlusterFS
XtreemeFS
FhgFS
MB
yte
s /s
* higher is better
Sequential write (local to cluster)
1K 100K 1M0
20
40
60
80
100
120
11.36
76.770.3
96.33 93.787.26
43.7
57.96
5.41
dd if=/tmp/test1 of=/mnt/test1 bs=XXGlusterFSXtreemeFSFhgFSTahoe-LAFS
MB
yte
s /s
* higher is better
Sequential read (cluster to local)
1K 100K 1M0
10
20
30
40
50
60
70
80
90
74.8372.56
82.5677.5
83.76 85.4
66.1 67.13
dd if=/mnt/test1 of=/tmp/test1 bs=XX GlusterFS
XtreemeFS
FhgFS
MB
yte
s /s
* higher is better
Sequential read/write (cluster to cluster)
1K 100K 1M0
20
40
60
80
100
120
11.8
62.759.6
94.4 93.73
103.96
3640.7
dd if=/mnt/test1 of=/mnt/test2 bs=XXGlusterFS
XtreemeFS
FhgFS
MB
yte
s /s
* higher is better
Joomla tests (local to cluster)28MB6384 inodes
copy0
10
20
30
40
50
60
70
19.26
62.83
31.42
# for i in {1..100}; do time cp -a /tmp/joomla /mnt/joomla$i; doneGlusterFS
XtreemeFS
FhgFS
seco
nd
s
* lower is better
Joomla tests (cluster to local)28MB6384 inodes
copy0
50
100
150
200
250
19.26
200.73
39.7
# for i in {1..100}; do time cp -a /mnt/joomla /tmp/joomla$i; doneGlusterFS
XtreemeFS
FhgFS
seco
nd
s
* lower is better
Joomla tests (cluster to cluster)28MB6384 inodes
copy link0
50
100
150
200
250
300
51.31
22.53
265.02
113.46
89.5276.44
# for i in {1..100}; do time cp -a joomla joomla$i; done# for i in {1..100}; do time cp -al joomla joomla$i; done
GlusterFS
XtreemeFS
FhgFS
seco
nd
s
* lower is better
Conclusion
➢Distributed FS for large file storage – FhgFS
➢ General purpose distributed FS - GlusterFS
* lower is better
QUESTIONS?
Marian Marinov<[email protected]>http://www.1h.comhttp://hydra.azilian.netirc.freenode.net hackmanICQ: 7556201Jabber: [email protected]