ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI
New results from CASPUR Storage Lab
Andrei MaslennikovCASPUR Consortium
May 2004
A.Maslennikov - May 2004 - SLAB update
2
Participated:
ADIC Software : E.EastmanCASPUR : A.Maslennikov(*), M.Mililotti, G.Palumbo CERN : C.Curran, J.Garcia Reyero, M.Gug, A.Horvath, J.Iven, P.Kelemen, G.Lee, I.Makhlyueva, B.Panzer-Steindel, R.Többicke, L.Vidak DataDirect Networks : L.Thiers ENEA : G.Bracco, S.PecoraroIBM : F.Conti, S.De Santis, S.FiniRZ Garching : H.ReuterSGI : L.Bagnaschi, P.Barbieri, A.Mattioli
(*) Project Coordinator
A.Maslennikov - May 2004 - SLAB update
3
Sponsors for these test sessions:
ACAL Storage Networking : Loaned a 16-port Brocade switchADIC Soiftware : Provided the StorNext file system product,
actively participated in tests DataDirect Networks : Loaned an S2A 8000 disk system, actively participated in testsE4 Computer Engineering : Loaned 10 assembled biprocessor nodesEmulex Corporation : Loaned 16 fibre channel HBAsIBM : Loaned a FASTt900 disk system and
SANFS product complete with 2 MDS units, actively participated in tests
Infortrend-Europe : Sold 4 EonStor disk systems at discount priceINTEL : Donated 10 motherboards and 20 CPUsSGI : Loaned the CXFS productStorcase : Loaned an InfoStation disk system
A.Maslennikov - May 2004 - SLAB update
4
Contents
• Goals• Components under test• Measurements:
- SATA/FC systems - SAN File Systems
- AFS Speedup - Lustre (preliminary) - LTO2
• Final remarks
A.Maslennikov - May 2004 - SLAB update
5
1. Performance of low-cost SATA/FC disk systems
2. Performance of SAN File Systems
3. AFS Speedup options
4. Lustre
5. Performance of LTO-2 tape drive
Goals for these test series
A.Maslennikov - May 2004 - SLAB update
6
Disk systems: 4x Infortrend EonStor A16F-G1A2 16 bay SATA-to-FC arrays: Maxtor Maxline Plus II 250 GB SATA disks (7200 rpm)
Dual Fibre Channel outlet at 2 Gbit Cache: 1 GB 2x IBM FAStT900 dual controller arrays with SATA expansion units: 4 x EXP100 expansion units with 14 Maxtor SATA disks of the same type
Dual Fibre Channel outlet at 2 Gbit Cache: 1 GB 1x StorCase InfoStation 12 bay array:
same Maxtor SATA disksDual Fibre Channel outlet at 2 Gbit
Cache: 256 MB
1x DataDirect S2A 8000 System: 2 controllers with 74 FC disks of 146GB
8 Fibre Channel outlets at 2 Gbit Cache: 2.56 GB
Components
7A.Maslennikov - May 2004 - SLAB update
Infortrend EonStor A16F-G1A2
- Two 2Gbps Fibre Host Channels- RAID levels supported: RAID 0, 1 (0+1), 3, 5, 10, 30, 50, NRAID and JBOD - Multiple arrays configurable with dedicated or global hot spares- Automatic background rebuild- Configurable stripe size and write policy per array- Up to 1024 LUNs supported- 3.5", 1" high 1.5Gbps SATA disk drives- Variable stripe size per logical drive- Up to 64TB per LD- Up to 1GB SDRAM
8A.Maslennikov - May 2004 - SLAB update
FAStT900 Storage Server
- 2 Gbps SFP - Expansion units: EXP700 FC / EXP100 sATA - Four SAN (FW-SW), or eight direct (FC-AL) - Four (redundant) 2 Gbps drive channels - Capacity: min 250GB – max 56TB (14 disks x EXP100 sATA) min 32GB – max 32TB (14 disks x EXP700 FC) - Dual-active controllers - Cache: 2GB - RAID support 0, 1, 3, 5, 10
FAStT900
EXP100
9A.Maslennikov - May 2004 - SLAB update
STORCase Fibre-to-SATA
- SATA and Ultra ATA/133 Drive Interface - 12 hot swappable drives - Switched or FC-AL host connections - RAID levels: 0, 1, 0+1, 3, 5, 30, 50 and JBOD - Dual Fibre 2Gbps host ports - Support up to 8 arrays and 128 LUNs - Up to 1GB PC200 DDR cache memory
10A.Maslennikov - May 2004 - SLAB update
DataDirect S²A8000
- Single 2U S2A8000 with Four 2Gb/s Ports or Dual 4U with Eight 2Gb/s Ports - Up to 1120 Disk Drives; 8192 LUNs supported - 5TB to 130TB with FC Disks, 20TB to 250TB with SATA disks - Sustained Performance well over 1GB/s (1.6 GB/s theoretical) - Full Fibre-Channel Duplex Performance on every port - PowerLUN™ 1 GB/s+ individual LUNs without host-based striping - Up to 20GB of Cache, LUN-in-Cache Solid State Disk functionality - Real time Any to Any Virtualization - Very fast rebuild rate
A.Maslennikov - May 2004 - SLAB update
11
- High-end Linux units for both servers and clients Biprocessor Pentium IV Xeon 2.4+ GHz, 1GB RAM Qlogic QLA2300 2Gbit or Emulex LP9xxx Fibre Channel HBAs - Network 2x Dell 5224 GigE switches - SAN Brocade 3800 switch – 16 ports (test series 1) Qlogic Sanbox 5200 – 32 ports (test series 2)
- Tapes 2x IBM Ultrium LTO2 (3580-TD2, Rev: 36U3 )
Components
12A.Maslennikov - May 2004 - SLAB update
Qlogic SANbox 5200 Stackable Switch
- 8, 12 or 16 auto-detecting 2Gb/1Gb device ports with 4-port incremental upgrade - Stacking of up to 4 units for 64 available user ports
- Interoperable with all FC SW-2 compliant Fibre Channel switches- Full-fabric, public-loop or switch-to-switch connectivity on 2Gb or 1Gb front ports - "No-Wait" routing - guaranteed maximum performance independent of data traffic - Support traffic between switches, servers and storage at up to 10Gb/s
- Low cost: 5200/16p is at least twice less expensive than Brocade 3800/16p - May be upgraded in 8p steps
13A.Maslennikov - May 2004 - SLAB update
IBM LTO Ultrium 2 Tape Drive Features
- 200 GB Native Capacity (400 GB compressed) - 35 MB/s native (70 MB/s compressed) - Read/Write LTO 1 Cartridge - Native 2Gb FC Interface - Backward read/write with Ultrium 1 cartridge - 64 MB buffer (vs 32 MB buffer in Ultrium 1) - Speed Matching, Channel Calibration - 512 Tracks vs. 384 Tracks in Ultrium 1 - 64 MB Buffer vs. 32 MB in Ultrium 1
- Enhanced Capacity (200GB)- Enhanced Performance (35 MB/s)- Backward Compatible- Faster Load/Unload Time, Data Access Time, Rewind Time
A.Maslennikov - May 2004 - SLAB update
14
SATA / FC Systems
A.Maslennikov - May 2004 - SLAB update
15
Typical array features: - single o dual (active-active) controller - up to 1GB of Raid Cache - battery to keep the cache afloat during power cuts - 8 through 16 drive slots - cost: 4-6 KUSD per 12/16 bay unit (Infortrend, Storcase)
Case and backplane directly impact on the disks’ lifetime: - protection against inrush currents - protection against the rotational vibration - orientation (H better than V – remark by A.Sansum) Infortrend EonStor: well engineered (removable controller module, lower vibration, H orientation) Storcase: special protection against inrush currents (“soft-start” drive power circuitry), low vibration
SATA / FC Systems – hw details
A.Maslennikov - May 2004 - SLAB update
16
High capacity ATA/SATA disk drives: - 250GB (Maxtor, IBM), 400GB (Hitachi) - RPM: 7200 - improved quality: warranty 3 years, component design lifetime : 5 years CASPUR experience with Maxtor drives: - In 1.5 years lost 5 drives out of ~100, 2 of which due to power cuts - Factory quality for recent Maxtor Maxline Plus II 250 GB disks: out of 66 disks purchased, 4 were shortly replaced. Others stand the stress very well
Learned during this meeting: - RAL annual failure rate is 21 out of 920 Maxtor Maxline drives
SATA / FC Systems – hw details
A.Maslennikov - May 2004 - SLAB update
17
SATA / FC Systems – test setup
Parameters to select / tune: - stripe size for RAID-5 - SCSI queue depth on controller and on Qlogic HBAs - number of disks per logical drive
In the end, we were working with RAID-5 LUNs composed of 8 HDs each Stripe size: 128K (and 256K, in some tests)
4x IFT A16F- G1A2
4x IBM FASTt 900
Storcase Infostation
Qlogic2x 5200
16 2x2.4+ GHz NodesQlogic 2310F HBA Dell 5224
A.Maslennikov - May 2004 - SLAB update
18
Kernel settings: - Kernels: 2.4.20-30.9smp, 2.4.20-20.9.XFS1.3.1smp - vm.bdflush: “2 500 0 0 500 1000 20 10 0” - vm.max(min)-readahead: 256(127) (large streaming writes) 4(3) (random reads with small blksize)
File Systems: - EXT3 (128k RAID-5 stripe size): fs options: “-m O –j –J size=128 –R stride=32 –T largefile4” mount options: “data=writeback”
- XFS 1.3.1 (128k RAID-5 stripe size): fs options: “-i size=512 –d agsize=4g,su=128k,sw=7,unwritten=0 –l su=128k” mount options: “logbsize=262144,logbufs=8”
SATA / FC tests – kernel and fs details
A.Maslennikov - May 2004 - SLAB update
19
Large serial writes and reads: - “lmdd” from “lmbench” suite: http://sourceforge.net/projects/lmbench typical invocation: lmdd of=/fs/file bs=1000k count=8000 fsync=1
Random reads: - Pileup benchmark ([email protected]) designed to emulate the disk activity for multiple data analysis jobs 1) series of 2GB files are being created in the desination directory 2) these files are then being read in a random way, in many threads
SATA / FC tests – benchmarks used
A.Maslennikov - May 2004 - SLAB update
20
EXT3 results – filling 1.7 TB with 8GB files IFT systems show anomalous behaviour with EXT3 file system: performancevaries along the file system. The effect visibly depends on the RAID-5 stripe size:
SATA / FC results
32K
128k
256K
! The problem was reproduced and understood by Infortrend New firmware is due in July
A.Maslennikov - May 2004 - SLAB update
21
IBM FAStT and Storcase behave in a more predictable manner with EXT3.Both these systems may however lose up to 20% in performance along thefile system:
SATA / FC results
A.Maslennikov - May 2004 - SLAB update
22
XFS results – filling 1.7 TB with 8GB files The situation changes radically with this file system. The curves are now becomingalmost flat, everything is much faster compared with EXT3:
SATA / FC results
IBM STORCASE INFORTREND
Infortrend and Storcase show compatible write speeds of about 135-140 MB/sec,IBM is much slower on writes (below 100 MB/sec).Read speeds are visibly higher thanks to the read-ahead function of controller(IBM and IFT systems had 1 GB of raid cache, Storcase had only 256 MB)
A.Maslennikov - May 2004 - SLAB update
23
Pileup tests: These tests were done only on IFT and Storcase systems. Results to a largeextent depend on the number of threads that access the previously preparedfiles (after a certain number of threads performance may drop since the testmachine’s may have problems to handle many threads at a time).
The best result was obtained with the Infortrend array for XFS file system:
SATA / FC results
Number of threadsEXT3, MB/sec XFS, MB/sec
Storcase Infortrend Storcase Infortrend
4 3.7 3.8 9.5 12.1
8 4.4 4.4 10.3 16.8
16 4.4 4.7 12.0 19.3
32 4.5 4.8 12.6 17.9
64 4.4 4.7 11.0 15.9
A.Maslennikov - May 2004 - SLAB update
24
Operation in degraded mode:
We have tried it on a single Infortrend LUN of 5HDs and EXT3. One of the disks was removed, and rebuild process was started.
The Write speed went down from 105 to 91 MB/secThe Read speed went down from 105 to 28 MB/sec and even less
SATA / FC results
A.Maslennikov - May 2004 - SLAB update
25
1) The recent low-cost SATA-to-FC disk arrays (Infortrend, Storcase) operate very well and are able to deliver excellent I/O speeds far exceeding that of Gigabit Ethernet. Cost of such systems may be as low as 2.5 USD/rawGB. Quality of these systems is dominated by the quality of SATA disks.
2) The choice of local file system is fundamental. XFS easily outperforms EXT3.
In one occasion we have observed an XFS hang under a very heavy load. “xfs_repair” was run, and the error had never reappeared again. We are now planning to investigate this in deep. CASPUR AFS and NFS servers are all XFS-based, and there was only one XFS-related problem since we have put XFS in production 1.5 years ago. But probably we were simply lucky.
SATA / FC results - conclusions
A.Maslennikov - May 2004 - SLAB update
26
SAN File Systems
A.Maslennikov - May 2004 - SLAB update
27
SAN FS Placement These advanced distributed file systems allow clients to operate directly with block devices (block-level file access). Metadata traffic: via GigE. Required: Storage Area Network.
Current cost of a single fibre channel connection > 1000 USD: Switch port, min ~ 500 USD including GBIC Host Based Adapter, min ~ 800 USD
Special discounts for massive purchases are not impossible, but it is very hard to imagine that the cost of connection will become less than 600-700 USD in the close future.. SAN FS with native fibre channel connection is still not an option for large farms. SAN FS with iSCSI connection may be re-evaluated in combination with new iSCSI-SATA disk arrays.
SAN File Systems
A.Maslennikov - May 2004 - SLAB update
28
Where SAN File Systems with FC connection may be used: 1) High Performance Computing – fast parallel I/O, faster sequential I/O 2) Hybrid SAN / NAS systems: relatively small number of SAN clients acting as (also redundant) NAS servers
3) HA Clusters with file locking : Mail (shared pool), Web etc
SAN File Systems
A.Maslennikov - May 2004 - SLAB update
29
So far, we have tried these products: 0) Sistina GFS (see our 2002 and 2003 reports) 1) ADIC StorNext File System 2) IBM SANFS (StorTank) (preliminary, continue looking into it)
3) SGI CXFS (work in progress)
SAN File Systems
A.Maslennikov - May 2004 - SLAB update
30
FS PlatformsMDS hostrequired
MAX FSsize
GFSServer-Client: Linux32/64 No 2 TB
StorNextServer-Client: Aix, Linux, Solaris, Irix, Windows
No petabytes
StorTankServer: Linux32 Client: Aix, Linux, Windows, Solaris
Yes petabytes
CXFS
Server: Irix/Linux64Client: Irix, Solaris, Aix, Windows, Linux, OsX Yes Esabytes
Linux32: 2TB
SAN File Systems
A.Maslennikov - May 2004 - SLAB update
31
What was measured (StorNext and StorTank): 1) Aggregate write and read speeds on 1, 7 and 14 clients 2) Aggregate Pileup speed on 1,7, and 14 clients accessing: A) different sets of files B) same set of files
During these tests we used 4 LUNS of 13 HDs each as recommended by IBMFor each SAN FS we have tried both IFT and FAStT disk systems
SAN File Systems
4x IFT A16F- G1A2
4x IBM FASTt 900Qlogic2x 5200
16 2x2.4+ GHz NodesQlogic 2310F HBA
Dell 5224IA32 IBM StorTank MDS
Origin 200 CXFS MDS
A.Maslennikov - May 2004 - SLAB update
32
SAN File Systems
1 Client 7 Clients 14 ClientsIBM IFT IBM IFT IBM IFT
Write 115 107 275 - 300 341Read 125 135 357 252 423 322
Large sequential files: StorNext and StorTank behave in a similar manner on writes. StorNext does betteron reads. IBM disk systems are performing better than IFT on reads for multiple clients:
1 Client 7 Clients 14 ClientsIBM IFT IBM IFT IBM IFT
Write 131 157 246 300 331 340Read 186 174 532 270 630 285
IBM StorTank
ADIC StorNext
All numbers in MB/sec
A.Maslennikov - May 2004 - SLAB update
33
SAN File Systems
Threads1 Client 7 Clients 14 Clients
IBM IFT IBM IFT IBM IFT
32 A 55 91 111 88 124 10264 A 72 120 159 116 138 7264 B 100 23
Pileup tests: StorTank is definitevely outperforming StorNext for this type of benchmark.The results are very interesting as it comes out that peak Pileup speeds with StorTank on a single client may reach the GigE speed (case of IFT disk):
Threads1 Client 7 Clients 14 Clients
IBM IFT IBM IFT IBM IFT
32 A 19 23 47 44 43 4264 A 21 23 45 44 46 4264 B 31 10
IBM StorTank
ADIC StorNext
! Unstable for IFT with more than 1 client
All numbers in MB/sec
A.Maslennikov - May 2004 - SLAB update
34
CXFS experience: MDS: on SGI Origin 200 with 1 GB of RAM (IRIX 6.5.22), 4 IFT arrays First numbers were not so bad, but with 4 clients or more the system becomes unstable (when they are used all at a time, one client will hang). That is what we have observed so far:
SAN File Systems
N of Clients Seq. Write Seq. Read1 62 MB/s 130 MB/s2 91 MB/s 245 MB/s3 117 MB/s 306 MB/s
We are currently investigating the problem together with SGI.
A.Maslennikov - May 2004 - SLAB update
35
StorNext on DataDirect system
SAN File Systems
EXT2, 8 distinct LUNS R/W, MB/sec
StorNext, 2 Power LUNSR/W, MB/sec
1 140 / 144 178 / 1808 470 / 700 380 / 535
16 - 570 / 1000
2x S2A8000 8 FC outlets
2x Brocade
3800
16 2x2.4+ GHz NodesEmulex LP9xxx HBAs Dell 5224
- S2A 8000 came with FC disks, although we asked for SATA- Quite easy in configuration, extremly flexible- Multiple levels of redundancy, small declared performance degradation on rebuilds- We ran only large serial wrirte and read 8GB lmdd tests using all the available power:
A.Maslennikov - May 2004 - SLAB update
36
- Performance of a SAN File System is quite close to that of disk hardware it is built upon (case of native FC connection). - StorNext is easiest in configuration. It does not require a standalone MDS. Works smoothly with all kinds of disk systems, fc switches etc We were able to export it via NFS, but with the loss of 50% of available bandwidth. iSCSI=? - StorTank is probably the most solid implementation of SAN FS, and it has a lot of useful options. It delivers the best numbers for random reads, and probably may be considered as a good candidate for relatively small clusters with native FC connection destinated for express data analysis. May have issues with 3rd party disks. Supports iSCSI.
- CXFS uses the very performant XFS base and hence should have a good potential, although the 2 TB file system size on Linux/32bit is a real limitation (same is true for GFS). Some functions like MDS fencing require particular hardware. iSCSI=?
- MDS loads: small for StorNext, CXFS and quite high for StorTank.
SAN File Systems – some remarks
A.Maslennikov - May 2004 - SLAB update
37
AFS Speedup
A.Maslennikov - May 2004 - SLAB update
38
- AFS performance for large files is quite poor (max 35-40 MB/sec even on a very performant hardware). To a large extent this is due to the limitations of Rx RPC protocol, and to the not most optimal implementation of the file server. - One possible workaround is to replace the Rx protocol with an alternative one in all cases where it is used for file serving. We were evaluating two such experimental implementations:
1) AFS with OSD support (Rainer Toebbicke). Rainer stores AFS data inside the Object-based Storage Devices (OSDs) which should not necessarily reside inside the AFS File Servers. The OSD performs basic space management and access control and is implemented as Linux daemon in user space on an EXT2 file system. AFS file server acts only as an MDS.
2) Reuter’s Fast AFS (Hartmut Reuter). In this approach, AFS partitions (/vicepXX) are made visible on the clients with fast SAN or NAS mechanism. As in the case 1), AFS file sever acts as an MDS and directs the clients to the right files inside the /vicepXX for faster data acess.
AFS speedup options
A.Maslennikov - May 2004 - SLAB update
39
Both methods worked!
The AFS/OSD scheme was tested during the Fall 2003 test session, the tests were done with the DataDirect’s S2A 8000 system. In one particular test we were able to achieve 425 MB/sec write speed for both native EXT2 and AFS/OSD configurations. The Reuter AFS was evaluated during the Spring 2004 session. StorNext SAN File System was used to distribute a /vicepX partition among several clients. Like in the previous case, AFS/Reuter performance was practically equal to the native performance of StorNext for large files.
To learn more on the DataDirect system and the Fall 2003 session, please visit the following site: http://afs.caspur.it/slab2003b.
AFS speedup options
A.Maslennikov - May 2004 - SLAB update
40
Lustre!
A.Maslennikov - May 2004 - SLAB update
41
- Lustre 1.0.4 - We used 4 Object Storage Targets on 4 Infortrend arrays, no striping - Very interesting numbers for sequential I/O (8GB files, MB/sec):
Lustre – preliminary results
N of Clients Seq. Write Seq. Read
1 72 33
6 319 234
14 310 287
- These numbers may be directly compared with SAN FS results obtained with the same disk arrays:
N of Clients Seq. Write Seq. ReadStorTank, 1 107 135
StorNext, 1 157 174
StorTank, 14 341 322
StorNext, 14 340 285
A.Maslennikov - May 2004 - SLAB update
42
LTO-2 Tape Drive
A.Maslennikov - May 2004 - SLAB update
43
The drive is a “Factor 2” evolution of its predecessor, LTO-1.According to the specs, it should be able to deliever up to 35 MB/secnative I/O speed, and 200 GB of native capacity.
We were mainly interested to check the following (see next page):
- write speed as a function of block size - time to write a tape mark - positioning times
The overall judgement: quite positive. The drive fits well for backup applications, and is acceptable for staging systems. Its strong pointIs definitively a relatively low cost (10-11 KUSD) which makes it quitecompetitive (cmp with ~30 KUSD for STK 9940B).
LTO-2 tape drive
A.Maslennikov - May 2004 - SLAB update
44
Write speed as a function of blocksize: > 31 MB/sec native for large blocks, very stable
LTO-2
Tape mark writing is rather slow, 1.4-1.5 sec/TM
Positioning: it may take up to 1.5 minutes to fsf to the needed file (Average= 1minute)
A.Maslennikov - May 2004 - SLAB update
45
Final remarks
Our immediate plans include:
- Further investigation of StorTank, CXFS and yet another SAN file system (Veritas) including NFS export
- Evaluation of iSCSI-enabled SATA RAID arrays in combination with SAN file systems - Further Lustre testing on IFT and IBM hardware (new version: 1.2, striping, other benchmarks)
Feel free to join us at any moment !