geoff quigley, stephen childs and brian coghlan trinity college dublin
TRANSCRIPT
Grid-Ireland Storage Update
Geoff Quigley, Stephen Childs and Brian CoghlanTrinity College Dublin
Overview e-INIS Regional Datastore @TCD Recent storage procurement Physical infrastructure 10Gb networking
•Simple lessons learned STEP09 Experiences Monitoring•Network (STEP09)•Storage
e-INIS The Irish National e-Infrastructure Funds Grid-Ireland Operations Centre Creating a National Datastore
•Multiple Regional Datastores•Ops Centre runs TCD regional datastore
For all disciplines •Not just science & technology
Projects with (inter)national dimension
Central allocation process Grid and non-grid use
Procurement
Grid-Ireland @ TCD already had some• Dell Poweredge 2950 (2xQuad Xeon)• Dell MD1000 (SAS - JBOD)
After procurement data store has total• 8x Dell PE2950 (6x1TB disks, 10GbE)• 30x MD1000, each with 15x 1TB disks
~11.6 TiB each after RAID6 and XFS format (~350TiB total)
• 2x Dell Blade Chassis with 8x M600 blades each• Dell tape library (24x Ultrium 4 tapes)• HP ExDS9100 with 4 capacity blocks of 82x 1TB disks each and 4 blades ~ 233 TiB total available for NFS/http export
Division of storage
DPM installed on Dell hardware• ~100TB for Ops Centre to allocate• Rest for Irish users via
allocation process• May also try to combine with iRODS
HP-ExDS high availability store• iRODS primarily• vNFS exports• Not for conventional grid use• Bridge services on blades for
community specific access patterns
Infrastructure
Room needed upgrade•Another cooler•UPS maxed out
New high-current AC circuits added 2x 3kVA UPS per rack acquired for Dell equipment
ExDS has 4x 16A 3Ø - 2 on room UPS, 2 raw
10 GbE to move data!
10GbE Optimisations Benchmarked with netperf
• http://www.netperf.org Initially 1-2Gb/s… not good Had machines that produced figures 4Gb/s +• What’s the difference?
Looked at a couple of documents on this:• http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf
• http://docs.sun.com/source/819-0938-13/D_linux.html
Tested various of these optimisations • Initially little improvement (~100Mb/s)• Then identified the most important changes
Best optimisations
Cards fitted to wrong PCI-E port•Were x4 instead of x8
New kernel version•New kernel supports MSI-X (multiqueue)•Was saturating one core, now distributes
Increase MTU (from 1500 to 9216)•Large difference to netperf•Smaller difference to real loads
Then compared two switches with direct connection
Throughput testnetperf 60s transfer test - showing repeat results for Arista switch
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Direct Force 10 Arista Arista rerun
Switch
Mbits/sec
A-B Solo
C-D Solo
A-B Sim
C-D Sim
Latency testnetperf 60s TCP Request/Response
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Direct Force 10 Arista
Switch
Requests/sec
A-B Solo
C-D Solo
A-B Sim
C-D Sim
ATLAS STEP ’09 Storage was mostly in place
10GbE was there but being tested• Brought into production early in STEP09
Useful exercise for us• See bulk data transfer in conjunction with user access to stored data
• The first large 'real' load on the new equipment
Grid-Ireland OpsCentre at TCD involved as Tier-2 site• Associated with NL Tier-1
Peak traffic observedduring STEP ‘09
What did we see?
Data transfers into TCD from NL•Peaked at 440 Mbit/s (capped at 500)•Recently upgraded FW box coped well
Internet to storage
HEAnet view of GEANT link
TCD view of Grid-Ireland link
What else did we see?
Lots of analysis jobs• Running on cluster nodes• Accessing large datasets directly from storage
• Caused heavy load on network and disk servers
• Caused problems for other jobs accessing storage
• Now known that access patterns were pathological
Also production jobs ATLASanalysis
ATLASproduction
LHCbproduction
Storage to cluster
3x1Gbit bonded links set up
Almost all data storedon this server
Network update
MonAMI DPM work
Fix to distinguish FS with identical names on different servers
Fixed display of long labels
Display space token stats in TB
New code for pool stats
MonAMI status
Pool stats first to use DPM C API• Previously everything was done via MySQL
Was able to merge some of these fixes• Time-consuming to contribute patches• Single “maintainer” with no dedicated effort …
MonAMI useful but future uncertain• Should UKI contribute effort to plugin development?
• Or should similar functionality be created for “native” Ganglia?
Conclusions Recent procurement gave us a huge increase in capacity
STEP09 great test of data paths into and within our new infrastructure
Identified bottlenecks and tuned configuration• Back-ported SL5 kernel to support 10GbE on SL4• Spread data across disk servers for load-balancing
• Increased capacity of cluster-storage link• Have since upgraded switches
Monitoring crucial to understanding what’s going on• Weathermap for quick visual check• Cacti for detailed information on network traffic• LEMON and Ganglia for host load, cluster usage, etc.
Thanks for your attention!
Monitoring Links
Ganglia monitoring system• http://ganglia.info/
Cacti • http://www.cacti.net/
Network weathermap• http://www.network-weathermap.com/
MonAMI• http://
monami.sourceforge.net/
DPM wishes
Quotas are close to becoming essential for us
10GbE problems have highlighted that releases on new platforms are needed far more quickly
10Gb summary
Firewall 1Gb outbound 10Gb internally
M8024 switch in ‘bridge’ blade chassis• 24 port (16 to blades) layer 3 switch
Force10 switch main ‘backbone’• 10GbE cards in DPM servers• 10GbE uplink from ‘National Servers’ 6224 switch
10GbE Copper (CX4) ExDS to M6220 in 2nd blade chassis• Link between 2 blade chassis M6220 - M8024
4-way LAG Force10 - M8024
Force10 S2410 switch
24 port 10Gb switch XFP modules
•Dell supplied our XFPs so cost per port reduced
10Gb/s only Layer 2 switch Same Fulcrum ASIC as Arista switch tested•Uses a standard reference implementation
Arista demo switch Arista networks 7124S 24 port switch SFP+ modules
• Low cost per port (switches relatively cheap too)
‘Open’ software - Linux• Even has bash available• Potential for customisation (e.g. iptables being ported)
Can run 1Gb/s and 10Gb/s simultaneously• Just plug in the different SFPs
Layer 2/3• Some docs refer to layer 3 as a software upgrade
PCI-E
Our 10GbE cards are Intel PCI-E 10GBASE-SR
Dell had plugged most into the 4xPCI-E slot
An error was coming up in dmesgTrivial solution:
I moved the cards to 8x slots
Now can get >5Gb/s on some machines
MTU
Maximum Transmission Unit• Ethernet spec says 1500• Most hardware/software can support jumbo frames
Ixgbe driver allowed MTU=9216• Must be set through whole path• Different switches have different max value
Makes a big difference to netperf Example of SL5 machines, 30s tests:
• MTU=1500, TCP stream at 5399 Mb/s• MTU=9216, TCP stream at 8009 Mb/s
MSI-X and Multiqueue Machines on SL4 kernels had very poor receive performance (50Mb/s)
One core was 0% idle•Use mpstat -P ALL•Sys/soft used up the whole core
/proc/interrupts showed PCI-MSI used All RX interrupts to one core New kernel had MSI-X and multiqueue
•Interrupts distributed, full RX performance
Multiqueues -bash-3.1$ grep eth2 /proc/interrupts 114: 247 694613 5597495 1264609 1103 15322
426508 2089709 PCI-MSI-X eth2:v0-Rx 122: 657 2401390 462620 499858 644629 234
1660625 1098900 PCI-MSI-X eth2:v1-Rx 130: 220 600108 453070 560354 1937777 128178
468223 3059723 PCI-MSI-X eth2:v2-Rx 138: 27 764411 1621884 1226975 839601 473
497416 2110542 PCI-MSI-X eth2:v3-Rx 146: 37 171163 418685 349575 1809175 17262
574859 2744006 PCI-MSI-X eth2:v4-Rx 154: 27 251647 210168 1889 795228 137892
2018363 2834302 PCI-MSI-X eth2:v5-Rx 162: 27 85615 2221420 286245 779341 363
415259 1628786 PCI-MSI-X eth2:v6-Rx 170: 27 1119768 1060578 892101 1312734 813
495187 2266459 PCI-MSI-X eth2:v7-Rx 178: 1834310 371384 149915 104323 27463 16021786
461 2405659 PCI-MSI-X eth2:v8-Tx 186: 45 0 158 0 0 1
23 0 PCI-MSI-X eth2:lsc