geoff quigley, stephen childs and brian coghlan trinity college dublin

Grid-Ireland Storage Update

Geoff Quigley, Stephen Childs and Brian CoghlanTrinity College Dublin

Overview e-INIS Regional Datastore @TCD Recent storage procurement Physical infrastructure 10Gb networking

•Simple lessons learned STEP09 Experiences Monitoring•Network (STEP09)•Storage

e-INIS The Irish National e-Infrastructure Funds Grid-Ireland Operations Centre Creating a National Datastore

•Multiple Regional Datastores•Ops Centre runs TCD regional datastore

For all disciplines •Not just science & technology

Projects with (inter)national dimension

Central allocation process Grid and non-grid use

Procurement

Grid-Ireland @ TCD already had some• Dell Poweredge 2950 (2xQuad Xeon)• Dell MD1000 (SAS - JBOD)

After procurement data store has total• 8x Dell PE2950 (6x1TB disks, 10GbE)• 30x MD1000, each with 15x 1TB disks

~11.6 TiB each after RAID6 and XFS format (~350TiB total)

• 2x Dell Blade Chassis with 8x M600 blades each• Dell tape library (24x Ultrium 4 tapes)• HP ExDS9100 with 4 capacity blocks of 82x 1TB disks each and 4 blades ~ 233 TiB total available for NFS/http export

Division of storage

DPM installed on Dell hardware• ~100TB for Ops Centre to allocate• Rest for Irish users via

allocation process• May also try to combine with iRODS

HP-ExDS high availability store• iRODS primarily• vNFS exports• Not for conventional grid use• Bridge services on blades for

community specific access patterns

Infrastructure

Room needed upgrade•Another cooler•UPS maxed out

New high-current AC circuits added 2x 3kVA UPS per rack acquired for Dell equipment

ExDS has 4x 16A 3Ø - 2 on room UPS, 2 raw

10 GbE to move data!

10GbE Optimisations Benchmarked with netperf

• http://www.netperf.org Initially 1-2Gb/s… not good Had machines that produced figures 4Gb/s +• What’s the difference?

Looked at a couple of documents on this:• http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf

• http://docs.sun.com/source/819-0938-13/D_linux.html

Tested various of these optimisations • Initially little improvement (~100Mb/s)• Then identified the most important changes

Best optimisations

Cards fitted to wrong PCI-E port•Were x4 instead of x8

New kernel version•New kernel supports MSI-X (multiqueue)•Was saturating one core, now distributes

Increase MTU (from 1500 to 9216)•Large difference to netperf•Smaller difference to real loads

Then compared two switches with direct connection

Throughput testnetperf 60s transfer test - showing repeat results for Arista switch

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Direct Force 10 Arista Arista rerun

Switch

Mbits/sec

A-B Solo

C-D Solo

A-B Sim

C-D Sim

Latency testnetperf 60s TCP Request/Response

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Direct Force 10 Arista

Switch

Requests/sec

A-B Solo

C-D Solo

A-B Sim

C-D Sim

ATLAS STEP ’09 Storage was mostly in place

10GbE was there but being tested• Brought into production early in STEP09

Useful exercise for us• See bulk data transfer in conjunction with user access to stored data

• The first large 'real' load on the new equipment

Grid-Ireland OpsCentre at TCD involved as Tier-2 site• Associated with NL Tier-1

Peak traffic observedduring STEP ‘09

What did we see?

Data transfers into TCD from NL•Peaked at 440 Mbit/s (capped at 500)•Recently upgraded FW box coped well

Internet to storage

HEAnet view of GEANT link

TCD view of Grid-Ireland link

What else did we see?

Lots of analysis jobs• Running on cluster nodes• Accessing large datasets directly from storage

• Caused heavy load on network and disk servers

• Caused problems for other jobs accessing storage

• Now known that access patterns were pathological

Also production jobs ATLASanalysis

ATLASproduction

LHCbproduction

Storage to cluster

3x1Gbit bonded links set up

Almost all data storedon this server

Network update

MonAMI DPM work

Fix to distinguish FS with identical names on different servers

Fixed display of long labels

Display space token stats in TB

New code for pool stats

MonAMI status

Pool stats first to use DPM C API• Previously everything was done via MySQL

Was able to merge some of these fixes• Time-consuming to contribute patches• Single “maintainer” with no dedicated effort …

MonAMI useful but future uncertain• Should UKI contribute effort to plugin development?

• Or should similar functionality be created for “native” Ganglia?

Conclusions Recent procurement gave us a huge increase in capacity

STEP09 great test of data paths into and within our new infrastructure

Identified bottlenecks and tuned configuration• Back-ported SL5 kernel to support 10GbE on SL4• Spread data across disk servers for load-balancing

• Increased capacity of cluster-storage link• Have since upgraded switches

Monitoring crucial to understanding what’s going on• Weathermap for quick visual check• Cacti for detailed information on network traffic• LEMON and Ganglia for host load, cluster usage, etc.

Thanks for your attention!

Monitoring Links

Ganglia monitoring system• http://ganglia.info/

Cacti • http://www.cacti.net/

Network weathermap• http://www.network-weathermap.com/

MonAMI• http://

monami.sourceforge.net/

DPM wishes

Quotas are close to becoming essential for us

10GbE problems have highlighted that releases on new platforms are needed far more quickly

10Gb summary

Firewall 1Gb outbound 10Gb internally

M8024 switch in ‘bridge’ blade chassis• 24 port (16 to blades) layer 3 switch

Force10 switch main ‘backbone’• 10GbE cards in DPM servers• 10GbE uplink from ‘National Servers’ 6224 switch

10GbE Copper (CX4) ExDS to M6220 in 2nd blade chassis• Link between 2 blade chassis M6220 - M8024

4-way LAG Force10 - M8024

Force10 S2410 switch

24 port 10Gb switch XFP modules

•Dell supplied our XFPs so cost per port reduced

10Gb/s only Layer 2 switch Same Fulcrum ASIC as Arista switch tested•Uses a standard reference implementation

Arista demo switch Arista networks 7124S 24 port switch SFP+ modules

• Low cost per port (switches relatively cheap too)

‘Open’ software - Linux• Even has bash available• Potential for customisation (e.g. iptables being ported)

Can run 1Gb/s and 10Gb/s simultaneously• Just plug in the different SFPs

Layer 2/3• Some docs refer to layer 3 as a software upgrade

PCI-E

Our 10GbE cards are Intel PCI-E 10GBASE-SR

Dell had plugged most into the 4xPCI-E slot

An error was coming up in dmesgTrivial solution:

I moved the cards to 8x slots

Now can get >5Gb/s on some machines

MTU

Maximum Transmission Unit• Ethernet spec says 1500• Most hardware/software can support jumbo frames

Ixgbe driver allowed MTU=9216• Must be set through whole path• Different switches have different max value

Makes a big difference to netperf Example of SL5 machines, 30s tests:

• MTU=1500, TCP stream at 5399 Mb/s• MTU=9216, TCP stream at 8009 Mb/s

MSI-X and Multiqueue Machines on SL4 kernels had very poor receive performance (50Mb/s)

One core was 0% idle•Use mpstat -P ALL•Sys/soft used up the whole core

/proc/interrupts showed PCI-MSI used All RX interrupts to one core New kernel had MSI-X and multiqueue

•Interrupts distributed, full RX performance

Multiqueues -bash-3.1$ grep eth2 /proc/interrupts 114: 247 694613 5597495 1264609 1103 15322

426508 2089709 PCI-MSI-X eth2:v0-Rx 122: 657 2401390 462620 499858 644629 234

1660625 1098900 PCI-MSI-X eth2:v1-Rx 130: 220 600108 453070 560354 1937777 128178

468223 3059723 PCI-MSI-X eth2:v2-Rx 138: 27 764411 1621884 1226975 839601 473

497416 2110542 PCI-MSI-X eth2:v3-Rx 146: 37 171163 418685 349575 1809175 17262

574859 2744006 PCI-MSI-X eth2:v4-Rx 154: 27 251647 210168 1889 795228 137892

2018363 2834302 PCI-MSI-X eth2:v5-Rx 162: 27 85615 2221420 286245 779341 363

415259 1628786 PCI-MSI-X eth2:v6-Rx 170: 27 1119768 1060578 892101 1312734 813

495187 2266459 PCI-MSI-X eth2:v7-Rx 178: 1834310 371384 149915 104323 27463 16021786

461 2405659 PCI-MSI-X eth2:v8-Tx 186: 45 0 158 0 0 1

23 0 PCI-MSI-X eth2:lsc

geoff quigley, stephen childs and brian coghlan trinity college dublin

Documents

server slide

ireland link slide

network step09 storage

nongrid use slide

nfshttp export slide

direct connection slide

important changes slide

way lag force10 m8024