disk to disk data transfers at 100gbps supercomputing 2011...

21
Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 Azher Mughal Caltech (HEP) CENIC 2012 http://supercomputing.caltech.edu

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Disk to Disk Data Transfers at 100Gbps SuperComputing 2011

Azher Mughal Caltech (HEP)

CENIC 2012

http://supercomputing.caltech.edu

Page 2: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Agenda • Motivation behind SC 2011 Demo • Collaboration (Caltech, Univ of Victoria, Vendors) • Network & Servers Design • 100G Network testing • Disk to Disk Transfers • PCIe Gen3 Server Performance • How to design a Fast Data Transfer Kit • Questions ?

Page 3: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

The LHC experiments, with their distributed Computing Models and world-wide hands-on involvement in LHC physics, have brought renewed focus on networks, thus renewed emphasis on “capacity” and “reliability” of the networks

Experiments have seen an exponential growth in capacity 10X in usage every 47 months in ESnet over 18 years About 6M times capacity growth over 25 years across the Atlantic

(LEP3Net in 1985 to USLHCNet as of today) LHC experiments (CMS / ATLAS) are generating massively large data sets

which needs to be efficiently transferred to the end sites, anywhere in the world

A sustained ability to use ever-larger continental and transoceanic networks effectively: high throughput transfers

HEP as a driver of R&E and mission-oriented networks Testing latest innovations both in terms of software and hardware

Motivation behind SC 2011 Demo

Harvey Newman, Caltech

Page 4: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

• Fiber Distance from Seattle Show Floor to Univ of Victoria, about 217 Km.

• Optical switches = Ciena OME-6500 using 100GE OTU4 cards

• Brocade MLXe-4 Routers – One 100GE card with LR4 – One 8+8 port 10GE line card

• Force10 Z9000 40GE switch • Mellanox PCIe Gen3 Dual port 40GE NICs • Servers with PCIe Gen3 slots using Intel E5 Sandy Bridge

processors –

CMS Data Transfer Volume (Oct 2010– Feb. 2011) 10 PetaBytes Transferred Over 4 Months = 8.0 Gbps Avg. (15 Gbps Peak)

Page 5: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

SuperComputing 2011 Collaborators

Caltech Booth 1223

Courtesy of Ciena

Page 6: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

SC11: Hardware Facts Caltech Booth • SuperMicro : SandyBridge E5 based Servers • Mellanox 40GE ConnectX-3, Active Fiber or Passive Copper Cables. • Dell-Force10 Z9000 Switch (all 40GE ports, 54MB shared buffer) • Brocade MLXe-4 Router with 100GE LR4 port and 16 x 10GE ports • LSI 9265-8i RAID Controllers (with FAST Path), OCZ Vertex 3 SSD Drives SCinet • Ciena OME6500 OTN switch BC Net • Ciena OME6500 OTN switch Univ of Victoria • Brocade MLXe-4 Router with 100GE LR4 port and 16 x 10GE ports • Dell R710 Servers with 10GE Intel NICs and OCZ Denva SSD Disks

Page 7: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

SC11 - WAN Design for 100G

Page 8: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Caltech Booth - Detailed Network

Page 9: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Key Challenges • First hand experience with PCIe Gen3 servers using

sample E5 Sandy Bridge processors, Not many vendors were available for testing.

• Will FDT be able to go close to the line rate of Mellanox ConnectX-3 Network Cards, 39.6Gbps (theoretical peak)

• What about the firmware and drivers for both Mellanox and LSI ?

• LAG between Brocade 100G router and Z9000, 10 x 10GE , will it work ?

• End to End 100G and 40G testing, any transport issues ? • What do we know on the BIOS settings for Gen3

Page 10: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Issues we faced • SuperMicro Servers, Mellanox CX3 drivers were all in BETA

stage. • Mellanox NIC randomly throwing interface errors, though no

physical errors. • QSFP Passive Copper cable has issues at full rates, occasional

cable errors • Mellanox NIC randomly throwing interface errors, though no

physical errors. • QSFP Passive Copper cable has issues at full rates. • LAG between Brocade and Z9000 had hashing issues for 10 x

10GE so we moved to 12 x 10GE • LSI drivers, single threaded, utilizing a single core to maximum.

Page 11: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

DDR3

Sandy Bridge CPU

0 0 Sandy Bridge

CPU

1 QPI

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

PCIe

3 x8

PCIe

3 x8

PCIe

-3 X

8

FDT Transfer Application LSI RAID Controller

DM

I2

PCIe

3 x8

PCIe

3 x8

PCIe

3 x8

PC

Ie2

x4

QPI

DDR3

PCIe

3 x8

PCIe

3 x8

Mellanox MSI Interrupts: 2/4/8

PCIe

-2 X

8

PCIe

-2 X

8

LSI LSI

How it looks from inside

Page 12: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

SC11: Software and Tuning • RHEL 6.1 / Scientific Linux 6.1 Distribution • Fast Data Transfer (FDT) utility for moving data among the sites

– Writing on the RAID-0 (SSD disk pool) – /dev/zero /dev/null memory test

• Kernel smp affinity: • Keep the Mellanox NIC driver queues to the processor cores

where NIC’s PCIe Lane is connected • Move LSI Driver IRQ to the second processor

• Using Numa Control to bind FDT application to the second processor

• Change Kernel TCP/IP parameters

Page 13: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Hardware Setting/Tuning • SuperMicro Motherboards

• PCI-e slot needs to be manually set to Gen3, otherwise defaults are Gen2

• Disable Hyper threading • Change PCI-e payload to the maximum (for Mellanox NICs)

• Z9000 • Flow control needs to be turned on for Mellanox NIC to work

properly in the Z9000 switch • Single Queue compared to 4 Queue model

• Mellanox • Got latest firmware, helped reducing the interface errors NIC

was throwing

Page 14: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

In (G

bps)

Tr

affic

: Out

Sustained 186 Gbps; Enough to transfer 100,000 Blue-ray per day

70

40

100

0

90

60

30

Servers Testing, reaching 100G

Page 15: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Disk to Disk Results; Peaks of 60Gbps

Disk write on 7 Supermicro and Dell servers with a mix of 40GE and 10GE Servers.

60Gbps

Page 16: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Total Transfers Total Traffic among all the Links during the Show: 4.657 PetaBytes

Page 17: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Single Server Gen3 performance: 36.8 Gbps (TCP) (During SC11)

37Gbps

Post SC11: Reaching 37.5Gbps inbound, and peaks of 38Gbps

37.5Gbps

CUBIC HTCP

38Gbps Spike

Page 18: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

40GE Server Design Kit SandyBridge E5 Based Servers: (SuperMicro X9DRi-F or Dell R720) Intel E5-2670 with C1 or C2 Stepping

Mellanox ConnectX-3 PCIe Gen3 NIC

Dell – Force10; Z9000 40GE Switch

Mellanox QSFP Active Fiber Cables

LSI 9265-8i, 8 port SATA 6G RAID Controller

OCZ Vertex 3 SSD, 6Gb/s

Server Cost = ~ $10k

Page 19: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Future Directions • Finding bottlenecks in the LSI Raid Card driver, New driver

supporting MSI Vectors is available (many configurable queues) • A better refined approach to distribute application and drivers

among the cores • Optimizing Linux kernel, timers, other unknowns • New Driver available from Mellanox, 1.5.7.2 • We are working with Mellanox team to find performance

limitations not reaching close to the possible 39.6Gbps ethernet rate using 9K packets.

• Ways to lower CPU Utilization … • Understand/overcome the SSD wearing out problems over a time • Yet to run tests with E5 2670 C2 chipsets, arrived a week ago.

Page 20: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Summary • The 100Gbps network technology has shown the potential

possibilities to transfer peta scale physics datasets in matter of hours around the world.

• Couple of highly tuned servers can easily reach to 100GE line rate, effectively utilizing the PCIe Gen3 technology.

• Individual Server tests using E5 processors and PCIe Gen-3 based Network Cards have shown stable performance reaching close to 37.5Gbps.

• Fast Data Transfer (FDT) application achieved an aggregate disk

write of 60Gbps.

• MonALISA intelligent monitoring software, effectively recorded and displayed the traffic at 40/100G and the other 10GE links.

Page 21: Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 ...cenic2012.cenic.org/program/slides/SC11-100GE.pdf · The LHC experiments, with their distributed Computing Models and

Questions ?