disk to disk data transfers at 100gbps supercomputing 2011...
TRANSCRIPT
Disk to Disk Data Transfers at 100Gbps SuperComputing 2011
Azher Mughal Caltech (HEP)
CENIC 2012
http://supercomputing.caltech.edu
Agenda • Motivation behind SC 2011 Demo • Collaboration (Caltech, Univ of Victoria, Vendors) • Network & Servers Design • 100G Network testing • Disk to Disk Transfers • PCIe Gen3 Server Performance • How to design a Fast Data Transfer Kit • Questions ?
The LHC experiments, with their distributed Computing Models and world-wide hands-on involvement in LHC physics, have brought renewed focus on networks, thus renewed emphasis on “capacity” and “reliability” of the networks
Experiments have seen an exponential growth in capacity 10X in usage every 47 months in ESnet over 18 years About 6M times capacity growth over 25 years across the Atlantic
(LEP3Net in 1985 to USLHCNet as of today) LHC experiments (CMS / ATLAS) are generating massively large data sets
which needs to be efficiently transferred to the end sites, anywhere in the world
A sustained ability to use ever-larger continental and transoceanic networks effectively: high throughput transfers
HEP as a driver of R&E and mission-oriented networks Testing latest innovations both in terms of software and hardware
Motivation behind SC 2011 Demo
Harvey Newman, Caltech
• Fiber Distance from Seattle Show Floor to Univ of Victoria, about 217 Km.
• Optical switches = Ciena OME-6500 using 100GE OTU4 cards
• Brocade MLXe-4 Routers – One 100GE card with LR4 – One 8+8 port 10GE line card
• Force10 Z9000 40GE switch • Mellanox PCIe Gen3 Dual port 40GE NICs • Servers with PCIe Gen3 slots using Intel E5 Sandy Bridge
processors –
CMS Data Transfer Volume (Oct 2010– Feb. 2011) 10 PetaBytes Transferred Over 4 Months = 8.0 Gbps Avg. (15 Gbps Peak)
SuperComputing 2011 Collaborators
Caltech Booth 1223
Courtesy of Ciena
SC11: Hardware Facts Caltech Booth • SuperMicro : SandyBridge E5 based Servers • Mellanox 40GE ConnectX-3, Active Fiber or Passive Copper Cables. • Dell-Force10 Z9000 Switch (all 40GE ports, 54MB shared buffer) • Brocade MLXe-4 Router with 100GE LR4 port and 16 x 10GE ports • LSI 9265-8i RAID Controllers (with FAST Path), OCZ Vertex 3 SSD Drives SCinet • Ciena OME6500 OTN switch BC Net • Ciena OME6500 OTN switch Univ of Victoria • Brocade MLXe-4 Router with 100GE LR4 port and 16 x 10GE ports • Dell R710 Servers with 10GE Intel NICs and OCZ Denva SSD Disks
SC11 - WAN Design for 100G
Caltech Booth - Detailed Network
Key Challenges • First hand experience with PCIe Gen3 servers using
sample E5 Sandy Bridge processors, Not many vendors were available for testing.
• Will FDT be able to go close to the line rate of Mellanox ConnectX-3 Network Cards, 39.6Gbps (theoretical peak)
• What about the firmware and drivers for both Mellanox and LSI ?
• LAG between Brocade 100G router and Z9000, 10 x 10GE , will it work ?
• End to End 100G and 40G testing, any transport issues ? • What do we know on the BIOS settings for Gen3
Issues we faced • SuperMicro Servers, Mellanox CX3 drivers were all in BETA
stage. • Mellanox NIC randomly throwing interface errors, though no
physical errors. • QSFP Passive Copper cable has issues at full rates, occasional
cable errors • Mellanox NIC randomly throwing interface errors, though no
physical errors. • QSFP Passive Copper cable has issues at full rates. • LAG between Brocade and Z9000 had hashing issues for 10 x
10GE so we moved to 12 x 10GE • LSI drivers, single threaded, utilizing a single core to maximum.
DDR3
Sandy Bridge CPU
0 0 Sandy Bridge
CPU
1 QPI
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
PCIe
3 x8
PCIe
3 x8
PCIe
-3 X
8
FDT Transfer Application LSI RAID Controller
DM
I2
PCIe
3 x8
PCIe
3 x8
PCIe
3 x8
PC
Ie2
x4
QPI
DDR3
PCIe
3 x8
PCIe
3 x8
Mellanox MSI Interrupts: 2/4/8
PCIe
-2 X
8
PCIe
-2 X
8
LSI LSI
How it looks from inside
SC11: Software and Tuning • RHEL 6.1 / Scientific Linux 6.1 Distribution • Fast Data Transfer (FDT) utility for moving data among the sites
– Writing on the RAID-0 (SSD disk pool) – /dev/zero /dev/null memory test
• Kernel smp affinity: • Keep the Mellanox NIC driver queues to the processor cores
where NIC’s PCIe Lane is connected • Move LSI Driver IRQ to the second processor
• Using Numa Control to bind FDT application to the second processor
• Change Kernel TCP/IP parameters
Hardware Setting/Tuning • SuperMicro Motherboards
• PCI-e slot needs to be manually set to Gen3, otherwise defaults are Gen2
• Disable Hyper threading • Change PCI-e payload to the maximum (for Mellanox NICs)
• Z9000 • Flow control needs to be turned on for Mellanox NIC to work
properly in the Z9000 switch • Single Queue compared to 4 Queue model
• Mellanox • Got latest firmware, helped reducing the interface errors NIC
was throwing
In (G
bps)
Tr
affic
: Out
Sustained 186 Gbps; Enough to transfer 100,000 Blue-ray per day
70
40
100
0
90
60
30
Servers Testing, reaching 100G
Disk to Disk Results; Peaks of 60Gbps
Disk write on 7 Supermicro and Dell servers with a mix of 40GE and 10GE Servers.
60Gbps
Total Transfers Total Traffic among all the Links during the Show: 4.657 PetaBytes
Single Server Gen3 performance: 36.8 Gbps (TCP) (During SC11)
37Gbps
Post SC11: Reaching 37.5Gbps inbound, and peaks of 38Gbps
37.5Gbps
CUBIC HTCP
38Gbps Spike
40GE Server Design Kit SandyBridge E5 Based Servers: (SuperMicro X9DRi-F or Dell R720) Intel E5-2670 with C1 or C2 Stepping
Mellanox ConnectX-3 PCIe Gen3 NIC
Dell – Force10; Z9000 40GE Switch
Mellanox QSFP Active Fiber Cables
LSI 9265-8i, 8 port SATA 6G RAID Controller
OCZ Vertex 3 SSD, 6Gb/s
Server Cost = ~ $10k
Future Directions • Finding bottlenecks in the LSI Raid Card driver, New driver
supporting MSI Vectors is available (many configurable queues) • A better refined approach to distribute application and drivers
among the cores • Optimizing Linux kernel, timers, other unknowns • New Driver available from Mellanox, 1.5.7.2 • We are working with Mellanox team to find performance
limitations not reaching close to the possible 39.6Gbps ethernet rate using 9K packets.
• Ways to lower CPU Utilization … • Understand/overcome the SSD wearing out problems over a time • Yet to run tests with E5 2670 C2 chipsets, arrived a week ago.
Summary • The 100Gbps network technology has shown the potential
possibilities to transfer peta scale physics datasets in matter of hours around the world.
• Couple of highly tuned servers can easily reach to 100GE line rate, effectively utilizing the PCIe Gen3 technology.
• Individual Server tests using E5 processors and PCIe Gen-3 based Network Cards have shown stable performance reaching close to 37.5Gbps.
• Fast Data Transfer (FDT) application achieved an aggregate disk
write of 60Gbps.
• MonALISA intelligent monitoring software, effectively recorded and displayed the traffic at 40/100G and the other 10GE links.
Questions ?