high performance computing - hpc advisory council · higher performance: 100gb/s, sub 0.7usec...
TRANSCRIPT
Dror Goldenberg, HPCAC Switzerland Conference
March 2015
High Performance Computing
© 2015 Mellanox Technologies 2
End-to-End Interconnect Solutions for All Platforms
Highest Performance and Scalability for
X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms
X86 Open
POWER GPU ARM FPGA
Smart Interconnect to Unleash The Power of All Compute Architectures
© 2015 Mellanox Technologies 3
Technology Roadmap – One-Generation Lead over the Competition
2000 2020 2010 2005
20Gbs 40Gbs 56Gbs 100Gbs
“Roadrunner” Mellanox Connected
1st 3rd
TOP500 2003 Virginia Tech (Apple)
2015
200Gbs
Terascale Petascale Exascale
Mellanox
© 2015 Mellanox Technologies 4
The Future is Here
Copper (Passive, Active) Optical Cables (VCSEL) Silicon Photonics
36 EDR (100Gb/s) Ports, <90ns Latency
Throughput of 7.2Tb/s
Entering the Era of 100Gb/s
100Gb/s Adapter, sub 0.7us latency
150 million messages per second
(10 / 25 / 40 / 50 / 56 / 100Gb/s)
© 2015 Mellanox Technologies 5
Mellanox QSFP 100Gb/s Cables
Complete Solution of 100Gb/s Copper and Fiber Cables
Copper Cables VCSEL AOCs Silicon Photonics AOCs
Making 100Gb/s Deployments as Easy as 10Gb/s
© 2015 Mellanox Technologies 6
The Advantage of 100G Passive Copper Cables (Over Fiber)
5000 Nodes Cluster
CAPEX Saving: $2.1M
Power Saving: 37.3KW
Lower Cost, Lower Power Consumption, Higher Reliability
1000 Nodes Cluster
CAPEX Saving: $420K
Power Saving: 7.6KW
Supercomputing’14 Demo: 4m, 6m and 8m Copper Cables
© 2015 Mellanox Technologies 7
Enter the World of Scalable Performance – 100Gb/s Adapter
Connect. Accelerate. Outperform
ConnectX-4: Highest Performance Adapter in the Market
InfiniBand: SDR / DDR / QDR / FDR / EDR
Ethernet: 10 / 25 / 40 / 50 / 56 / 100GbE
100Gb/s, <0.7us latency
150 million messages per second
OpenPOWER CAPI technology
CORE-Direct technology
GPUDirect RDMA
Dynamically Connected Transport (DCT)
Ethernet /IPoIB offloads (HDS, RSS, TSS, LRO, LSOv2)
© 2015 Mellanox Technologies 8
Shattering The World of Interconnect Performance!
ConnectX-4 EDR 100G InfiniBand
InfiniBand Throughput 100 Gb/s
InfiniBand Bi-Directional Throughput 195 Gb/s
InfiniBand Latency 0.61 us
InfiniBand Message Rate 149.5 Million/sec
HPC-X MPI Bi-Directional Throughput 193.1 Gb/s
*First results, optimizations in progress
© 2015 Mellanox Technologies 9
ConnectX-4 InfiniBand Message Rate
© 2015 Mellanox Technologies 10
Enter the World of Scalable Performance – 100Gb/s Switch
Store Analyze
7th Generation InfiniBand Switch
36 EDR (100Gb/s) Ports, <90ns Latency
Throughput of 7.2 Tb/s
InfiniBand Router
Adaptive Routing
Switch-IB: Highest Performance Switch in the Market
© 2015 Mellanox Technologies 11
FDR to EDR Switch Hardware Improvements
• x86 CPU
• 80 gold+ and energy star certifications moving to efficient power supplies
• All ports support 3.5W (class 4)
• 4 separated fan drawers improved fan MTBF
• Two power supplies out-of-the-box
• AC or DC power supply ordering options
• BBU (Battery Backup Units) ordering option
• Improved rail kits (captive screws, stopper, routing)
• Dual leaf design
• N+N power supplies @220VAC
• Modular design
- Same leaf/spine/PS and fans
• Protected power supplies fuses
© 2015 Mellanox Technologies 12
36-Port Switch IC versus 48-Port Switch IC
Cluster Size
(nodes)
Mellanox EDR 100G
36-port Switch IC
Network Latency
48-port Switch IC
Network Latency
32 90ns 113ns
128 270ns 339ns
512 270ns 339ns
2K 450ns 565ns
5K 450ns ?
10K 450ns ?
Mellanox 36-Port Switch Solution Delivers 20% Lower Network Latency
© 2015 Mellanox Technologies 13
Optimize Network Performance
Adaptive Routing
© 2015 Mellanox Technologies 14
How it works?
• Each destination is given at least one address = Local IDentifier (LID)
• Routing is Destination Based: OutPort = LFT(DLID)
• Multi-pathing supported by assigning a destination > 1 LID
Advantages
• Fast – single memory lookup
• Low cost – No TCM / LPM required
• Simple to compute
Destination Based Forwarding
LID Port
1
2
3
4
5
6
2
3
2
3
6
7
8
2
0
0xFF
LID Port
1
2
3
4
5
6
1
2
3
4
6
7
8
0
6
0xFF
1
4
2
5 3
6 1
4
2
5 3
6
2
3
4
5
1
6 7 8
LFT LFT
© 2015 Mellanox Technologies 15
Non-Blocking
• A network is Non-Blocking if it can be routed to support any possible source destination
pairing (a.k.a. “Permutation”) with no network contention
Strictly Non-Blocking:
• When routing of new pairs does not interfere with previously routed pairs
Rearrangeable Non-Blocking:
• When routing of new pairs may require re-routing of previously routed pairs
Destination Based Forwarding and Non-Blocking Clos Networks
Data flow
© 2015 Mellanox Technologies 16
Traffic Aware Load Balancing Systems
Centralized
• Flows are routed according to a “global”
knowledge
Distributed
• Each flow is routed by its input switch with “local”
knowledge
Property Central Adaptive Routing Distributed Adaptive Routing
Scalability Low High
Knowledge Global Local (to keep scalability)
Non-Blocking Yes Unknown
Central Routing
Control
AR
AR
AR
Self
Routing
Unit
© 2015 Mellanox Technologies 17
Trial and Error Is Fundamental to Distributed AR
Randomize output port – Trial 1
Send the traffic
Contention 1
Un-route contending flow
• Randomize new output port – Trial 2
Send the traffic
Contention 2
Un-route contending flow
• Randomize new output port – Trial 3
Send the traffic
Convergence!
SR
SR
SR
D
D
D
Adaptive Routing is a Reactive Mechanism!
© 2015 Mellanox Technologies 18
Mellanox Adaptive Routing Notification (ARN)
Internet switch to switch communications
Faster convergence after routing changes
Fast notification to decision point
Fully configurable (topology agnostic)
Adapt even when congestion is far away
1) Sampling
prefers the
large flows
ARN
3) ARN cause
new output port
selection 2) ARN
forwarded,
can’t fix here
Faster Routing Modifications, Resilient Network
© 2015 Mellanox Technologies 19
InfiniBand Adaptive Routing
0
1000
2000
3000
4000
5000
6000
7000
16 nodes 32 nodes 64 nodes 128 nodes 256 nodes 512 nodes 1024 nodes 1585 nodes
Ba
nd
wid
th (M
B/s
)
B_Eff Benchmark
FDR InfiniBand Cray Aries
Higher Performance and Better Network Utilization
© 2015 Mellanox Technologies 20
Comprehensive HPC Software
© 2015 Mellanox Technologies 21
Mellanox HPC-X™ Scalable HPC Software Toolkit
MPI, PGAS OpenSHMEM and UPC package for HPC environments
Fully optimized for Mellanox InfiniBand and 3rd party interconnect solutions
Maximize application performance
Mellanox tested, supported and packaged
For commercial and open source usage
© 2015 Mellanox Technologies 22
Enabling Highest Applications Scalability and Performance
Mellanox Ethernet (RoCE) Mellanox InfiniBand
Platforms (x86, Power8, ARM, GPU, FPGA)
Operating System
Mellanox OFED® PeerDirect™, Core-Direct™, GPUDirect® RDMA
Mellanox HPC-X™ MPI, SHMEM, UPC, MXM, FCA
Applications
3rd Party Standard
Interconnect (InfiniBand, Ethernet)
Comprehensive MPI, PGAS/OpenSHMEM/UPC Software Suite
© 2015 Mellanox Technologies 23
Mellanox Delivers Highest Applications Performance
World Record Performance with Mellanox Connect-IB and HPC-X
Achieve Higher Performance with 67% less compute infrastructure versus Cray
25% 5%
© 2015 Mellanox Technologies 24
HPC-X Performance Advantage
58% Performance Advantage!
Enabling Highest Applications Scalability and Performance
21% Performance Advantage! 20% Performance Advantage!
© 2015 Mellanox Technologies 25
GPUDirect RDMA
Accelerator and GPU Offloads
© 2015 Mellanox Technologies 26
PeerDirect Technology
Based on Peer-to-Peer capability of PCIe
Support for any PCIe peer which can provide access to its memory
• NVIDIA GPU, XEON PHI, AMD, custom FPGA
PeerDirect
GPU Memory
CPU
GPU Memory
RDMA
System
Memory
PeerDirect RDMA
PCIe PCIe
System
Memory CPU
GPU GPU
PeerDirect
© 2015 Mellanox Technologies 27
Eliminates CPU bandwidth and latency bottlenecks
Uses remote direct memory access (RDMA) transfers between GPUs
Resulting in significantly improved MPI SendRecv efficiency between GPUs in remote nodes
Based on PeerDirect technology
GPUDirect™ RDMA (GPUDirect 3.0)
With GPUDirect™ RDMA
Using PeerDirect™
© 2015 Mellanox Technologies 28
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
0
5
10
15
20
25
1 4 16 64 256 1K 4K
Message Size (bytes)
Late
ncy (
us
)
GPU-GPU Internode MPI Latency
Low
er is
Bette
r 67 %
5.49 usec
Performance of MVAPICH2 with GPUDirect RDMA
67% Lower Latency
5X
GPU-GPU Internode MPI Bandwidth
Hig
her
is B
ett
er
5X Increase in Throughput
Source: Prof. DK Panda
© 2015 Mellanox Technologies 29
Mellanox PeerDirect™ with NVIDIA GPUDirect RDMA
HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs
GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand • Unlocks performance between GPU and InfiniBand
• This provides a significant decrease in GPU-GPU communication latency
• Provides complete CPU offload from all GPU communications across the network
Demonstrated up to 102% performance improvement with large number of particles
102%
© 2015 Mellanox Technologies 30
GPUDirect Sync (GPUDirect 4.0)
GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect
• Control path still uses the CPU
- CPU prepares and queues communication tasks on GPU
- GPU triggers communication on HCA
- Mellanox HCA directly accesses GPU memory
GPUDirect Sync (GPUDirect 4.0)
• Both data path and control path go directly
between the GPU and the Mellanox interconnect
0
10
20
30
40
50
60
70
80
2 4
Ave
rag
e t
ime
pe
r it
era
tio
n (
us
)
Number of nodes/GPUs
2D stencil benchmark
RDMA only RDMA+PeerSync
27% faster 23% faster
Maximum Performance
For GPU Clusters
© 2015 Mellanox Technologies 31
Remote GPU Access through rCUDA
GPU servers GPU as a Service
rCUDA daemon
Network Interface CUDA
Driver + runtime Network Interface
rCUDA library
Application
Client Side Server Side
Application
CUDA
Driver + runtime
CUDA Application
rCUDA provides remote access from
every node to any GPU in the system
CPU VGPU
CPU VGPU
CPU VGPU
GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
© 2015 Mellanox Technologies 32
Multi-Host Technology
© 2015 Mellanox Technologies 33
Data Center Evolution Over Time
Multi-Core Multi-Host Multi-Server
© 2015 Mellanox Technologies 34
New Compute Rack / Data Center Architecture
Traditional Data Center
Expensive design for fixed data centers
Requires many ports on ToR switch
Dedicated NIC / cable per server
Scalable Data Center with Multi-Host
The future of data center design
CPU becomes plug-in peripheral
Flexible, configurable, application optimized
Optimized ToR switches
Takes advantage of high-throughput network
Server
CPU
The Network is The Computer
NIC
Server
NIC
CPU Slots
© 2015 Mellanox Technologies 35
Lower OPEX and CAPEX
New Scalable Rack Design
Smart Interconnect to Unleash the Power of All Compute Architectures
CPU
CPU
CPU
CPU
CPU CPU CPU CPU
Single Adapter
© 2015 Mellanox Technologies 36
ConnectX-4 on Facebook OCP Multi-Host Platform (Yosemite)
Compute Slots
ConnectX-4
Multi-Host
Adapter
The Next Generation Compute and Storage Rack Design
© 2015 Mellanox Technologies 37
Higher Performance Data Center
Smart Interconnect for a High Variety of Compute Architectures
CPU CPU CPU CPU QPI
Overcoming CPU-to-CPU Connectivity Bottlenecks
Lower Application Latency
© 2015 Mellanox Technologies 38
Efficient GPU-based Data Centers
QPI GPU CPU GPU
QPI GPU CPU GPU
GPUDirect® RDMA GPUDirect® RDMA GPUDirect® RDMA
Smart Interconnect to Unleash the Power of All Compute Architectures
CPU CPU
GPUDirect® RDMA
Enabling GPUDirect RDMA Across all Available GPUs
© 2015 Mellanox Technologies 39
HPC Clouds
All HPC Clouds to Use Mellanox Solutions
© 2015 Mellanox Technologies 40
Virtualization for HPC: Mellanox SRIOV
© 2015 Mellanox Technologies 41
HPC Clouds – Performance Demands Mellanox Solutions
San Diego Supercomputing Center “Comet” System (2015) to
Leverage Mellanox Solutions and Technology to Build HPC Cloud
© 2015 Mellanox Technologies 42
Summary
© 2015 Mellanox Technologies 43
PCIe 4.0 Accelerating CPU / Memory - Interconnect Performance
1.0
2.0
3.0
4.0
2003-2004
2007-2008
2011-2012
2015-2016
2.5Gb/s per lane
8/10 bit encoding
5Gb/s per lane
8/10 bit encoding
8Gb/s per lane
128/130 bit encoding
16Gb/s per lane
128/130 bit encoding
New Capabilities
4.0 • Higher Bandwidth (16 to 25Gb/s per lane)
• Cache Coherency
• Atomic Operations
• Advanced Power Management
• Memory Management…
© 2015 Mellanox Technologies 44
Mellanox Interconnect Advantages
Mellanox solutions provide a proven, scalable and high performance end-to-end connectivity
Flexible, support all compute architectures: x86, Power, ARM, GPU, FPGA etc.
Standards-based (InfiniBand, Ethernet), supported by large eco-system
Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec
HPC-X software provides leading performance for MPI, OpenSHMEM/PGAS and UPC
Superiors applications offloads: RDMA, Collectives, scalable transport
Backward and future compatible
Speed-Up Your Present, Protect Your Future
Paving The Road to Exascale Computing Together
Thank You