high performance computing - hpc advisory council · higher performance: 100gb/s, sub 0.7usec...

Dror Goldenberg, HPCAC Switzerland Conference

March 2015

High Performance Computing

© 2015 Mellanox Technologies 2

End-to-End Interconnect Solutions for All Platforms

Highest Performance and Scalability for

X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms

X86 Open

POWER GPU ARM FPGA

Smart Interconnect to Unleash The Power of All Compute Architectures


Technology Roadmap – One-Generation Lead over the Competition

2000 2020 2010 2005

20Gbs 40Gbs 56Gbs 100Gbs

“Roadrunner” Mellanox Connected

1st 3rd

TOP500 2003 Virginia Tech (Apple)

2015

200Gbs

Terascale Petascale Exascale

Mellanox


The Future is Here

Copper (Passive, Active) Optical Cables (VCSEL) Silicon Photonics

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2Tb/s

Entering the Era of 100Gb/s

100Gb/s Adapter, sub 0.7us latency

150 million messages per second

(10 / 25 / 40 / 50 / 56 / 100Gb/s)


Mellanox QSFP 100Gb/s Cables

Complete Solution of 100Gb/s Copper and Fiber Cables

Copper Cables VCSEL AOCs Silicon Photonics AOCs

Making 100Gb/s Deployments as Easy as 10Gb/s


The Advantage of 100G Passive Copper Cables (Over Fiber)

5000 Nodes Cluster

CAPEX Saving: $2.1M

Power Saving: 37.3KW

Lower Cost, Lower Power Consumption, Higher Reliability

1000 Nodes Cluster

CAPEX Saving: $420K

Power Saving: 7.6KW

Supercomputing’14 Demo: 4m, 6m and 8m Copper Cables


Enter the World of Scalable Performance – 100Gb/s Adapter

Connect. Accelerate. Outperform

ConnectX-4: Highest Performance Adapter in the Market

InfiniBand: SDR / DDR / QDR / FDR / EDR

Ethernet: 10 / 25 / 40 / 50 / 56 / 100GbE

100Gb/s, <0.7us latency

150 million messages per second

OpenPOWER CAPI technology

CORE-Direct technology

GPUDirect RDMA

Dynamically Connected Transport (DCT)

Ethernet /IPoIB offloads (HDS, RSS, TSS, LRO, LSOv2)


Shattering The World of Interconnect Performance!

ConnectX-4 EDR 100G InfiniBand

InfiniBand Throughput 100 Gb/s

InfiniBand Bi-Directional Throughput 195 Gb/s

InfiniBand Latency 0.61 us

InfiniBand Message Rate 149.5 Million/sec

HPC-X MPI Bi-Directional Throughput 193.1 Gb/s

*First results, optimizations in progress


ConnectX-4 InfiniBand Message Rate


Enter the World of Scalable Performance – 100Gb/s Switch

Store Analyze

7th Generation InfiniBand Switch

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2 Tb/s

InfiniBand Router

Adaptive Routing

Switch-IB: Highest Performance Switch in the Market


FDR to EDR Switch Hardware Improvements

• x86 CPU

• 80 gold+ and energy star certifications moving to efficient power supplies

• All ports support 3.5W (class 4)

• 4 separated fan drawers improved fan MTBF

• Two power supplies out-of-the-box

• AC or DC power supply ordering options

• BBU (Battery Backup Units) ordering option

• Improved rail kits (captive screws, stopper, routing)

• Dual leaf design

• N+N power supplies @220VAC

• Modular design

- Same leaf/spine/PS and fans

• Protected power supplies fuses


36-Port Switch IC versus 48-Port Switch IC

Cluster Size

(nodes)

Mellanox EDR 100G

36-port Switch IC

Network Latency

48-port Switch IC

Network Latency

32 90ns 113ns

128 270ns 339ns

512 270ns 339ns

2K 450ns 565ns

5K 450ns ?

10K 450ns ?

Mellanox 36-Port Switch Solution Delivers 20% Lower Network Latency


Optimize Network Performance

Adaptive Routing


How it works?

• Each destination is given at least one address = Local IDentifier (LID)

• Routing is Destination Based: OutPort = LFT(DLID)

• Multi-pathing supported by assigning a destination > 1 LID

Advantages

• Fast – single memory lookup

• Low cost – No TCM / LPM required

• Simple to compute

Destination Based Forwarding

LID Port

1

2

3

4

5

6

2

3

2

3

6

7

8

2

0

0xFF

LID Port

1

2

3

4

5

6

1

2

3

4

6

7

8

0

6

0xFF

1

4

2

5 3

6 1

4

2

5 3

6

2

3

4

5

1

6 7 8

LFT LFT


Non-Blocking

• A network is Non-Blocking if it can be routed to support any possible source destination

pairing (a.k.a. “Permutation”) with no network contention

Strictly Non-Blocking:

• When routing of new pairs does not interfere with previously routed pairs

Rearrangeable Non-Blocking:

• When routing of new pairs may require re-routing of previously routed pairs

Destination Based Forwarding and Non-Blocking Clos Networks

Data flow


Traffic Aware Load Balancing Systems

Centralized

• Flows are routed according to a “global”

knowledge

Distributed

• Each flow is routed by its input switch with “local”

knowledge

Property Central Adaptive Routing Distributed Adaptive Routing

Scalability Low High

Knowledge Global Local (to keep scalability)

Non-Blocking Yes Unknown

Central Routing

Control

AR

AR

AR

Self

Routing

Unit


Trial and Error Is Fundamental to Distributed AR

Randomize output port – Trial 1

Send the traffic

Contention 1

Un-route contending flow

• Randomize new output port – Trial 2

Send the traffic

Contention 2

Un-route contending flow

• Randomize new output port – Trial 3

Send the traffic

Convergence!

SR

SR

SR

D

D

D

Adaptive Routing is a Reactive Mechanism!


Mellanox Adaptive Routing Notification (ARN)

Internet switch to switch communications

Faster convergence after routing changes

Fast notification to decision point

Fully configurable (topology agnostic)

Adapt even when congestion is far away

1) Sampling

prefers the

large flows

ARN

3) ARN cause

new output port

selection 2) ARN

forwarded,

can’t fix here

Faster Routing Modifications, Resilient Network


InfiniBand Adaptive Routing

0

1000

2000

3000

4000

5000

6000

7000

16 nodes 32 nodes 64 nodes 128 nodes 256 nodes 512 nodes 1024 nodes 1585 nodes

Ba

nd

wid

th (M

B/s

)

B_Eff Benchmark

FDR InfiniBand Cray Aries

Higher Performance and Better Network Utilization


Comprehensive HPC Software


Mellanox HPC-X™ Scalable HPC Software Toolkit

MPI, PGAS OpenSHMEM and UPC package for HPC environments

Fully optimized for Mellanox InfiniBand and 3rd party interconnect solutions

Maximize application performance

Mellanox tested, supported and packaged

For commercial and open source usage


Enabling Highest Applications Scalability and Performance

Mellanox Ethernet (RoCE) Mellanox InfiniBand

Platforms (x86, Power8, ARM, GPU, FPGA)

Operating System

Mellanox OFED® PeerDirect™, Core-Direct™, GPUDirect® RDMA

Mellanox HPC-X™ MPI, SHMEM, UPC, MXM, FCA

Applications

3rd Party Standard

Interconnect (InfiniBand, Ethernet)

Comprehensive MPI, PGAS/OpenSHMEM/UPC Software Suite


Mellanox Delivers Highest Applications Performance

World Record Performance with Mellanox Connect-IB and HPC-X

Achieve Higher Performance with 67% less compute infrastructure versus Cray

25% 5%


HPC-X Performance Advantage

58% Performance Advantage!

Enabling Highest Applications Scalability and Performance

21% Performance Advantage! 20% Performance Advantage!


GPUDirect RDMA

Accelerator and GPU Offloads


PeerDirect Technology

Based on Peer-to-Peer capability of PCIe

Support for any PCIe peer which can provide access to its memory

• NVIDIA GPU, XEON PHI, AMD, custom FPGA

PeerDirect

GPU Memory

CPU

GPU Memory

RDMA

System

Memory

PeerDirect RDMA

PCIe PCIe

System

Memory CPU

GPU GPU

PeerDirect


Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPI SendRecv efficiency between GPUs in remote nodes

Based on PeerDirect technology

GPUDirect™ RDMA (GPUDirect 3.0)

With GPUDirect™ RDMA

Using PeerDirect™


0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

0

5

10

15

20

25

1 4 16 64 256 1K 4K

Message Size (bytes)

Late

ncy (

us

)

GPU-GPU Internode MPI Latency

Low

er is

Bette

r 67 %

5.49 usec

Performance of MVAPICH2 with GPUDirect RDMA

67% Lower Latency

5X

GPU-GPU Internode MPI Bandwidth

Hig

her

is B

ett

er

5X Increase in Throughput

Source: Prof. DK Panda

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892


Mellanox PeerDirect™ with NVIDIA GPUDirect RDMA

HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs

GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand • Unlocks performance between GPU and InfiniBand

• This provides a significant decrease in GPU-GPU communication latency

• Provides complete CPU offload from all GPU communications across the network

Demonstrated up to 102% performance improvement with large number of particles

102%


GPUDirect Sync (GPUDirect 4.0)

GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect

• Control path still uses the CPU

- CPU prepares and queues communication tasks on GPU

- GPU triggers communication on HCA

- Mellanox HCA directly accesses GPU memory

GPUDirect Sync (GPUDirect 4.0)

• Both data path and control path go directly

between the GPU and the Mellanox interconnect

0

10

20

30

40

50

60

70

80

2 4

Ave

rag

e t

ime

pe

r it

era

tio

n (

us

)

Number of nodes/GPUs

2D stencil benchmark

RDMA only RDMA+PeerSync

27% faster 23% faster

Maximum Performance

For GPU Clusters


Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network Interface CUDA

Driver + runtime Network Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA

Driver + runtime

CUDA Application

rCUDA provides remote access from

every node to any GPU in the system

CPU VGPU

CPU VGPU

CPU VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU


Multi-Host Technology


Data Center Evolution Over Time

Multi-Core Multi-Host Multi-Server


New Compute Rack / Data Center Architecture

Traditional Data Center

Expensive design for fixed data centers

Requires many ports on ToR switch

Dedicated NIC / cable per server

Scalable Data Center with Multi-Host

The future of data center design

CPU becomes plug-in peripheral

Flexible, configurable, application optimized

Optimized ToR switches

Takes advantage of high-throughput network

Server

CPU

The Network is The Computer

NIC

Server

NIC

CPU Slots


Lower OPEX and CAPEX

New Scalable Rack Design

Smart Interconnect to Unleash the Power of All Compute Architectures

CPU

CPU

CPU

CPU

CPU CPU CPU CPU

Single Adapter


ConnectX-4 on Facebook OCP Multi-Host Platform (Yosemite)

Compute Slots

ConnectX-4

Multi-Host

Adapter

The Next Generation Compute and Storage Rack Design


Higher Performance Data Center

Smart Interconnect for a High Variety of Compute Architectures

CPU CPU CPU CPU QPI

Overcoming CPU-to-CPU Connectivity Bottlenecks

Lower Application Latency


Efficient GPU-based Data Centers

QPI GPU CPU GPU

QPI GPU CPU GPU

GPUDirect® RDMA GPUDirect® RDMA GPUDirect® RDMA

Smart Interconnect to Unleash the Power of All Compute Architectures

CPU CPU

GPUDirect® RDMA

Enabling GPUDirect RDMA Across all Available GPUs


HPC Clouds

All HPC Clouds to Use Mellanox Solutions


Virtualization for HPC: Mellanox SRIOV


HPC Clouds – Performance Demands Mellanox Solutions

San Diego Supercomputing Center “Comet” System (2015) to

Leverage Mellanox Solutions and Technology to Build HPC Cloud


Summary


PCIe 4.0 Accelerating CPU / Memory - Interconnect Performance

1.0

2.0

3.0

4.0

2003-2004

2007-2008

2011-2012

2015-2016

2.5Gb/s per lane

8/10 bit encoding

5Gb/s per lane

8/10 bit encoding

8Gb/s per lane

128/130 bit encoding

16Gb/s per lane

128/130 bit encoding

New Capabilities

4.0 • Higher Bandwidth (16 to 25Gb/s per lane)

• Cache Coherency

• Atomic Operations

• Advanced Power Management

• Memory Management…


Mellanox Interconnect Advantages

Mellanox solutions provide a proven, scalable and high performance end-to-end connectivity

Flexible, support all compute architectures: x86, Power, ARM, GPU, FPGA etc.

Standards-based (InfiniBand, Ethernet), supported by large eco-system

Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec

HPC-X software provides leading performance for MPI, OpenSHMEM/PGAS and UPC

Superiors applications offloads: RDMA, Collectives, scalable transport

Backward and future compatible

Speed-Up Your Present, Protect Your Future

Paving The Road to Exascale Computing Together

Thank You

high performance computing - hpc advisory council · higher performance: 100gb/s, sub 0.7usec...

Documents