high performance computing - hpc advisory council · higher performance: 100gb/s, sub 0.7usec...

45
Dror Goldenberg, HPCAC Switzerland Conference March 2015 High Performance Computing

Upload: others

Post on 28-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

Dror Goldenberg, HPCAC Switzerland Conference

March 2015

High Performance Computing

Page 2: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 2

End-to-End Interconnect Solutions for All Platforms

Highest Performance and Scalability for

X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms

X86 Open

POWER GPU ARM FPGA

Smart Interconnect to Unleash The Power of All Compute Architectures

Page 3: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 3

Technology Roadmap – One-Generation Lead over the Competition

2000 2020 2010 2005

20Gbs 40Gbs 56Gbs 100Gbs

“Roadrunner” Mellanox Connected

1st 3rd

TOP500 2003 Virginia Tech (Apple)

2015

200Gbs

Terascale Petascale Exascale

Mellanox

Page 4: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 4

The Future is Here

Copper (Passive, Active) Optical Cables (VCSEL) Silicon Photonics

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2Tb/s

Entering the Era of 100Gb/s

100Gb/s Adapter, sub 0.7us latency

150 million messages per second

(10 / 25 / 40 / 50 / 56 / 100Gb/s)

Page 5: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 5

Mellanox QSFP 100Gb/s Cables

Complete Solution of 100Gb/s Copper and Fiber Cables

Copper Cables VCSEL AOCs Silicon Photonics AOCs

Making 100Gb/s Deployments as Easy as 10Gb/s

Page 6: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 6

The Advantage of 100G Passive Copper Cables (Over Fiber)

5000 Nodes Cluster

CAPEX Saving: $2.1M

Power Saving: 37.3KW

Lower Cost, Lower Power Consumption, Higher Reliability

1000 Nodes Cluster

CAPEX Saving: $420K

Power Saving: 7.6KW

Supercomputing’14 Demo: 4m, 6m and 8m Copper Cables

Page 7: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 7

Enter the World of Scalable Performance – 100Gb/s Adapter

Connect. Accelerate. Outperform

ConnectX-4: Highest Performance Adapter in the Market

InfiniBand: SDR / DDR / QDR / FDR / EDR

Ethernet: 10 / 25 / 40 / 50 / 56 / 100GbE

100Gb/s, <0.7us latency

150 million messages per second

OpenPOWER CAPI technology

CORE-Direct technology

GPUDirect RDMA

Dynamically Connected Transport (DCT)

Ethernet /IPoIB offloads (HDS, RSS, TSS, LRO, LSOv2)

Page 8: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 8

Shattering The World of Interconnect Performance!

ConnectX-4 EDR 100G InfiniBand

InfiniBand Throughput 100 Gb/s

InfiniBand Bi-Directional Throughput 195 Gb/s

InfiniBand Latency 0.61 us

InfiniBand Message Rate 149.5 Million/sec

HPC-X MPI Bi-Directional Throughput 193.1 Gb/s

*First results, optimizations in progress

Page 9: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 9

ConnectX-4 InfiniBand Message Rate

Page 10: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 10

Enter the World of Scalable Performance – 100Gb/s Switch

Store Analyze

7th Generation InfiniBand Switch

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2 Tb/s

InfiniBand Router

Adaptive Routing

Switch-IB: Highest Performance Switch in the Market

Page 11: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 11

FDR to EDR Switch Hardware Improvements

• x86 CPU

• 80 gold+ and energy star certifications moving to efficient power supplies

• All ports support 3.5W (class 4)

• 4 separated fan drawers improved fan MTBF

• Two power supplies out-of-the-box

• AC or DC power supply ordering options

• BBU (Battery Backup Units) ordering option

• Improved rail kits (captive screws, stopper, routing)

• Dual leaf design

• N+N power supplies @220VAC

• Modular design

- Same leaf/spine/PS and fans

• Protected power supplies fuses

Page 12: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 12

36-Port Switch IC versus 48-Port Switch IC

Cluster Size

(nodes)

Mellanox EDR 100G

36-port Switch IC

Network Latency

48-port Switch IC

Network Latency

32 90ns 113ns

128 270ns 339ns

512 270ns 339ns

2K 450ns 565ns

5K 450ns ?

10K 450ns ?

Mellanox 36-Port Switch Solution Delivers 20% Lower Network Latency

Page 13: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 13

Optimize Network Performance

Adaptive Routing

Page 14: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 14

How it works?

• Each destination is given at least one address = Local IDentifier (LID)

• Routing is Destination Based: OutPort = LFT(DLID)

• Multi-pathing supported by assigning a destination > 1 LID

Advantages

• Fast – single memory lookup

• Low cost – No TCM / LPM required

• Simple to compute

Destination Based Forwarding

LID Port

1

2

3

4

5

6

2

3

2

3

6

7

8

2

0

0xFF

LID Port

1

2

3

4

5

6

1

2

3

4

6

7

8

0

6

0xFF

1

4

2

5 3

6 1

4

2

5 3

6

2

3

4

5

1

6 7 8

LFT LFT

Page 15: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 15

Non-Blocking

• A network is Non-Blocking if it can be routed to support any possible source destination

pairing (a.k.a. “Permutation”) with no network contention

Strictly Non-Blocking:

• When routing of new pairs does not interfere with previously routed pairs

Rearrangeable Non-Blocking:

• When routing of new pairs may require re-routing of previously routed pairs

Destination Based Forwarding and Non-Blocking Clos Networks

Data flow

Page 16: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 16

Traffic Aware Load Balancing Systems

Centralized

• Flows are routed according to a “global”

knowledge

Distributed

• Each flow is routed by its input switch with “local”

knowledge

Property Central Adaptive Routing Distributed Adaptive Routing

Scalability Low High

Knowledge Global Local (to keep scalability)

Non-Blocking Yes Unknown

Central Routing

Control

AR

AR

AR

Self

Routing

Unit

Page 17: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 17

Trial and Error Is Fundamental to Distributed AR

Randomize output port – Trial 1

Send the traffic

Contention 1

Un-route contending flow

• Randomize new output port – Trial 2

Send the traffic

Contention 2

Un-route contending flow

• Randomize new output port – Trial 3

Send the traffic

Convergence!

SR

SR

SR

D

D

D

Adaptive Routing is a Reactive Mechanism!

Page 18: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 18

Mellanox Adaptive Routing Notification (ARN)

Internet switch to switch communications

Faster convergence after routing changes

Fast notification to decision point

Fully configurable (topology agnostic)

Adapt even when congestion is far away

1) Sampling

prefers the

large flows

ARN

3) ARN cause

new output port

selection 2) ARN

forwarded,

can’t fix here

Faster Routing Modifications, Resilient Network

Page 19: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 19

InfiniBand Adaptive Routing

0

1000

2000

3000

4000

5000

6000

7000

16 nodes 32 nodes 64 nodes 128 nodes 256 nodes 512 nodes 1024 nodes 1585 nodes

Ba

nd

wid

th (M

B/s

)

B_Eff Benchmark

FDR InfiniBand Cray Aries

Higher Performance and Better Network Utilization

Page 20: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 20

Comprehensive HPC Software

Page 21: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 21

Mellanox HPC-X™ Scalable HPC Software Toolkit

MPI, PGAS OpenSHMEM and UPC package for HPC environments

Fully optimized for Mellanox InfiniBand and 3rd party interconnect solutions

Maximize application performance

Mellanox tested, supported and packaged

For commercial and open source usage

Page 22: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 22

Enabling Highest Applications Scalability and Performance

Mellanox Ethernet (RoCE) Mellanox InfiniBand

Platforms (x86, Power8, ARM, GPU, FPGA)

Operating System

Mellanox OFED® PeerDirect™, Core-Direct™, GPUDirect® RDMA

Mellanox HPC-X™ MPI, SHMEM, UPC, MXM, FCA

Applications

3rd Party Standard

Interconnect (InfiniBand, Ethernet)

Comprehensive MPI, PGAS/OpenSHMEM/UPC Software Suite

Page 23: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 23

Mellanox Delivers Highest Applications Performance

World Record Performance with Mellanox Connect-IB and HPC-X

Achieve Higher Performance with 67% less compute infrastructure versus Cray

25% 5%

Page 24: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 24

HPC-X Performance Advantage

58% Performance Advantage!

Enabling Highest Applications Scalability and Performance

21% Performance Advantage! 20% Performance Advantage!

Page 25: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 25

GPUDirect RDMA

Accelerator and GPU Offloads

Page 26: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 26

PeerDirect Technology

Based on Peer-to-Peer capability of PCIe

Support for any PCIe peer which can provide access to its memory

• NVIDIA GPU, XEON PHI, AMD, custom FPGA

PeerDirect

GPU Memory

CPU

GPU Memory

RDMA

System

Memory

PeerDirect RDMA

PCIe PCIe

System

Memory CPU

GPU GPU

PeerDirect

Page 27: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 27

Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPI SendRecv efficiency between GPUs in remote nodes

Based on PeerDirect technology

GPUDirect™ RDMA (GPUDirect 3.0)

With GPUDirect™ RDMA

Using PeerDirect™

Page 28: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 28

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

0

5

10

15

20

25

1 4 16 64 256 1K 4K

Message Size (bytes)

Late

ncy (

us

)

GPU-GPU Internode MPI Latency

Low

er is

Bette

r 67 %

5.49 usec

Performance of MVAPICH2 with GPUDirect RDMA

67% Lower Latency

5X

GPU-GPU Internode MPI Bandwidth

Hig

her

is B

ett

er

5X Increase in Throughput

Source: Prof. DK Panda

Page 29: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 29

Mellanox PeerDirect™ with NVIDIA GPUDirect RDMA

HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs

GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand • Unlocks performance between GPU and InfiniBand

• This provides a significant decrease in GPU-GPU communication latency

• Provides complete CPU offload from all GPU communications across the network

Demonstrated up to 102% performance improvement with large number of particles

102%

Page 30: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 30

GPUDirect Sync (GPUDirect 4.0)

GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect

• Control path still uses the CPU

- CPU prepares and queues communication tasks on GPU

- GPU triggers communication on HCA

- Mellanox HCA directly accesses GPU memory

GPUDirect Sync (GPUDirect 4.0)

• Both data path and control path go directly

between the GPU and the Mellanox interconnect

0

10

20

30

40

50

60

70

80

2 4

Ave

rag

e t

ime

pe

r it

era

tio

n (

us

)

Number of nodes/GPUs

2D stencil benchmark

RDMA only RDMA+PeerSync

27% faster 23% faster

Maximum Performance

For GPU Clusters

Page 31: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 31

Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network Interface CUDA

Driver + runtime Network Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA

Driver + runtime

CUDA Application

rCUDA provides remote access from

every node to any GPU in the system

CPU VGPU

CPU VGPU

CPU VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

Page 32: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 32

Multi-Host Technology

Page 33: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 33

Data Center Evolution Over Time

Multi-Core Multi-Host Multi-Server

Page 34: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 34

New Compute Rack / Data Center Architecture

Traditional Data Center

Expensive design for fixed data centers

Requires many ports on ToR switch

Dedicated NIC / cable per server

Scalable Data Center with Multi-Host

The future of data center design

CPU becomes plug-in peripheral

Flexible, configurable, application optimized

Optimized ToR switches

Takes advantage of high-throughput network

Server

CPU

The Network is The Computer

NIC

Server

NIC

CPU Slots

Page 35: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 35

Lower OPEX and CAPEX

New Scalable Rack Design

Smart Interconnect to Unleash the Power of All Compute Architectures

CPU

CPU

CPU

CPU

CPU CPU CPU CPU

Single Adapter

Page 36: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 36

ConnectX-4 on Facebook OCP Multi-Host Platform (Yosemite)

Compute Slots

ConnectX-4

Multi-Host

Adapter

The Next Generation Compute and Storage Rack Design

Page 37: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 37

Higher Performance Data Center

Smart Interconnect for a High Variety of Compute Architectures

CPU CPU CPU CPU QPI

Overcoming CPU-to-CPU Connectivity Bottlenecks

Lower Application Latency

Page 38: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 38

Efficient GPU-based Data Centers

QPI GPU CPU GPU

QPI GPU CPU GPU

GPUDirect® RDMA GPUDirect® RDMA GPUDirect® RDMA

Smart Interconnect to Unleash the Power of All Compute Architectures

CPU CPU

GPUDirect® RDMA

Enabling GPUDirect RDMA Across all Available GPUs

Page 39: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 39

HPC Clouds

All HPC Clouds to Use Mellanox Solutions

Page 40: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 40

Virtualization for HPC: Mellanox SRIOV

Page 41: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 41

HPC Clouds – Performance Demands Mellanox Solutions

San Diego Supercomputing Center “Comet” System (2015) to

Leverage Mellanox Solutions and Technology to Build HPC Cloud

Page 42: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 42

Summary

Page 43: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 43

PCIe 4.0 Accelerating CPU / Memory - Interconnect Performance

1.0

2.0

3.0

4.0

2003-2004

2007-2008

2011-2012

2015-2016

2.5Gb/s per lane

8/10 bit encoding

5Gb/s per lane

8/10 bit encoding

8Gb/s per lane

128/130 bit encoding

16Gb/s per lane

128/130 bit encoding

New Capabilities

4.0 • Higher Bandwidth (16 to 25Gb/s per lane)

• Cache Coherency

• Atomic Operations

• Advanced Power Management

• Memory Management…

Page 44: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

© 2015 Mellanox Technologies 44

Mellanox Interconnect Advantages

Mellanox solutions provide a proven, scalable and high performance end-to-end connectivity

Flexible, support all compute architectures: x86, Power, ARM, GPU, FPGA etc.

Standards-based (InfiniBand, Ethernet), supported by large eco-system

Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec

HPC-X software provides leading performance for MPI, OpenSHMEM/PGAS and UPC

Superiors applications offloads: RDMA, Collectives, scalable transport

Backward and future compatible

Speed-Up Your Present, Protect Your Future

Paving The Road to Exascale Computing Together

Page 45: High Performance Computing - HPC Advisory Council · Higher performance: 100Gb/s, sub 0.7usec latency, 150 million messages/sec HPC-X software provides leading performance for MPI,

Thank You