ethernet enhancements for high performance computing and

22
Ethernet Enhancements for High Performance Computing and Storage Systems Yan Zhuang, Huawei Technologies IEEE 802.3 New Ethernet Adhoc April 07, 2021

Upload: others

Post on 18-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ethernet Enhancements for High Performance Computing and

Ethernet Enhancements for High Performance Computing and Storage

Systems

Yan Zhuang, Huawei Technologies

IEEE 802.3 New Ethernet Adhoc

April 07, 2021

Page 2: Ethernet Enhancements for High Performance Computing and

Contributors and Supporters

โ€ข Josรฉ Duato, Polytechnic University of Valencia, Member of the Royal Spanish Academy of Sciences

โ€ข Fazhi Qi, Computer Center, Institute of High Energy Physics, CAS

โ€ข Shan Zeng, Computer Center, Institute of High Energy Physics, CAS

โ€ข Liang Guo, CAICT

โ€ข Jie Li, CAICT

โ€ข Zhaogeng Li, Baidu

โ€ข Xuequn Wei, JD.com

โ€ข Hao Quan, Meituan

โ€ข Leon Bruckman, Huawei

2

Page 3: Ethernet Enhancements for High Performance Computing and

Acknowledgement

The authors would like to thank Sudheer Chunduri, Scott Parker, Pavan Balaji,

Kevin Harms and Kalyan Kumaran from ALCF, Argonne National Laboratory,

Argonne National Laboratory, for their good analysis in the paper

โ€œCharacterization of MPI Usage on a Production Supercomputerโ€ and

permission to share their work in this contribution.

The authors also would like to thank SNIA for their work and would like to

share it in this contribution.

3

Page 4: Ethernet Enhancements for High Performance Computing and

Agenda โ€ข Market Drivers

โˆ’ Computing Market

โˆ’ Storage Market

โ€ข Ethernet enhancements for high performance applications

โˆ’ Why Ethernet Helps

โˆ’ The way for Ethernet to go

โˆ™ Current Challenges

โˆ™ Enhancements to be considered

โ€ข Q&A?

4

Page 5: Ethernet Enhancements for High Performance Computing and

Market DriversThe Market that we are talking about and its Ethernet trends

โˆ’ Computing Market

โˆ’ Storage Market

5

Data Storage Computing

Communication

Processing System

Drive

Page 6: Ethernet Enhancements for High Performance Computing and

AI Market

6Source: B400G CFI๏ผšhttps://www.ieee802.org/3/ad_hoc/ngrates/public/calls/20_1029/CFI_Beyond400GbE_Rev7_201029.pdf

Page 7: Ethernet Enhancements for High Performance Computing and

7

HPC Market

Source: B400G CFI๏ผšhttps://www.ieee802.org/3/ad_hoc/ngrates/public/calls/20_1029/CFI_Beyond400GbE_Rev7_201029.pdf

HPC now is stepping to more AI-focused use cases for science computation as well as data analysis. Meanwhile, OTT companies (like AWS, IBM as well as Huawei etc al.) is providing HPC cloud to their customers for high performance computing. These two markets are tending to merge somehow to provide computing-intensive services for their customers by cloud access.

Page 8: Ethernet Enhancements for High Performance Computing and

โ€ข The primary objective of computing infrastructures is to provide high/large computing

power for their customers to deal with various computing-intensive tasks.

โ€ข Efforts to build a such system includes:

โˆ’ To improve the computing unit performance, however since Mooreโ€™s Law is coming to

an end, increasing the number of accelerators in a server comes at the expense of

proportionally increased power consumption.

โˆ’ Efficient use of multiple accelerators in each server without introducing a

communication bottleneck. Accelerators should be directly connected to the

interconnect (e.g. Habana Gaudi). Similarly, in a disaggregated architecture, the

different pooled components (CPUs, accelerators, memory, similar work can be found

at Dell blog โ€œServer Disaggregation: Sometimes the Sum of the Parts Is Greater Than

the Wholeโ€) need direct access to the interconnect. In both cases, the number of

network ports per server increases significantly.

Interconnects to connect computing power

8

Assume there is a picture

Page 9: Ethernet Enhancements for High Performance Computing and

โ€ข Efforts to build a such system also includes:

โˆ’ The increase in the total number of servers per system, which also contributes to a

much higher number of interconnect ports.

โˆ’ The results of combining both above mentioned needs, require that both the total

number of interconnect end nodes and the total aggregated network bandwidth will

have to increase by a factor between 10x to 100x. Additionally, as network size

increases, it becomes increasingly difficult to guarantee low latency.

โ€ข Given that, the computing interconnect is going beyond a single device to a more scalable

fabric system by interconnecting a larger (and increasing) amount of computing and

storage units, to compose a high performance system.

Interconnects to connect computing power

9

Computing fabric becomes a key point to โ€œsum upโ€ the computing

power of increasingly heterogeneous nodes!

Page 10: Ethernet Enhancements for High Performance Computing and

10

Ethernet Interconnects in Top500

Ethernet Interconnects takes the largest share (50.8%) of the Top 500 systems, while it only contributes 19.6%

performance share (i.e. Rmax) compared with IB interconnects contributing 40% performance share with 31%

system share.

Note: all data from https://www.top500.org/statistics/list/

Infiniband interconnects are mainly EDR and stepped into

HDR200 in Nov. 2020, while Ethernet interconnect stays

still at 10GE (17.4%), 25GE (16%) and 40GE (14.6%).

Note: Rmax is a systemโ€™s maximal achieved performance

50.8%

31.0%

9.4%7.4%

1.2% 0.2%

Interconnect Family System Share

Gigabit Ethernet Infiniband Omnipath

Custom Interconnect Properietary Network Myrinet

19.60%

40%7.60%

13.30%

19.50%

Interconnect Family Performance Share

Gigabit Ethernet Infiniband Omnipath

Custom Interconnect Properietary Network Others

17.4%

16%

14.60%10%

8.80%

6.20%

6%

2.60%

2.40%

2.20%

13.80%

Interconnect System Share10G Ethernet25G Ethernet40G EthernetInfiniband EDRIntel Omin-PathInfiniband FDRAries InterconnectMellanox HDR InfinibandMellanox InfiniBand HDR100Mellanox InfiniBand EDROthers

Page 11: Ethernet Enhancements for High Performance Computing and

11

Interconnect Upgrades in HPC

Note: all data is from: https://www.top500.org/statistics/list/.

โ€ข Ethernet starts earlier in HPC networking while IB applies its high rate into HPC networking much faster. โ€ข The highest rank Ethernet interconnect system is 86 from HPE -- Unizah-II (Cray CS500), while Infiniband takes 7 positions in Top10 Systems.โ€ข Starting from 100Gb/s, HPC networking is pursuing Exascale or higher computing performance with much more data exchange, aimed for more

efficiency interconnects.

GbE 10GbE 25GbE 40GbE 100GbE

802.3ab-1999 802.3an-2006 802.3bq-2016 802.3by-2016

802.3bq-2016802.3ba-2010

802.3ba-2010802.3bj-2014802.3cu-2020

0

50

100

150

200

250

300

20

02

.06

20

02

.11

20

03

.06

20

03

.11

20

04

.06

20

04

.11

20

05

.06

20

05

.11

20

06

.06

20

06

.11

20

07

.06

20

07

.11

20

08

.06

20

08

.11

20

09

.06

20

09

.11

20

10

.06

20

10

.11

20

11

.06

20

11

.11

20

12

.06

20

12

.11

20

13

.06

20

13

.11

20

14

.06

20

14

.11

20

15

.06

20

15

.11

20

16

.06

20

16

.11

20

17

.06

20

17

.11

20

18

.06

20

18

.11

20

19

.06

20

19

.11

20

20

.06

20

20

.11

Nu

m o

f sy

ste

ms

Year

Ethernet Interconnect Trends

Ethernet Family 10G Ethernet 25G Ethernet 40G Ethernet 100G Ethernet

0

50

100

150

200

250

300

20

02

.06

20

02

.11

20

03

.06

20

03

.11

20

04

.06

20

04

.11

20

05

.06

20

05

.11

20

06

.06

20

06

.11

20

07

.06

20

07

.11

20

08

.06

20

08

.11

20

09

.06

20

09

.11

20

10

.06

20

10

.11

20

11

.06

20

11

.11

20

12

.06

20

12

.11

20

13

.06

20

13

.11

20

14

.06

20

14

.11

20

15

.06

20

15

.11

20

16

.06

20

16

.11

20

17

.06

20

17

.11

20

18

.06

20

18

.11

20

19

.06

20

19

.11

20

20

.06

20

20

.11

Nu

mb

er

of

Syst

em

s

Year

Infiniband Interconnect Trends

INFINIBAND Family IB QDR IB FDR IB EDR* IB HDR* IB HDR 100* IB HDR 200

QDR: 10Gb/s

FDR: 14Gb/s

EDR: 25Gb/s

HDR: 50Gb/s

NDR:100Gb/s

XDR: 250Gb/s

2007 2011 2014 2018 2021 ??

Page 12: Ethernet Enhancements for High Performance Computing and

Ethernet Switch

ESSDESSDESSDESSDESSDESSD

JBOF to NVMe-oF(NoF) Ethernet BOF

โ€ข From JBOF to NoF EBOF:

โˆ’ A simple backplane design that offers high density and full utilization of flash

attached.

โˆ’ Scalable Ethernet switching and extensibility for more SSD nodes.

โˆ’ Less power with direct connected to Ethernet

Shahar Noy from Marvell also presented โ€œStorage - Ethernet as the Next Storage Fabric of Choiceโ€, OCP summit 2019

Notes: JBOF = Just a Bunch of Flash.

NVMe SSD

NVMe SSD

NVMe Host Driver

CPU DRAM

PCIe Switch

JBOF

NVMe-oF Host Driver

CPU DRAM

NVMe-oF controller

PCIe Fabric

NoF JBOF

NVMe-oF Host Driver

CPU DRAM

Ethernet Switch

Fabric

NVMe SSD

NVMe SSD

NVMe SSD

NVMe SSD

PCIe

NVMe SSD

NVMe SSD

NVMe SSD

NVMe SSD

NVMe SSD

NVMe SSD

PCIe ESSDESSDESSD

NoF Ethernet BOF

12

Page 13: Ethernet Enhancements for High Performance Computing and

New Storage Media Needs Microseconds Network

Network is stepping into ~10us or even ~1us level to offer proportional gains!

โ€ข Storage media is getting faster from SSD to SCM (Storage Class Memory) (aka. PM, Persistent

Memory) with latency down to ~10us or even below 1us.

โ€ข Note that even with 10us latency, the network takes 50% of the total latency (with 10us SCM) and will

perform even worse with more advanced media (e.g. 1us PM).

13

HDD:3500

Flash๏ผš150

Flash๏ผš150

SCM๏ผš10

SAS 100

SAS๏ผš100

NVMe๏ผš20

NVMe๏ผš20

FC๏ผš300

FC๏ผš300

NoF๏ผš30

NoF๏ผš30

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

HDD era

๏ผˆ~2010๏ผ‰

Flash 1.0

๏ผˆ2010-2015๏ผ‰

Flash 2.0

๏ผˆ2016-2020๏ผŸ๏ผ‰

SCM

๏ผˆ2021๏ผŸ๏ผ‰

Late

ncy

๏ผˆu

s๏ผ‰

Network

Storage Protocol

Media

Source: โ€˜ Storage Hierarchy: Capacity & Speedโ€™ diagram reprinted with permission from SNIA, โ€˜Introduction to SNIA Persistent Memory Performance Test Specificationโ€™, ยฉ2020, www.sina.org.

Page 14: Ethernet Enhancements for High Performance Computing and

Ethernet Enhancements for high performance applicationsโ€ข Why Ethernet?

โ€ข The way for Ethernet to go -โˆ’ High Throughput

โˆ’ Low latency improvement, even with small steps.

14

Page 15: Ethernet Enhancements for High Performance Computing and

Why Ethernet๏ผŸ

15

Development of High Speed Technologies

Deep ecosystem involvement and rich connections.

Source: https://www.ieee802.org/3/ad_hoc/ngrates/public/calls/20_0430/dambrosia_nea_02a_200430.pdf

Page 16: Ethernet Enhancements for High Performance Computing and

Challenge 1๏ผš High ThroughputWe are focusing on higher bandwidth, however:

MPI is the communication protocol for most HPC applications. It

provides support for collective communication, which includes

Reduce, AllReduce and Broadcast operations. For HPC

applications, over 85% of the Reduce messages are below 64

bytes and 70% of the Allreduce messages are below 64 bytes.

16

โ€ข As we can see from the graph, Ethernet provides lower

message rates when compared with IB port @ the same rate.

โ€ข The Next Plaform also posted an article of โ€œHow Cray Makes

Ethernet Suited For HPC And AI With Slingshotโ€ August 16,

2019, by Timothy Prickett Morgan, in which it provides

message rate comparison between Ethernet and other

interconnects.

S. Chunduri, S. Parker, P. Balaji, K. Harms and K. Kumaran, "Characterization of

MPI Usage on a Production Supercomputer," SC18: International Conference for

High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA,

2018, pp. 386-400, doi: 10.1109/SC.2018.00033. (ยฉ 2018 IEEE).

Analysis of the message rates of IB and Ethernet @ 25G/50G/100G

Page 17: Ethernet Enhancements for High Performance Computing and

17

Challenge 2 ๏ผšLow Latency

โ€ข By using RDMA, time to transfers improves around 5x times which leads to below 10us round-trip.

โ€ข With the new storage media and storage protocols, the network latency percentage increases as part of the total latency. With 10us

latency SCM, the network takes 50% of the total latency and will even perform worse with more advanced media (e.g. 1us PM).

โ€ข For large parallel application executed on the next generation high performance computing (HPC) systems, MPI communication

latency should be lower than 1us*.

Transport gets faster: from TCP to RDMA Storage gets faster: from 100us SSD to 1us PM/SCM

Storage media and stack gets faster and faster.

Even with 10us latency, the network takes 50% of the total latency (with 10us SCM)

and will even perform worse with more advanced media (e.g. 1us PM).

HDD:3500

Flash๏ผš150

Flash๏ผš150

SCM๏ผš10

SAS 100

SAS๏ผš100

NVMe๏ผš20

NVMe๏ผš20

FC๏ผš300

FC๏ผš300

NoF๏ผš30

NoF๏ผš30

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

HDD era

๏ผˆ~2010๏ผ‰

Flash 1.0

๏ผˆ2010-2015๏ผ‰

Flash 2.0

๏ผˆ2016-2020๏ผŸ๏ผ‰

SCM

๏ผˆ2021๏ผŸ๏ผ‰

Late

ncy

๏ผˆu

s๏ผ‰

Network

Storage Protocol

Media

* Reference from: J. Tomkins, โ€œInterconnects: A Buyers Point of View,โ€ ACS Work-shop, June 2007

Page 18: Ethernet Enhancements for High Performance Computing and

Conceptual interconnects within a system

CPU

GPU

FPGA

SW

DRAM

GPUGPU

DRAMSCM

SW

GPUGPUGPU

FPGADRAM

CPUCPU

GPUGPUCPU

GPUGPUDRAM

GPUGPUSCM

Server #1 Server #n

SW

SW

โ‘ก Data exchange over a fabric

โ‘ข External networkinterface

โ‘  Data within a device

18

Outbox interconnect

Inbox interconnect

Interconnect ๐‘™๐‘Ž๐‘ก๐‘’๐‘›๐‘๐‘ฆ = ๐‘ ๐‘ก๐‘Ž๐‘ก๐‘–๐‘ ๐‘™๐‘Ž๐‘ก๐‘’๐‘›๐‘๐‘ฆ + ๐‘‘๐‘ฆ๐‘›๐‘Ž๐‘š๐‘–๐‘ ๐‘™๐‘Ž๐‘ก๐‘’๐‘›๐‘๐‘ฆ

Page 19: Ethernet Enhancements for High Performance Computing and

Latency Analysis

๐ธ๐‘›๐‘‘๐‘๐‘œ๐‘–๐‘›๐‘ก๐‘ก๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ+ ๐ธ๐‘›๐‘‘๐‘๐‘œ๐‘–๐‘›๐‘ก๐‘Ÿ๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ

+ ๐‘๐‘Ÿ๐‘œ๐‘๐‘’๐‘ ๐‘ ๐‘–๐‘›๐‘” ๐‘ก๐‘–๐‘š๐‘’ ๐‘๐‘’๐‘Ÿ โ„Ž๐‘œ๐‘ โˆ— โ„Ž๐‘œ๐‘๐‘  +

๐‘–

๐‘™๐‘–๐‘›๐‘˜ ๐‘™๐‘Ž๐‘ก๐‘’๐‘›๐‘๐‘ฆ๐‘–

19

๐ผ๐‘›๐‘ก๐‘’๐‘Ÿ๐‘๐‘œ๐‘›๐‘›๐‘’๐‘๐‘ก ๐‘™๐‘Ž๐‘ก๐‘’๐‘›๐‘๐‘ฆ = ๐‘†๐‘ก๐‘Ž๐‘ก๐‘–๐‘ ๐‘™๐‘Ž๐‘ก๐‘’๐‘›๐‘๐‘ฆ + ๐ท๐‘ฆ๐‘›๐‘Ž๐‘š๐‘–๐‘ ๐‘™๐‘Ž๐‘ก๐‘’๐‘›๐‘๐‘ฆ

๐‘๐‘Ÿ๐‘œ๐‘๐‘’๐‘ ๐‘ ๐‘–๐‘›๐‘” ๐‘ก๐‘–๐‘š๐‘’ ๐‘๐‘’๐‘Ÿ โ„Ž๐‘œ๐‘ = ๐‘Ÿ๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ + ๐‘ก๐‘ฅ๐‘ƒโ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ + ๐‘†๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘–๐‘›๐‘” ๐‘ก๐‘–๐‘š๐‘’

โ‘ก Outbox interconnects:

โ€ข 2-layer CLOS topology, there are 3 hops, which is

โ€ข 3-layer CLOS topology, there are 5 hops, which is

A. Dynamic latency is related to queuing time due to congestion and latency caused by retransmission due to packet loss.

4 โˆ— (๐‘Ÿ๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ + ๐‘ก๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ)

6 โˆ— (๐‘Ÿ๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ + ๐‘ก๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ)

โ‘  Inbox interconnects ( unified interconnects without heterogeneous link transformation delay)

B. Link latency = 5ns/m * reach. If the anticipated communication latency is less than 1us, then the reach is limited (might be below 20m).

๐‘โ„Ž๐‘ฆ ๐‘™๐‘Ž๐‘ก๐‘’๐‘›๐‘๐‘ฆ = 1 + โ„Ž๐‘œ๐‘๐‘  โˆ— (๐‘Ÿ๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ + ๐‘ก๐‘ฅ๐‘โ„Ž๐‘ฆ๐‘‘๐‘’๐‘™๐‘Ž๐‘ฆ)

C. Switching Time = Pipeline Time + SWB time.

* Note: for simplicity, we assume all interfaces with the same rate, no convergence.

*

Page 20: Ethernet Enhancements for High Performance Computing and

20

Ethernet Delay constraints in spec

Taking KP4 FEC with150ns as an example, in a 2-layer CLOS network, the end2end FEC delay would be greater

than 0.6us regardless of the other sublayers and link transmission time.

The FEC sublayer takes over 60% of the whole latency. For 200GbE, the PCS/FEC sublayer

takes over 67.9% of the whole PHY delay constraints, while for 400GbE, it takes 69%.

Sublayers 25GbE 40GbE 100GbE 200GbE 400GbE

RS,MAC and MAC control

๏ผˆround-trip๏ผ‰

327.68 409.6 245.76 245.76 245.76

xxGBASE-R PCS 143.36 281.6 353.28 801.28 800

BASE-R

FEC

BASE-R FEC 245.76 614.4 1228.8 -- --

RS-FEC - - 409.60 -- --

BASE-R PMA 163.84 102.4 92.16 92.16 92.16

BASE-R

PMD

CR/CR-S 20.48 102.4 CR4: 20.48

CR10:97.28

40.96 20.48

KR/KR-S 51.2 KR4: 20.48

KP4 PMA/PMD: 81.92

40.96 20.48

SR/LR/ER 25.6 20.48 20.48 20.48

BASE-T PHY 1024 640 -- -- --

Sublayer delay constraints (maximum: ns)

Page 21: Ethernet Enhancements for High Performance Computing and

21

โ€ข Based on existing Ethernet PMDs

โˆ’100GbE/200GbE/400GbE

โ€ข Directions to be considered (invite thoughts and suggestions)

โˆ’To improve throughput for computing messages

โ€ข Support flexible frames size (e.g. below 64B)

โˆ’To improve latency for low latency applications, even with small steps.

โ€ข Support PCS/FEC latency <TBD ns for 100G/lane at least XXm reach...

โ€ข Support PCS/FEC latency <TBD ns for 200G/lane at least XXm reachโ€ฆ

โ€ข Support PCS/FEC latency <TBD ns for 400G/lane at least XXm reachโ€ฆ

โˆ’Anything elseโ€ฆ

Ethernet Enhancements for High Performance Computing and Storage

Page 22: Ethernet Enhancements for High Performance Computing and

Thanks!

22