ethernet enhancements for high performance computing and
TRANSCRIPT
Ethernet Enhancements for High Performance Computing and Storage
Systems
Yan Zhuang, Huawei Technologies
IEEE 802.3 New Ethernet Adhoc
April 07, 2021
Contributors and Supporters
โข Josรฉ Duato, Polytechnic University of Valencia, Member of the Royal Spanish Academy of Sciences
โข Fazhi Qi, Computer Center, Institute of High Energy Physics, CAS
โข Shan Zeng, Computer Center, Institute of High Energy Physics, CAS
โข Liang Guo, CAICT
โข Jie Li, CAICT
โข Zhaogeng Li, Baidu
โข Xuequn Wei, JD.com
โข Hao Quan, Meituan
โข Leon Bruckman, Huawei
2
Acknowledgement
The authors would like to thank Sudheer Chunduri, Scott Parker, Pavan Balaji,
Kevin Harms and Kalyan Kumaran from ALCF, Argonne National Laboratory,
Argonne National Laboratory, for their good analysis in the paper
โCharacterization of MPI Usage on a Production Supercomputerโ and
permission to share their work in this contribution.
The authors also would like to thank SNIA for their work and would like to
share it in this contribution.
3
Agenda โข Market Drivers
โ Computing Market
โ Storage Market
โข Ethernet enhancements for high performance applications
โ Why Ethernet Helps
โ The way for Ethernet to go
โ Current Challenges
โ Enhancements to be considered
โข Q&A?
4
Market DriversThe Market that we are talking about and its Ethernet trends
โ Computing Market
โ Storage Market
5
Data Storage Computing
Communication
Processing System
Drive
AI Market
6Source: B400G CFI๏ผhttps://www.ieee802.org/3/ad_hoc/ngrates/public/calls/20_1029/CFI_Beyond400GbE_Rev7_201029.pdf
7
HPC Market
Source: B400G CFI๏ผhttps://www.ieee802.org/3/ad_hoc/ngrates/public/calls/20_1029/CFI_Beyond400GbE_Rev7_201029.pdf
HPC now is stepping to more AI-focused use cases for science computation as well as data analysis. Meanwhile, OTT companies (like AWS, IBM as well as Huawei etc al.) is providing HPC cloud to their customers for high performance computing. These two markets are tending to merge somehow to provide computing-intensive services for their customers by cloud access.
โข The primary objective of computing infrastructures is to provide high/large computing
power for their customers to deal with various computing-intensive tasks.
โข Efforts to build a such system includes:
โ To improve the computing unit performance, however since Mooreโs Law is coming to
an end, increasing the number of accelerators in a server comes at the expense of
proportionally increased power consumption.
โ Efficient use of multiple accelerators in each server without introducing a
communication bottleneck. Accelerators should be directly connected to the
interconnect (e.g. Habana Gaudi). Similarly, in a disaggregated architecture, the
different pooled components (CPUs, accelerators, memory, similar work can be found
at Dell blog โServer Disaggregation: Sometimes the Sum of the Parts Is Greater Than
the Wholeโ) need direct access to the interconnect. In both cases, the number of
network ports per server increases significantly.
Interconnects to connect computing power
8
Assume there is a picture
โข Efforts to build a such system also includes:
โ The increase in the total number of servers per system, which also contributes to a
much higher number of interconnect ports.
โ The results of combining both above mentioned needs, require that both the total
number of interconnect end nodes and the total aggregated network bandwidth will
have to increase by a factor between 10x to 100x. Additionally, as network size
increases, it becomes increasingly difficult to guarantee low latency.
โข Given that, the computing interconnect is going beyond a single device to a more scalable
fabric system by interconnecting a larger (and increasing) amount of computing and
storage units, to compose a high performance system.
Interconnects to connect computing power
9
Computing fabric becomes a key point to โsum upโ the computing
power of increasingly heterogeneous nodes!
10
Ethernet Interconnects in Top500
Ethernet Interconnects takes the largest share (50.8%) of the Top 500 systems, while it only contributes 19.6%
performance share (i.e. Rmax) compared with IB interconnects contributing 40% performance share with 31%
system share.
Note: all data from https://www.top500.org/statistics/list/
Infiniband interconnects are mainly EDR and stepped into
HDR200 in Nov. 2020, while Ethernet interconnect stays
still at 10GE (17.4%), 25GE (16%) and 40GE (14.6%).
Note: Rmax is a systemโs maximal achieved performance
50.8%
31.0%
9.4%7.4%
1.2% 0.2%
Interconnect Family System Share
Gigabit Ethernet Infiniband Omnipath
Custom Interconnect Properietary Network Myrinet
19.60%
40%7.60%
13.30%
19.50%
Interconnect Family Performance Share
Gigabit Ethernet Infiniband Omnipath
Custom Interconnect Properietary Network Others
17.4%
16%
14.60%10%
8.80%
6.20%
6%
2.60%
2.40%
2.20%
13.80%
Interconnect System Share10G Ethernet25G Ethernet40G EthernetInfiniband EDRIntel Omin-PathInfiniband FDRAries InterconnectMellanox HDR InfinibandMellanox InfiniBand HDR100Mellanox InfiniBand EDROthers
11
Interconnect Upgrades in HPC
Note: all data is from: https://www.top500.org/statistics/list/.
โข Ethernet starts earlier in HPC networking while IB applies its high rate into HPC networking much faster. โข The highest rank Ethernet interconnect system is 86 from HPE -- Unizah-II (Cray CS500), while Infiniband takes 7 positions in Top10 Systems.โข Starting from 100Gb/s, HPC networking is pursuing Exascale or higher computing performance with much more data exchange, aimed for more
efficiency interconnects.
GbE 10GbE 25GbE 40GbE 100GbE
802.3ab-1999 802.3an-2006 802.3bq-2016 802.3by-2016
802.3bq-2016802.3ba-2010
802.3ba-2010802.3bj-2014802.3cu-2020
0
50
100
150
200
250
300
20
02
.06
20
02
.11
20
03
.06
20
03
.11
20
04
.06
20
04
.11
20
05
.06
20
05
.11
20
06
.06
20
06
.11
20
07
.06
20
07
.11
20
08
.06
20
08
.11
20
09
.06
20
09
.11
20
10
.06
20
10
.11
20
11
.06
20
11
.11
20
12
.06
20
12
.11
20
13
.06
20
13
.11
20
14
.06
20
14
.11
20
15
.06
20
15
.11
20
16
.06
20
16
.11
20
17
.06
20
17
.11
20
18
.06
20
18
.11
20
19
.06
20
19
.11
20
20
.06
20
20
.11
Nu
m o
f sy
ste
ms
Year
Ethernet Interconnect Trends
Ethernet Family 10G Ethernet 25G Ethernet 40G Ethernet 100G Ethernet
0
50
100
150
200
250
300
20
02
.06
20
02
.11
20
03
.06
20
03
.11
20
04
.06
20
04
.11
20
05
.06
20
05
.11
20
06
.06
20
06
.11
20
07
.06
20
07
.11
20
08
.06
20
08
.11
20
09
.06
20
09
.11
20
10
.06
20
10
.11
20
11
.06
20
11
.11
20
12
.06
20
12
.11
20
13
.06
20
13
.11
20
14
.06
20
14
.11
20
15
.06
20
15
.11
20
16
.06
20
16
.11
20
17
.06
20
17
.11
20
18
.06
20
18
.11
20
19
.06
20
19
.11
20
20
.06
20
20
.11
Nu
mb
er
of
Syst
em
s
Year
Infiniband Interconnect Trends
INFINIBAND Family IB QDR IB FDR IB EDR* IB HDR* IB HDR 100* IB HDR 200
QDR: 10Gb/s
FDR: 14Gb/s
EDR: 25Gb/s
HDR: 50Gb/s
NDR:100Gb/s
XDR: 250Gb/s
2007 2011 2014 2018 2021 ??
Ethernet Switch
ESSDESSDESSDESSDESSDESSD
JBOF to NVMe-oF(NoF) Ethernet BOF
โข From JBOF to NoF EBOF:
โ A simple backplane design that offers high density and full utilization of flash
attached.
โ Scalable Ethernet switching and extensibility for more SSD nodes.
โ Less power with direct connected to Ethernet
Shahar Noy from Marvell also presented โStorage - Ethernet as the Next Storage Fabric of Choiceโ, OCP summit 2019
Notes: JBOF = Just a Bunch of Flash.
NVMe SSD
NVMe SSD
NVMe Host Driver
CPU DRAM
PCIe Switch
JBOF
NVMe-oF Host Driver
CPU DRAM
NVMe-oF controller
PCIe Fabric
NoF JBOF
NVMe-oF Host Driver
CPU DRAM
Ethernet Switch
Fabric
NVMe SSD
NVMe SSD
NVMe SSD
NVMe SSD
PCIe
NVMe SSD
NVMe SSD
NVMe SSD
NVMe SSD
NVMe SSD
NVMe SSD
PCIe ESSDESSDESSD
NoF Ethernet BOF
12
New Storage Media Needs Microseconds Network
Network is stepping into ~10us or even ~1us level to offer proportional gains!
โข Storage media is getting faster from SSD to SCM (Storage Class Memory) (aka. PM, Persistent
Memory) with latency down to ~10us or even below 1us.
โข Note that even with 10us latency, the network takes 50% of the total latency (with 10us SCM) and will
perform even worse with more advanced media (e.g. 1us PM).
13
HDD:3500
Flash๏ผ150
Flash๏ผ150
SCM๏ผ10
SAS 100
SAS๏ผ100
NVMe๏ผ20
NVMe๏ผ20
FC๏ผ300
FC๏ผ300
NoF๏ผ30
NoF๏ผ30
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
HDD era
๏ผ~2010๏ผ
Flash 1.0
๏ผ2010-2015๏ผ
Flash 2.0
๏ผ2016-2020๏ผ๏ผ
SCM
๏ผ2021๏ผ๏ผ
Late
ncy
๏ผu
s๏ผ
Network
Storage Protocol
Media
Source: โ Storage Hierarchy: Capacity & Speedโ diagram reprinted with permission from SNIA, โIntroduction to SNIA Persistent Memory Performance Test Specificationโ, ยฉ2020, www.sina.org.
Ethernet Enhancements for high performance applicationsโข Why Ethernet?
โข The way for Ethernet to go -โ High Throughput
โ Low latency improvement, even with small steps.
14
Why Ethernet๏ผ
15
Development of High Speed Technologies
Deep ecosystem involvement and rich connections.
Source: https://www.ieee802.org/3/ad_hoc/ngrates/public/calls/20_0430/dambrosia_nea_02a_200430.pdf
Challenge 1๏ผ High ThroughputWe are focusing on higher bandwidth, however:
MPI is the communication protocol for most HPC applications. It
provides support for collective communication, which includes
Reduce, AllReduce and Broadcast operations. For HPC
applications, over 85% of the Reduce messages are below 64
bytes and 70% of the Allreduce messages are below 64 bytes.
16
โข As we can see from the graph, Ethernet provides lower
message rates when compared with IB port @ the same rate.
โข The Next Plaform also posted an article of โHow Cray Makes
Ethernet Suited For HPC And AI With Slingshotโ August 16,
2019, by Timothy Prickett Morgan, in which it provides
message rate comparison between Ethernet and other
interconnects.
S. Chunduri, S. Parker, P. Balaji, K. Harms and K. Kumaran, "Characterization of
MPI Usage on a Production Supercomputer," SC18: International Conference for
High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA,
2018, pp. 386-400, doi: 10.1109/SC.2018.00033. (ยฉ 2018 IEEE).
Analysis of the message rates of IB and Ethernet @ 25G/50G/100G
17
Challenge 2 ๏ผLow Latency
โข By using RDMA, time to transfers improves around 5x times which leads to below 10us round-trip.
โข With the new storage media and storage protocols, the network latency percentage increases as part of the total latency. With 10us
latency SCM, the network takes 50% of the total latency and will even perform worse with more advanced media (e.g. 1us PM).
โข For large parallel application executed on the next generation high performance computing (HPC) systems, MPI communication
latency should be lower than 1us*.
Transport gets faster: from TCP to RDMA Storage gets faster: from 100us SSD to 1us PM/SCM
Storage media and stack gets faster and faster.
Even with 10us latency, the network takes 50% of the total latency (with 10us SCM)
and will even perform worse with more advanced media (e.g. 1us PM).
HDD:3500
Flash๏ผ150
Flash๏ผ150
SCM๏ผ10
SAS 100
SAS๏ผ100
NVMe๏ผ20
NVMe๏ผ20
FC๏ผ300
FC๏ผ300
NoF๏ผ30
NoF๏ผ30
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
HDD era
๏ผ~2010๏ผ
Flash 1.0
๏ผ2010-2015๏ผ
Flash 2.0
๏ผ2016-2020๏ผ๏ผ
SCM
๏ผ2021๏ผ๏ผ
Late
ncy
๏ผu
s๏ผ
Network
Storage Protocol
Media
* Reference from: J. Tomkins, โInterconnects: A Buyers Point of View,โ ACS Work-shop, June 2007
Conceptual interconnects within a system
CPU
GPU
FPGA
SW
DRAM
GPUGPU
DRAMSCM
SW
GPUGPUGPU
FPGADRAM
CPUCPU
GPUGPUCPU
GPUGPUDRAM
GPUGPUSCM
Server #1 Server #n
SW
SW
โก Data exchange over a fabric
โข External networkinterface
โ Data within a device
18
Outbox interconnect
Inbox interconnect
Interconnect ๐๐๐ก๐๐๐๐ฆ = ๐ ๐ก๐๐ก๐๐ ๐๐๐ก๐๐๐๐ฆ + ๐๐ฆ๐๐๐๐๐ ๐๐๐ก๐๐๐๐ฆ
Latency Analysis
๐ธ๐๐๐๐๐๐๐ก๐ก๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ+ ๐ธ๐๐๐๐๐๐๐ก๐๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ
+ ๐๐๐๐๐๐ ๐ ๐๐๐ ๐ก๐๐๐ ๐๐๐ โ๐๐ โ โ๐๐๐ +
๐
๐๐๐๐ ๐๐๐ก๐๐๐๐ฆ๐
19
๐ผ๐๐ก๐๐๐๐๐๐๐๐๐ก ๐๐๐ก๐๐๐๐ฆ = ๐๐ก๐๐ก๐๐ ๐๐๐ก๐๐๐๐ฆ + ๐ท๐ฆ๐๐๐๐๐ ๐๐๐ก๐๐๐๐ฆ
๐๐๐๐๐๐ ๐ ๐๐๐ ๐ก๐๐๐ ๐๐๐ โ๐๐ = ๐๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ + ๐ก๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ + ๐๐ค๐๐ก๐โ๐๐๐ ๐ก๐๐๐
โก Outbox interconnects:
โข 2-layer CLOS topology, there are 3 hops, which is
โข 3-layer CLOS topology, there are 5 hops, which is
A. Dynamic latency is related to queuing time due to congestion and latency caused by retransmission due to packet loss.
4 โ (๐๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ + ๐ก๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ)
6 โ (๐๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ + ๐ก๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ)
โ Inbox interconnects ( unified interconnects without heterogeneous link transformation delay)
B. Link latency = 5ns/m * reach. If the anticipated communication latency is less than 1us, then the reach is limited (might be below 20m).
๐โ๐ฆ ๐๐๐ก๐๐๐๐ฆ = 1 + โ๐๐๐ โ (๐๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ + ๐ก๐ฅ๐โ๐ฆ๐๐๐๐๐ฆ)
C. Switching Time = Pipeline Time + SWB time.
* Note: for simplicity, we assume all interfaces with the same rate, no convergence.
*
20
Ethernet Delay constraints in spec
Taking KP4 FEC with150ns as an example, in a 2-layer CLOS network, the end2end FEC delay would be greater
than 0.6us regardless of the other sublayers and link transmission time.
The FEC sublayer takes over 60% of the whole latency. For 200GbE, the PCS/FEC sublayer
takes over 67.9% of the whole PHY delay constraints, while for 400GbE, it takes 69%.
Sublayers 25GbE 40GbE 100GbE 200GbE 400GbE
RS,MAC and MAC control
๏ผround-trip๏ผ
327.68 409.6 245.76 245.76 245.76
xxGBASE-R PCS 143.36 281.6 353.28 801.28 800
BASE-R
FEC
BASE-R FEC 245.76 614.4 1228.8 -- --
RS-FEC - - 409.60 -- --
BASE-R PMA 163.84 102.4 92.16 92.16 92.16
BASE-R
PMD
CR/CR-S 20.48 102.4 CR4: 20.48
CR10:97.28
40.96 20.48
KR/KR-S 51.2 KR4: 20.48
KP4 PMA/PMD: 81.92
40.96 20.48
SR/LR/ER 25.6 20.48 20.48 20.48
BASE-T PHY 1024 640 -- -- --
Sublayer delay constraints (maximum: ns)
21
โข Based on existing Ethernet PMDs
โ100GbE/200GbE/400GbE
โข Directions to be considered (invite thoughts and suggestions)
โTo improve throughput for computing messages
โข Support flexible frames size (e.g. below 64B)
โTo improve latency for low latency applications, even with small steps.
โข Support PCS/FEC latency <TBD ns for 100G/lane at least XXm reach...
โข Support PCS/FEC latency <TBD ns for 200G/lane at least XXm reachโฆ
โข Support PCS/FEC latency <TBD ns for 400G/lane at least XXm reachโฆ
โAnything elseโฆ
Ethernet Enhancements for High Performance Computing and Storage
Thanks!
22