Download - The Tofu Interconnect D for Supercomputer Fugaku · Role of Supercomputer in Fujitsu Products Supercomputer is a technology driver for Fujitsu’s computing products in technologies

20th June 2019

Yuichiro Ajima

Fujitsu Limited

The Tofu Interconnect D for

Supercomputer Fugaku

Copyright 2019 FUJITSU LIMITED020th June 2019, ExaComm 2019

Overview of Fujitsu’s Computing Products

Fujitsu continues to develop general purpose computing products and new domain specific computing devices

1 Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019

Domain Specific

Computing DevicesGeneral Purpose Computing Products

Mainframes

PC Clusters

SPARC Servers x86 Servers

Supercomputers

High Performance Computing

Servers w/ Fujitsu own processorDeep Learning

Combinatorial Problems

Quantum Computing

Role of Supercomputer in Fujitsu Products

Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect

2 Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019

Mainframes

PC Clusters

SPARC Servers x86 Servers

Supercomputers

High Performance Computing

Servers w/ Fujitsu own processorDeep Learning

Combinatorial Problems

Quantum Computing

General Purpose Computing Products Domain Specific

Computing Devices

Development of Packaging Technology

Fujitsu has developed single-socket node and water-cooled supercomputer with 3D-stacked memory

Fugaku will integrate memory stacks into a CPU package


Single-Socket Node

K computer

FX10 FX100FX1HPC2500

3

Water-Cooling

3D-Stacked Memory

2.5D Package

2009 2012 20152003 2021

Fugaku

(This image is prototype)

Development of Interconnect Technology

Tofu Interconnect (Tofu1) for K computer

6D mesh/torus network, virtual torus rank mapping, Tofu Barrier

Tofu2 added new functions; atomic operations, cache injection

Tofu Interconnect D (TofuD) for Fugaku

Increased resources for high-density node configuration

Fault-resilience – dynamic packet slicing


2009 2012 20152003 2021

InfiniBandDTU

K computer

FX10 FX100FX1HPC2500

Fugaku

Tofu2Tofu1 TofuD

4

(This image is prototype)

6D Mesh/Torus Network

Virtual 3D-Torus Rank-mapping

Tofu Barrier

Characteristics of Torus Network

Features of the Tofu interconnect family

Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019 5

6D Mesh/Torus Network

Six coordinate axes: X, Y, Z, A, B, C

X, Y, Z: the size varies according to the system configuration

A, B, C: the size is fixed to 2×3×2

Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C)


Z

XC

A

B

X×Y×Z×2×3×2

Y

6

Virtual 3D-Torus Rank-mapping

A rank-mapping option for topology-awareness

A 3D torus rank can be mapped to a 6D submesh even if there is an offline node

This fault tolerance contributes to the system availability


6D

submesh

0

1110 1

89 2

73 5

64

B Y0

7

1

6

2

5

3

4A

X

07

16

25

34

C

Z

7

Tofu Barrier

Tofu Barrier offloads Barrier and Allreduce communications

Barrier channel (BCH) is an interface

Barrier gate (BG) is a communication engine

Tofu barrier can execute an arbitrary communication algorithm

Recursive-doubling algorithm uses log2(n) of BGs in each process

Reduce-broadcast algorithm uses a maximum of 5 BGs in each process


Process 0

Process 1

Process 2

Process 3

Process 4

Process 5

Process 6

Process 7

Process 8

Process 9

BCH + start/end BG

Intermediate BG

8

Characteristics of Torus Network

All systems have the same order of bisection bandwidth

No significant performance difference in global data exchange

Torus networks have higher total injection bandwidth

Topology-aware communication such as nearest-neighbor data exchange results in higher performance


System Network Total Injection

Bandwidth (PB/s)

Bisection

Bandwidth (TB/s)

Blue Gene/Q Torus (5D) 1.97 49

K Computer Mesh/Torus (6D)

Virtual Torus (3D)

1.66 46

34

Sunway TaihuLight Tapered Fat-Tree 0.51 70

Piz Daint Dragonfly 0.07 36

Summit Fat-Tree 0.12 115

Oakforest-PACS Fat-Tree 0.10 102

40X

36X

7.3X

2.0X

1.0X

1.0X

9

High Density Node Configuration

Link Configuration and Injection Bandwidth

Packaging

Dynamic Packet Slicing

Increased Tofu Barrier Resources

The Design of TofuD


High Density Node Configuration

The processor die size gets smaller from FX100

The off-chip channels are halved

# memory stacks: 8 to 4

# high speed serial lanes for Tofu: 40 to 20

The area of Tofu interconnect shrinks to about 1/3 size


Fugaku – A64FX (7nm)

FX100 – SPARC64TM XIfx (20nm)

Tofu2

HMC

HMC

HMC

HMC

HMC

HMC

HMC

HMC

TofuD

HBM

HBM

HBM

HBM

High Density Node Configuration (cont.)

More resources are integrated into the CPU

# CPU Memory Groups (NUMA nodes): 2 to 4

The expected number of processes per node is also doubled

# Tofu Network Interfaces: 4 to 6

Provide more resources and accelerate collective communications


PCIe

TNI0

TNI1

TNI2

TNI3

To

fu N

etw

ork

Rou

terc c c c c c c c

c c c c c c c c

c

c c c c c c c c

c c c c c c c c

c

HMC HMC HMC HMC

HMC HMC HMC HMC

4 lan

es ×

10 p

ort

s

SPARC64 XIfxC M G

C M G Tofu2

PCIe

TNI0

NOC

cccc

c ccc

ccccc

cccc

cccc

ccc

cc

HBM2

cccc

c ccc

cc

ccc

HBM2

HBM2

cccc

cccc

ccc

cc

HBM2

TNI1

TNI2

TNI3

TNI4

TNI5

lan

es ×

10

po

rts

2la

ne

s ×

10

po

rts

To

fu N

etw

ork

Rou

ter

A64FXC M G C M G

C M G C M G TofuD

12

Copyright 2019 FUJITSU LIMITED

Data transfer rate increased from 25 Gbps to 28 Gbps

Link bandwidth reduced from 12.5 GB/s to 6.8 GB/s

TofuD simultaneously transmits in 6 directions

Increased from 4 directions in the case of Tofu1 and Tofu2

Total injection bandwidth per node is 40.8 GB/s

Approximately, twice that of Tofu1 or 80% that of Tofu2

Link Configuration and Injection Bandwidth

20th June 2019, ExaComm 2019

Tofu1 Tofu2 TofuD

Data rate (Gbps) 6.25 25.78125 28.05

Number of signal lanes per link 8 4 2

Link bandwidth (GB/s) 5.0 12.5 6.8

Number of TNIs per node 4 4 6

Injection bandwidth per node (GB/s) 20 50 40.8

13

Packaging – CPU Memory Unit (CMU)

Two CPUs connected with C-axis

X×Y×Z×A×B×C = 1×1×1×1×1×2

Two or three active optical cable (AOC) cages on the board

Each cable bundles two lanes of signals from each of the two CPUs


CPU

CPU

AOC (X)

AOC (Y)

AOC (Z)

AOC

AOC

14

Packaging – Rack Structure

Rack

8 shelves

192 CMUs (384 CPUs)

Shelf

24 CMUs (48 CPUs)

X×Y×Z×A×B×C = 1×1×4×2×3×2

Top or bottom half of rack

4 shelves

X×Y×Z×A×B×C = 2×2×4×2×3×2


Rack (prototype)

Shelves

15

Dynamic Packet Slicing – Split Mode

The physical layer of TofuD is independent for each lane

In the ordinary multi-lane transmission, the physical layer has media-independent interface and hides the number of signal lanes

A packet is sliced and each is injected into a different lane

The routing header of the packet is copied to both slices for virtual cut-through packet transfer

This is normal operation and is called split mode


Slice 0

Slice 1

Packet Routing

Header

Slice 0

Slice 1

16

Dynamic Packet Slicing – Duplicate Mode

When the error rate is high, the operation mode falls down to duplicate mode that duplicates packets

If the error rate returns to low, the link can return to split mode

Each lane is never disconnected independently

The error rates of both lanes are continuously monitored and fed back


Packet

Packet

Packet Routing

Header

Slice 0

Slice 1

Error rate feedback

17


The number of Tofu Barrier resources significantly increased

All 6 TNIs of TofuD have Tofu barrier

Only TNI #0 of Tofu1/2 has Tofu barrier

This change intended to support intra-node synchronization


Tofu1/2 TofuD

TNINumber of BCHs 8 16

Number of BGs 64 48

Node

Number of TNI w/ Tofu Barrier 1 6

Number of BCHs 8 96

Number of BGs 64 288

18

Put Latencies

Latency Breakdown

Injection Rates

Tofu Barrier

Performance Evaluations


Put Latencies

8B Put transfer between nodes on the same board

The low-latency features were used

Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1

The cache injection feature contributed to this reduction

TofuD reduced the Put latency by 0.22 μs from that of Tofu2


Communication settings Latency

Tofu1 Descriptor on main memory 1.15 µs

Direct Descriptor 0.91 µs

Tofu2 Cache injection OFF 0.87 µs

Cache injection ON 0.71 µs

TofuD To/From far CMGs 0.54 µs

To/From near CMGs 0.49 µs

0.20 µs

0.22 µs

20

Latency Breakdown

The overhead increase in Tofu2 has been reducedCopyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019

0

100

200

300

400

500

600

700

800

900

1000

Tofu1 Tofu2 TofuD

La

ten

cy (

ns

ec

)

Rx CPU

Rx Host bus

Rx TNI

Packet Transfer

Tx TNI

Tx Host bus

Tx CPU

Cache injection

Tx Optimization

Rx Optimization

Increased Overhead in Physical Layer

Overhead Reduced

21

Injection Rates per Node

Simultaneous Put transfers to the nearest-neighbor nodes

Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs

The efficiencies of Tofu1 were lower than 90%

Because of a bottleneck in the bus that connects CPU and ICC

The efficiencies of Tofu2 and TofuD exceeded 90 %

Integration into the processor chip removed the bottleneck


Injection rate Efficiency

Tofu1 (K) 15.0 GB/s 77 %

Tofu1 (FX10) 17.6 GB/s 88 %

Tofu2 45.8 GB/s 92 %

TofuD 38.1 GB/s 93 %

22

Tofu Barrier – Intra-Node Synchronization

The test program synchronized multiple BCHs in a node

For 8 and 16 BCHs, some TNIs are shared by multiple BCHs

Sharing TNI causes the serialization of BCHs/BGs processing

Two delay estimation approaches

One approach only considers the number of communication stages and the delays of BCH (0.48 μs) and BG (0.13 μs)

The other approach considers the serialization of BCH and BG processing using the same TNI


Number of BCHs 1 4 8 16 48

Number of used TNIs 1 4 6 6 6

Number of communication stages 2 2 4 6 9

Max. number of BCHs per TNI 1 1 2 3 8

Max. number of BGs per TNI 2 2 5 9 24

23

Tofu Barrier – Results

The simple estimate results were too low

The estimates considering the serialization of BCHs/BGs processing were consistent with the evaluation results

MPI needs to assign BCHs in the same synchronization group to different TNIs to avoid serialization penalties


0.0

1.0

2.0

3.0

4.0

5.0

1 4 16 64

Late

ncy (

μse

c)

Number of BCHs per node

Estimate considering serialization

Evaluation Result

Simple Estimate

un

de

restim

ate

24

Summary

Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect

TofuD is developed for Fugaku and designed for high-density node configuration and fault-resilience

The design of TofuD

Node configuration, injection rate and packaging

Dynamic Packet Slicing


The evaluation results of TofuD

Latency was 0.49 μs, which was reduced by 0.22 μs from that of Tofu2

Injection rate was 38.1 GB/s and the link efficiency exceeds 90%

The evaluation results of Tofu Barrier showed the serialization penalties which a proper MPI implementation can avoid


Download - The Tofu Interconnect D for Supercomputer Fugaku · Role of Supercomputer in Fujitsu Products Supercomputer is a technology driver for Fujitsu’s computing products in technologies

Top Related