20th June 2019
Yuichiro Ajima
Fujitsu Limited
The Tofu Interconnect D for
Supercomputer Fugaku
Copyright 2019 FUJITSU LIMITED020th June 2019, ExaComm 2019
Overview of Fujitsu’s Computing Products
Fujitsu continues to develop general purpose computing products and new domain specific computing devices
1 Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Domain Specific
Computing DevicesGeneral Purpose Computing Products
Mainframes
PC Clusters
SPARC Servers x86 Servers
Supercomputers
High Performance Computing
Servers w/ Fujitsu own processorDeep Learning
Combinatorial Problems
Quantum Computing
Role of Supercomputer in Fujitsu Products
Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect
2 Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Mainframes
PC Clusters
SPARC Servers x86 Servers
Supercomputers
High Performance Computing
Servers w/ Fujitsu own processorDeep Learning
Combinatorial Problems
Quantum Computing
General Purpose Computing Products Domain Specific
Computing Devices
Development of Packaging Technology
Fujitsu has developed single-socket node and water-cooled supercomputer with 3D-stacked memory
Fugaku will integrate memory stacks into a CPU package
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Single-Socket Node
K computer
FX10 FX100FX1HPC2500
3
Water-Cooling
3D-Stacked Memory
2.5D Package
2009 2012 20152003 2021
Fugaku
(This image is prototype)
Development of Interconnect Technology
Tofu Interconnect (Tofu1) for K computer
6D mesh/torus network, virtual torus rank mapping, Tofu Barrier
Tofu2 added new functions; atomic operations, cache injection
Tofu Interconnect D (TofuD) for Fugaku
Increased resources for high-density node configuration
Fault-resilience – dynamic packet slicing
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
2009 2012 20152003 2021
InfiniBandDTU
K computer
FX10 FX100FX1HPC2500
Fugaku
Tofu2Tofu1 TofuD
4
(This image is prototype)
6D Mesh/Torus Network
Virtual 3D-Torus Rank-mapping
Tofu Barrier
Characteristics of Torus Network
Features of the Tofu interconnect family
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019 5
6D Mesh/Torus Network
Six coordinate axes: X, Y, Z, A, B, C
X, Y, Z: the size varies according to the system configuration
A, B, C: the size is fixed to 2×3×2
Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C)
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Z
XC
A
B
X×Y×Z×2×3×2
Y
6
Virtual 3D-Torus Rank-mapping
A rank-mapping option for topology-awareness
A 3D torus rank can be mapped to a 6D submesh even if there is an offline node
This fault tolerance contributes to the system availability
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
6D
submesh
0
1110 1
89 2
73 5
64
B Y0
7
1
6
2
5
3
4A
X
07
16
25
34
C
Z
7
Tofu Barrier
Tofu Barrier offloads Barrier and Allreduce communications
Barrier channel (BCH) is an interface
Barrier gate (BG) is a communication engine
Tofu barrier can execute an arbitrary communication algorithm
Recursive-doubling algorithm uses log2(n) of BGs in each process
Reduce-broadcast algorithm uses a maximum of 5 BGs in each process
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Process 0
Process 1
Process 2
Process 3
Process 4
Process 5
Process 6
Process 7
Process 8
Process 9
BCH + start/end BG
Intermediate BG
8
Characteristics of Torus Network
All systems have the same order of bisection bandwidth
No significant performance difference in global data exchange
Torus networks have higher total injection bandwidth
Topology-aware communication such as nearest-neighbor data exchange results in higher performance
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
System Network Total Injection
Bandwidth (PB/s)
Bisection
Bandwidth (TB/s)
Blue Gene/Q Torus (5D) 1.97 49
K Computer Mesh/Torus (6D)
Virtual Torus (3D)
1.66 46
34
Sunway TaihuLight Tapered Fat-Tree 0.51 70
Piz Daint Dragonfly 0.07 36
Summit Fat-Tree 0.12 115
Oakforest-PACS Fat-Tree 0.10 102
40X
36X
7.3X
2.0X
1.0X
1.0X
9
High Density Node Configuration
Link Configuration and Injection Bandwidth
Packaging
Dynamic Packet Slicing
Increased Tofu Barrier Resources
The Design of TofuD
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019 10
High Density Node Configuration
The processor die size gets smaller from FX100
The off-chip channels are halved
# memory stacks: 8 to 4
# high speed serial lanes for Tofu: 40 to 20
The area of Tofu interconnect shrinks to about 1/3 size
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019 11
Fugaku – A64FX (7nm)
FX100 – SPARC64TM XIfx (20nm)
Tofu2
HMC
HMC
HMC
HMC
HMC
HMC
HMC
HMC
TofuD
HBM
HBM
HBM
HBM
High Density Node Configuration (cont.)
More resources are integrated into the CPU
# CPU Memory Groups (NUMA nodes): 2 to 4
The expected number of processes per node is also doubled
# Tofu Network Interfaces: 4 to 6
Provide more resources and accelerate collective communications
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
PCIe
TNI0
TNI1
TNI2
TNI3
To
fu N
etw
ork
Rou
terc c c c c c c c
c c c c c c c c
c
c c c c c c c c
c c c c c c c c
c
HMC HMC HMC HMC
HMC HMC HMC HMC
4 lan
es ×
10 p
ort
s
SPARC64 XIfxC M G
C M G Tofu2
PCIe
TNI0
NOC
cccc
c ccc
ccccc
cccc
cccc
ccc
cc
HBM2
cccc
c ccc
cc
ccc
HBM2
HBM2
cccc
cccc
ccc
cc
HBM2
TNI1
TNI2
TNI3
TNI4
TNI5
lan
es ×
10
po
rts
2la
ne
s ×
10
po
rts
To
fu N
etw
ork
Rou
ter
A64FXC M G C M G
C M G C M G TofuD
12
Copyright 2019 FUJITSU LIMITED
Data transfer rate increased from 25 Gbps to 28 Gbps
Link bandwidth reduced from 12.5 GB/s to 6.8 GB/s
TofuD simultaneously transmits in 6 directions
Increased from 4 directions in the case of Tofu1 and Tofu2
Total injection bandwidth per node is 40.8 GB/s
Approximately, twice that of Tofu1 or 80% that of Tofu2
Link Configuration and Injection Bandwidth
20th June 2019, ExaComm 2019
Tofu1 Tofu2 TofuD
Data rate (Gbps) 6.25 25.78125 28.05
Number of signal lanes per link 8 4 2
Link bandwidth (GB/s) 5.0 12.5 6.8
Number of TNIs per node 4 4 6
Injection bandwidth per node (GB/s) 20 50 40.8
13
Packaging – CPU Memory Unit (CMU)
Two CPUs connected with C-axis
X×Y×Z×A×B×C = 1×1×1×1×1×2
Two or three active optical cable (AOC) cages on the board
Each cable bundles two lanes of signals from each of the two CPUs
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
CPU
CPU
AOC (X)
AOC (Y)
AOC (Z)
AOC
AOC
14
Packaging – Rack Structure
Rack
8 shelves
192 CMUs (384 CPUs)
Shelf
24 CMUs (48 CPUs)
X×Y×Z×A×B×C = 1×1×4×2×3×2
Top or bottom half of rack
4 shelves
X×Y×Z×A×B×C = 2×2×4×2×3×2
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Rack (prototype)
Shelves
15
Dynamic Packet Slicing – Split Mode
The physical layer of TofuD is independent for each lane
In the ordinary multi-lane transmission, the physical layer has media-independent interface and hides the number of signal lanes
A packet is sliced and each is injected into a different lane
The routing header of the packet is copied to both slices for virtual cut-through packet transfer
This is normal operation and is called split mode
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Slice 0
Slice 1
Packet Routing
Header
Slice 0
Slice 1
16
Dynamic Packet Slicing – Duplicate Mode
When the error rate is high, the operation mode falls down to duplicate mode that duplicates packets
If the error rate returns to low, the link can return to split mode
Each lane is never disconnected independently
The error rates of both lanes are continuously monitored and fed back
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Packet
Packet
Packet Routing
Header
Slice 0
Slice 1
Error rate feedback
17
Increased Tofu Barrier Resources
The number of Tofu Barrier resources significantly increased
All 6 TNIs of TofuD have Tofu barrier
Only TNI #0 of Tofu1/2 has Tofu barrier
This change intended to support intra-node synchronization
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Tofu1/2 TofuD
TNINumber of BCHs 8 16
Number of BGs 64 48
Node
Number of TNI w/ Tofu Barrier 1 6
Number of BCHs 8 96
Number of BGs 64 288
18
Put Latencies
Latency Breakdown
Injection Rates
Tofu Barrier
Performance Evaluations
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019 19
Put Latencies
8B Put transfer between nodes on the same board
The low-latency features were used
Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1
The cache injection feature contributed to this reduction
TofuD reduced the Put latency by 0.22 μs from that of Tofu2
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Communication settings Latency
Tofu1 Descriptor on main memory 1.15 µs
Direct Descriptor 0.91 µs
Tofu2 Cache injection OFF 0.87 µs
Cache injection ON 0.71 µs
TofuD To/From far CMGs 0.54 µs
To/From near CMGs 0.49 µs
0.20 µs
0.22 µs
20
Latency Breakdown
The overhead increase in Tofu2 has been reducedCopyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
0
100
200
300
400
500
600
700
800
900
1000
Tofu1 Tofu2 TofuD
La
ten
cy (
ns
ec
)
Rx CPU
Rx Host bus
Rx TNI
Packet Transfer
Tx TNI
Tx Host bus
Tx CPU
Cache injection
Tx Optimization
Rx Optimization
Increased Overhead in Physical Layer
Overhead Reduced
21
Injection Rates per Node
Simultaneous Put transfers to the nearest-neighbor nodes
Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs
The efficiencies of Tofu1 were lower than 90%
Because of a bottleneck in the bus that connects CPU and ICC
The efficiencies of Tofu2 and TofuD exceeded 90 %
Integration into the processor chip removed the bottleneck
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Injection rate Efficiency
Tofu1 (K) 15.0 GB/s 77 %
Tofu1 (FX10) 17.6 GB/s 88 %
Tofu2 45.8 GB/s 92 %
TofuD 38.1 GB/s 93 %
22
Tofu Barrier – Intra-Node Synchronization
The test program synchronized multiple BCHs in a node
For 8 and 16 BCHs, some TNIs are shared by multiple BCHs
Sharing TNI causes the serialization of BCHs/BGs processing
Two delay estimation approaches
One approach only considers the number of communication stages and the delays of BCH (0.48 μs) and BG (0.13 μs)
The other approach considers the serialization of BCH and BG processing using the same TNI
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
Number of BCHs 1 4 8 16 48
Number of used TNIs 1 4 6 6 6
Number of communication stages 2 2 4 6 9
Max. number of BCHs per TNI 1 1 2 3 8
Max. number of BGs per TNI 2 2 5 9 24
23
Tofu Barrier – Results
The simple estimate results were too low
The estimates considering the serialization of BCHs/BGs processing were consistent with the evaluation results
MPI needs to assign BCHs in the same synchronization group to different TNIs to avoid serialization penalties
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019
0.0
1.0
2.0
3.0
4.0
5.0
1 4 16 64
Late
ncy (
μse
c)
Number of BCHs per node
Estimate considering serialization
Evaluation Result
Simple Estimate
un
de
restim
ate
24
Summary
Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect
TofuD is developed for Fugaku and designed for high-density node configuration and fault-resilience
The design of TofuD
Node configuration, injection rate and packaging
Dynamic Packet Slicing
Increased Tofu Barrier Resources
The evaluation results of TofuD
Latency was 0.49 μs, which was reduced by 0.22 μs from that of Tofu2
Injection rate was 38.1 GB/s and the link efficiency exceeds 90%
The evaluation results of Tofu Barrier showed the serialization penalties which a proper MPI implementation can avoid
Copyright 2019 FUJITSU LIMITED20th June 2019, ExaComm 2019 25
Copyright 2019 FUJITSU LIMITED26