Networks: Topologies How to Design
Gilad Shainer, [email protected]
2
TOP500 Statistics
3
TOP500 Statistics
4
World Leading Large-Scale Systems
• National Supercomputing Centre in Shenzhen – Fat-tree, 5.2K nodes, 120K cores, NVIDIA GPUs, China (Petaflop)
• Tokyo Institute of Technology – Fat-tree, 4K nodes, NVIDIA GPUs, Japan (Petaflop)
• Commissariat a l'Energie Atomique (CEA)
– Fat-tree, 4K nodes, 140K cores, France (Petaflop)
• Los Alamos National Lab - Roadrunner – Fat-tree, 4K nodes, 130K cores, USA (Petaflop)
• NASA
– Hypercube, 9.2K nodes, 82K cores – NASA, USA
• Jülich JuRoPa
– Fat-tree, 3K nodes, 30K cores, Germany
• Sandia National Labs – Red Sky
– 3D-Torus, 5.4K nodes, 43K cores – Sandia “Red Sky”, USA
5
ORNL “Spider” System – Lustre File System
• Oak Ridge Nation Lab central storage system – 13400 drives
– 192 Lustre OSS
– 240GB/s bandwidth
– InfiniBand interconnect
– 10PB capacity
6
Network Topologies
• Fat-tree (CLOS), Mesh, 3D-Torus topologies
• CLOS (fat-tree)
– Can be fully non-blocking (1:1) or blocking (x:1)
– Typically enables best performance
• Non blocking bandwidth, lowest network latency
• Mesh or 3D Torus
– Blocking network, cost-effective for systems at scale
– Great performance solutions for applications with locality
– Support for dedicate sub-networks
– Simple expansion for future growth
0,0
1,0
0,1
1,1
2,0 2,1
0,2
1,2
2,2
7
d-Dimensional Torus Topology
• Formal definition – T=(V,E) is said to be d-dimensional torus of size N1xN2x…xNd if:
• V={(v1,v2,…,vd) : 0 ≤ vi ≤ Ni-1}
• E={(uv) : exists j s.t. 1) for each i≠j, vi=ui AND 2) vj=(uj±1) mod Nj}
• Examples
0 1 2
0,0
1,0
0,1
1,1
2,0 2,1
0,2
1,2
2,2
N1=5 N1=N2=3
3 4
8
3D-Torus System – Key Items
• Multiple server nodes per cube junction
• Smallest 3D cube size the better
– Lowest latency between remote nodes
– Minimizing throughput contention
• Ability to connect storage
• Support for separate networks
– Dedicated network (links) for specific
applications/usage
– Example: links dedicated for collectives or
specific jobs
9
InfiniBand 3D Torus
10
Routing for 3D Torus (Avoiding Deadlocks)
• Setting routing might look simple – Just route packets on the shortest path between source - destination
• In lossless networks trivial routing can be disastrous
Communication pairs
1. 02
2. 13
3. 24
4. 30
5. 41
0 1 2 3 4 2 3 4 0 1
11
Avoiding Deadlock – Restrictive Approach
• Idea – Define a set of rules forbidding usage of some resources or a
(temporal) combination of resources which will guarantee freedom
from deadlock
– Design a routing complying with the rules
0 1 2 3 4 2 3 4 0 1
12
Avoiding Deadlock – Separation Approach
• Idea – Decompose each (unidirectional) physical link into several logical
channels with private buffer resources
– Use logical channels to separate the network into virtual networks,
each dependency-cycle-free
– Assign communication pairs (with their paths) to the virtual networks
• Back to our ring
0 1 2 3 4
Routing: Shortest
path Virtual mapping: If a
path uses 04 or
40 link map it to the
red virtual network
else to the black one
2 3 4
1 0
13
InfiniBand 3D Torus
• InfiniBand drivers includes subnet management for
– Fat Tree – min hop, up/down etc
– 3D Torus - Dimension Ordered Routing
14
Mixed Topologies
• Fat-tree topology provide the best performance solution
• 3D-Torus can be more cost effective, easier to scale, good fit for
applications with locality
• Mixed topology
– System connected as 3D Torus
– Fast Fat-tree for collective operations
0,0
1,0
0,1
1,1
2,0 2,1
0,2
1,2
2,2
15
Notes
• Following Fat-Tree network configurations – Flat network
– No unused port
– Two layer of switch fabric (L1 and L2)
• Following 3D Torus configurations – Each 3D Torus junction is a 36-port switch
– Number of switches refers to 36-port switches
• InfiniBand is a great interconnect technology to enable
flat connectivity of thousands and tens-of-thousands
of servers in future Mega Warehouse Data Centers
16
Example: Non-blocking, Fat-Tree, 40Gb/s
Non-blocking
Network
Servers
Servers
18
18
648 L1 36-port switches 18 L2 648-port switches
Total: 11664 servers (nodes)
Throughput: 40Gb/s to the node
17
Non-blocking
Network
Servers
Servers
24
24
648 L1 36-port switches 12 L2 648-port switches
Total: 15552 servers (nodes)
Throughput: 20Gb/s to the node
Example: 2:1 Oversubscription, Fat-Tree, 40Gb/s
18
Example: 3:1 Oversubscription, Fat-Tree, 40Gb/s
Non-blocking
Network
Servers
Servers
27
27
648 L1 36-port switches 9 L2 648-port switches
Total: 17496 servers (nodes)
Throughput: 13Gb/s to the node
19
Example: 8:1 Oversubscription, Fat-Tree, 40Gb/s
Non-blocking
Network
Servers
Servers
32
32
324 L1 36-port switches 2 L2 648-port switches
Total: 10368 servers (nodes)
Throughput: 5Gb/s to the node
20
Example: 3D Torus
120Gb/s
120Gb/s
120Gb/s
120Gb/s
120Gb/s
120Gb/s
+ y + z
- y
- z
+ x - x
40Gb/s each
3D Torus size: 8x8x8 (512 36-port switches)
Total number of servers: 9216
18 Servers (nodes)
3D Torus Switch Junction
21
36-port Switch
+x -x +y -y +z -z Node 1
6
Node 1
5
Node 1
4
Node 1
3
Node 1
2
Node 1
1
Node 1
0
Node
9
Node
8
Node
7
Node
6
Node
5
Node
4
Node
3
Node
2
Node
1
ION
1
3D Torus Connections Example
Node
0
22
Choosing the Right Topology
• Performance: Fat Tree
– Application locality? 3D can become an option
– Multiple users/applications? Fat Tree
– Non blocking? Fat Tree
• Cost
– Depends on the size of the system
– Very large systems can be more cost effective with 3D Torus
• Future expansion? 3D Torus will be easier to expend
23
Network Offloading
• Transport offloads – critical for CPU efficiency
• Congestion avoidance – must be done in the network
• Applications offloading (MPI offloading)
– For example: MPI Collectives Offloads
Software MPI:
Losing performance
beyond 20% CPU
computation
availability
Collectives Offload
based MPI:
Beyond 80% CPU
computation
availability without
any performance
loss!
Lower is better
24 24
Thank You
www.hpcadvisorycouncil.com