graph500: from kepler to pascal - nvidiaon-demand.gputechconf.com/gtc/2017/presentation/s7309...i 2...
TRANSCRIPT
Graph500: From Kepler to Pascal
Julien Loiseau, Michaël Krajecki, François Alin and Christophe Jaillet
GTC 2017
University of Reims Champagne Ardenne (URCA)
Multidisciplinary University
I about 27 000
I 5 campus: Reims, Troyes, Charleville-Mézières, Chaumont etChâlons-en-Champagne
I A wide initial undergaduate studies program
I Graduate studies and PhD program linked with research lab
Graph500: From Kepler to Pascal J. Loiseau et al. 1 / 19
HPC issues
Power e�ciency
Exascale architecture
I Computational power: Peta�op → ×1000→ Exa�opI Moore's law is over
I Energy e�ciency: 8MW → ×1000→ 8GW ??
Graph500: From Kepler to Pascal J. Loiseau et al. 2 / 19
HPC issues
HPC Architectures
CPU(s) + Accelerator(s)
CPU(s) + Accelerator(s)
MPI
Memory - In-core
- Out-of-core
Xeon PhiGPU
FPGA
ASIC
SSD
HDD
CPU/GPU
Graph500: From Kepler to Pascal J. Loiseau et al. 3 / 19
HPC issues
ROMEO, Reims, France
ROMEO supercomputer
I Reims, Champagne-Ardenne, FranceI 130 nodes
I 2 × CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM)I 2 × GPU NVIDIA K20Xm (6GB RAM)
I FatTree with In�niBand
I 1 × DGX-1 node
Graph500: From Kepler to Pascal J. Loiseau et al. 4 / 19
HPC issues
ROMEO, Regional HPC Center
Its mission is to deliver, for both industrial and academic researchers:
I high performance computing resources
I secured storage spaces
I speci�c & scienti�c software
I advanced user support in exploiting these resources
I in-depth expertise in di�erent engineering �elds: HPC, appliedmathematics, physics, biophysics and chemistry, ...
I Promote and di�use HPC and simulation to companies / SMB
I identify, experiment and master breakthrough technologiesI which give new opportunities for our usersI from technology-watching to productionI for all research domains
Graph500: From Kepler to Pascal J. Loiseau et al. 5 / 19
HPC issues
GPU Integration
20081 server
Tesla S1070960 cores/U
201010% of cluster
nodes with GPU
Fermi - M2090
2013100% of cluster
nodes with GPU260 K20x
TOP 500 & GREEN 500
2012
2015
2016
2016DGX1 Server
First server dedicated to
Deep Learning 8 GPU P100
Graph500: From Kepler to Pascal J. Loiseau et al. 6 / 19
HPC issues
Benchmarking HPC Architectures
How to compare the computing power of parallel architectures?
TOP500
I LINPACK
I Solving n equations with n unknowns
I "Regular"
FLoating-point Operation Per Second, FLOPS
GRAPH500
I BFS
I Large randomly generated graphs
I "Irregular"
Traversed Edges Per Second, TEPS
Graph500: From Kepler to Pascal J. Loiseau et al. 7 / 19
GRAPH500
Protocol and ranking
Graphs algorithms:
I Irregular memory access
I Irregular communications
I No heavy computation step
Data dense application
Steps
I Graph generation (SKG, RMAT)
I Randomly sample 64 unique root vertices
I Structure generationI For each root vertex:
I BFSI Validate BFS tree
Graph500: From Kepler to Pascal J. Loiseau et al. 8 / 19
GRAPH500
Protocol and ranking
Goal: Breadth First Search on random graph:
Level 2
Level 1
Source
Level 3
BFS
Graph500: From Kepler to Pascal J. Loiseau et al. 9 / 19
GRAPH500
Problem Scale
- 2SCALE ⇒ vertices
- 2SCALE+4 ⇒ edges
Problem size Scale Memory (TB)
Toy 26 0,0172Mini 29 0,1374Small 32 1,0995Medium 36 17,5922Large 39 140,7375Huge 42 1125,8999
I For graph generation
I Converted before use
I Graph500 current list (Nov. 2016)
Best CPU & GPU machines:
Name Scale GTEPS
(1)K computer 40 38621,4(2)Sunway 40 23755.7(3)Sequoia 41 23751... ... ...(31) TSUBAME 2.0 35 462.25(39) GSIC Center 35 317,09(43) HA-PACS 32 223.634
I BlueGene ⇒ 19 in top 30
I GPU ⇒ NVIDIA
Graph500: From Kepler to Pascal J. Loiseau et al. 10 / 19
GRAPH500
Graph generation
a
dc
From
To Nodes
Nodes
c d
ba b
c d
Sparse Graph
I Kronecker
I Generation: a = 0.57 b = 0.19 c = 0.19 d = 0.05
I Edge: 16× more than vertices
Graph500: From Kepler to Pascal J. Loiseau et al. 11 / 19
GRAPH500
Data structure format
Structure format
→ BitmapI Natural representationI 220 vertices = 128GBI BG/Q version
→ CSR/CSC (Compressed Sparse Rows/Columns)I Compressed formatI 220 vertices < 1GBI BG/P and GPU version
Compressed Sparse Row
0 1 0 1
0
0
0001
0
1 0 1
01
0 3 6 8 9
1 3 4 0 2 4
0 1 2 3
0
3
2
1
Sparse Matrix
Row pointers
Column indice
1
1
1
0
4
01114 0
12
1 4 0 0 1 2
Graph500: From Kepler to Pascal J. Loiseau et al. 12 / 19
GRAPH500
Parallel algorithm
I Split the adjacency matrix into blocks
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
Generate output queue
A0,0
A0,1
A0,2
A0,3
A1,0
A1,1
A1,2
A1,3
A2,0
A2,1
A2,2
A2,3
A3,0
A3,1
A3,2
A3,3
Share input queue
Repeat until end
- l × l machines with l = 2k(k ∈ N)- Machine Mi,j ⇔ block Ai,j
→ Vertices "In" Ri
→ Vertices "Out" Rj
- Predecessors, 1D distribution: Mi,j get 1/4 vertices in Ri
Graph500: From Kepler to Pascal J. Loiseau et al. 13 / 19
GRAPH500
Exploration
CSR
I Top-down
I in_queue → out_queue
CSC
I Bottom-up
I out_queue & visited ← in_queue
Current frontier
Not yet visited
Current frontier
Not yet visited
Search for children:
Search for parents:
Iteration Top-down Bottom-up Hybrid version0 27 22 090 111 271 8 156 1 568 798 8 1562 3 695 684 587 893 587 8933 19 565 465 12 586 12 5864 214 578 8 256 8 2565 5 865 1 201 1 2016 12 156 12
Graph500: From Kepler to Pascal J. Loiseau et al. 14 / 19
Results and prospects
Performance analysis
CPU/GPU Comparison
I one CPU or GPU
I Di�erent graph scales
16 17 18 19 20 210
0,5
1
1,5
2
2,5
3CPU/GPU Comparision
GTX 970GTX 780 TiK20XCPU E5-2650v2TX1CPU GRAPH500
SCALE
GT
EP
S
Graph500: From Kepler to Pascal J. Loiseau et al. 15 / 19
Results and prospects
Scalability
1 4 16 640
2
4
6
8
10
12
14weak scaling
CPU
GPU
#GPU
GTEPS
21
23
25
27
1 4 16 640
1
2
3
4
5
6
strong scaling (SCALE=21)
CPU
GPU
#GPU
GTEPS
I ROMEO → 105th (Nov. 2016)
Graph500: From Kepler to Pascal J. Loiseau et al. 16 / 19
Results and prospects
P100 GPU
P100 GPU
I Pascal Architecture
I Several improvements
I Base component of DGX-1
Communication, NVLink
I DGX-1, Power 8
I 40GBs bidirectional
⇒ Advantage for graph
Graph500: From Kepler to Pascal J. Loiseau et al. 17 / 19
Kepler vs Pascal
Product K20X Tesla P100Arch Kepler PascalGPU GK100 GP100SMs 14 56 | More concurrent blocks
FP32/SM 192 64FP32/GPU 2688 3584 | More concurrent threadsFP64/SM 64 32
FP64/GPU 896 1792Base Clock 732 MHz 1328 MHz
FP32 GFLOPs 3950 10600FP64 GFLOPs 1310 5300
Memory Interface 384b GDDR5 4096b HBM2Memory Size 6GB 16 GB | 2.6× more mem.L2 Cache Size 1536 KB 4096 KBRegister/SM 256 KB 256 KB | same register size
Register/GPU 3584 KB 14336 KB
Graph500: From Kepler to Pascal J. Loiseau et al. 18 / 19
Kepler vs Pascal
ROMEO Supercomputer
I 105th of GRAPH500 (November 2016 list)
Benchmark for architectures and accelerators
I Machine MESCA: 12TB of RAM + 8 sockets (256 threads)
I FPGA: Intel partnership
I Xeon Phi: Knights Landing ?
I IBM OpenPower: new communications device (NVLINK)
Applications
I Social networks
I Management of electric network
I Big data and deep learning
Graph500: From Kepler to Pascal J. Loiseau et al. 19 / 19