hpc and ai engineering at the innovation lab
TRANSCRIPT
HPC and AI Solution
OverviewGarima Kochhar
HPC and AI Innovation Lab
1
2
Dell EMC HPC and DL team charter
HeadingLorem ipsum dolor
sit amet,
consectetur
adipiscing elit.
HeadingLorem ipsum dolor
sit amet,
consectetur
adipiscing elit.
New Investment:
more SMEs, huge innovation ecosystem
Design, develop and integrate HPC and DL systems• Flexible reference architectures• Systems tuned for research
computing, manufacturing, life sciences, oil and gas, etc.
Act as the focal point for joint
R&D activities• Technology collaboration with
partners for joint innovation
• Research coordination with DSC,
COEs and customers
HPC Innovation
LabPrototype and evaluate
advanced technologies• HPC+Cloud, HPC+Big Data
• nVMEs, FPGAs, containers, DL/ML
workloads, etc.
Conduct application performance studies and develop best practices• White papers, blogs,
presentations• www.hpcatdell.com
Technical
briefings,
tours, remote
access
3
World-class infrastructure in the Innovation Lab
Zenith• TOP500-class system based on Intel Scalable
Systems Framework (OPA, KNL, Xeon)
• 324 nodes with Intel Xeon Gold 6148-F processors,
Omni-Path fabric and 655TF sustained performance.
• +160 Intel Xeon Phi (KNL) servers. 805TF combined
performance.
• #292 on Top500, 1.4PFlop/s theoretical peak
• Isilon H600, F800
Rattler• Research/development system with Mellanox, NVIDIA
and Bright Computing
• 88 nodes with Intel Xeon Gold 6148 processors and
EDR InfiniBand
• 16 nodes with Intel Xeon Gold 6148 processors and 4
V100 GPUs each
13K ft.2 lab, 1,300+ servers, ~10PB storage dedicated to HPC in collaboration with the community
4
Focus areas
HPC software stackBright Cluster Manager, OpenHPC
Integration of all software components
Compute performance and tuningApplication focus + BIOS, Memory, Interconnect
Accelerators and co-processors
Different workloads
Interconnect performance and tuning
Storage solutionsIsilon, NSS , Lustre
Vertical solutionsGenomics research
CFD/Manufacturing
Proof of Concept studiesContainers, FPGAs, NVMeoF, etc.
Compute- performance- tuning
6
Careful memory configuration - Skylake
• Unbalanced configurations are very bad for performance
• Balanced and near-balanced configurations are ideal for HPC
0.350.39 0.41
0.49
0.65
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0
50
100
150
200
250
8176 6142 5120 4114 3106
16
DIM
M r
elat
ive
ban
dw
idth
wh
en c
om
par
ed t
o
12
DIM
Ms.
Stre
am T
riad
GB
/s (
hig
her
is b
ette
r)
Processor SKU
Impact of unbalanced 512 GB memory configuration
384 GB 12x32GB 512 GB 16x32GB Rel perf 16x32GB
1.05
1.00
1.031.04
0.92
1.06
1.00
1.031.05
0.93
0.85
0.90
0.95
1.00
1.05
1.10
Balanced Balanced Balanced Near balanced Near balanced
12 x 16GB 12 x 32GB 24 x 32GB 12 x 16GB + 12x 32GB
12 x 8GB + 12 x16GB
192 GB 384 GB 768 GB 576 GB 288 GB
Rel
ativ
e m
emo
ry b
and
wid
thh
[h
igh
er is
bet
ter]
Near-balanced configurations - 288 GB, 576 GB
6138, 2666 MT/s 6142, 2666 MT/s
7
ANSYS® Fluent®—Two Socket System Performance
• Skylake provides significantly better performance for ANSYS Fluent relative to Broadwell.
Relative performance depends on the specific benchmark dataset: 1.2 to 1.4 for 6150.
0.74 0.75 0.90 1.00 1.02 1.23 1.17 1.32 1.45 1.43 1.62
0.98
1.211.02 1.00
0.91
1.23
1.56
1.321.16
1.271.08
0.00.20.40.60.81.01.21.41.61.8
E5-2697 v2 12c,2.7 GHz, 130W,
1866 MT/s
E5-2660 v3 10c,2.6/2.2 GHz,105W, 2133
MT/s
E5-2697 v3 14c,2.6/2.2 GHz,145W, 2133
MT/s
E5-2697A v416c, 2.6/2.2GHz, 145W,2400 MT/s
E5-2697 v4 18c,2.3/2.0 GHz,145W, 2400
MT/s
6130 16c,2.1GHz, 125W,12x16GB 2666
MT/s
6136 12c,3.0GHz, 150W,24x16GB 2666
MT/s
6142 16c,2.6GHz, 150W,12x16GB 2666
MT/s
6148 20c,2.4GHz, 150W,12x16GB 2666
MT/s
6150 18c,2.7GHz, 165W,12x16GB 2666
MT/s
8168 24c,2.7GHz, 205W,12x16GB 2666
MT/s
IVB HSW BDW SKX
Perf
orm
ance
Rel
ativ
e to
E5
-26
97
A v
4
ANSYS Fluent v17.2—aircraft_wing_14m
Relative Performance per Core
8
AMD EPYC - WRF multi-node tests
• EPYC 7601 is ~3% better than 7551. Base frequency 10% faster, Turbo 6%.
0.0
5.0
10.0
15.0
20.0
25.0
30.0
64 128 256 512 1024
1 epyc 2 epyc 4 epyc 8 epyc 16 epyc
Rel
ativ
e p
erfo
rman
ce o
ver
16
no
des
(h
igh
er is
bet
ter)
WRF - conus 12km
EPYC 75512.0 GHz
EPYC 76012.2 GHz
Linear scaling
0.0
5.0
10.0
15.0
20.0
25.0
30.0
64 128 256 512 1024
1 epyc 2 epyc 4 epyc 8 epyc 16 epyc
Rel
ativ
e p
erfo
rman
ce o
ver
16
no
des
(h
igh
er is
bet
ter)
WRF - conus 2.5 km
EPYC 75512.0 GHz
EPYC 76012.2 GHz
Linear scaling
Interconnects
10
EDR Latency w/ c-states
1.07 1.10 1.10 1.10 1.101.14 1.13 1.15
1.591.64
1.73
1.591.63
1.72
0.87 0.89 0.89 0.88 0.890.93 0.93 0.94
1.381.43
1.52
0.50
0.70
0.90
1.10
1.30
1.50
1.70
1.90
0 1 2 4 8 16 32 64 128 256 512
La
ten
cy (
us)
Message size (bytes)
SKL,BW EDR Latency (cstates enabled vs disabled)
SKL cstates-en w/ switch SKL cstates-en, B2B
SKL cstates-dis w/ switch SKL cstates-disabled, B2B SKL- Intel Xeon Gold 6142 @ 2.6GHz 16C
• Cstates enabled and disabled latency results are about the same.
11
• With a small data set, performance advantage of Skylake decreases at scale, but remains
positive.
Unlikely to run a model of this size with more than a few nodes.
ANSYS® Fluent®—Small Model Scaling
1.0
1.8
2.8
3.6
4.2
4.95.2
1.2
2.2
3.2
4.1
4.9
5.66.0
1.251.20 1.18 1.15 1.16 1.14 1.15
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
0
200
400
600
800
1000
1200
1400
36 (1) 72 (2) 144 (4) 216 (6) 288 (8) 432 (12) 576 (16)
Perf
orm
ance
Rel
ativ
e to
E5
-26
97
v4
Solv
er R
atin
g (h
igh
er is
bet
ter)
Number of Cores (Nodes)
ANSYS Fluent v17.2—ice_2m
C6320 + E5-2697 v4 + EDR C6420 + 6150 + EDR Perf relative to E5-2697 v4
GP GPU and accelerators
13
AMBER B V100 vs P100 79 %
K V100 vs P100 80 %
• V100 is significantly faster, 1x V100 faster than 4x
P100
• SMX2 better than PCIe – 9% P100, up to 30%
V100
• SMX2 has slightly higher frequency than PCIe card
15.8 16.5
28.329.8
20.322.0
31.4
37.0
22.224.2
29.5
39.4
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
PCIe SXM2 PCIe SXM2
13G C4130 P100 14G C4140 V100
ns/
day
(h
igh
er is
bet
ter)
AMBER STMV
1 2 4
Configuration B Configuration K
HPC storage solutions- NSS- IEEL- Isilon- Proof Of Concepts
15
Solution benefits & Dell differentiation
• Parallel scalable file system based on
Intel EE for Lustre software
• Single file system namespace scalable to
high capacities and performance
• Best practices developed by Dell HPC
Engineering provide optimal performance
on Dell hardware
• Tests yield peaks of roughly 15GB/s write
and 17GB/s read per building block
• Lustre Distributed Namespace allows
distribution of Lustre sub-directories across
multiple MDTs to increase metadata
capacity capabilities and performance
• Share data with other file systems utilizing
optional NFS/CIFS gateway
• Dell Networking 10/40GbE, InfiniBand or
Omni-Path
Dell Storage for HPC with Intel EE for Lustre SolutionTurn-key solution designed for high speed fast scratch storage
MDS Pair
Dell PowerEdge R730
Active/Passive
12 Gbps SAS
Failover Connect ions
Dell PowerVault
MD3420
OSS Pair
Dell PowerEdge R730
Active/Active
12 Gbps SAS
Failover Connect ions
Dell PowerVault
MD3460
Intel Manager for Lustre
PowerEdge R630
Dell PowerVault
MD3420
(Optional for DNE)
16
IEEL3.0+OPA
17
Support for Isilon for Scalable NFS- Sequential performance (N-N) -- Write
Deep Learning
19
TensorFlow+Horovod on multiple V100 nodes
338589
1115
2173
3205
4205
1.01.7
3.3
6.4
9.5
12.5
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 V100 2 V100 4 V100 8 V100 12 V100 16 V100
Spe
ed
up
Imag
es/
sec
TensorFlow+Horovod Resnet50 on Multi-node
Performance Speedup
• TensorFlow+Horovod scales well on multiple nodes, 12.5x speedup with 16 V100 (4 nodes)
• Ibverbs and MPI are used for nodes communication
• 4 nodes with V100-PCIe GPUs are used
• FP32 mode, batch size is 128 per GPU
20
Open Source Frameworks
TensorFlow, MxNet, CNTK, Theano, Torch, Caffe/Caffe2 …
Neural network Libraries
MLPython, CaffeOnSpark, cuDNN, cuBLAS, NCCL, Keras, GIE…
Ready Bundle for Deep Learning - NVIDIA
Platform C4140
Processor 2 x Intel Xeon CPU 6148
Memory 384GB DDR4 @ 2400MHz
Drives 2x200GB 1.8” SSD
Network Mellanox ConnectX-5 VPI (EDR
100Gb/s)
GPU 4x V100-16GB – SXM2
Software & Firmware [Reference]
Operating
System
RHEL 7.4 x86_64
Provisioning and
Management
Bright Cluster Manager 8.0
Tying it together - Access to the lab- White papers and Blogs
22
Recent Publications• Design Principles for HPC
• 14G with Skylake – how much
better for HPC?
• BIOS characterization for HPC with
Intel Skylake processor
• HPCG Performance study with Intel
Skylake processors
• Skylake memory study
• Performance study of four Socket
PowerEdge R940 Server with Intel
Skylake processors
• Dell EMC HPC Systems - SKY is
the limit
• Entering a new arena in
Computing- KNL
• System benchmark results on KNL
– STREAM and HPL
• HPCG Performance study with Intel
KNL
• NAMD Performance Analysis on
Skylake Architecture
• LAMMPS Four Node Comparative
Performance Analysis on Skylake
Processors
• De novo assembly with PowerEdge
R940
• Dell EMC HPC System for Life
Science v1.1
• HPC Applications Performance on
V100
• Application Performance on P100-
PCIe GPUs
• Containerizing HPC Applications
with Singularity
• Performance of LS-DYNA on
Singularity Containers
• Scaling Deep Learning on
Multiple V100 Nodes
• Deep Learning on V100
• Deep Learning Inference on
P40 vs P4 with Skylake
• Deep Learning Inference on
P40 GPUs
• Deep Learning Performance
with Intel Caffe – Training,
CPU model choice and
Scalability
• Deep Learning Performance
on R740 with V100 PCIe
GPUs
• Getting Started With
OpenHPC
• DELL EMC Isilon F800 and
F600 I/O Performance
• DELL EMC ISILON F800 AND
H600 WHOLE GENOME
ANALYSIS PERFORMANCE
• Digital Manufacturing with 14G
www.dellhpc.org