hpc and ai engineering at the innovation lab

HPC and AI Solution

OverviewGarima Kochhar

HPC and AI Innovation Lab

1

2

Dell EMC HPC and DL team charter

HeadingLorem ipsum dolor

sit amet,

consectetur

adipiscing elit.

HeadingLorem ipsum dolor

sit amet,

consectetur

adipiscing elit.

New Investment:

more SMEs, huge innovation ecosystem

Design, develop and integrate HPC and DL systems• Flexible reference architectures• Systems tuned for research

computing, manufacturing, life sciences, oil and gas, etc.

Act as the focal point for joint

R&D activities• Technology collaboration with

partners for joint innovation

• Research coordination with DSC,

COEs and customers

HPC Innovation

LabPrototype and evaluate

advanced technologies• HPC+Cloud, HPC+Big Data

• nVMEs, FPGAs, containers, DL/ML

workloads, etc.

Conduct application performance studies and develop best practices• White papers, blogs,

presentations• www.hpcatdell.com

Technical

briefings,

tours, remote

access

3

World-class infrastructure in the Innovation Lab

Zenith• TOP500-class system based on Intel Scalable

Systems Framework (OPA, KNL, Xeon)

• 324 nodes with Intel Xeon Gold 6148-F processors,

Omni-Path fabric and 655TF sustained performance.

• +160 Intel Xeon Phi (KNL) servers. 805TF combined

performance.

• #292 on Top500, 1.4PFlop/s theoretical peak

• Isilon H600, F800

Rattler• Research/development system with Mellanox, NVIDIA

and Bright Computing

• 88 nodes with Intel Xeon Gold 6148 processors and

EDR InfiniBand

• 16 nodes with Intel Xeon Gold 6148 processors and 4

V100 GPUs each

13K ft.2 lab, 1,300+ servers, ~10PB storage dedicated to HPC in collaboration with the community

4

Focus areas

HPC software stackBright Cluster Manager, OpenHPC

Integration of all software components

Compute performance and tuningApplication focus + BIOS, Memory, Interconnect

Accelerators and co-processors

Different workloads

Interconnect performance and tuning

Storage solutionsIsilon, NSS , Lustre

Vertical solutionsGenomics research

CFD/Manufacturing

Proof of Concept studiesContainers, FPGAs, NVMeoF, etc.

Compute- performance- tuning

6

Careful memory configuration - Skylake

• Unbalanced configurations are very bad for performance

• Balanced and near-balanced configurations are ideal for HPC

0.350.39 0.41

0.49

0.65

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0

50

100

150

200

250

8176 6142 5120 4114 3106

16

DIM

M r

elat

ive

ban

dw

idth

wh

en c

om

par

ed t

o

12

DIM

Ms.

Stre

am T

riad

GB

/s (

hig

her

is b

ette

r)

Processor SKU

Impact of unbalanced 512 GB memory configuration

384 GB 12x32GB 512 GB 16x32GB Rel perf 16x32GB

1.05

1.00

1.031.04

0.92

1.06

1.00

1.031.05

0.93

0.85

0.90

0.95

1.00

1.05

1.10

Balanced Balanced Balanced Near balanced Near balanced

12 x 16GB 12 x 32GB 24 x 32GB 12 x 16GB + 12x 32GB

12 x 8GB + 12 x16GB

192 GB 384 GB 768 GB 576 GB 288 GB

Rel

ativ

e m

emo

ry b

and

wid

thh

[h

igh

er is

bet

ter]

Near-balanced configurations - 288 GB, 576 GB

6138, 2666 MT/s 6142, 2666 MT/s

7

ANSYS® Fluent®—Two Socket System Performance

• Skylake provides significantly better performance for ANSYS Fluent relative to Broadwell.

Relative performance depends on the specific benchmark dataset: 1.2 to 1.4 for 6150.

0.74 0.75 0.90 1.00 1.02 1.23 1.17 1.32 1.45 1.43 1.62

0.98

1.211.02 1.00

0.91

1.23

1.56

1.321.16

1.271.08

0.00.20.40.60.81.01.21.41.61.8

E5-2697 v2 12c,2.7 GHz, 130W,

1866 MT/s

E5-2660 v3 10c,2.6/2.2 GHz,105W, 2133

MT/s

E5-2697 v3 14c,2.6/2.2 GHz,145W, 2133

MT/s

E5-2697A v416c, 2.6/2.2GHz, 145W,2400 MT/s

E5-2697 v4 18c,2.3/2.0 GHz,145W, 2400

MT/s

6130 16c,2.1GHz, 125W,12x16GB 2666

MT/s

6136 12c,3.0GHz, 150W,24x16GB 2666

MT/s

6142 16c,2.6GHz, 150W,12x16GB 2666

MT/s

6148 20c,2.4GHz, 150W,12x16GB 2666

MT/s

6150 18c,2.7GHz, 165W,12x16GB 2666

MT/s

8168 24c,2.7GHz, 205W,12x16GB 2666

MT/s

IVB HSW BDW SKX

Perf

orm

ance

Rel

ativ

e to

E5

-26

97

A v

4

ANSYS Fluent v17.2—aircraft_wing_14m

Relative Performance per Core

8

AMD EPYC - WRF multi-node tests

• EPYC 7601 is ~3% better than 7551. Base frequency 10% faster, Turbo 6%.

0.0

5.0

10.0

15.0

20.0

25.0

30.0

64 128 256 512 1024

1 epyc 2 epyc 4 epyc 8 epyc 16 epyc

Rel

ativ

e p

erfo

rman

ce o

ver

16

no

des

(h

igh

er is

bet

ter)

WRF - conus 12km

EPYC 75512.0 GHz

EPYC 76012.2 GHz

Linear scaling

0.0

5.0

10.0

15.0

20.0

25.0

30.0

64 128 256 512 1024

1 epyc 2 epyc 4 epyc 8 epyc 16 epyc

Rel

ativ

e p

erfo

rman

ce o

ver

16

no

des

(h

igh

er is

bet

ter)

WRF - conus 2.5 km

EPYC 75512.0 GHz

EPYC 76012.2 GHz

Linear scaling

Interconnects

10

EDR Latency w/ c-states

1.07 1.10 1.10 1.10 1.101.14 1.13 1.15

1.591.64

1.73

1.591.63

1.72

0.87 0.89 0.89 0.88 0.890.93 0.93 0.94

1.381.43

1.52

0.50

0.70

0.90

1.10

1.30

1.50

1.70

1.90

0 1 2 4 8 16 32 64 128 256 512

La

ten

cy (

us)

Message size (bytes)

SKL,BW EDR Latency (cstates enabled vs disabled)

SKL cstates-en w/ switch SKL cstates-en, B2B

SKL cstates-dis w/ switch SKL cstates-disabled, B2B SKL- Intel Xeon Gold 6142 @ 2.6GHz 16C

• Cstates enabled and disabled latency results are about the same.

11

• With a small data set, performance advantage of Skylake decreases at scale, but remains

positive.

Unlikely to run a model of this size with more than a few nodes.

ANSYS® Fluent®—Small Model Scaling

1.0

1.8

2.8

3.6

4.2

4.95.2

1.2

2.2

3.2

4.1

4.9

5.66.0

1.251.20 1.18 1.15 1.16 1.14 1.15

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0

200

400

600

800

1000

1200

1400

36 (1) 72 (2) 144 (4) 216 (6) 288 (8) 432 (12) 576 (16)

Perf

orm

ance

Rel

ativ

e to

E5

-26

97

v4

Solv

er R

atin

g (h

igh

er is

bet

ter)

Number of Cores (Nodes)

ANSYS Fluent v17.2—ice_2m

C6320 + E5-2697 v4 + EDR C6420 + 6150 + EDR Perf relative to E5-2697 v4

GP GPU and accelerators

13

AMBER B V100 vs P100 79 %

K V100 vs P100 80 %

• V100 is significantly faster, 1x V100 faster than 4x

P100

• SMX2 better than PCIe – 9% P100, up to 30%

V100

• SMX2 has slightly higher frequency than PCIe card

15.8 16.5

28.329.8

20.322.0

31.4

37.0

22.224.2

29.5

39.4

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

PCIe SXM2 PCIe SXM2

13G C4130 P100 14G C4140 V100

ns/

day

(h

igh

er is

bet

ter)

AMBER STMV

1 2 4

Configuration B Configuration K

HPC storage solutions- NSS- IEEL- Isilon- Proof Of Concepts

15

Solution benefits & Dell differentiation

• Parallel scalable file system based on

Intel EE for Lustre software

• Single file system namespace scalable to

high capacities and performance

• Best practices developed by Dell HPC

Engineering provide optimal performance

on Dell hardware

• Tests yield peaks of roughly 15GB/s write

and 17GB/s read per building block

• Lustre Distributed Namespace allows

distribution of Lustre sub-directories across

multiple MDTs to increase metadata

capacity capabilities and performance

• Share data with other file systems utilizing

optional NFS/CIFS gateway

• Dell Networking 10/40GbE, InfiniBand or

Omni-Path

Dell Storage for HPC with Intel EE for Lustre SolutionTurn-key solution designed for high speed fast scratch storage

MDS Pair

Dell PowerEdge R730

Active/Passive

12 Gbps SAS

Failover Connect ions

Dell PowerVault

MD3420

OSS Pair

Dell PowerEdge R730

Active/Active

12 Gbps SAS

Failover Connect ions

Dell PowerVault

MD3460

Intel Manager for Lustre

PowerEdge R630

Dell PowerVault

MD3420

(Optional for DNE)

16

IEEL3.0+OPA

17

Support for Isilon for Scalable NFS- Sequential performance (N-N) -- Write

Deep Learning

19

TensorFlow+Horovod on multiple V100 nodes

338589

1115

2173

3205

4205

1.01.7

3.3

6.4

9.5

12.5

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 V100 2 V100 4 V100 8 V100 12 V100 16 V100

Spe

ed

up

Imag

es/

sec

TensorFlow+Horovod Resnet50 on Multi-node

Performance Speedup

• TensorFlow+Horovod scales well on multiple nodes, 12.5x speedup with 16 V100 (4 nodes)

• Ibverbs and MPI are used for nodes communication

• 4 nodes with V100-PCIe GPUs are used

• FP32 mode, batch size is 128 per GPU

20

Open Source Frameworks

TensorFlow, MxNet, CNTK, Theano, Torch, Caffe/Caffe2 …

Neural network Libraries

MLPython, CaffeOnSpark, cuDNN, cuBLAS, NCCL, Keras, GIE…

Ready Bundle for Deep Learning - NVIDIA

Platform C4140

Processor 2 x Intel Xeon CPU 6148

Memory 384GB DDR4 @ 2400MHz

Drives 2x200GB 1.8” SSD

Network Mellanox ConnectX-5 VPI (EDR

100Gb/s)

GPU 4x V100-16GB – SXM2

Software & Firmware [Reference]

Operating

System

RHEL 7.4 x86_64

Provisioning and

Management

Bright Cluster Manager 8.0

Tying it together - Access to the lab- White papers and Blogs

22

Recent Publications• Design Principles for HPC

• 14G with Skylake – how much

better for HPC?

• BIOS characterization for HPC with

Intel Skylake processor

• HPCG Performance study with Intel

Skylake processors

• Skylake memory study

• Performance study of four Socket

PowerEdge R940 Server with Intel

Skylake processors

• Dell EMC HPC Systems - SKY is

the limit

• Entering a new arena in

Computing- KNL

• System benchmark results on KNL

– STREAM and HPL

• HPCG Performance study with Intel

KNL

• NAMD Performance Analysis on

Skylake Architecture

• LAMMPS Four Node Comparative

Performance Analysis on Skylake

Processors

• De novo assembly with PowerEdge

R940

• Dell EMC HPC System for Life

Science v1.1

• HPC Applications Performance on

V100

• Application Performance on P100-

PCIe GPUs

• Containerizing HPC Applications

with Singularity

• Performance of LS-DYNA on

Singularity Containers

• Scaling Deep Learning on

Multiple V100 Nodes

• Deep Learning on V100

• Deep Learning Inference on

P40 vs P4 with Skylake

• Deep Learning Inference on

P40 GPUs

• Deep Learning Performance

with Intel Caffe – Training,

CPU model choice and

Scalability

• Deep Learning Performance

on R740 with V100 PCIe

GPUs

• Getting Started With

OpenHPC

• DELL EMC Isilon F800 and

F600 I/O Performance

• DELL EMC ISILON F800 AND

H600 WHOLE GENOME

ANALYSIS PERFORMANCE

• Digital Manufacturing with 14G

www.dellhpc.org

hpc and ai engineering at the innovation lab

Documents