deep learning on hpc: performance factors and lessons learned · deep learning on hpc: performance...

Deep Learning on HPC: Performance Factors and Lessons Learned

Weijia Xu Scalable Computational Intelligence Group

Texas Advanced Computing Center The University of Texas at Austin

BenchCounil’19 Denver, CO

!1

Outline• Motivating applications at TACC

• Challenges for running deep learning on HPC. • Scalability and accuracy • Scalability and I/O • Memory error impact

• Conclusions and Discussions

!2

Motivating Applications• Traffic Camera Video Analysis

• In Collaboration with City of Austin • ~ 540MB/hour MPEG4 video from one camera • ~100GB for a typical study from a single camera

!3

[1] “Deep learning methods to leverage traffic monitoring cameras for pedestrian data applications” Weijia Xu, Natalia Ruiz, Ruizhu Huang, Joel Meyer, and Jen Duthie, John Clary, 26th ITS Word Congress, (Best Technical Paper) [2] Detecting Pedestrian Crossing Events in Large Video Data form Traffic Monitoring Cameras, Weijia Xu, Natalia Ruiz, Kelly Pierce, Ruizhu Huang, Joel Meyer, and Jen Duthie, to appear IEEE BigData2019

Motivating Applications• There are over 400 CCTV IP cameras

within city limit of Austin

• Mostly just used for manual monitoring

• With deep learning, we can • Learn more about traffic pattern • Understand how road are used • Improving pedestrian safety. • A lot of unexpected…

!4

Motivating Applications• Neural image resolution enhancement with super

resolution generative adversarial network. • In collaboration with Salk Institute • ~600 GB neural image dataset • Pytorch+FastAI, Each run of the early version takes

~24 hours on 16 NVIDIA V100 GPUs

!5

Biorixv’19 Fang, L., Monroe, F., Novak, S.W., Kirk, L., Schiavon, C.R., Seungyoon, B.Y., Zhang, T., Wu, M., Kastner, K., Kubota, Y. and Zhang, Z., 2019. Deep Learning-Based Point-Scanning Super-Resolution Imaging. bioRxiv, p.740548.

Motivating Applications• Face recognition

• In Collaboration with NASA JPL • ~100 GB image data • TensorFlow + Horovod • Each run takes ~12 hours on 16 NVIDIA GTX 1080 Ti GPUs

!6

[1]DLS’19-1 Mattmann, Chris A., Zhang, Z.. Deep Facial Recognition with TensorFlow, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO[2] Courtesy image from: https://megapixels.cc/datasets/msceleb/

https://megapixels.cc/datasets/msceleb/

More DL Applications at TACC

• Deep learning is both compute intensive and data intensive.

!7

Scale up vs. Scale out• Scale Up

• Better and faster GPU cards /Specialized hardware, e.g. TPU

• High acquisition cost to build large cluster

• Scale Out • Using more computing nodes • Consistent with traditional HPC

operations. • Specific challenges

• Accuracy vs scalability • I/O issues

!8

The Race of ResNet50• 90-epoch ResNet-50 training finished in 20 mins on

2,048 KNL with 74.9% top-1 accuracy • Against 8 GPUs baseline

!9

1 29 28 56116

264

466

791

0

100

200

300

400

500

600

700

800

900

He et al.(Microsoft)

Goyal et al.(Facebook)

Cordeanu etal. (SURF &

Intel )

You et al(UBC &TACC)

preferrednetworks

Tencent &HKBU

SonyResearch

Google1024 TPUv3

ResNet 50 ImageNet Training Acceleration from 2016~2019

Accuracy vs Scalability • To yield high utilization at scale, we need to feed enough

data (computation), which results in large batch size

• Validation (test) accuracy is sensitive to batch size.

• Large batch size can result in degraded validation accuracy

• Layer-wise Adaptive Rate Scaling algorithm (LARS)1 • Intuition: learning rate should be adjusted according

to the norm of the weights in each layer

!10[1] You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper

Scalable Training Algorithm• Using batch size of 32K while preserving validation

accuracy

!11

Scalable Training Algorithm• Using batch size of 32K on Intel Xeon Phi 7250 (KNL) and Intel Xeon

Platinum 8160 (SKX) nodes

!12

You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper

Scalability vs Data I/O• ResNet-50 with ImageNet on 16 Nvidia 1080Ti GPUs,

mini-batch of 64 per GPU

!13

Tpt (

imgs

/sec

)

0

2250

4500

6750

9000

Number of Nodes (4 GPUs per node)1 4 8 16

Lustre Ideal 8704

4352

2176

54427862467

270447544

2176

4352

8704

I/O on Lustre

!14

Dataset # files # dirs total_size file_size

ImageNet 1.3 million 2002 140 GB KB-MB

NeuralImage 0.6 million 6 500 GB MB

ReactorStatus

0.17 million 1 65 GB KB

Deep Learning I/O • DL’s long lasting, repeated, high volume, and highly

concurrent file access can easily saturate the metadata and data service of traditional shared file system.

• ResNet-50 with Keras, TensorFlow, and Horovod on 16 nodes, each with 4 GPUs • 128K stat() and readdir() operations with 64-way

concurrent access • 117M stat(), open(), close() operations with 256-way

concurrent access • ~180M read() operations with same concurrency • ~ 8 hour duration

!15

[DLS’19-2] Zhang, Z., Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO

FanStore Design• FanStore is a transient runtime file

system that optimizes I/O for distributed DL training.1

• Data is partitioned (optionally compressed) and spread across local storage space

• File access functions are intercepted and handled in user space

• Remote file access is in the form of MPI round-trip message

!16[1] Zhang, Z., Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO

ResNet-50 GPU Results

!17

0

2250

4500

6750

9000

Number of Nodes (4 GPUs per node)1 4 8 16

Lustre Ideal FanStore

7867

4050

1902544

8704

4352

2176544 27862467

270447544

2176

4352

8704

• 4 Nvidia 1080Ti GPUs per node, mini-batch of 64 per GPU

ResNet-50 CPU Results

!18

0

4500

9000

13500

18000

Number of Nodes1 64 128 256 512

FanStore Ideal 16384

8192

40962048

32

15109

7710

3901196832322048

4096

8192

16384

• Intel Xeon Platinum 8160 nodes on Stampede2, mini-batch size of 256 per node

Memory Error in Deep Learning• The impact of memory error on deep learning training is

unclear, due to its stochastic nature and mathematical properties • Difficult for computing centers or individual researchers to

make hardware procurement decisions • Difficult for users to estimate their confidence in training

correctness on ECC-free processors • Potential performance gain by not using ECC

• To quantify the impact of memory errors on deep learning training and to investigate alternative solutions for memory error detection

!19

[Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM

Technical Approach• Focusing on impact from silent data corruption (SDC)

• P(Failure) ≈ P(Failure, SDC) = P(Failure | SDC) x P(SDC)

• To evaluate P(Failure | SDC) • Sampling in the experiment design space • Manually flipping the selected bit • Observing validation accuracy and training loss • Estimating P(Failure | SDC) via marginal probability

!20

Testing Applications

!21

App SW Version Node Device Memory Mem Usage Run Time

ConvNet nvcaffe 0.16.5 1 2x1080 Ti 11 GB 0.45 GB 4.5 mins

LRCN caffe 1.0.0 1 1x1080 Ti 11 GB 3.9 GB 16 mins

ResNet50 Intel-Caffe 1.1.0 512 KNL 96 GB 18.4 GB 8 mins

Example of ConvNet with Cifar10• ConvNet with Cifar10 dataset baseline

• 50,000 training items/10,000 validation items

• 60,000 Iterations/120 epochs

• Batch size: 100

• Top-1 Test Accuracy Acceptable range: 76.52% - 80.83%

• Training Loss Acceptable range: 0.2594 - 0.4975

!22

Parameter Value

Iteration 200, 10200, 20200, 30200, 40200, 50200, 60000

Phase forward, backward

Place data, model

Layers 1, 2, …, 15

Parameter Layers 1, 2, …, 7

Data Position 0, mid, last

Bit Position 31, 30, 29, 28, 27, 22

Repetition 3

Key Observations• Training failure is independent of iteration number

• Used part of training process instead of complete runs.

• Errors on less significant bits lead to less training failures

• Convolution layers have the most training failures, so we estimate the worst case failure rate assuming every layer is a convolution layer

• Training loss in the immediate next iteration is an effective signal to detect catastrophic SDCs.

!23

Memory Error Impact on DL Training

!24

App P(SDC) P(F|SDC) Scaling Factor P(F)

ExpectedRuns per Failure

ConvNet 3.07 *10-6 1.76% 1 5.4 * 10-8 18.5 M

ResNet50 5.89 *10-2 1.22% 9 7.18 * 10-4 1,610

LRCN 5.19 *10-3 0.61% 110 3.17 * 10-5 31,500

[Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM

Conclusions• Scalable deep learning is required for large and future

problems

• Challenges for scaling out deep learning • How to maintain accuracy while maximize scalability? • Deep learning is also a big data problem with more

computations, i.e. both IO and computing intensive • Different hardware requirement

!25

Discussions• Machine learning on HPC is a growing area and in need

a dedicated benchmark • Science applications

• data set • Models

• Running on HPC • accuracy and convergence • Communication and data bottleneck • System cost, e.g. acquisition and operations • Overall system balance, reliability

!26

THANKS• Distributed Deep Learning tutorial

• 1:30 - 5pm Nov. 17 Room 201

• Deep Learning on Supercomputers workshop • 9:00 - 5:30 Nov. 17 Room 502

• Contact • Weijia Xu: [email protected] • General data inquiry: [email protected]

!27

mailto:[email protected]

mailto:[email protected]

deep learning on hpc: performance factors and lessons learned · deep learning on hpc: performance...

Documents