using the intel optimized caffe stephen blair-chappell

Using the Intel Optimized Caffe FrameworkStephen Blair-Chappell

bayncore

Three Ingredients to Success

Three ingredients to success

4

datacenter gateway EdgeVision

Inte

l Sili

con

CPU+

Opt

imis

ed

Fram

ewor

ksIn

tel

S/W

& to

ols

1

2

3 https://software.intel.com/en-us/parallel-studio-xe

https://www.intelnervana.com/

End-to-end example

5

Train Configure Run

Weights

Pre-trained Model

Model

Intel optimised multinode Caffe Intel Neural Compute SDK

The Application - YOLO

6

You Only Look Once

• State-of-the-art, real-time object detection system• Identify most things in a couple of seconds• Designed by Joseph Redmon

https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

The Model

7https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

INTRODUCING INTEL OPTIMIZED caffeWhat Intel Offers

9

Intel Optimized Caffe brings improved performance and functionality

Performance

• Single node performance improvements provided byIntel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN).

Functionality

• Intel version introduces multi-node execution.

10

Improved single node performance -1

Intel MKL DNN

SKX KNL/KNM

https://github.com/01org/mkl-dnn

Intel® MKL & MKL-DNN

https://github.com/01org/mkl-dnn

11

multi-node execution

• Multi-Node execution provided by Intel® Machine Learning Scaling Library (MLSL)

• Under-the-hood uses MPI

https://github.com/01org/MLSL

https://github.com/intel/caffe/wiki/Multinode-guide

https://github.com/01org/MLSL

https://github.com/intel/caffe/wiki/Multinode-guide

• Built on top of MPI

• Optimized to drive scalability ofcommunication patterns

• Works across various interconnects: Intel® Omni-Path Architecture, InfiniBand*, andEthernet

• Common API to support deep learning frameworks (Caffe*, Theano*, Torch*, etc.)

Intel® Machine Learning Scaling Library

FORWARD PROPAGAT ION

BACKPROPAGATION

LAYER

1

LAYER

2

LAYER

N

Allre

duce

Alltoall

Allre

duce

Allre

duce

Reduce Scatter

AllgatherAlltoallAllreduce

multi-node execution provided by MLSL

INTEL caffe!Getting started

End-to-end example

14

Train Configure Run

Weights

Pre-trained Model

Model

Intel Neural Compute SDK

https://software.intel.com/en-us/ai-academy/tools/devcloud

15

Intel Caffe up & running – the easy (best) way

>> install miniconda>> export PATH=<miniconda_install_root>/bin:$PATH>> conda create -n intel_caffe -c intel --override-channels caffe>> source activate intel_caffe

Shortcut to Python wrap, no build required

https://conda.io/miniconda.html

16

INTEL Caffe up & runningCustomize the install process and build your own optimized system

https://github.com/intel/caffe

17

INTEL Caffe up & runningCustomize the install process and build your own optimized system

https://github.com/intel/caffe

>> cd caffe/python>> for req in $(cat requirements.txt); do pip install $req; done >> cd caffe_root>> make pycaffe>> make distribute >> export PYTHONPATH=/path/to/caffe/python:$PYTHONPATH

To activate Python wrap

HANDS-ON DEMOTraining on the cloud

18Intel Confidential

19

NETWORK definition

solver.prototxt : is a configuration file used to tell caffe how you want the network trained

https://github.com/BVLC/caffe/wiki/Solver-Prototxt

net: "models/intel_optimized_models/alexnet/train_val.prototxt"test_iter: 1000test_interval: 10000base_lr: 0.007lr_policy: "poly"power: 0.6display: 20max_iter: 250000momentum: 0.9weight_decay: 0.0005snapshot: 50000snapshot_prefix: "models/intel_optimized_models/alexnet/alexnet_train"solver_mode: CPU

https://github.com/BVLC/caffe/wiki/Solver-Prototxt

20

NETWORK definition

train_val.prototxt: file that define the network

http://caffe.berkeleyvision.org/tutorial/layers.html

http://caffe.berkeleyvision.org/tutorial/layers.html

21

NETWORK definitionfrom caffe import layers as Lfrom caffe import params as P

def lenet(lmdb, batch_size):# our version of LeNet: a series of linear and simple nonlinear transformationsn = caffe.NetSpec()n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,

transform_param=dict(scale=1./255), ntop=2)n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight_filler=dict(type='xavier'))n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)n.ip1 = L.InnerProduct(n.pool2, num_output=500, weight_filler=dict(type='xavier'))n.relu1 = L.ReLU(n.ip1, in_place=True)n.ip2 = L.InnerProduct(n.relu1, num_output=10, weight_filler=dict(type='xavier'))n.loss = L.SoftmaxWithLoss(n.ip2, n.label)return n.to_proto()

with open('examples/mnist/lenet_auto_train.prototxt', 'w') as f:f.write(str(lenet('examples/mnist/mnist_train_lmdb', 64)))

with open('examples/mnist/lenet_auto_test.prototxt', 'w') as f:f.write(str(lenet('examples/mnist/mnist_test_lmdb', 100)))

22

NETWORK definition

# from Caffe root>> python/draw_net.py path/to/train_val.prototxt output.png

23

NETWORK Execution

# Train command single node>> ./build/tools/caffe train

--solver=models/intel_optimized_models/alexnet/solver.prototxt --engine=MKL2017 | tee train_alexnet.log

From Caffe root:

Python:

# Train command single node>> python my_code.py

24

NETWORK Execution (multinode)

# Train command multi node>> mpirun -n 4 -ppn 1 -machinefile mpd.hosts -genv OMP_NUM_THREADS=64

./build/tools/caffe train --solver=models/intel_optimized_models/alexnet/solver.prototxt--engine MKL2017

| tee train_googlenet_4nodes_0410_tmi_omp_set_64.log

From Caffe root:

25

NETWORK Execution

# Train command>> ./build/tools/caffe train

--solver=example/cifar10/cifar10_full_train_test.prototxt--weights my_model.caffemodel

Fine-tune a model

Example available here:http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html

http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html

Inference on the edge with the

Movidius Compute stick

Movidius Neural Compute StickRedefining the AI developer kit

• Neural Network Accelerator in USB Stick Form Factor

• No additional heat-sink, no fan, no cables, no additional power supply

• Prototype, tune, validate and deploy deep neural networks at the edge

• Features the same Movidius vision processing unit (VPU) used in drones,

surveillance cameras, VR headsets, and other low-power intelligent and

autonomous products

Myriad 2 Vision Processing Unit (VPU)

28

29

EDGE Example use - The DJI SPARK DRONE

See: https://www.dji.com/spark

Face Aware

Gesture Mode

Safe Landing

Example: Scaling inference performance with multiple sticks

Movidius Neural Compute StickRedefining the AI developer kit

+Profiler

Checker

Compiler

API

NC Toolkit

NC API

NC SDKFree download @ developer.movidius.com

NC SDK workflow

Profiler-----------------------------------A tool that provides a detailed stage-by-stage breakdown of where the bottlenecks are in your system.

What can I do with the NCS?

DNN architect / data scientist Applications developer

C API-------------------GetDeviceNameOpenDeviceAllocateGraphDeallocateGraphLoadTensorSetGraphOptionCloseDevice…

Python bindings-------------------StatusGlobalOptionDeviceOptionGraphOptionEnumerateDevicesSetGlobalOptionLoadTensor…

Checker-----------------------------------Runs a single inference on the NCS using the provided model, allowing for the calculation of classification correctness.

Compiler------------------------------------The compiler is used to create a graph which is an optimized binary file that can be processed by the NCS.

Benchmarking & Performance

36

Benchmarking

Calculate img/s:

Where the exec. Time is :

Img/s = #nodes * batch sz * max_iter / time

time = Iteration N – Iteration 0 (on the train .log file)

37

PERFormance EVALuation

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.. Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.

Node Number Images/s Scalability

1 480 -

4 1708 88.96%

8 3376 87.92%

16 6368 82.92%

32 11616 75.63%

Topology: AlexNet Dataset: Imagenet Dataset Input: JPEG Raw Data Batch size: 256 Xeon Phi: KNL7210

1X3.6X

7.03X

13.3X

24.2X

1node 4nodes 8nodes 16nodes 32nodes

Nor

mal

ized

Tra

inin

g Ti

me

Hig

her i

s be

tter

24.2x

http://www.intel.com/performance/datacenter

38


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.. Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.

1X3.7X

7.7X

14.8X

29.7X

1node 4nodes 8nodes 16nodes 32nodes

Nor

mal

ized

Tra

inin

g Ti

me

Hig

her i

s be

tter

Node Number Images/s Scalability

1 556 -

4 2064.5 92.83%

8 4311.6 96.93%

16 8241.4 92.64%

32 16516 92.83%

Topology: AlexNet Dataset: Imagenet Dataset Input: Compressed LMDB Batch size: 256 Xeon Phi: KNL7210

29.7x

Compressed Data gives better performance!

http://www.intel.com/performance/datacenter


0

5

10

15

100

3380

067

500

1012

0013

4900

1686

0020

2300

2360

0026

9700

3034

0033

7100

3708

0040

4500

4382

0047

1900

5056

0053

9300

5730

0060

6700

6404

0067

4100

7078

0074

1500

7752

0080

8900

8426

0087

6300

9100

00

loss

iterations

Googlenet v1 training_loss

1node_batch32_step_lr0.005_training_loss

Multi node training gets faster convergence speed than single node

top1 and top5 accuracy increase are faster than single node

00.10.20.30.40.50.60.70.80.9

1

1000

2800

055

000

8200

010

9000

1360

0016

3000

1900

0021

7000

2440

0027

1000

2980

0032

5000

3520

0037

9000

4060

0043

3000

4600

0048

7000

5140

0054

1000

5680

0059

5000

6220

0064

9000

6760

0070

3000

7300

0075

7000

7840

0081

1000

8380

0086

5000

8920

0091

9000

accu

racy

iterations

Googlenet v1 accuracy

1node_batch32_step_lr0.005_loss3_accuracy@top1

1node_batch32_step_lr0.005_loss3_accuracy@top5

8node_batch32*8_step_lr0.005_loss3_accuracy@top1

8node_batch32*8_step_lr0.005_loss3_accuracy@top5

Top Tips

41

bios settingsSINGLE NODE

• Cluster Mode: AlltoAll

• Cache Mode: Flat for workload <=16GB memory, Cache otherwise

• Hyper threading: Enabled

• CPU Power and Performance Policy: Performance

• Set Fan Profile: Performance

42

bios settingsMULTI NODE

• Intel Hyper-Thread: Disabled

• Cluster Mode: Quadrant

• MCDRAM: Cache

• CPU Power and Performance Policy: Performance

• Set Fan Profile: Performance

• Use SSD drive. If during trainings/scoring you observe in logs "waiting for

data" - you should install better SSD or reduce batch size

Top tips• Choose big batch size to take advantage of big memory of KNL/KNM system

• Multi-node Intel Caffe on KNL/KNM + OPA is the best scalable solution with better accuracy vs. Ethernet

• Right BIOS setting , latest ver. Software

• Understand MLSL/OMP settings

• Pay attention to affinity settings (see next slides)

44

CPU Affinity for Performance ManagementThe Intel® OpenMP* runtime library has the ability to bind OpenMP threads to physical processing units

The interface is controlled using the KMP_AFFINITY and KMP_PLACE_THREADS environment variables

There are 2 considerations for OpenMP threading and affinity

Determine the number of threads to utilize

Bind threads to specific processor cores

Example command (running Intel Caffe on 8 nodes of KNL 7250):mpirun -l -n 8 -ppn 1 -machinefile 8nodes.hosts -genv OMP_NUM_THREADS=64 -genvKMP_AFFINITY="proclist=[0-63],granularity=thread,explicit" -genv MIC_KMP_AFFINITY="verbose, none" -genv KMP_HW_SUBSET=1t -genv MLSL_NUM_SERVERS=4 numactl -i all /root/intelcaffe/build/tools/caffetrain --solver /root/Lei/iFlytek/solver.prototxt -iterations 1000

Detail information of using KMP_AFFINITY environment variable: https://software.intel.com/en-us/node/522691

https://software.intel.com/en-us/node/522691

45

KNL CPU Affinity with numactlMachine

Socket 0

0 63 7 1514 18 72 73… … …

CPU core resource hierarchyKNL 7250 with HT onLinux kernel 3.xx.x- xxx.xx.x

Processor id: 0, 68, 136,204

Core id:

Physic id:

3, 71, 139,207 4, 72, 140,208 5, 73, 141,209 67, 135, 203,272

• Check core_id to processor_id mapping: egrep "(( id|processo).*:|^ *$)" /proc/cpuinfo

• Numactl command for CPU affinityExample: supply core_id 0-11 for a python scriptnumactl –C +0-9 python your_script.pyExample: supply core_id 0,73 for a python scriptnumactl –C +0, 67 python your_script.py

numactl –C +0-9 python your_script.py

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

46

47

Legal Notices & disclaimersThis document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

http://www.intel.com/performance

using the intel optimized caffe stephen blair-chappell

Documents