using the intel optimized caffe stephen blair-chappell
TRANSCRIPT
Using the Intel Optimized Caffe FrameworkStephen Blair-Chappell
bayncore
2
Three Ingredients to Success
Three ingredients to success
4
datacenter gateway EdgeVision
Inte
l Sili
con
CPU+
Opt
imis
ed
Fram
ewor
ksIn
tel
S/W
& to
ols
1
2
3 https://software.intel.com/en-us/parallel-studio-xe
https://www.intelnervana.com/
End-to-end example
5
Train Configure Run
Weights
Pre-trained Model
Model
Intel optimised multinode Caffe Intel Neural Compute SDK
The Application - YOLO
6
You Only Look Once
• State-of-the-art, real-time object detection system• Identify most things in a couple of seconds• Designed by Joseph Redmon
https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf
The Model
7https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf
INTRODUCING INTEL OPTIMIZED caffeWhat Intel Offers
9
Intel Optimized Caffe brings improved performance and functionality
Performance
• Single node performance improvements provided byIntel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN).
Functionality
• Intel version introduces multi-node execution.
10
Improved single node performance -1
Intel MKL DNN
SKX KNL/KNM
https://github.com/01org/mkl-dnn
Intel® MKL & MKL-DNN
11
multi-node execution
• Multi-Node execution provided by Intel® Machine Learning Scaling Library (MLSL)
• Under-the-hood uses MPI
https://github.com/01org/MLSL
https://github.com/intel/caffe/wiki/Multinode-guide
• Built on top of MPI
• Optimized to drive scalability ofcommunication patterns
• Works across various interconnects: Intel® Omni-Path Architecture, InfiniBand*, andEthernet
• Common API to support deep learning frameworks (Caffe*, Theano*, Torch*, etc.)
Intel® Machine Learning Scaling Library
FORWARD PROPAGAT ION
BACKPROPAGATION
LAYER
1
LAYER
2
LAYER
N
Allre
duce
Alltoall
Allre
duce
Allre
duce
Reduce Scatter
AllgatherAlltoallAllreduce
multi-node execution provided by MLSL
INTEL caffe!Getting started
End-to-end example
14
Train Configure Run
Weights
Pre-trained Model
Model
Intel Neural Compute SDK
https://software.intel.com/en-us/ai-academy/tools/devcloud
15
Intel Caffe up & running – the easy (best) way
>> install miniconda>> export PATH=<miniconda_install_root>/bin:$PATH>> conda create -n intel_caffe -c intel --override-channels caffe>> source activate intel_caffe
Shortcut to Python wrap, no build required
16
INTEL Caffe up & runningCustomize the install process and build your own optimized system
https://github.com/intel/caffe
17
INTEL Caffe up & runningCustomize the install process and build your own optimized system
https://github.com/intel/caffe
>> cd caffe/python>> for req in $(cat requirements.txt); do pip install $req; done >> cd caffe_root>> make pycaffe>> make distribute >> export PYTHONPATH=/path/to/caffe/python:$PYTHONPATH
To activate Python wrap
HANDS-ON DEMOTraining on the cloud
18Intel Confidential
19
NETWORK definition
solver.prototxt : is a configuration file used to tell caffe how you want the network trained
https://github.com/BVLC/caffe/wiki/Solver-Prototxt
net: "models/intel_optimized_models/alexnet/train_val.prototxt"test_iter: 1000test_interval: 10000base_lr: 0.007lr_policy: "poly"power: 0.6display: 20max_iter: 250000momentum: 0.9weight_decay: 0.0005snapshot: 50000snapshot_prefix: "models/intel_optimized_models/alexnet/alexnet_train"solver_mode: CPU
20
NETWORK definition
train_val.prototxt: file that define the network
http://caffe.berkeleyvision.org/tutorial/layers.html
21
NETWORK definitionfrom caffe import layers as Lfrom caffe import params as P
def lenet(lmdb, batch_size):# our version of LeNet: a series of linear and simple nonlinear transformationsn = caffe.NetSpec()n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,
transform_param=dict(scale=1./255), ntop=2)n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight_filler=dict(type='xavier'))n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)n.ip1 = L.InnerProduct(n.pool2, num_output=500, weight_filler=dict(type='xavier'))n.relu1 = L.ReLU(n.ip1, in_place=True)n.ip2 = L.InnerProduct(n.relu1, num_output=10, weight_filler=dict(type='xavier'))n.loss = L.SoftmaxWithLoss(n.ip2, n.label)return n.to_proto()
with open('examples/mnist/lenet_auto_train.prototxt', 'w') as f:f.write(str(lenet('examples/mnist/mnist_train_lmdb', 64)))
with open('examples/mnist/lenet_auto_test.prototxt', 'w') as f:f.write(str(lenet('examples/mnist/mnist_test_lmdb', 100)))
22
NETWORK definition
# from Caffe root>> python/draw_net.py path/to/train_val.prototxt output.png
23
NETWORK Execution
# Train command single node>> ./build/tools/caffe train
--solver=models/intel_optimized_models/alexnet/solver.prototxt --engine=MKL2017 | tee train_alexnet.log
From Caffe root:
Python:
# Train command single node>> python my_code.py
24
NETWORK Execution (multinode)
# Train command multi node>> mpirun -n 4 -ppn 1 -machinefile mpd.hosts -genv OMP_NUM_THREADS=64
./build/tools/caffe train --solver=models/intel_optimized_models/alexnet/solver.prototxt--engine MKL2017
| tee train_googlenet_4nodes_0410_tmi_omp_set_64.log
From Caffe root:
25
NETWORK Execution
# Train command>> ./build/tools/caffe train
--solver=example/cifar10/cifar10_full_train_test.prototxt--weights my_model.caffemodel
Fine-tune a model
Example available here:http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
Inference on the edge with the
Movidius Compute stick
Movidius Neural Compute StickRedefining the AI developer kit
• Neural Network Accelerator in USB Stick Form Factor
• No additional heat-sink, no fan, no cables, no additional power supply
• Prototype, tune, validate and deploy deep neural networks at the edge
• Features the same Movidius vision processing unit (VPU) used in drones,
surveillance cameras, VR headsets, and other low-power intelligent and
autonomous products
Myriad 2 Vision Processing Unit (VPU)
28
29
EDGE Example use - The DJI SPARK DRONE
See: https://www.dji.com/spark
Face Aware
Gesture Mode
Safe Landing
Example: Scaling inference performance with multiple sticks
31
+
Movidius Neural Compute StickRedefining the AI developer kit
+Profiler
Checker
Compiler
API
NC Toolkit
NC API
NC SDKFree download @ developer.movidius.com
NC SDK workflow
Profiler-----------------------------------A tool that provides a detailed stage-by-stage breakdown of where the bottlenecks are in your system.
What can I do with the NCS?
DNN architect / data scientist Applications developer
C API-------------------GetDeviceNameOpenDeviceAllocateGraphDeallocateGraphLoadTensorSetGraphOptionCloseDevice…
Python bindings-------------------StatusGlobalOptionDeviceOptionGraphOptionEnumerateDevicesSetGlobalOptionLoadTensor…
Checker-----------------------------------Runs a single inference on the NCS using the provided model, allowing for the calculation of classification correctness.
Compiler------------------------------------The compiler is used to create a graph which is an optimized binary file that can be processed by the NCS.
Benchmarking & Performance
36
Benchmarking
Calculate img/s:
Where the exec. Time is :
Img/s = #nodes * batch sz * max_iter / time
time = Iteration N – Iteration 0 (on the train .log file)
37
PERFormance EVALuation
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.. Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.
Node Number Images/s Scalability
1 480 -
4 1708 88.96%
8 3376 87.92%
16 6368 82.92%
32 11616 75.63%
Topology: AlexNet Dataset: Imagenet Dataset Input: JPEG Raw Data Batch size: 256 Xeon Phi: KNL7210
1X3.6X
7.03X
13.3X
24.2X
1node 4nodes 8nodes 16nodes 32nodes
Nor
mal
ized
Tra
inin
g Ti
me
Hig
her i
s be
tter
24.2x
38
PERFormance EVALuation
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.. Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.
1X3.7X
7.7X
14.8X
29.7X
1node 4nodes 8nodes 16nodes 32nodes
Nor
mal
ized
Tra
inin
g Ti
me
Hig
her i
s be
tter
Node Number Images/s Scalability
1 556 -
4 2064.5 92.83%
8 4311.6 96.93%
16 8241.4 92.64%
32 16516 92.83%
Topology: AlexNet Dataset: Imagenet Dataset Input: Compressed LMDB Batch size: 256 Xeon Phi: KNL7210
29.7x
Compressed Data gives better performance!
PERFormance EVALuation
0
5
10
15
100
3380
067
500
1012
0013
4900
1686
0020
2300
2360
0026
9700
3034
0033
7100
3708
0040
4500
4382
0047
1900
5056
0053
9300
5730
0060
6700
6404
0067
4100
7078
0074
1500
7752
0080
8900
8426
0087
6300
9100
00
loss
iterations
Googlenet v1 training_loss
1node_batch32_step_lr0.005_training_loss
Multi node training gets faster convergence speed than single node
top1 and top5 accuracy increase are faster than single node
00.10.20.30.40.50.60.70.80.9
1
1000
2800
055
000
8200
010
9000
1360
0016
3000
1900
0021
7000
2440
0027
1000
2980
0032
5000
3520
0037
9000
4060
0043
3000
4600
0048
7000
5140
0054
1000
5680
0059
5000
6220
0064
9000
6760
0070
3000
7300
0075
7000
7840
0081
1000
8380
0086
5000
8920
0091
9000
accu
racy
iterations
Googlenet v1 accuracy
1node_batch32_step_lr0.005_loss3_accuracy@top1
1node_batch32_step_lr0.005_loss3_accuracy@top5
8node_batch32*8_step_lr0.005_loss3_accuracy@top1
8node_batch32*8_step_lr0.005_loss3_accuracy@top5
Top Tips
41
bios settingsSINGLE NODE
• Cluster Mode: AlltoAll
• Cache Mode: Flat for workload <=16GB memory, Cache otherwise
• Hyper threading: Enabled
• CPU Power and Performance Policy: Performance
• Set Fan Profile: Performance
42
bios settingsMULTI NODE
• Intel Hyper-Thread: Disabled
• Cluster Mode: Quadrant
• MCDRAM: Cache
• CPU Power and Performance Policy: Performance
• Set Fan Profile: Performance
• Use SSD drive. If during trainings/scoring you observe in logs "waiting for
data" - you should install better SSD or reduce batch size
Top tips• Choose big batch size to take advantage of big memory of KNL/KNM system
• Multi-node Intel Caffe on KNL/KNM + OPA is the best scalable solution with better accuracy vs. Ethernet
• Right BIOS setting , latest ver. Software
• Understand MLSL/OMP settings
• Pay attention to affinity settings (see next slides)
44
CPU Affinity for Performance ManagementThe Intel® OpenMP* runtime library has the ability to bind OpenMP threads to physical processing units
The interface is controlled using the KMP_AFFINITY and KMP_PLACE_THREADS environment variables
There are 2 considerations for OpenMP threading and affinity
Determine the number of threads to utilize
Bind threads to specific processor cores
Example command (running Intel Caffe on 8 nodes of KNL 7250):mpirun -l -n 8 -ppn 1 -machinefile 8nodes.hosts -genv OMP_NUM_THREADS=64 -genvKMP_AFFINITY="proclist=[0-63],granularity=thread,explicit" -genv MIC_KMP_AFFINITY="verbose, none" -genv KMP_HW_SUBSET=1t -genv MLSL_NUM_SERVERS=4 numactl -i all /root/intelcaffe/build/tools/caffetrain --solver /root/Lei/iFlytek/solver.prototxt -iterations 1000
Detail information of using KMP_AFFINITY environment variable: https://software.intel.com/en-us/node/522691
45
KNL CPU Affinity with numactlMachine
Socket 0
0 63 7 1514 18 72 73… … …
CPU core resource hierarchyKNL 7250 with HT onLinux kernel 3.xx.x- xxx.xx.x
Processor id: 0, 68, 136,204
Core id:
Physic id:
3, 71, 139,207 4, 72, 140,208 5, 73, 141,209 67, 135, 203,272
• Check core_id to processor_id mapping: egrep "(( id|processo).*:|^ *$)" /proc/cpuinfo
• Numactl command for CPU affinityExample: supply core_id 0-11 for a python scriptnumactl –C +0-9 python your_script.pyExample: supply core_id 0,73 for a python scriptnumactl –C +0, 67 python your_script.py
numactl –C +0-9 python your_script.py
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
46
47
Legal Notices & disclaimersThis document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation.