building end-to-end ml workflows with arm
TRANSCRIPT
![Page 1: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/1.jpg)
Building end-to-end ML workflows with ArmGian Marco Iodice, Tech Lead ACL, Arm
Wei Xiao, Principal Evangelist AI Ecosystem, Arm
![Page 2: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/2.jpg)
Agenda• Introduction: Presenting ArmNN and Compute Library
• Q/A – Break
• Hands-on session: Preventing Disaster with CNN
• Q/A – Break
• Performance Analysis for Deep Learning Inference
• Wrap up
![Page 3: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/3.jpg)
Agenda• Introduction: Presenting ArmNN and Compute Library
• Q/A – Break
• Hands-on session: Preventing Disaster with CNN
• Q/A – Break
• Hands-on session: Performance Analysis for Deep Learning Inference
• Wrap up
![Page 4: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/4.jpg)
ArmNN and Compute Library
![Page 5: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/5.jpg)
ArmNN and Compute Library
Are free and open source SW libraries developed by Arm for ML inference applications• Both support Android and Linux
• Quarterly officially released together on GitHub
• Development branch available on Linaro
![Page 6: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/6.jpg)
Linaro AI InitiativeIs the home for ArmNN and Compute Library and brings companies together to
develop the best-in-class Deep Learning performance on Arm
https://mlplatform.org https://mlplatform.org/contributing
![Page 7: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/7.jpg)
Challenges AddressedThere are challenges to deploy ML applications such as:
• Framework/Code/Performance portability• Code optimization on specific architectures
ArmNN and Compute library were developed to address these challenges:
The libraries want to make deployment of intelligent vision applications easy and performant on Arm-based platforms
![Page 8: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/8.jpg)
SW Architecture Overview• ML workload can use Arm NN or Compute
Library to access CPU and GPU acceleration
• Only ArmNN provides the access to Arm Ethos NPU
• Only ArmNN provides parsers for 3rd party libraries (TensorFlow, TensorFlowLite, Caffe, ONNX...)
• Arm Android NN HAL driver provides access to ArmNN for Android applications
Compute Library
Cortex-A
CPUMali GPU Ethos NPU
NPU driver
ArmNN
HAL driver
ML workload
Android NN3rd party
ML library
TensorFlow, Caffe,..
![Page 9: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/9.jpg)
ArmNNIs the inference engine to enable the deployment of ML workloads
efficiently on Arm Cortex-A CPUs, Mali GPUs and Ethos NPUs
• ArmNN is also available for the AndroidNN API
• ArmNN SDK includes tools, parsers for various frameworks (i.e. TensorFlow, TensorFlow Lite, Caffe,..) and utilizes Compute Library to target Arm Cortex-A CPUs and Arm Mali GPUs
https://developer.arm.com/ip-products/processors/machine-learning/arm-nn
![Page 10: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/10.jpg)
Arm Compute Library (ACL)Bundle of optimized functions for ML, Computer Vision and Image
Processing for Arm Cortex-A CPUs and Arm Mali GPUs
• Provides acceleration on Arm CPUs through Neon/SVE (aarch32/aarch64)
• Provides acceleration on Arm Mali GPU through OpenCL
• Over 120 functions implemented!
https://developer.arm.com/technologies/compute-library
![Page 11: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/11.jpg)
FunctionsMachine learning• Activation
• Convolution
• Depth-wise Convolution
• Normalization
• Pooling
• Softmax
• and many more…
Computer vision Image processing• Canny Edge
• Harris corner
• HOG
• Gaussian Pyramid
• Gradient
• Optical Flow
• and many more…
• Colour convert
• Dilate
• Gaussian/Sobel 3x3/5x5
• Histogram equalization
• Remap
• Warp Affine/Perspective
• and many more…
![Page 12: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/12.jpg)
Key Features• Different data type support for CPU and GPU: FP32/FP16/Int8/Uint8
• More than 40 examples ready to be profiled with our benchmark test suite
• Different algorithms available for convolution layer (GEMM, Winograd, FFT and Direct)
• Memory manager
• Micro-architecture optimizations for key algorithms like GEMM or Winograd
• OpenCL tuner
• Fast math support
![Page 13: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/13.jpg)
OpenCL TunerSimply tweaking the number of work-items in a work-group can have a
huge performance impact
Setting the optimal LWS can be tricky because of:
• Cache size
• Maximum number of threads per compute unit
• Work-group dispatching
• Input and output dimensions
• …
In ACL we implemented the OpenCL tuner to look for the optimal LWS
![Page 14: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/14.jpg)
OpenCL Tuner Improvement
0.00%10.00%20.00%30.00%40.00%50.00%60.00%
AlexNet F32
GoogleNet F32
Inception V3 F32
Inception V4 F32
MobileN
et F32
MobileN
et v2 F32
ResNet12 F32
ResNet50 F32
SqueezeNet F32
VGG16 F32
MobileN
et QASYM
M8
MobileN
et v2 QASYM
M8
VGG VDSR QASYM
M8
Impr
ovem
ent (
%)
Performance improvement(higher, better) Exhaustive Normal Rapid
Three different levels of tuning supported, trade-offs between performance improvement and tuning time
![Page 15: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/15.jpg)
Agenda• Introduction: Presenting ArmNN and Compute Library
• Q/A
• Hands-on sessions
• Q/A
• Hands-on session: Performance Analysis for Deep Learning Inference
• Wrap up
![Page 16: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/16.jpg)
Hands-on session
![Page 17: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/17.jpg)
Let’s have fun on Raspberry Pi 4!
![Page 18: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/18.jpg)
Raspberry Pi 4Tiny but powerful development board with the mission to put the power of computing and digital making into the hands of people all over the world*
Raspberry Pi 4 is the latest version released in June 2019
• Arm Cortex-A72 quad-core running at 1.65GHz
• Wifi, Bluetooth, Ethernet
• Dual monitor support (4K)
• and many more…
* https://www.raspberrypi.org/
![Page 19: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/19.jpg)
What Will you Find in your Raspberry Pi?
• Raspbian OS pre-installed (September 2019 release) with serial communication and remote connections (ssh) enabled
• Pre-built binaries for ArmNN and Compute Library (19.08 release)
• Ready to use examples for the hands-on sessions
All the instructions to reproduce the labs from scratch can be found in the Backup section of the Support Material
![Page 20: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/20.jpg)
Note about Raspbian OS
Arm Cortex-A72 processor is an aarch64 architecture but Raspbian OS is based on an Armv7-A (aarch32) filesystem
This means that the Arm Compute Library will not be able to call the optimized routines for aarch64
![Page 21: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/21.jpg)
How to Control the Raspberry PiThere are multiples ways to control the Raspberry Pi accordingly
with your neededAssuming:
• both host (your laptop) and Raspberry Pi are in the same network
• internet connection and control of the desktop interface are needed
We can use the serial and ethernet cable to control our board
![Page 22: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/22.jpg)
Raspberry Pi 4: Connections
Laptop(Host)
Raspberry Pi 4(Target)
USB Serial cable
USB-C power cable
AC socket
Router / Ethernet socket
![Page 23: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/23.jpg)
Raspberry Pi 4: Serial CableVccGndTxRx
Raspberry Pi 4(Target)
VccGndRxTx
Laptop(Host)
Are swapped!
Red = VccBlack = GndWhite = TxGreen = Rx
![Page 24: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/24.jpg)
Step 1: Get IP Address• Open the serial terminal following the instructions provided
• Login in Raspbian OS
• User: pi
• Password: raspberry
• In the terminal window, enter the command:
• Copy the IP address along with the desktop number (i.e. 10.42.0.53:1)
$ vncserver
![Page 25: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/25.jpg)
Step 2: Control the Desktop Interface• Install VNC viewer on Google Chrome following
the instructions provided on your laptop
• Open the VNC app and enter the IP address and desktop number of Raspberry Pi
• Login in Raspbian OS using the same credentials of step 1
![Page 26: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/26.jpg)
Setup: https://tinyurl.com/u7a6jfv
Workshop instruction - https://tinyurl.com/uwvy59a
![Page 27: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/27.jpg)
Now, we are ready to play!
![Page 28: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/28.jpg)
Lab 1
Run Image Classification for Fire Detection
![Page 29: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/29.jpg)
• Arm NN Core
Graph builder API
Optimizer
Runtime
Reference and Neon/Cl via Compute Library
New backend planned• Parsers
Tensorflow
Tensorflow Lite
Caffe
ONNX• Android NNAPI Driver
Arm NN Components
![Page 30: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/30.jpg)
Arm NN Supported Layers … more to come!
• Activation• Addition• BatchNormalization• BatchToSpaceND• Constant• Convolution2d• DepthwiseConvolution2d• DetectionPostProcess• Division• Equal• Floor• FullyConnected
• Output• Pad• Permute• Pooling2d• Reshape• ResizeBilinear• Rsqrt• Softmax• SpaceToBatchND• Splitter• StridedSlice• Subtraction
• Gather• Greater• Input• L2Normalization• LSTM• Maximum• Mean• MemCopy• Merger• Minimum• Multiplication• Normalization
![Page 31: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/31.jpg)
Lab 2
Run Time Series ML Model
![Page 32: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/32.jpg)
Agenda• Introduction: Presenting ArmNN and Compute Library
• Q/A
• Hands-on session:
• Q/A
• Performance Analysis for Deep Learning Inference
• Wrap up
![Page 33: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/33.jpg)
Performance Analysis for Deep Learning Inference
![Page 34: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/34.jpg)
Introduction
What can we say about the following measurements?
SqueezeNet: 200 fps, 5 msMobileNet: 250 fps, 4 ms
Nothing...
• What processor?
• What frequency?
• How many threads?
• What data type?
![Page 35: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/35.jpg)
Goals for Performance Analysis
1. Give a meaning to our performance numbers. i.e. Good? Bad?
2. Understand performance bottlenecks to fix inefficiencies
• No lucky or casual optimizations
• Better optimization planning for the short and long term
We need to introduce a formal methodology
![Page 36: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/36.jpg)
DNN Under Test (DUT)
In order to give a meaning to our perf. numbers, we need to know:
• Hardware capabilities
• Algorithm complexity (number of ops required by the algorithm)
DNNi.e. Image, audio, text Exec. Time, fps
![Page 37: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/37.jpg)
HW Capabilities vs Algorithm ComplexityHardware capabilities Algorithm complexity• (Fl)ops/Core/Cycle
# arithmetic operations per core and per cycle
• MACs/Core/Cycle
# multiply-accumulate (mul+add) operations per core and per cycle
• Max external memory bandwidth (i.e. DDR):
# bytes read from or stored into memory in unit of time (bytes/s)
• Sum of the ops required by each function/layer in the network
Well defined for DNNs and mainly dominated by the MACs operations in Convolution/Fully Connected Layers
• Since convolution layer could use multiples algorithms (GEMM, Winograd, FFT…), we should consider the ops required by the one executed
![Page 38: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/38.jpg)
Example Algorithm ComplexityVGG16 layers Parameters (W/H/Kernel/OFMs) Algorithm GFlops
Conv1 224/224/3/3/64 GEMM 3.55
Conv2 224/224/3/64/64 GEMM 1.95
Conv3 112/112/3/64/128 GEMM 1.20
Conv4 112/112/3/128/128 GEMM 2.08
Conv5 56/56/3/128/256 GEMM 1.20
Conv6 / Conv7 56/56/3/256/256 GEMM 2.39 x2
… … … …
Conv11 / Conv12 / Conv13 14/14/3/512/512 GEMM 0.354 x3
FC1 1/1/1/4096 GEMM 0.0000502
FC2 1/1/1/4096 GEMM 0.00000819
FC3 1/1/1/1000 GEMM 0.00000819
Total GFlops ~30
![Page 39: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/39.jpg)
Processor UtilizationExpresses how well the algorithm uses the available
processor computational resources
Ta: actual execution time (time measured)
Tt: theoretical execution time
Putil=TtTa
∈[0, 1]
=Ops23456789:
Op𝑠<56=/<?<3= @ 𝑛𝑢𝑚<56=D @ 𝑓𝑟𝑒𝑞
Ops required by the algorithm
Ops/Core/Cycle processor
![Page 40: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/40.jpg)
Static Memory Bound Analysis
Used to check if an algorithm can be limited by the memory transfers
𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚P2QRS =𝐷𝑅85823 + 𝐷𝑊85823
𝑇8Tt: theoretical execution timeDRtotal: Total data readDWtotal: Total data write
Should be compared with the external memory bandwidth (i.e. DDR)
![Page 41: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/41.jpg)
Exercise 1Features Value
GFLops network 1.1
Actual execution time (ms) 4
Input image size 224x224
Flops/Core/Cycle 512
Num. cores 20
Frequency 500 MHz
Max external memory bandwidth 12.9 GB/s
Processor utilization?5.6 %
Really low!
![Page 42: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/42.jpg)
Exercise 2Features Value
GFLops network 34.3
Actual execution time (ms) 200
Input image size 224x224
Flops/Core/Cycle 16
Num. cores 4
Frequency 2 GHz
Max external memory bandwidth 12.9 GB/s
Processor utilization?133.6 %
It cannot be > 100%! Maybe GFLops network is not correct?
![Page 43: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/43.jpg)
Exercise 3Features Value
Flops/Core/Cycle 16
Num. cores 4
Frequency 2 GHz
Max external memory bandwidth 12.9 GB/s
Data type M,N,K DR[GB]
DW [GB]
Tt
[s]AlgorithmMaxBW
[GB/s]
F32 2704, 256, 1152 0.01160 0.00109 0.012 1.019
F32 1, 4096, 25088 9.34E-5 0.38 0.0016 238.5
This is memory bound!
Matrix multiplication: • M: Number of output rows• N: Number of output columns• K: Number of right hand side matrix rows
![Page 44: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/44.jpg)
Formal MethodologyTop-Down approach
Made up of different granular investigations to find performance bottlenecks
Level 1: Graph (DNN) profiling
Level 2: Function profiling
Level 3: HW counters profilingFunction HW counters (L2 cache hits, mis-predictions, cycles)
Layer/function timing
Network timing
![Page 45: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/45.jpg)
Driver Overhead Estimation
TN0
t0 t1 t2
Other processor
CPUoverhead
func0 func1 func2
time
Graph profiling
t0 t1 t2
func0 func1 func2
Overhead increased by the profiler!
Function profiling
TN1time
TN0 != TN1TN0 != (t0+t1+t2)
Overhead = TN0 -(t0+t1+t2)
Overhead
![Page 46: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/46.jpg)
Driver Overhead EstimationStreamline
Arm® Streamline Performance Analyzer is a system-wide visualizer and profiler for Arm hardware targets, and models of them
![Page 47: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/47.jpg)
Locating and Fixing InefficienciesWith graph and function profiling we can gather four important
information points
1. Network utilization2. Overhead limitation3. Layer/function utilization4. Memory transfer limitation
Phase 1
Phase 2
![Page 48: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/48.jpg)
Phase 1
Overhead evaluation
Graph profiling
Function profiling
Putili < Thresholdutil
Overhead < ThresholdoverheadOverhead optimization
DNN optimized
Phase 2Go to level 3HW counters
profiling
Yes
Yes
No
No
![Page 49: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/49.jpg)
Phase 2 (1 of 2)
Overhead optimization
Function profiling
Layer analysis
Remove layers with % time spent < Thresholdt
Remove layers with Putil> Thresholdp
Is the list empty?
Layer ResNet12 (F32) % time spent
% Putil
Conv12 61.2 5.3
Conv1 3.63 40.1
Conv4 3.15 58.3
Conv6 3.15 59.2
Conv3 3.11 61.3
Conv1 3.11 60.1
Conv5 3.1 59.3
Conv7 3.1 58.6
…
Activation12 0.01 3
Phase 1Next slide…
Yes
No
i.e. > 50
i.e. < 1
Layer analysis: ResNet12 (F32)
![Page 50: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/50.jpg)
Phase 2 (2 of 2)Memory bandwidth evaluation
Is it memory bound?
Layer ResNet12 (F32)
Max Memory bandwidthGB/s
Conv12 79.5 (GEMM-based)
Conv1 4.71 (GEMM-based)
Features Value
Flops/Core/Cycle 16
Num. Cores 4
Frequency 2 GHz
Max Memory Bandwidth 12.9 GB/s
Can we use a different algorithm?
Change algorithm andupdate Flops required
Phase 1
No
Yes
Yes
Can we change theDNN design?
i.e. FFT
Change DNN design
No
No
Yes
Go to level 3HW counters
profiling
Go to level 3HW counters
profiling
![Page 51: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/51.jpg)
MLPerf
Overhead optimization
Useful benchmarks for measuring training and inference performance of ML hardware, software, and services as a collaboration of companies and researchers
https://mlperf.org
![Page 52: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/52.jpg)
Lab 3
Performance Evaluation with ACL
![Page 53: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/53.jpg)
ACL Benchmark Framework
Overhead optimization
• Currently in experimental phase
• Only available on the Arm Compute Library development branch (Linaro)
• Allows performance profiling of all the examples included in the “examples” folder
• For each example, it generates a new binary with prefix “benchmark_” in the build/tests folder
![Page 54: Building end-to-end ML workflows with Arm](https://reader030.vdocument.in/reader030/viewer/2022041202/6250909f87f7e958ec18e834/html5/thumbnails/54.jpg)
Benchmark Examples
Overhead optimizationi.e../benchmark_graph_mobilenet –example_args=--threads=1 --iterations=10./benchmark_graph_mobilenet –example_args=--threads=4 --iterations=10./benchmark_graph_mobilenet –example_args=--threads=4 --iterations=10 --instruments=scheduler_timer_ms
Plenty of DNN examples are included in ACL and ready to be profiled