cvml software development tools

N. Tsapanos, P. Bassia, P. Giannakeris,

Prof. Ioannis Pitas

Aristotle University of Thessaloniki

[email protected]

www.aiia.csd.auth.gr

Version 2.5

CVML software development

tools

1

Distributed computing (under development)

• ROS

• Libraries

• OpenCV

• BLAS, cuBLAS

• MKL DNN, cuDNN

• DNN Frameworks

• Neon, Tensorflow, Pytortch, Keras, MXNet

• Distributed/cloud computing

• MapReduce programming model

• Apache Spark

• Collaborative SW Development

• GitHub/Bitbucket.

Robotic Operating System (ROS)

ROS is used in developing robotic applications• ROS architecture:

• ROS master, ROS nodes.

• Node communications:

• Topics

• Publisher and Subscriber nodes.

• Functionalities:• Node graph visualization • Logging/plotting

• Checks and diagnostics.

ROS architecture

(Adapted from Willow Garage’s “What is ROS?” Presentation)

Master

Publisher

Publisher

Subscriber

Subscriber

/topic

(DNS-like)

ROS architecture

• A Node is a single ROS-enabled program: – Most communication happens between nodes. – Nodes can run on many different devices.

• One Master per system.

ROS Master • Enables nodes to locate one another • Handles Parameter Server

camera interface

image processing

motion logic

robot planning

robot interface

ROS Communications

msg … msg … msg

Publisher Node

Publishing msg data On channel /topic

Publisher Node

Advertises /topic is available

with type msg

Subscriber Node

Subscribed to /topicwith type msg

ROS MASTER

Subscriber Node

Listening for /topicwith type msg

/topic

Topics are for Streaming Data.

msg … msg … msg

Libraries vs Frameworks

7https://www.programcreek.com/

•When you use a library, you are in charge of the flow of the

application. You are choosing when and where to call the library.

•When you use a framework, the framework is in charge of the flow.

It provides some places for you to plug in your code, but it calls the

code you plugged in as needed.

https://www.programcreek.com/

OpenCVOpen Computer vision library:• Image processing.• Video analysis.

• Deep NN, Machine Learning• Object detection.

• Computer vision• Camera calibration, 3D world modeling.

Basic Linear Algebra

Subprograms (BLAS) Library

Basic Linear Algebra Subprograms (BLAS) is a software

library of high performance Linear Algebra routines:

• BLAS has three routine sets (“levels“).

• They correspond to both the chronological order of

definition and publication, as well as the degree of algorithm

complexity.

9

BLAS

BLAS Level 1:

• It contains 𝑂(𝑛) routines for vector operations:

• Dot products, vector norms for vector dimensionality 𝑛;

• A generalized vector addition of the form:

𝐲 ← 𝑎𝐱 + 𝐲.• Several other operations.

10

BLAS

BLAS Level 2:

• It contains 𝑂(𝑛2) matrix-vector operations for vector

dimensionality 𝑛, matrix dimensionality 𝑛 × 𝑛.

• A generalized matrix to vector multiplication:

𝐲 ← 𝛼𝐀𝐱 + 𝛽𝐲.• System solver for system of linear equation:

𝐓𝐱 = 𝐲,

• 𝐓: triangular matrix.

11

BLAS

BLAS Level 3:

• It contains 𝑂(𝑛3) matrix to matrix operations for 𝑛 ×𝑛 matrices:

• General matrix multiplication of the form:

𝐂 ← a𝐀𝐁 + β𝐂,• 𝐀, 𝐁 can optionally be transposed or Hermitian-conjugated inside

the routine;

• All three matrices 𝐀,𝐁, 𝐂 may be strided.

• Routines for solving:

𝐁 ← a𝐓−1𝐁,

• 𝐓 is a triangular matrix.12

BLAS• BLAS is extremely fast.

• There are several implementations.

• Some of them are hardware specific.

• Most of them are heavily optimized:

• Taking advantage of data locality in memory to reduce CPU cache

misses.

• Loop unrolling (probably best handled by the compiler).

• Using multiple threads to execute on every CPU core.

• Employing the Single Input Multiple Data (SIMD) CPU operation

capabilities (Streaming SIMD Extensions).

13

cuBLAS

• NVIDIA cuBLAS library is a fast GPU-accelerated

implementation of the standard basic linear algebra

subroutines (BLAS).

• Speed up is achieved by deploying compute-intensive

operations to a single GPU or scale up and distribute work

across multi-GPU configurations efficiently.

• Data needs to be copied to the GPU memory prior to

calling cuBLAS functions. Results need to be copied

backwards to the CPU memory.

cuBLAS

• It has support for multiple GPUs, concurrent kernels and streams.

• The optimization includes:

• Software pipelining.

• Use of vector memory operations that keeps data coalesced.

• Hide any kind of memory path latencies, with instruction and thread

level parallelism (ILP, TLP).

• Take advantage of each different GPU memory (shared, constant,

texture).

• Loop unrolling.

• cuBLAS 9.2 is up to 35 times faster than BLAS.

15

KL DNN

• Intel Math Kernel Library – Deep Neural Network is an

open source, performance-enhancing library for

accelerating deep learning frameworks on Intel Architecture.

• Optimizations are abstracted and integrated directly into

PyTorch, MXNet frameworks.

• Makes use of SIMD instructions available on Intel

processors

• Engine choice available (CPU/GPU).

16

cuDNN

• The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-

accelerated library of primitives for deep neural networks.

• Standard routines:

• Forward and backward convolution.

• Pooling.

• Normalization layers.

• Activation layers.

• cuDNN accelerates widely used deep learning frameworks:

• Caffe, Keras, MATLAB, TensorFlow, Torch.

17https://developer.nvidia.com/cudnn

https://developer.nvidia.com/cudnn

DNN Frameworks

Image Source: Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Le - Performance Analysis of CNN Frameworks for GPUs

DNN Frameworks• All 5 frameworks work with cuDNN as backend.

• cuDNN unfortunately is not open source.

• cuDNN supports FFT and Winograd convolutions.

Image Source: Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Le - Performance Analysis of CNN Frameworks for GPUs

DNN Frameworks

Neon

• Intel Neon is a modern deep learning framework created by

Nervana Systems.

• Implemented in Python, while Nervana Caffe framework is

written in C and C++.

• Impressively fast compared to other frameworks.

• Image processing oriented (not general purpose enough).

20

DNN FrameworksNeon:

• Written in Python and C;

• It does not support Windows;

• Uses MKL for CPU (highly optimized by Intel);

• Supports CUDA for GPU;

• Known mostly to be the first to implement Winograd

convolutions faster than others;

• The library is open source and it is an good starting point for

those interested to study the code.

DNN FrameworksTensorFlow (TF)

• For Python (but also APIs in C++, C#, JavaScript, Java).

• It operates with static computation graphs.

• Recommended for production (cloud, mobile).

• Supported by Google.

• Scalable.

Pytorch

• Main competitor of TF.

• Dynamically updated graph (advantage over TF).

• Suited for research, small projects and prototyping.

• Supports distributed learning.22

DNN FrameworksKeras

• High-level API for TF.

• Suitable for quick testing, prototyping.

• Very easy to use, good for beginners.

• Many plug-and-play architectures available.

MXNet

• Highly scalable.

• Very effective parallelization on multiple GPUs and machines.

• Supports many languages (C ++, Python, R, Julia, JavaScript,

Scala, Go, Perl).

• Supported by Apache.23

DNN Frameworks

Recommended for beginners:

Recommended for research:

Recommended for production

(cloud or mobile):

24

Distributed/cloud

computing

• An alternative to expensive Graphics cards and

supercomputers.

• It can consist of any combination of PCs of

various processing power.

• Extremely flexible and scalable.

• Designed to handle Big Data.

MapReduce programming

model

▪ This programming model was developed to implement

the Map and Reduce commands, which can, in turn, be

used to write distributed programs with ease.

▪ The Map command applies a function on every

element of the distributed dataset. This can be used to

filter, sort, or transform the data.

▪ The Reduce command collects the distributed

dataset, or the results of a previous Map or Reduce

command, in a summary operation, such as addition.

Apache Spark• Open software for cluster

computing.

• Developed by Apache.

• Distributed big data computing.

• It is based on MapReduce.

• Up to 100 times faster than

Hadoop MapReduce.

• Machine Learning, Graph

Processing, SQL, Streaming

modules.

Apache Spark

▪ The Apache Spark framework is an implementation of

the MapReduce programming model.

▪ It provides the Map and Reduce command, among

others, in a fault-tolerant environment.

▪ Its advantages over other alternatives, like Hadoop,

include the ability to cache data in memory, in order to

minimize disk access.

▪ The data are stored in data collections named Resilient

Distributed Datasets (RDDs).

Apache Spark

• Supported languages include Java, Scala and Python.

• Spark also includes specialized APIs, like GraphX for

graph processing and SparkML for Machine Learning.

• It is possible to use native libraries to process data

(avoid OpenJDK).

Apache Spark

▪ In Spark, a master node coordinates several worker

nodes.

▪ We used VirtualBox VMs running Ubuntu to set up our

computing cluster.

Apache Spark Mllib

• Machine Learning Library:

• Clustering

• Classification

• Regression

• Data modeling

• Dimensionality reduction

• Available in Pyspark (Spark

wrapper for Python).

other partners offering

Collaborative SW Development

32

• Modern on-line collaborative development platforms provide a common

project repository hosted on-cloud.

• Several features are typically available:

• Distributed version control.

• Source code management.

• Access control.

• Bug tracking.

• Task management.

• Web interface.

Collaborative SW Development

33

• Several alternatives exist.

• Most popular front-end services:

• GitHub.

• Bitbucket.

• Most popular underlying open-source version control systems:

• Git.

• Mercurial.

34

References[BRA2008] Bradski G, Kaehler A. Learning OpenCV: Computer vision with the

OpenCV library. " O'Reilly Media, Inc."; 2008 Sep 24.

[KIR2016] Kirk DB, Wen-Mei WH. Programming massively parallel processors: a

hands-on approach. Morgan kaufmann; 2016.

[ABA2016] Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M,

Ghemawat S, Irving G, Isard M, Kudlur M. Tensorflow: A system for large-scale

machine learning. In12th {USENIX} Symposium on Operating Systems Design and

Implementation ({OSDI} 16) 2016 (pp. 265-283).

[DEA2010] Dean J, Ghemawat S. MapReduce: a flexible data processing tool.

Communications of the ACM. 2010 Jan 1;53(1):72-7.

[ZAH2026] Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X,

Rosen J, Venkataraman S, Franklin MJ, Ghodsi A. Apache spark: a unified engine

for big data processing. Communications of the ACM. 2016 Oct 28;59(11):56-65.

[KET2017] Ketkar N. Introduction to pytorch. InDeep learning with python 2017 (pp.

195-208). Apress.

Q & A

Thank you very much for your attention!

Contact: Prof. I. Pitas

[email protected]

35

cvml software development tools

Documents