aa 2018-2019 computer vision - cal unibg · computer vision (cv), from the perspective of...

A brief introduction to Computer Vision

Michele Ermidoro

SPEAKER

Michele Ermidoro

PLACE

UniBG, Dalmine

DATE

12 March 2019

Ingegneria dei Sistemi di Controllo

AA 2018-2019

Outline

1. What is Machine Vision?

2. Digital image: start from basics

• Image representation

• Image processing

3. Classic approach

4. Convolutional Neural Network

5. An Object Detection Pipeline

6. Deep Learning Framework

2

What is Machine Vision?

Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do.Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions.Understanding in this context means the transformation of visual images (the input of the retina) into descriptions of the world that can interface with other thought processes and elicit appropriate action.

[https://en.wikipedia.org/wiki/Computer_vision]

Machine vision (MV) is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance, usually in industryMachine vision as a systems engineering discipline can be considered distinct from computer vision, a form of computer science. It attempts to integrate existing technologies in new ways and apply them to solve real world problems.

[https://en.wikipedia.org/wiki/Machine_vision]


Computer

Vision

Machine

Vision

Image

Processing

Neurobiology Imaging

Optics

Signal

processing

RoboticsAr tificial

intelligence(AI)

Machine

Learning

Math

Computer

intelligence

Object

Detection

Cognitive

Vision

Geometry

Statistics

Optimization

Human Vision

Lighting

Imaging

sensors

Lenses

Odometry

Navigation

Data

Estimation

Filtering


• Almost 80% of the data traveling on the net is visual data

• Everybody has a smartphone, and every smartphone has at least 2 cameras

• "A picture is worth a thousand words" is an English language-idiom

• A camera is one of the powerful sensor in a lot of applications

• It has wide fields of application:• Robotics• Surveillance• Industry • Self-driving cars/drones/buses..• Medical• …

http://crcv.ucf.edu/people/faculty/Bagci/research.php

Computer Vision – Why?

Computer Vision – Hype?


≈5.1

Human


≈5.1

Human

What happened

here?

1. Computational Power

2. Convolutional Neural

Network (CNN)

Computational Power

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

2009 - “We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods” “Large-scale Deep Unsupervised Learning using Graphics

Processors” Rajat Raina, Anand Madhavan, Andrew Y. Ng

https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af

IMAGENET Challenge

Convolutional Neural Network

Computer Vision Tasks

http://vision.stanford.edu

Classification

What’s in the image?

• People• Car• Traffic light• Clock • …


Car


Detection

What’s in the image? And Where it is?

• A car in the orange box

CarPerson

Clock



Detection


• A car in the orange box

• A person in the blue box

• A clock in the green box



Segmentation


And Which pixels are?

Car

Clock



Annotation

How would you describe the picture?

“People crossing a

street while a car

is waiting”


https://engineering.matterport.com/splash-of-color-instance-

segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46

• Face detection

• Smile detection

• Eye-open detection

• …

Computer Vision Tasks - Others

https://medium.com/waymo/recreating-the-self-driving-experience-the-making-of-the-waymo-360-video-

37a80466af49


Figure credit: Dai, He, and Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, CVPR 2016


Cardiologist-Level Arrhythmia Detection

with Convolutional Neural Networks


https://www.youtube.com/watch?v=bcswZLwhTUI


https://www.youtube.com/watch?v=NrmMk1Myrxc


Digital images Basics

Classic approachFeature engineering

CNN

What’s next

Digital image: start from basicsImage representation

Image representation

An image, inside a PC, is just a matrix of numbers

255 -> white

0 -> black

0

1

1

1

1

0

1

1

1

1

1

0

1

1

0

1

4x4x1 matrixColor depth – 1 bitColor channels – 116 pixel


An image, inside a PC, is just a matrix of numbers

217

255

255

255

255

191

255

255

255

255

255

127

255

255

0

255



255

255

255

255

255

170

255

255

255

255

255

85

255

255

0

255


0

255

255

255

255

85

255

255

255

255

255

170

255

255

255

255


2453x2453x3 matrixColor depth – 24 bitColor channels – 3Color space - sRGB

6 Mpixel (6.017.209 pixel)

Color spaces

Color space, also known as the color model (or colorsystem), is an abstract mathematical model which simply describes the range of colors as tuples of numbers, typically as 3 or 4 values or color components

There are a variety of color spaces, such as RGB, CMYK, HSV, CIEXYZ..

CMYK

(Cyan, Magenta, Yellow,

Key black)

RGB (Red, Green, Blue)

Color spaces

CIE XYZ

It is the most accurate from a scientific point of view. It tries to represent all the colors that an human eye can see.

RGBThe RGB color model is an additive color model in which red, green and blue light are added. The name comes from the three additive primary colors, red, green, and blue.

HSB (Hue/Sat/Bright)

Designed in the 1970s by

computer graphics researchers

to more closely align with the

way human vision perceives

color-making attributes

Histogram

A color histogram is a representation of the distribution of colors in an image

Gray-scale histogram

Color histogram

Histogram equalization

Digital image: start from basicsImage processing

BinarizationImage binarization

It trasfroms a grey scale image in a binary

image, depending on a threshold

𝑔[𝑛,𝑚] = ቊ255, 𝑓 𝑛,𝑚 > 100

0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Filters

Most of the operation on an image are made using filters

• Filters are mathematic functions which are applied to each pixel.

• The filter is represented using a matrix, called Kernel

• Depending on the size of the kernel (3x3, 5x5, 7x7..), the functions will involve the pixel and its neighbors.

• The filters are applied to an image convolvingthe image and the kernel

• The kernel sum must be 1

ConvolutionConvolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.

Input image

Kernel

-13

Output

STEP 1

http://www.songho.ca/dsp/convolution/convolution2d_example.html

Convolution

Input image

Kernel Output

Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.

STEP 5

-13

-18

-20

-24

-17


Convolution

Input image

Kernel Output

Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.

STEP 9

-13

-18

13

-20

-24

20

-17

-18

17


Convolution

Kernel shape

Depending on the shape of the kernel, different operations can be made:

-1 -1-1

-1 -18

-1 -1-1

Edge-Detection

0 0-1

-1 -15

0 0-1

Sharpen

1 11

1 11

1 11

Blur/moving average

1

9

0 00

0 01

0 00

Original

Kernel shape

Depending on the shape of the kernel, different operations can be made:

-1

2

-1

-1

2

-1

-1

2

-1

-1

-1

-1

2

2

2

-1

-1

-1

-1

-1

2

-1

2

-1

2

-1

-1

-1

-1

-1

-1

8

-1

-1

-1

-1

EDGE 45° Lines

Horiz. Lines Vert. Lines

Kernel example

Denoising

Reduce noise in an image

1 12

2 24

1 12

1

16

Filtro Gaussiano

Classic approach

Visual recognition

Analyze images and extract high level information, like what’s in the

picture, where it is..

Different viewpoint Different illumination

Deformations Different shape of the same object

How to identify an object?

Object Bag of ‘Features’

How to identify an object?

The idea is to extract from the image some of its most important characteristics. We’ll call them “features”

These features will be then compared with a dictionary, where the specific knowledge is stored.

The selection of what’s important to search in the image is called features

engineering and extraction

The comparison phase is done through a classifier

The knowledge is passed to the classifier during the training

Classification pipeline

The classifier is a function that receive as input all the representative features

of an image and it decides what’s in the picture.

𝑓 = 𝑎𝑝𝑝𝑙𝑒

𝑓 = 𝑐𝑜𝑤

𝑓 = 𝑡𝑜𝑚𝑎𝑡𝑜

features

extraction


Train images

Features Training

Training

labels

Classifier

Test image

features

extractionFeatures Classifier It’s an apple

Prediction

features

extraction


Train images

Features Training

Training

labels

Classifier

Test image

features

extractionFeatures Classifier It’s an apple

Prediction

Features engineering Choice of the classifier

Features?

In Computer Vision, a feature is a piece of information which is particularly relevant in the solution of a certain problem (e.g. in the detection of a face, the presence of two eyes is a good feature)

We can identify 3 hierarchical categories of features:

Low-level features• Colors• Edges• Blob• Corners

Mid-level features• Scale-invariant

features• SIFT• SVD

High-level features• Histograms of

gradients (HOG)• Region descriptors

The selection process is called feature engineering

Low Level Features

Edges & Corners:Since the process of image classification involve the exploitation of edges and corners, there are a lot of methods for finding them.

Canny edge detector Harris corner detection

https://dsp.stackexchange.com/questions/14338/corner-detection-using-chris-harris-mike-stephens

It uses a 5 step algorithm, involving some filters (gradient) to compute all the edges in an image.

A mobile window slide over the image and compute the Hessian. Evaluating the eigenvalue of each matrix the corners are detected.

Mid Level Features

SIFT – Scale Invariant Feature Transform [1999]It is an algorithm which is able to detect and describe features in an image. In particular it is able to do this at different scales and rotations.

It has diffent steps which involve a scaling of the image and the computation of the Difference Of Gaussians (DoG)

High Level Features

HOG – Histogram of Oriented Gradients [2005]The technique counts occurrences of gradient orientation in localized portions of an image. The HOG feature descriptor, the distribution ( histograms ) of directions of gradients ( oriented gradients ) are used as features. Gradients ( x and y derivatives ) of an image are useful because the magnitude of gradients is large around edges and corners ( regions of abrupt intensity changes ) and we know that edges and corners pack in a lot more information about object shape than flat regions

https://www.learnopencv.com/histogram-of-oriented-gradients/

Classifier

• The classifier needs a phase of training or learning

• In this phase the classifier “learn” how to distinguish between the classes in output

• This training phase needs a dataset where each images is labelled with the corresponding class name

Training

Training

labels

Classifier

Choice of classifier

DOGS CATS

The classifiers which learn from a labelled dataset are called supervised and they can be divided into 3 major categories:1. Linear (with training)2. Trees (with training)3. Based on distances (without training)

Manual classifier - example

As hypothesis, we want to recognize the type of bottle watching it from above. The

diameter of the neck determines the type of bottle.

Classifierfeatures info

Hough Transform

𝑑1

Hough Transform

𝑑2

IF 𝑑𝑖𝑛 > 𝑡ℎ1 THEN

𝑏𝑜𝑡𝑡𝑙𝑒𝑡𝑦𝑝𝑒 = 1

ELSE

𝑏𝑜𝑡𝑡𝑙𝑒𝑡𝑦𝑝𝑒 = 2

END

1

2

Features extraction

features Manual classifier Class

Supervised classifier – distance example

A more complex problem, is there a cat or a dog in the image?

Kaggle Dogs vs. Cats dataset.

Istogramma RGB - 3D

Manual

classifier?


Supervised classifier – distance example

A more complex problem, is there a cat or a dog in the image?


Istogramma RGB - 3D


x x

xx

x

x

x

x

o

oo

o

o

o

o

x2

x1

+

1-NN - cat

3-NN - dog

5-NN - dog

cat

dog

k-Nearest

Neighbor

FeaturesClassifier

without training

Dataset

Supervised classifier - linear

There are various linear supervised classifier, some of them are:• Logistic regression• SVM• Neural Network


Istogramma RGB - 3D

Training

Labels

Trained

Classifier

cat

Parameters

Supervised learning – tree

We have different classifier based on trees:• Decision trees• Random forest• Gradient Boosting


Istogramma RGB - 3D

Training

Labels

Trained

Classifier

cat

Parameters

Viola Jones object detector (Haar Cascades)

• First object detection framework to provide competitive object detection rates in real-time

• Employ Haar Features tocharacterize the input image

Viola, Jones: Robust Real-time Object Detection, IJCV 2001

Each feature resuts in a single value computedsubtracting the sum of pixels under white rectanglefrom the sum of pixels under black rectangle




• Makes use of a bank of Adaboost classifiers in acascade fashion

Viola, Jones: Robust Real-time Object Detection, IJCV 2001

Sub windows of the image at different scales are passedthrougth a series of classifier and discarded if they fail in any of the stage




• Makes use of a bank of Adaboost classifiers in acascade fashion

https://vimeo.com/12774628

N. Dalal and B. Triggs Histograms of oriented gradients for human detection CVPR, 2005

Step 1: scan image at

all scales and locations

Step 2: extract features

over each sliding

window location

Step 3: use linear SVM

to classify features

extracted from each

window

Step 4: apply non-

maxima suppression to

obtain final bounding

boxes

Pros/Cons:

+ Real-time

+ Open source software implementation (Dlib)

+ Higher accuracy than Haar Cascades

- Pre-trained only for frontal faces

- Low accuracy on distance and overlapping faces

- High false-positive rate

- High sensitivity to parameter changes

An example – HOG for pedestrian detection

Where we are?

PASCAL VOC Challenge

It is an online competion for object detection (classification and localization of objects)

20 classes.

11,530 images

27,450 objects

Where we are?

• Plateau of results between2011/2012

• Results of 2012 are achieved using a combination of HOG+LBP features and an ensemble of classifier in: Boosted Local Structured HOG-LBP for Object Localization Junge Zhang, Kaiqi Huang, Yinan Yu and Tieniu Tan, 2012

PASCAL Visual Object Classes:• Train/validation/test: 9,963 images

containing 24,640 annotated objects• 20 classes of objects

Where we are?

Convolutional Neural NetworkCNN


Train images Training

labels

Test image/s

It’s an applePrediction

Training

Features Classifier

Learned Model

Learned

Features

Learned

Classifier

Learned Model

Learned

Features

Learned

Classifier


In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analysing visual imagery.

Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered.

A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, RELU layer i.e. activation function, pooling layers, fully connected layers and normalization layers.

https://en.wikipedia.org/wiki/Convolutional_neural_network


≈5.1

Human

CNN - AlexNet

Conv 1

Conv 2

Conv 5

Conv 4

Conv 3

FC

8

FC

7

FC

6

OU

TP

UT

Inp

ut Im

age

Convolutional Layer

Recap

Convolution

Convolutional Layer

32

32

3

5

5

3

Input image: 32x32x3 (height,width,depth)

Filter: 5x5x3

1. Convolve the filter with the image

(“slide over the image spatially)

2. Filter should have the same depth

of the previous layer (in this case 3)

3. Convolution preserve spatial

structure

Convolutional Layer

28

28

1

Input image: 32x32x3

Filter: 5x5x3

1 number:Convolution result, the dot product

between the filter and a small 5x5x3

chunk of the image

Slide (convolve) over spatial location

Activation Map

Convolutional Layer

28

28

1

Input image: 32x32x3

Filter #2: 5x5x3


Activation Map

Convolutional Layer

If we have a 4 filters, we’ll have 4 activation maps


We obtain a new 28x28x4 image

Convolutional Layer

In the previous step, convolving we reduced the dimension from 32 to 28

7

7

3

3

7x7 image3x3 filter

Convolutional Layer

In the previous step, convolving we reduced the dimension from 32 to 28

7

7

3

3

7x7 image3x3 filter

→ Produce a 5x5 output

In order to keep the dimensionality (keep to a very depth neauralnetwork), we can add a frame, called padding

Convolutional Layer

9

3

9x9 (7x7 image + 1 frame of padding)

3x3 filter

The size of padding depend on the size of the filter


Assume we want to reduce the size of the output of the convolution layer, we can use the stride parameter

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

93

Convolutional Layer


3x3 filter

Stride 2

9

3

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

93

Convolutional Layer


3x3 filter

Stride 2

9

3

0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0

9 [N]3 [F]


In general:

𝐎𝐮𝐭𝐩𝐮𝐭 =𝑵− 𝑭

𝒔𝒕𝒓𝒊𝒅𝒆+ 𝟏

Convolutional Layer

The Conv Layer:• Accepts a volume of size W1 x H1 x D1

• Requires 4 hyperparameters:• Number of filters K• Their spatial extent F• The amount of zero padding P• The stride S

• Produce a volume of size W2 x H2 x D2 where:

• 𝑊2 =𝑊1−𝐹+2𝑃

𝑆+ 1

• 𝐻2 =𝐻1−𝐹+2𝑃

𝑆+ 1

• 𝐷2 = 𝐾• It introduces 𝐹 ∙ 𝐹 ∙ 𝐷1 weights per filter, for a total of (𝐹 ∙ 𝐹 ∙ 𝐷1) ∙ 𝐾

weights (10 filter with a dimension of 5x5 on an RGB image will have 5 ∙ 5 ∙ 3 ∙ 10 = 𝟕𝟓𝟎 parameters)

Common settings: • K = (powers of 2, e.g. 32, 64, 128, 512) • F = 3, S = 1, P = 1 • F = 5, S = 1, P = 2 • F = 5, S = 2, P = ? (whatever fits) • F = 1, S = 1, P = 0

Brain / Convolutional Layer

28

28

1

Filter: 5x5x3 Convolution

result

Activation Map

32

32

3

This is like a single neuron with local connectivity

An activation map is a 28x28 sheet of neuron outputs:1. Each is connected to a small region in the

input 2. All of them share parameters

28 “5x5 filter” -> “5x5 receptive field for each neuron”

Brain / Convolutional Layer

28

28

5

32

32

3

Using 5 filters, we are stacking neurons in a matrix 28x28x5.

This means that, somehow, 5 different neurons are looking at the same piece of image and producing an output.

28

Activation layer

Done by convolution

Done by activation function

Activation layer

Sigmoid

𝒚 𝒙 =𝟏

𝟏 + 𝒆−𝒙

Hyperbolic tang.𝒚 𝒙 = 𝒕𝒂𝒏𝒉(𝒙)

ReLU𝒚 𝒙 = 𝒎𝒂𝒙(𝟎,𝒙)

Leaky ReLU𝒚 𝒙 = 𝒎𝒂𝒙(𝟎.𝟏𝒙, 𝒙)

ELU

𝒚 𝒙 = ቊ𝒙

𝜶(𝒆𝒙 − 𝟏)𝒙 ≥ 𝟎𝒙 < 𝟎

Pooling layer

This layer reduce the spatial size of the representation of the image. It aims to reduce the amount of parameters and computation in the network, and hence to also control overfitting

MAX POOLING2x2 filter

Stride of 2

The idea is to reduce the size keeping the neuron “more activated”

Pooling layer

• Accepts a volume of size W1 x H1 x D1

• Requires two hyperparameters:

• their spatial extent F,• the stride S,

• Produces a volume of size W2 x H2 x D2 where:

• 𝑊2 =𝑊1−𝐹

𝑆+ 1

• 𝐻2 =𝐻1−𝐹

𝑆+ 1

• 𝐷1 = 𝐷2

• Introduces zero parameters since it computes a fixed function of the input

• Different version of pooling exist. The most used is MAX pooling, then you have AVG

pooling..

• Pooling can be obtained even through Conv Layer with big stride

Fully connected layer

These layers are just like the classic neural network layers.

All the inputs are connected to the outputs

Fully connected layer

In the CNNs they have the task to “classify” the high level features extracted by the previous layers

Image 32x32x3Stretched to 3072x1

3072

1

Output can be the number of classes (e.g. 10)

10

1Wx10x3072 weights

This will be the layer involved in the concept of “Transfer Learning” (more on this later)

CNN – AlexNet – Recap

Conv 1

11

x11

x3x9

6 s

trid

e 4

Conv 2

5x5

x96

x25

6

Conv 5

3x3

x38

4x2

56

Conv 4

3x3

x38

4x3

84

Conv 3

3x3

x25

6x3

84

FC

8

FC

7

FC

6

OU

TP

UT

Inp

ut Im

age

CNN – AlexNet – Recap

Figure credit: Zeiler and Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV 2014

Features (and filters) are more complex compared to the manual ones

Hierarchical

features

Other networksGoogLeNet. The ILSVRC 2014 winner. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters. (Inception)

VGGNet. The runner-up in ILSVRC 2014. Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end.

ResNet. Residual Network was the winner of ILSVRC 2015. It features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network (as of May 10, 2016).

Architecture #Params Size Accuracy Year #OPS FW time

[GPU]

FW time

[CPU]

AlexNet 61M 238 MB 80.2 2012 724M 3.1 ms 0.29 s

Inception V1 7M 70 MB 88.3 2014 1.43B - -

VGGNet 138M 528 MB 91.2 2014 15.5B 9.4 ms 4.36 s

ResNet-50 25.5M 99 MB 93 2015 3.9B 11 ms 1.13 s

GPU - Titan X | CPU i7-4790K (4 GHz)

Training Process

1. Create your model

2. Choose the Activation Functions (use ReLU)

3. Data Preprocessing (images: subtract mean)

4. Weight Initialization

5. Use Batch Normalization

6. Babysitting the Learning process

7. Hyperparameter Optimization

An Object Detection Pipeline

How to use a CNN on your data

AIM: being able to reuse a CNN for object detection on our own data

1. REUSE A CNN2. OBJECT DETECTION3. OWN DATA

Data gathering

Suppose we want to create a CV software which is able to detect playing cards (for simplicity only 9,10, Jack, Queen , King and Ace)

Since the algorithm is supervised we need a dataset with the label associated

The dataset must be:1. As large as possible (200 items per class at least)2. With the same object in different “conditions” (background, lights)3. With random objects along with the desired object4. It should respect “application condition”. So decide if we need partial objects,

overlapping and so on5. With no label errors6. Not too large (less training time)

You can create the dataset taking pictures (e.g. smartphone) or harvest from Google Images.

Data gathering

LabellingWe need to put an annotation on each image, to explain to the CNN what’s in the image.

We are building an object detection, so the label process will involve the creation of bounding boxes

10

King

King

1. This process is how we transfer the knowledge to the data

2. We can use a lot of open-source software (LabelImg→https://github.com/tzutalin/labelImg)

3. It will create an XML file associated to the image

<object>

<name>ten</name>

<pose>Unspecified</pose>

<truncated>0</truncated>

<difficult>0</difficult>

<bndbox>

<xmin>145</xmin>

<ymin>68</ymin>

<xmax>303</xmax>

<ymax>225</ymax>

</bndbox>

</object>

https://github.com/tzutalin/labelImg

Object detection

Our dataset is ready, we need now to search a model for Object detection.

So far we learned how to do Image Classification, we can use the same models and slide a window over the image.

CONS: different shape of the window in different position. Huge amount of time.

Object detection – Classification based

Region proposal: run the CNN only on part which can contain an object

R-CNN

Further improvements leaded to the creation of:• Fast-RCNN• Faster-RCNN

They are also called two-stage algorithms

Girshick et al, “Rich feature hierarchies for accurate object detection and

semantic segmentation”, CVPR 2014

Object detection – Regression based

NO Region proposal: run the CNN in “pre-defined” areas

Divide image into grid 7 x 7

Imagine you have B base boxes for each grid cell

The CNN will regress from base boxes to a new tensor

(dx, dy, dh, dw, confidence)

Output of these network are a big tensor

7 x 7 x (B * (5 + C))C - classes

Object detection – Regression based

NO Region proposal

Classes: C=2 [cat – dog]Boxes: B = 3

Output: 7x7x21

(confidence[0.85], bx, by, bh, bw, cat[0], dog[1])

The most famous structures are:• YOLO (You Only Look Once)• SSD (Single Shot Detection)

They are also called one-stage algorithms and they’ve been built with the aim of obj detection

They are faster compared to other methods, but even less accurate.

Object detection

Once we decided the model, we can download the structure and the weights.

https://github.com/tensorflow/models/blob/master/resea

rch/object_detection/g3doc/detection_model_zoo.md

These models already have the structure of the CNN implemented.

The weights we download, however, are trained on some other dataset (COCO, Pascal, Kitti…)

How to use these huge network already trained for our problem?

The solution is transfer learning

Transfer learningTraining a CNN from zero requires a very big dataset and a very expensive hardware

RetrainFreeze

Already trained Small dataset Big dataset

#Class #Class

It re-uses the ability learnt in another task

Pre trained on ImageNet

Object detection pipeline

Create your own Dataset

Choose your network structure

(Faster-RCNN / SSD..)

Modify the Fully Connected layers

Do a «trasnfer training»

Detect your objects

Object detection examples

YOLO – 1 class SSD – 5 classes

Calibrated

cropping

Persons Detection

R-FCN

Face Extraction

D-Lib

Age/Gender estim.

VGG-16 + DC layer

W –

(25,35)

M –

(38,43)

Object detection examples

Bibliografia

• https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html• https://www.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv• http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-

networks.pdf• https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-

Multiple-Objects-Windows-10• https://project.inria.fr/deeplearning/files/2016/05/DLFrameworks.pdf [for in depth

comparison of DL frameworks]

https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

https://www.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10

https://project.inria.fr/deeplearning/files/2016/05/DLFrameworks.pdf

aa 2018-2019 computer vision - cal unibg · computer vision (cv), from the perspective of...

Documents