aa 2018-2019 computer vision - cal unibg · computer vision (cv), from the perspective of...
TRANSCRIPT
![Page 1: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/1.jpg)
A brief introduction to Computer Vision
Michele Ermidoro
SPEAKER
Michele Ermidoro
PLACE
UniBG, Dalmine
DATE
12 March 2019
Ingegneria dei Sistemi di Controllo
AA 2018-2019
![Page 2: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/2.jpg)
Outline
1. What is Machine Vision?
2. Digital image: start from basics
• Image representation
• Image processing
3. Classic approach
4. Convolutional Neural Network
5. An Object Detection Pipeline
6. Deep Learning Framework
2
![Page 3: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/3.jpg)
What is Machine Vision?
![Page 4: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/4.jpg)
Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do.Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions.Understanding in this context means the transformation of visual images (the input of the retina) into descriptions of the world that can interface with other thought processes and elicit appropriate action.
[https://en.wikipedia.org/wiki/Computer_vision]
Machine vision (MV) is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance, usually in industryMachine vision as a systems engineering discipline can be considered distinct from computer vision, a form of computer science. It attempts to integrate existing technologies in new ways and apply them to solve real world problems.
[https://en.wikipedia.org/wiki/Machine_vision]
What is Machine Vision?
![Page 5: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/5.jpg)
Computer
Vision
Machine
Vision
Image
Processing
Neurobiology Imaging
Optics
Signal
processing
RoboticsAr tificial
intelligence(AI)
Machine
Learning
Math
Computer
intelligence
Object
Detection
Cognitive
Vision
Geometry
Statistics
Optimization
Human Vision
Lighting
Imaging
sensors
Lenses
Odometry
Navigation
Data
Estimation
Filtering
What is Machine Vision?
![Page 6: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/6.jpg)
• Almost 80% of the data traveling on the net is visual data
• Everybody has a smartphone, and every smartphone has at least 2 cameras
• "A picture is worth a thousand words" is an English language-idiom
• A camera is one of the powerful sensor in a lot of applications
• It has wide fields of application:• Robotics• Surveillance• Industry • Self-driving cars/drones/buses..• Medical• …
http://crcv.ucf.edu/people/faculty/Bagci/research.php
Computer Vision – Why?
![Page 7: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/7.jpg)
Computer Vision – Hype?
![Page 8: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/8.jpg)
Computer Vision – Hype?
≈5.1
Human
![Page 9: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/9.jpg)
Computer Vision – Hype?
≈5.1
Human
What happened
here?
1. Computational Power
2. Convolutional Neural
Network (CNN)
![Page 10: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/10.jpg)
Computational Power
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
2009 - “We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods” “Large-scale Deep Unsupervised Learning using Graphics
Processors” Rajat Raina, Anand Madhavan, Andrew Y. Ng
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
IMAGENET Challenge
![Page 11: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/11.jpg)
Convolutional Neural Network
![Page 12: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/12.jpg)
Computer Vision Tasks
http://vision.stanford.edu
Classification
What’s in the image?
• People• Car• Traffic light• Clock • …
![Page 13: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/13.jpg)
Computer Vision Tasks
Car
http://vision.stanford.edu
Detection
What’s in the image? And Where it is?
• A car in the orange box
![Page 14: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/14.jpg)
CarPerson
Clock
Computer Vision Tasks
http://vision.stanford.edu
Detection
What’s in the image? And Where it is?
• A car in the orange box
• A person in the blue box
• A clock in the green box
![Page 15: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/15.jpg)
Computer Vision Tasks
http://vision.stanford.edu
Segmentation
What’s in the image? And Where it is?
And Which pixels are?
Car
Clock
![Page 16: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/16.jpg)
Computer Vision Tasks
http://vision.stanford.edu
Annotation
How would you describe the picture?
“People crossing a
street while a car
is waiting”
![Page 17: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/17.jpg)
Computer Vision Tasks
https://engineering.matterport.com/splash-of-color-instance-
segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46
![Page 18: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/18.jpg)
• Face detection
• Smile detection
• Eye-open detection
• …
Computer Vision Tasks - Others
![Page 19: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/19.jpg)
https://medium.com/waymo/recreating-the-self-driving-experience-the-making-of-the-waymo-360-video-
37a80466af49
Computer Vision Tasks - Others
![Page 20: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/20.jpg)
Figure credit: Dai, He, and Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, CVPR 2016
Computer Vision Tasks - Others
![Page 21: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/21.jpg)
Cardiologist-Level Arrhythmia Detection
with Convolutional Neural Networks
Computer Vision Tasks - Others
![Page 22: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/22.jpg)
https://www.youtube.com/watch?v=bcswZLwhTUI
Computer Vision Tasks - Others
![Page 23: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/23.jpg)
Computer Vision Tasks - Others
![Page 24: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/24.jpg)
https://www.youtube.com/watch?v=NrmMk1Myrxc
Computer Vision Tasks - Others
![Page 25: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/25.jpg)
Digital images Basics
Classic approachFeature engineering
CNN
What’s next
![Page 26: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/26.jpg)
Digital image: start from basicsImage representation
![Page 27: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/27.jpg)
Image representation
An image, inside a PC, is just a matrix of numbers
255 -> white
0 -> black
0
1
1
1
1
0
1
1
1
1
1
0
1
1
0
1
4x4x1 matrixColor depth – 1 bitColor channels – 116 pixel
![Page 28: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/28.jpg)
Image representation
An image, inside a PC, is just a matrix of numbers
217
255
255
255
255
191
255
255
255
255
255
127
255
255
0
255
4x4x1 matrixColor depth – 8 bitColor channels – 116 pixel
![Page 29: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/29.jpg)
Image representation
255
255
255
255
255
170
255
255
255
255
255
85
255
255
0
255
4x4x2 matrixColor depth – 8 bitColor channels – 216 pixel
0
255
255
255
255
85
255
255
255
255
255
170
255
255
255
255
![Page 30: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/30.jpg)
Image representation
2453x2453x3 matrixColor depth – 24 bitColor channels – 3Color space - sRGB
6 Mpixel (6.017.209 pixel)
![Page 31: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/31.jpg)
Color spaces
Color space, also known as the color model (or colorsystem), is an abstract mathematical model which simply describes the range of colors as tuples of numbers, typically as 3 or 4 values or color components
There are a variety of color spaces, such as RGB, CMYK, HSV, CIEXYZ..
CMYK
(Cyan, Magenta, Yellow,
Key black)
RGB (Red, Green, Blue)
![Page 32: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/32.jpg)
Color spaces
CIE XYZ
It is the most accurate from a scientific point of view. It tries to represent all the colors that an human eye can see.
RGBThe RGB color model is an additive color model in which red, green and blue light are added. The name comes from the three additive primary colors, red, green, and blue.
HSB (Hue/Sat/Bright)
Designed in the 1970s by
computer graphics researchers
to more closely align with the
way human vision perceives
color-making attributes
![Page 33: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/33.jpg)
Histogram
A color histogram is a representation of the distribution of colors in an image
Gray-scale histogram
Color histogram
Histogram equalization
![Page 34: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/34.jpg)
Digital image: start from basicsImage processing
![Page 35: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/35.jpg)
BinarizationImage binarization
It trasfroms a grey scale image in a binary
image, depending on a threshold
𝑔[𝑛,𝑚] = ቊ255, 𝑓 𝑛,𝑚 > 100
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
![Page 36: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/36.jpg)
Filters
Most of the operation on an image are made using filters
• Filters are mathematic functions which are applied to each pixel.
• The filter is represented using a matrix, called Kernel
• Depending on the size of the kernel (3x3, 5x5, 7x7..), the functions will involve the pixel and its neighbors.
• The filters are applied to an image convolvingthe image and the kernel
• The kernel sum must be 1
![Page 37: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/37.jpg)
ConvolutionConvolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.
Input image
Kernel
-13
Output
STEP 1
http://www.songho.ca/dsp/convolution/convolution2d_example.html
![Page 38: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/38.jpg)
Convolution
Input image
Kernel Output
Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.
STEP 5
-13
-18
-20
-24
-17
http://www.songho.ca/dsp/convolution/convolution2d_example.html
![Page 39: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/39.jpg)
Convolution
Input image
Kernel Output
Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.
STEP 9
-13
-18
13
-20
-24
20
-17
-18
17
http://www.songho.ca/dsp/convolution/convolution2d_example.html
![Page 40: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/40.jpg)
Convolution
![Page 41: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/41.jpg)
Kernel shape
Depending on the shape of the kernel, different operations can be made:
-1 -1-1
-1 -18
-1 -1-1
Edge-Detection
0 0-1
-1 -15
0 0-1
Sharpen
1 11
1 11
1 11
Blur/moving average
1
9
0 00
0 01
0 00
Original
![Page 42: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/42.jpg)
Kernel shape
Depending on the shape of the kernel, different operations can be made:
-1
2
-1
-1
2
-1
-1
2
-1
-1
-1
-1
2
2
2
-1
-1
-1
-1
-1
2
-1
2
-1
2
-1
-1
-1
-1
-1
-1
8
-1
-1
-1
-1
EDGE 45° Lines
Horiz. Lines Vert. Lines
![Page 43: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/43.jpg)
Kernel example
Denoising
Reduce noise in an image
1 12
2 24
1 12
1
16
Filtro Gaussiano
![Page 44: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/44.jpg)
Classic approach
![Page 45: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/45.jpg)
Visual recognition
Analyze images and extract high level information, like what’s in the
picture, where it is..
Different viewpoint Different illumination
Deformations Different shape of the same object
![Page 46: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/46.jpg)
How to identify an object?
Object Bag of ‘Features’
![Page 47: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/47.jpg)
How to identify an object?
The idea is to extract from the image some of its most important characteristics. We’ll call them “features”
These features will be then compared with a dictionary, where the specific knowledge is stored.
The selection of what’s important to search in the image is called features
engineering and extraction
The comparison phase is done through a classifier
The knowledge is passed to the classifier during the training
![Page 48: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/48.jpg)
Classification pipeline
The classifier is a function that receive as input all the representative features
of an image and it decides what’s in the picture.
𝑓 = 𝑎𝑝𝑝𝑙𝑒
𝑓 = 𝑐𝑜𝑤
𝑓 = 𝑡𝑜𝑚𝑎𝑡𝑜
![Page 49: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/49.jpg)
features
extraction
Classification pipeline
Train images
Features Training
Training
labels
Classifier
Test image
features
extractionFeatures Classifier It’s an apple
Prediction
![Page 50: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/50.jpg)
features
extraction
Classification pipeline
Train images
Features Training
Training
labels
Classifier
Test image
features
extractionFeatures Classifier It’s an apple
Prediction
Features engineering Choice of the classifier
![Page 51: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/51.jpg)
Features?
In Computer Vision, a feature is a piece of information which is particularly relevant in the solution of a certain problem (e.g. in the detection of a face, the presence of two eyes is a good feature)
We can identify 3 hierarchical categories of features:
Low-level features• Colors• Edges• Blob• Corners
Mid-level features• Scale-invariant
features• SIFT• SVD
High-level features• Histograms of
gradients (HOG)• Region descriptors
The selection process is called feature engineering
![Page 52: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/52.jpg)
Low Level Features
Edges & Corners:Since the process of image classification involve the exploitation of edges and corners, there are a lot of methods for finding them.
Canny edge detector Harris corner detection
https://dsp.stackexchange.com/questions/14338/corner-detection-using-chris-harris-mike-stephens
It uses a 5 step algorithm, involving some filters (gradient) to compute all the edges in an image.
A mobile window slide over the image and compute the Hessian. Evaluating the eigenvalue of each matrix the corners are detected.
![Page 53: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/53.jpg)
Mid Level Features
SIFT – Scale Invariant Feature Transform [1999]It is an algorithm which is able to detect and describe features in an image. In particular it is able to do this at different scales and rotations.
It has diffent steps which involve a scaling of the image and the computation of the Difference Of Gaussians (DoG)
![Page 54: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/54.jpg)
High Level Features
HOG – Histogram of Oriented Gradients [2005]The technique counts occurrences of gradient orientation in localized portions of an image. The HOG feature descriptor, the distribution ( histograms ) of directions of gradients ( oriented gradients ) are used as features. Gradients ( x and y derivatives ) of an image are useful because the magnitude of gradients is large around edges and corners ( regions of abrupt intensity changes ) and we know that edges and corners pack in a lot more information about object shape than flat regions
https://www.learnopencv.com/histogram-of-oriented-gradients/
![Page 55: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/55.jpg)
Classifier
• The classifier needs a phase of training or learning
• In this phase the classifier “learn” how to distinguish between the classes in output
• This training phase needs a dataset where each images is labelled with the corresponding class name
Training
Training
labels
Classifier
Choice of classifier
DOGS CATS
The classifiers which learn from a labelled dataset are called supervised and they can be divided into 3 major categories:1. Linear (with training)2. Trees (with training)3. Based on distances (without training)
![Page 56: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/56.jpg)
Manual classifier - example
As hypothesis, we want to recognize the type of bottle watching it from above. The
diameter of the neck determines the type of bottle.
Classifierfeatures info
Hough Transform
𝑑1
Hough Transform
𝑑2
IF 𝑑𝑖𝑛 > 𝑡ℎ1 THEN
𝑏𝑜𝑡𝑡𝑙𝑒𝑡𝑦𝑝𝑒 = 1
ELSE
𝑏𝑜𝑡𝑡𝑙𝑒𝑡𝑦𝑝𝑒 = 2
END
1
2
Features extraction
features Manual classifier Class
![Page 57: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/57.jpg)
Supervised classifier – distance example
A more complex problem, is there a cat or a dog in the image?
Kaggle Dogs vs. Cats dataset.
Istogramma RGB - 3D
Manual
classifier?
Classifierfeatures info
![Page 58: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/58.jpg)
Supervised classifier – distance example
A more complex problem, is there a cat or a dog in the image?
Kaggle Dogs vs. Cats dataset.
Istogramma RGB - 3D
Classifierfeatures info
x x
xx
x
x
x
x
o
oo
o
o
o
o
x2
x1
+
1-NN - cat
3-NN - dog
5-NN - dog
cat
dog
k-Nearest
Neighbor
FeaturesClassifier
without training
Dataset
![Page 59: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/59.jpg)
Supervised classifier - linear
There are various linear supervised classifier, some of them are:• Logistic regression• SVM• Neural Network
Kaggle Dogs vs. Cats dataset.
Istogramma RGB - 3D
Training
Labels
Trained
Classifier
cat
Parameters
![Page 60: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/60.jpg)
Supervised learning – tree
We have different classifier based on trees:• Decision trees• Random forest• Gradient Boosting
Kaggle Dogs vs. Cats dataset.
Istogramma RGB - 3D
Training
Labels
Trained
Classifier
cat
Parameters
![Page 61: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/61.jpg)
Viola Jones object detector (Haar Cascades)
• First object detection framework to provide competitive object detection rates in real-time
• Employ Haar Features tocharacterize the input image
Viola, Jones: Robust Real-time Object Detection, IJCV 2001
Each feature resuts in a single value computedsubtracting the sum of pixels under white rectanglefrom the sum of pixels under black rectangle
![Page 62: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/62.jpg)
Viola Jones object detector (Haar Cascades)
• First object detection framework to provide competitive object detection rates in real-time
• Employ Haar Features tocharacterize the input image
• Makes use of a bank of Adaboost classifiers in acascade fashion
Viola, Jones: Robust Real-time Object Detection, IJCV 2001
Sub windows of the image at different scales are passedthrougth a series of classifier and discarded if they fail in any of the stage
![Page 63: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/63.jpg)
Viola Jones object detector (Haar Cascades)
• First object detection framework to provide competitive object detection rates in real-time
• Employ Haar Features tocharacterize the input image
• Makes use of a bank of Adaboost classifiers in acascade fashion
https://vimeo.com/12774628
![Page 64: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/64.jpg)
N. Dalal and B. Triggs Histograms of oriented gradients for human detection CVPR, 2005
Step 1: scan image at
all scales and locations
Step 2: extract features
over each sliding
window location
Step 3: use linear SVM
to classify features
extracted from each
window
Step 4: apply non-
maxima suppression to
obtain final bounding
boxes
Pros/Cons:
+ Real-time
+ Open source software implementation (Dlib)
+ Higher accuracy than Haar Cascades
- Pre-trained only for frontal faces
- Low accuracy on distance and overlapping faces
- High false-positive rate
- High sensitivity to parameter changes
An example – HOG for pedestrian detection
![Page 65: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/65.jpg)
Where we are?
PASCAL VOC Challenge
It is an online competion for object detection (classification and localization of objects)
20 classes.
11,530 images
27,450 objects
![Page 66: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/66.jpg)
Where we are?
• Plateau of results between2011/2012
• Results of 2012 are achieved using a combination of HOG+LBP features and an ensemble of classifier in: Boosted Local Structured HOG-LBP for Object Localization Junge Zhang, Kaiqi Huang, Yinan Yu and Tieniu Tan, 2012
PASCAL Visual Object Classes:• Train/validation/test: 9,963 images
containing 24,640 annotated objects• 20 classes of objects
![Page 67: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/67.jpg)
Where we are?
![Page 68: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/68.jpg)
Convolutional Neural NetworkCNN
![Page 69: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/69.jpg)
Convolutional Neural Network
Train images Training
labels
Test image/s
It’s an applePrediction
Training
Features Classifier
Learned Model
Learned
Features
Learned
Classifier
Learned Model
Learned
Features
Learned
Classifier
![Page 70: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/70.jpg)
Convolutional Neural Network
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analysing visual imagery.
Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field.
CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered.
A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, RELU layer i.e. activation function, pooling layers, fully connected layers and normalization layers.
https://en.wikipedia.org/wiki/Convolutional_neural_network
![Page 71: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/71.jpg)
Convolutional Neural Network
≈5.1
Human
![Page 72: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/72.jpg)
CNN - AlexNet
Conv 1
Conv 2
Conv 5
Conv 4
Conv 3
FC
8
FC
7
FC
6
OU
TP
UT
Inp
ut Im
age
![Page 73: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/73.jpg)
Convolutional Layer
Recap
Convolution
![Page 74: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/74.jpg)
Convolutional Layer
32
32
3
5
5
3
Input image: 32x32x3 (height,width,depth)
Filter: 5x5x3
1. Convolve the filter with the image
(“slide over the image spatially)
2. Filter should have the same depth
of the previous layer (in this case 3)
3. Convolution preserve spatial
structure
![Page 75: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/75.jpg)
Convolutional Layer
28
28
1
Input image: 32x32x3
Filter: 5x5x3
1 number:Convolution result, the dot product
between the filter and a small 5x5x3
chunk of the image
Slide (convolve) over spatial location
Activation Map
![Page 76: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/76.jpg)
Convolutional Layer
28
28
1
Input image: 32x32x3
Filter #2: 5x5x3
Slide (convolve) over spatial location
Activation Map
![Page 77: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/77.jpg)
Convolutional Layer
If we have a 4 filters, we’ll have 4 activation maps
Slide (convolve) over spatial location
We obtain a new 28x28x4 image
![Page 78: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/78.jpg)
Convolutional Layer
In the previous step, convolving we reduced the dimension from 32 to 28
7
7
3
3
7x7 image3x3 filter
![Page 79: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/79.jpg)
Convolutional Layer
In the previous step, convolving we reduced the dimension from 32 to 28
7
7
3
3
7x7 image3x3 filter
![Page 80: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/80.jpg)
Convolutional Layer
In the previous step, convolving we reduced the dimension from 32 to 28
7
7
3
3
7x7 image3x3 filter
![Page 81: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/81.jpg)
Convolutional Layer
In the previous step, convolving we reduced the dimension from 32 to 28
7
7
3
3
7x7 image3x3 filter
→ Produce a 5x5 output
In order to keep the dimensionality (keep to a very depth neauralnetwork), we can add a frame, called padding
![Page 82: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/82.jpg)
Convolutional Layer
9
3
9x9 (7x7 image + 1 frame of padding)
3x3 filter
The size of padding depend on the size of the filter
→ Produce a 7x7 output
Assume we want to reduce the size of the output of the convolution layer, we can use the stride parameter
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
93
![Page 83: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/83.jpg)
Convolutional Layer
9x9 (7x7 image + 1 frame of padding)
3x3 filter
Stride 2
9
3
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
93
![Page 84: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/84.jpg)
Convolutional Layer
9x9 (7x7 image + 1 frame of padding)
3x3 filter
Stride 2
9
3
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
93
![Page 85: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/85.jpg)
Convolutional Layer
9x9 (7x7 image + 1 frame of padding)
3x3 filter
Stride 2
9
3
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
93
![Page 86: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/86.jpg)
Convolutional Layer
9x9 (7x7 image + 1 frame of padding)
3x3 filter
Stride 2
9
3
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
9 [N]3 [F]
→ Produce a 4x4 output
In general:
𝐎𝐮𝐭𝐩𝐮𝐭 =𝑵− 𝑭
𝒔𝒕𝒓𝒊𝒅𝒆+ 𝟏
![Page 87: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/87.jpg)
Convolutional Layer
The Conv Layer:• Accepts a volume of size W1 x H1 x D1
• Requires 4 hyperparameters:• Number of filters K• Their spatial extent F• The amount of zero padding P• The stride S
• Produce a volume of size W2 x H2 x D2 where:
• 𝑊2 =𝑊1−𝐹+2𝑃
𝑆+ 1
• 𝐻2 =𝐻1−𝐹+2𝑃
𝑆+ 1
• 𝐷2 = 𝐾• It introduces 𝐹 ∙ 𝐹 ∙ 𝐷1 weights per filter, for a total of (𝐹 ∙ 𝐹 ∙ 𝐷1) ∙ 𝐾
weights (10 filter with a dimension of 5x5 on an RGB image will have 5 ∙ 5 ∙ 3 ∙ 10 = 𝟕𝟓𝟎 parameters)
Common settings: • K = (powers of 2, e.g. 32, 64, 128, 512) • F = 3, S = 1, P = 1 • F = 5, S = 1, P = 2 • F = 5, S = 2, P = ? (whatever fits) • F = 1, S = 1, P = 0
![Page 88: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/88.jpg)
Brain / Convolutional Layer
28
28
1
Filter: 5x5x3 Convolution
result
Activation Map
32
32
3
This is like a single neuron with local connectivity
An activation map is a 28x28 sheet of neuron outputs:1. Each is connected to a small region in the
input 2. All of them share parameters
28 “5x5 filter” -> “5x5 receptive field for each neuron”
![Page 89: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/89.jpg)
Brain / Convolutional Layer
28
28
5
32
32
3
Using 5 filters, we are stacking neurons in a matrix 28x28x5.
This means that, somehow, 5 different neurons are looking at the same piece of image and producing an output.
28
![Page 90: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/90.jpg)
Activation layer
Done by convolution
Done by activation function
![Page 91: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/91.jpg)
Activation layer
Sigmoid
𝒚 𝒙 =𝟏
𝟏 + 𝒆−𝒙
Hyperbolic tang.𝒚 𝒙 = 𝒕𝒂𝒏𝒉(𝒙)
ReLU𝒚 𝒙 = 𝒎𝒂𝒙(𝟎,𝒙)
Leaky ReLU𝒚 𝒙 = 𝒎𝒂𝒙(𝟎.𝟏𝒙, 𝒙)
ELU
𝒚 𝒙 = ቊ𝒙
𝜶(𝒆𝒙 − 𝟏)𝒙 ≥ 𝟎𝒙 < 𝟎
![Page 92: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/92.jpg)
Pooling layer
This layer reduce the spatial size of the representation of the image. It aims to reduce the amount of parameters and computation in the network, and hence to also control overfitting
MAX POOLING2x2 filter
Stride of 2
The idea is to reduce the size keeping the neuron “more activated”
![Page 93: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/93.jpg)
Pooling layer
• Accepts a volume of size W1 x H1 x D1
• Requires two hyperparameters:
• their spatial extent F,• the stride S,
• Produces a volume of size W2 x H2 x D2 where:
• 𝑊2 =𝑊1−𝐹
𝑆+ 1
• 𝐻2 =𝐻1−𝐹
𝑆+ 1
• 𝐷1 = 𝐷2
• Introduces zero parameters since it computes a fixed function of the input
• Different version of pooling exist. The most used is MAX pooling, then you have AVG
pooling..
• Pooling can be obtained even through Conv Layer with big stride
![Page 94: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/94.jpg)
Fully connected layer
These layers are just like the classic neural network layers.
All the inputs are connected to the outputs
![Page 95: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/95.jpg)
Fully connected layer
In the CNNs they have the task to “classify” the high level features extracted by the previous layers
Image 32x32x3Stretched to 3072x1
3072
1
Output can be the number of classes (e.g. 10)
10
1Wx10x3072 weights
This will be the layer involved in the concept of “Transfer Learning” (more on this later)
![Page 96: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/96.jpg)
CNN – AlexNet – Recap
Conv 1
11
x11
x3x9
6 s
trid
e 4
Conv 2
5x5
x96
x25
6
Conv 5
3x3
x38
4x2
56
Conv 4
3x3
x38
4x3
84
Conv 3
3x3
x25
6x3
84
FC
8
FC
7
FC
6
OU
TP
UT
Inp
ut Im
age
![Page 97: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/97.jpg)
CNN – AlexNet – Recap
Figure credit: Zeiler and Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV 2014
Features (and filters) are more complex compared to the manual ones
Hierarchical
features
![Page 98: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/98.jpg)
Other networksGoogLeNet. The ILSVRC 2014 winner. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters. (Inception)
VGGNet. The runner-up in ILSVRC 2014. Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end.
ResNet. Residual Network was the winner of ILSVRC 2015. It features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network (as of May 10, 2016).
Architecture #Params Size Accuracy Year #OPS FW time
[GPU]
FW time
[CPU]
AlexNet 61M 238 MB 80.2 2012 724M 3.1 ms 0.29 s
Inception V1 7M 70 MB 88.3 2014 1.43B - -
VGGNet 138M 528 MB 91.2 2014 15.5B 9.4 ms 4.36 s
ResNet-50 25.5M 99 MB 93 2015 3.9B 11 ms 1.13 s
GPU - Titan X | CPU i7-4790K (4 GHz)
![Page 99: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/99.jpg)
Training Process
1. Create your model
2. Choose the Activation Functions (use ReLU)
3. Data Preprocessing (images: subtract mean)
4. Weight Initialization
5. Use Batch Normalization
6. Babysitting the Learning process
7. Hyperparameter Optimization
![Page 100: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/100.jpg)
An Object Detection Pipeline
![Page 101: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/101.jpg)
How to use a CNN on your data
AIM: being able to reuse a CNN for object detection on our own data
1. REUSE A CNN2. OBJECT DETECTION3. OWN DATA
![Page 102: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/102.jpg)
Data gathering
Suppose we want to create a CV software which is able to detect playing cards (for simplicity only 9,10, Jack, Queen , King and Ace)
Since the algorithm is supervised we need a dataset with the label associated
The dataset must be:1. As large as possible (200 items per class at least)2. With the same object in different “conditions” (background, lights)3. With random objects along with the desired object4. It should respect “application condition”. So decide if we need partial objects,
overlapping and so on5. With no label errors6. Not too large (less training time)
You can create the dataset taking pictures (e.g. smartphone) or harvest from Google Images.
![Page 103: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/103.jpg)
Data gathering
![Page 104: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/104.jpg)
LabellingWe need to put an annotation on each image, to explain to the CNN what’s in the image.
We are building an object detection, so the label process will involve the creation of bounding boxes
10
King
King
1. This process is how we transfer the knowledge to the data
2. We can use a lot of open-source software (LabelImg→https://github.com/tzutalin/labelImg)
3. It will create an XML file associated to the image
<object>
<name>ten</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>145</xmin>
<ymin>68</ymin>
<xmax>303</xmax>
<ymax>225</ymax>
</bndbox>
</object>
![Page 105: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/105.jpg)
Object detection
Our dataset is ready, we need now to search a model for Object detection.
So far we learned how to do Image Classification, we can use the same models and slide a window over the image.
CONS: different shape of the window in different position. Huge amount of time.
![Page 106: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/106.jpg)
Object detection – Classification based
Region proposal: run the CNN only on part which can contain an object
R-CNN
Further improvements leaded to the creation of:• Fast-RCNN• Faster-RCNN
They are also called two-stage algorithms
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014
![Page 107: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/107.jpg)
Object detection – Regression based
NO Region proposal: run the CNN in “pre-defined” areas
Divide image into grid 7 x 7
Imagine you have B base boxes for each grid cell
The CNN will regress from base boxes to a new tensor
(dx, dy, dh, dw, confidence)
Output of these network are a big tensor
7 x 7 x (B * (5 + C))C - classes
![Page 108: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/108.jpg)
Object detection – Regression based
NO Region proposal
Classes: C=2 [cat – dog]Boxes: B = 3
Output: 7x7x21
(confidence[0.85], bx, by, bh, bw, cat[0], dog[1])
The most famous structures are:• YOLO (You Only Look Once)• SSD (Single Shot Detection)
They are also called one-stage algorithms and they’ve been built with the aim of obj detection
They are faster compared to other methods, but even less accurate.
![Page 109: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/109.jpg)
Object detection
Once we decided the model, we can download the structure and the weights.
https://github.com/tensorflow/models/blob/master/resea
rch/object_detection/g3doc/detection_model_zoo.md
These models already have the structure of the CNN implemented.
The weights we download, however, are trained on some other dataset (COCO, Pascal, Kitti…)
How to use these huge network already trained for our problem?
The solution is transfer learning
![Page 110: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/110.jpg)
Transfer learningTraining a CNN from zero requires a very big dataset and a very expensive hardware
RetrainFreeze
Already trained Small dataset Big dataset
#Class #Class
It re-uses the ability learnt in another task
Pre trained on ImageNet
![Page 111: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/111.jpg)
Object detection pipeline
Create your own Dataset
Choose your network structure
(Faster-RCNN / SSD..)
Modify the Fully Connected layers
Do a «trasnfer training»
Detect your objects
![Page 112: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/112.jpg)
Object detection examples
YOLO – 1 class SSD – 5 classes
![Page 113: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/113.jpg)
Calibrated
cropping
Persons Detection
R-FCN
Face Extraction
D-Lib
Age/Gender estim.
VGG-16 + DC layer
W –
(25,35)
M –
(38,43)
Object detection examples
![Page 114: AA 2018-2019 Computer Vision - CAL UniBg · Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision](https://reader034.vdocument.in/reader034/viewer/2022051811/60209abeac7aec58854ba0eb/html5/thumbnails/114.jpg)
Bibliografia
• https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html• https://www.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv• http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-
networks.pdf• https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-
Multiple-Objects-Windows-10• https://project.inria.fr/deeplearning/files/2016/05/DLFrameworks.pdf [for in depth
comparison of DL frameworks]