deep learning for computer vision pr. jenny benois-pineau ...benois-p/deeplearning... · input data...

Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université Bordeaux Chapter 2. From Shallow to Deep

Chapter 2

Summary. 1. Kinds of machine Learning. 1.1. Unsupervised learning, 1.2. Supervised learning, main formulations, 2. Artificial Neural Networks 3. Multi-Layered Perceptron (MLP)

Deep Learning for Computer Vision 2

1. Kinds of Machine learning

➔  Once more ➔  Machine learning teaches computers to do what comes naturally to

humans and animals: learn from experience. ➔  Machine learning algorithms use computational methods to “learn” information directly from data without ➔  The algorithms adaptively improve their performance as the number of

samples available for learning increases.

Oge Marques, IPTA’2017

Types of Learning

Evaluation Metrics

Confusion Matrix

Quality Metrics

➔  ACC = (TP +TN)/(TP+FP+TN+FN)

➔  BACC = (TP/P + TN/N)/2*

➔  TPR = TP/(TP+FN) or recall (R) ➔  TNR = TN/(TN + FP)

➔  P = TP/(TP+FP) ➔  F-score = 2/(1/R + 1/P)* ➔  FPR = FP/(FP+TN) ➔  FNR = FN/(FN+TP)

➔  * - unbalanced classes

ROC curve

➔  If we consider a binary classification problem ➔  The classifier dependent on the threshold, then for different values of

the threshold

➔  In multi-class classification problem : one vs all ! And we plot for all N classes

1.1. Unsupervised learning(1)

➔  Unsupervised learning finds hidden patterns or intrinsic structures in data. ➔  It is used to draw inferences from datasets consisting of input data without labelled responses. ➔  Clustering is the most common unsupervised learning technique. ➔  It is used for exploratory data analysis to find hidden patterns or groupings in data. ➔  Applications in computer vision : visual data summarization (collections

of images, video)

K-means clustering

➔  J. MacQueen, “Some methods for classification and analysis of multivariate observations”, Proc. Of the Fifth Berkley Symposium on Math. Stat. And Prob., pp. 281 – 296, 1967

➔  Principle : Unisupervised clasification with a priori known numebr of clusters.

➔  Parameter : the number k of clusters ➔  Input data : a sample of M descriptor vectors x1,... xM. ➔  (1) Chose k initial centers c1,... ck

➔  (2) For each of M vectors, assign it to the i-th cluster the center ci of which is closest in the sense of chosen metrics

➔  (3) If none vector changes its class then stop. ➔  (4) Compute new centers: for each i, ci is the mean of vectors of th

eclass i ➔  (5) Go to 2

Application example(1)

➔  Lifelogging ( K. Gurin, A. Smeaton DCU)

http://www.slideshare.net/cgurrin/biohackers-summit-2015-lifelogging-a-new-era-of-personal-data

Application Example (2)

➔  Grouping of similar images ➔  Selection of the cluster center : “Hyper-scenes”

H4 H. Nicolas, A. Manoury, J. Benois-Pineau, Wi. Dupuis, D. Barba: Grouping video shots into scenes based on 1D mosaic descriptors. IEEE ICIP 2004: 637-640

Hierarchical agglomerative clustering (HAG)

➔  Principle : ➔  (1) At the initialisation each descriptor-vector in the sample forms a class ➔  (2) While the number of clusters is larger than k ( limit k=1)

›  Groupe classes in the sense of a distance d Distance between clusters Max-link Min-link Mean-link

( )yxdji CyCx

ji CCd ,max,

max ),(∈∈

( )yxdji CyCx

ji CCd ,min,

min ),(∈∈

dmean(Ci ,Cj ) =1

ni ×nj l=1

∑ dp=1

∑ xl , yp( )

Dendrogramm

S. Benini et al. Extraction of Significant Video Summaries by Dendrogram Analysis. IEEE ICIP 2006: 133-136

Supervised learning(1)

➔  Problem statement :

➔  Le us consider a set of pairs

➔  - feature vectors

➔  - labels (of classes)

➔  Let us consider a function ,

➔  Let us now consider a function - the loss of predicting

x1, y1( ),..., xn , yn( ),... xN , yN( ){ }

xn ∈ RK = X

yn ∈Y

g(x,α) : X →Y g(xn ,α) = yn

L yn , yn( ) yn

Supervised learning(2)

➔  Empirical risk minimization : to find a function which minimizes

➔  Structural risk minimization : consider a penalty

➔  If the variable y is discrete – classification, otherwise – regression

➔  For a given form of g the problem consists in finding ➔  optimal parameters

Remp g( ) = 1N L yn ,g xn ,α( )( )n=1

J g( ) = Remp g( )+λC g( ),λ ≥ 0

C g( ) :G→ R+

Quality of prediction

Type II Error

Type I Error

2. Artificial Neural Networks

➔  Biological inspiration ➔  The basic computational unit of the brain is a neuron. Approximately 86

billion neurons can be found in the human nervous system and they are connected with approximately 1014 – 1015 synapses.

g x,W,b( ) = f wixi + bi∑⎛

⎝⎜

⎠⎟

McCulloch and Pitts, 1943, activation function

f t( ) =1, t > 0

0 otherwise

⎨⎪

⎩⎪

Commonly used non-linear functions

➔  Sigmoïd : ➔  Tanh :

➔  Sigmoids saturate and kill gradients(!) ➔  Tanh non-linearity is always preferred to the sigmoid nonlinearity.

f (t) = 11+ e−t

f (t) = et − e−t

et + e−t

a) Sigmoid b) Tanh

ReLu non-linearity

➔  Rectified Linear Unit :

f (t) =max(t, 0)

- Lower computational cost wrt sigmoid and tanh- Faster convergence

A simple neuron

x =x1x2

⎜⎜

⎟⎟

w =w1w2

⎜⎜

⎟⎟

xTw = x1w1 + x2w2

y = f (x1w1 + x2w2 )

Our simplest function f : Heaviside Step function

f t( ) =1, t > 0

0 otherwise

⎨⎪

⎩⎪

Source: Wikipedia

y = f (x1w1 + x2w2 )><0 10

How to determine weights ? wi

Training a neuron

➔  The artificial neuron can be trained to perform as elementary linear classifier, i.e. we can determine the weights which will minimize empirical (or structural) risk of our classifier.

➔  The elementary training algorithm was proposed by Rosenblatt (1958). “Perceptron”

➔  Consider the “training set” : (In our case of a simple neuron is a binary label) ➔  Initialize weights randomly ➔  Then at each iteration t the weights are updated as

➔  “Back propagation”

x1, y1( ),..., xn , yn( ),... xN , yN( ){ }yi ∈ 0;1{ }

wit+1= wi

t +η y tn − yn( ) xi ,n

Limitations : the set of functions – classifiers which could be simulated by Perceptron is narrow ( cf. Minsky and Papert (1969) XOR)

3. Multi-Layered Perceptron (MLP)

➔  Perceptron ( 1958) -> MLP (1961) – Rosenblatt ➔  Let us consider a binary classification problem

➔  Input layer – is just our data, ➔  Hidden layers produce more abstract features ➔  Hidden layers are fully connected ➔  The output of hidden layers is usually not binary (RELU, Tanh, Sigm)

Deep Learning for Computer Vision 22 http://neuralnetworksanddeeplearning.com/chap1.html

Example : Recognition of Handwritten digits

➔  Input is a binarised image

➔  Each matrix if of 28x28

➔  10-class classification problem

The architecture of an MLP with 1 hidden layer

http://neuralnetworksanddeeplearning.com/chap1.html

Training of MLP

➔  As in the case of Perceptron, training of MLP consists in finding optimal configuration of weights at all layers

➔  Principle : is the same, i.e. to minimize the Loss L between prediction and ground truth labels

➔  To stress multilayer architecture let us denote the weight between i-th neuron of the layer l and j-th neuron of the layer l+1

➔  - is the learning rate ➔  I.e. each parameter of MLP is updated in the direction opposite to the

gradient of of the loss function L – gradient descent. To compute the derivatives of the Loss function at each layer we use the “chain rule”.

wij(t+1),l ,(l+1) = wij

(t )l ,(l+1) −η∂L

∂wijl ,(l+1)

wijl ,(l+1)

Back-propagation and chain rule(1)

➔  Let us consider a trivial Neural Network

➔  x is input ➔  h stands for hidden layer ➔  o stands for output layer ➔  f is a non-linear transformation ➔  w are synaptic weights ➔  y is the known output ➔  is the predicted output

f x ⋅wh( )x ⋅whwo

w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )

Back-propagation(2)

➔  Les us consider Error/Loss function as

➔  The goal is to find

➔  Method : Gradient descent

➔  is the “learning rate” for simplicity, the same for all layers

L = 12y − y( )

wo*,wh

*( )T= Argmin L wo ,wh( )

wo(t+1) = wo

(t ) −η ʹLwo wo(t ) ,wh

(t )( )

wh(t+1) = wh

(t ) −η ʹLwh wo(t ) ,wh

(t )( )

Back-propagation(3)

➔  How to compute the partial derivatives

➔  Chain rule :

w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )

L = 12y − y( )

2ʹLwo = ?

ʹLwo = y − y( ) ⋅ ˆʹywo = y − y( ) ⋅ ʹf ⋅ f x ⋅wh( )

ʹLwh = ?

a b c x( )( )( )ʹ = ʹa b( ) ⋅ ʹb c( ) ⋅ ʹc x( )

ʹLwh = y − y( ) ⋅ ˆʹywh = y − y( ) ⋅ ʹf ⋅wo ⋅ ʹf ⋅ x

Back-propagation(4)

➔  How to compute the partial derivatives

w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )

ʹLwh = y − y( ) ⋅ ˆʹywh = y − y( ) ⋅ ʹf ⋅wo ⋅ ʹf ⋅ x

ʹLwo = y − y( ) ⋅ ˆʹywo = y − y( ) ⋅ ʹf ⋅ f x ⋅wh( )yhLayer Error

Layer input

Layer Error

Layer input

MLP- conclusion

➔  MLP is a fully connected network : each output of a previous layer is connected with all inputs of the next layer

➔  MLP is a feed-forward neural network network : at a test step the input data pass in a direct manner from the input layer up to the output one

➔  MLP trained with back-propagation proved to be very effective supervised learning algorithm

➔  It was widely used for : character recognition, face recognition etc…

➔  Nevertheless, if we work with high resolution images, the number of parameters to train becomes very high. This would kill performances.

➔  Solution : Convolutional Neural Networks (CNN). Deep Learning for Computer Vision 30

deep learning for computer vision pr. jenny benois-pineau ...benois-p/deeplearning... · input data...

Documents

deeplearning for developers

deep learning for natural language...

chapter 10 presented by bill giesler julie pineau

leon bakst’s portrait of a. benois (1898)

françois kraus and denis pineau-valencienne ·...

irim at trecvid 2018: instance searchirim at trecvid 2018:...

deeplearning in finance

deeplearning-levelmelanomadetectionbyinterpretable machine

the tour de france - d39o10hdlsc638.cloudfront.net ·...

deep learning for computer vision pr. jenny benois-pineau...

electricshovelteethmissingdetectionmethodbasedon...

pierre-olivier pineau, hec montréal (canada)

françois kraus and denis...

towardsreproducibilityin machinelearning andai...kristian...

jenny benois-pineau, labri – université de bordeaux –...

pineau press kit 2021 v2 · 2020. 12. 21. · pineau des...

cognitive iot using deeplearning on data parallel frameworks...

7kh 1a1-os-05a-1 « ¿ « Í ¿ Å t s z deeplearning ; m h...

deeplearning on hadoop @oscon 2014

new pineau supp. table s1 · 2012. 9. 7. · pineau supp....