training convolutional neural networksraul/rs/training convolutional... · • data-dependent...

Training Convolutional Neural Networks

R. Q. FEITOSA

Overview

1. Activation Functions

2. Data Preprocessing

3. Weight Initialization

4. Batch Normalization

2

Activation Functions

3

Activation Layer

4

𝑓

𝑤0𝑥0 = 𝑏

𝑤1𝑥1

𝑤𝑛𝑥𝑛

𝑤𝑖𝑥𝑖

𝑖

+ 𝑏

𝑓 𝑤𝑖𝑥𝑖

𝑖

+ 𝑏

activation

function

Activation Functions

5

tanh

𝑡𝑎𝑛ℎ 𝑥

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

Leaky ReLU

𝑚𝑎𝑥 0.01𝑥, 𝑥

ReLU

𝑚𝑎𝑥 0, 𝑥

Sigmoid

• Squashes numbers to range [0,1]

• Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron.

Three shortcomings:

1) Saturated neurons “kill” the gradients,

2) Sigmoid outputs are not zero-centered,

3) 𝑒𝑥𝑝 () is a bit compute expensive.

6

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

𝜎 𝑥 = 1

1 + 𝑒−𝑥

Sigmoid kills gradient

7

𝜕𝐿

𝜕𝜎

𝜕𝐿

𝜕𝑥=

𝜕𝜎

𝜕𝑥

𝜕𝐿

𝜕𝜎 upstream

gradient

sigmoid

gate

𝜕𝜎

𝜕𝑥

𝑥 𝜎 𝑥 =1

1 + 𝑒−𝑥

local

gradient

Sigmoid

𝜎 𝑥 = 1

1 + 𝑒−𝑥

“kills” gradient “kills” gradient

What happens when the input to a neuron (𝑥) is always positive?

The gradients on 𝒘 are all positive or all negative.

Therefore, zero-mean data is highly desirable!

possible

gradient

update

directions

possible

gradient

update

directions

Sigmoid output not zero-centered

8

𝑓 𝑤𝑖𝑥𝑖

𝑖

+ 𝑏 𝑤1

𝑤2

hypothetical

optimal

𝑤 vector

zig zag

updates

Tanh

9

tanh

𝑡𝑎𝑛ℎ 𝑥

• Squashes numbers to range [-1,1]

• Zero-centered.

Shortcoming:

Still “kills” the gradients when saturated

“kills” gradient “kills” gradient

𝑓 𝑥 = 𝑡𝑎𝑛ℎ 𝑥

Rectified Linear Unit

10

• Does not saturate in the positive region.

• Very computationally efficient

• Converges much faster than sigmoid/tanh ( 6×)

Shortcoming:

1) not zero-centered output

ReLU


𝑓 𝑥 = 𝑚𝑎𝑥 0, 𝑥

ReLU


ReLU kills gradient for negative input

11

𝜕𝐿

𝜕𝑧

𝜕𝐿

𝜕𝑥=

𝜕𝑧

𝜕𝑥

𝜕𝐿

𝜕𝑧 upstream

gradient

ReLU

gate 𝜕𝑧

𝜕𝑥

𝑥 𝑧 = 𝑅𝑒𝐿𝑈 𝑥 = 𝑚𝑎𝑥 (0, 𝑥)

local

gradient

“kills” gradient

𝑥1

𝑥2

data cloud

Dead ReLU

12

𝜕𝐿

𝜕𝑧

𝜕𝐿

𝜕𝑦=

𝜕𝑧

𝜕𝑦

𝜕𝐿

𝜕𝑧 upstream

gradient

ReLU

𝑥

local

gradient 𝑊𝑥 + 𝑑 para 𝑊𝑥 + 𝑑 >0

0, otherwise

𝜕𝑧

𝜕𝑥=

𝑊𝑥 + 𝑑

𝑦 = 𝑊𝑥 + 𝑑 𝑧 = 𝑚𝑎𝑥 (0,𝑊𝑥 + 𝑑 )

only data on this side of

the hyperplane will have

a non-zero derivative

Leaky ReLU

13

• Never saturates.

• Computationally efficient

• Converges much faster than sigmoid/tanh ( 6×)

• will not “die”.

Shortcoming:

not zero-centered output

𝑓 𝑥 = 𝑚𝑎𝑥 0.01𝑥, 𝑥

Leaky ReLU

𝑚𝑎𝑥 0.1𝑥, 𝑥

Parametric Rectifier (PReLU):

𝑓 𝑥 = 𝑚𝑎𝑥 𝛼𝑥, 𝑥

Data Preprocessing

14

Preprocessing Operations

15

In practice for images

• No normalization (pixel

values in the same range)

• Subtract the mean image

(e.g. AlexNet)

• Subtract per-channel mean

(e.g. VGGNet)

Weight Initialization

16

Small initial Weights As the data flows forward, the input (𝑥) at each layer tends to concentrate around zero.

Recall that the local gradient 𝑊𝑥 𝜕𝑊𝑥

𝜕𝑊 = 𝑥.

As the gradient back props, it tends to vanish no update.

17

layer 1 layer 1 layer 1 layer 1 layer 1

histograms of the input data at each convolutional layer

Large initial Weights Assume a tanh or a sigmoid activation function.

The input to the activation functions are in the saturated region, where the gradient is small.dients 𝜕𝑊𝑥

𝜕𝑊 = 𝑥.

As the gradient propagates back, they tend to vanish no update.

18

Optimal Weight Initialization Still an active research topic:

• Understanding the difficulty of training deep feedforward neural networks by Glorot and

Bengio, 2010

• Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

• Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

• Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

• Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

• All you need is a good init, Mishkin and Matas, 2015

19

Batch Normalization

20

Making the Activations Gaussian Take a batch of activation at some layer. To make each dimension 𝑘 look like a unit Gaussian, apply

𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘

𝑉𝑎𝑟 𝑥 𝑘

Notice that this function has a simple derivative.

21

Making the Activations Gaussian In practice, we take the empirical mean and variance. For 𝑥 over a mini-batch ℬ = 𝑥1…𝑚 :

22

…

…

…

…

sa

mp

les

features

𝑥 1

𝑥2

𝑥𝑚

…

𝜇ℬ =1

𝑚 𝑥𝑖

𝑚

𝑖=1

𝜎ℬ2 =

1

𝑚 𝑥𝑖 − 𝜇ℬ

2

𝑚

𝑖=1

𝑥 𝑖 =𝑥𝑖 − 𝜇ℬ

𝜎ℬ2 + 𝜀

constant added for

numerical stability

𝜇ℬ and 𝜎ℬ2

computed for

each

dimension

independently

Recovering the Identity Mapping Normalize:

and then allow the network to undo it, if “wished”.

23

Notice that the network can learn:

to recover the identity mapping

𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘

𝑉𝑎𝑟 𝑥 𝑘

𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘 𝛾 𝑘 = 𝑉𝑎𝑟 𝑥 𝑘

𝛽 𝑘 = 𝔼 𝑥 𝑘

Obs.: recall that 𝑘 refers to a dimension.

learnable parameters

ConvNet Layer with BN

Batch normalization is inserted prior to the nonlinearity

24

polling

activation function

batch normalization

convolution

𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘

Batch Normalizing Transform

Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;

Parameters to be learned: 𝛾, 𝛽

Input:

𝜇ℬ =1

𝑚 𝑥𝑖

𝑚𝑖=1 // mini-batch mean

𝜎ℬ2 =

1


2𝑚𝑖=1 // mini-batch variance

𝑥 𝑖 =𝑥𝑖−𝜇ℬ

𝜎ℬ2+𝜀

// normalize

𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ 𝐵𝑁𝛾,𝛽 𝑥𝑖 // scale and shift

25

Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,

2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167

• Improves gradient flow

through the network

• Allows higher learning

rates

• Reduces the dependence

on the initialization

Batch Normalizing Transform

Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;

Parameters to be learned: 𝛾, 𝛽

Input:

𝜇ℬ =1

𝑚 𝑥𝑖

𝑚𝑖=1 // mini-batch mean

𝜎ℬ2 =

1


2𝑚𝑖=1 // mini-batch variance

𝑥 𝑖 =𝑥𝑖−𝜇ℬ

𝜎ℬ2+𝜀

// normalize

𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ BN𝛾,𝛽 𝑥𝑖 // scale and shift

26

Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,

2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167

At test time BN functions

differently:

The mean/std are not

computed based on the

batch. Instead, fixed

empirical values computed

during training are used.

Summary

• Activation Functions (use ReLU or Leaky ReLU)

• Data Preprocessing (subtract the mean)

• Weight Initialization (use Xavier init)

• Batch Normalization (use)

27

Training Convolutional Neural Networks

Thank you!

R. Q. FEITOSA

training convolutional neural networksraul/rs/training convolutional... · • data-dependent...

Documents