training convolutional neural networksraul/rs/training convolutional... · • data-dependent...
TRANSCRIPT
Training Convolutional Neural Networks
R. Q. FEITOSA
Overview
1. Activation Functions
2. Data Preprocessing
3. Weight Initialization
4. Batch Normalization
2
Activation Functions
3
Activation Layer
4
𝑓
𝑤0𝑥0 = 𝑏
𝑤1𝑥1
𝑤𝑛𝑥𝑛
𝑤𝑖𝑥𝑖
𝑖
+ 𝑏
𝑓 𝑤𝑖𝑥𝑖
𝑖
+ 𝑏
activation
function
Activation Functions
5
tanh
𝑡𝑎𝑛ℎ 𝑥
Sigmoid
𝜎 𝑥 = 1
1 + 𝑒−𝑥
Leaky ReLU
𝑚𝑎𝑥 0.01𝑥, 𝑥
ReLU
𝑚𝑎𝑥 0, 𝑥
Sigmoid
• Squashes numbers to range [0,1]
• Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron.
Three shortcomings:
1) Saturated neurons “kill” the gradients,
2) Sigmoid outputs are not zero-centered,
3) 𝑒𝑥𝑝 () is a bit compute expensive.
6
Sigmoid
𝜎 𝑥 = 1
1 + 𝑒−𝑥
𝜎 𝑥 = 1
1 + 𝑒−𝑥
Sigmoid kills gradient
7
𝜕𝐿
𝜕𝜎
𝜕𝐿
𝜕𝑥=
𝜕𝜎
𝜕𝑥
𝜕𝐿
𝜕𝜎 upstream
gradient
sigmoid
gate
𝜕𝜎
𝜕𝑥
𝑥 𝜎 𝑥 =1
1 + 𝑒−𝑥
local
gradient
Sigmoid
𝜎 𝑥 = 1
1 + 𝑒−𝑥
“kills” gradient “kills” gradient
What happens when the input to a neuron (𝑥) is always positive?
The gradients on 𝒘 are all positive or all negative.
Therefore, zero-mean data is highly desirable!
possible
gradient
update
directions
possible
gradient
update
directions
Sigmoid output not zero-centered
8
𝑓 𝑤𝑖𝑥𝑖
𝑖
+ 𝑏 𝑤1
𝑤2
hypothetical
optimal
𝑤 vector
zig zag
updates
Tanh
9
tanh
𝑡𝑎𝑛ℎ 𝑥
• Squashes numbers to range [-1,1]
• Zero-centered.
Shortcoming:
Still “kills” the gradients when saturated
“kills” gradient “kills” gradient
𝑓 𝑥 = 𝑡𝑎𝑛ℎ 𝑥
Rectified Linear Unit
10
• Does not saturate in the positive region.
• Very computationally efficient
• Converges much faster than sigmoid/tanh ( 6×)
Shortcoming:
1) not zero-centered output
ReLU
𝑚𝑎𝑥 0, 𝑥
𝑓 𝑥 = 𝑚𝑎𝑥 0, 𝑥
ReLU
𝑚𝑎𝑥 0, 𝑥
ReLU kills gradient for negative input
11
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑥=
𝜕𝑧
𝜕𝑥
𝜕𝐿
𝜕𝑧 upstream
gradient
ReLU
gate 𝜕𝑧
𝜕𝑥
𝑥 𝑧 = 𝑅𝑒𝐿𝑈 𝑥 = 𝑚𝑎𝑥 (0, 𝑥)
local
gradient
“kills” gradient
𝑥1
𝑥2
data cloud
Dead ReLU
12
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑦=
𝜕𝑧
𝜕𝑦
𝜕𝐿
𝜕𝑧 upstream
gradient
ReLU
𝑥
local
gradient 𝑊𝑥 + 𝑑 para 𝑊𝑥 + 𝑑 >0
0, otherwise
𝜕𝑧
𝜕𝑥=
𝑊𝑥 + 𝑑
𝑦 = 𝑊𝑥 + 𝑑 𝑧 = 𝑚𝑎𝑥 (0,𝑊𝑥 + 𝑑 )
only data on this side of
the hyperplane will have
a non-zero derivative
Leaky ReLU
13
• Never saturates.
• Computationally efficient
• Converges much faster than sigmoid/tanh ( 6×)
• will not “die”.
Shortcoming:
not zero-centered output
𝑓 𝑥 = 𝑚𝑎𝑥 0.01𝑥, 𝑥
Leaky ReLU
𝑚𝑎𝑥 0.1𝑥, 𝑥
Parametric Rectifier (PReLU):
𝑓 𝑥 = 𝑚𝑎𝑥 𝛼𝑥, 𝑥
Data Preprocessing
14
Preprocessing Operations
15
In practice for images
• No normalization (pixel
values in the same range)
• Subtract the mean image
(e.g. AlexNet)
• Subtract per-channel mean
(e.g. VGGNet)
Weight Initialization
16
Small initial Weights As the data flows forward, the input (𝑥) at each layer tends to concentrate around zero.
Recall that the local gradient 𝑊𝑥 𝜕𝑊𝑥
𝜕𝑊 = 𝑥.
As the gradient back props, it tends to vanish no update.
17
layer 1 layer 1 layer 1 layer 1 layer 1
histograms of the input data at each convolutional layer
Large initial Weights Assume a tanh or a sigmoid activation function.
The input to the activation functions are in the saturated region, where the gradient is small.dients 𝜕𝑊𝑥
𝜕𝑊 = 𝑥.
As the gradient propagates back, they tend to vanish no update.
18
Optimal Weight Initialization Still an active research topic:
• Understanding the difficulty of training deep feedforward neural networks by Glorot and
Bengio, 2010
• Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
• Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
• Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
• Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
• All you need is a good init, Mishkin and Matas, 2015
19
Batch Normalization
20
Making the Activations Gaussian Take a batch of activation at some layer. To make each dimension 𝑘 look like a unit Gaussian, apply
𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘
𝑉𝑎𝑟 𝑥 𝑘
Notice that this function has a simple derivative.
21
Making the Activations Gaussian In practice, we take the empirical mean and variance. For 𝑥 over a mini-batch ℬ = 𝑥1…𝑚 :
22
…
…
…
…
sa
mp
les
features
𝑥 1
𝑥2
𝑥𝑚
…
𝜇ℬ =1
𝑚 𝑥𝑖
𝑚
𝑖=1
𝜎ℬ2 =
1
𝑚 𝑥𝑖 − 𝜇ℬ
2
𝑚
𝑖=1
𝑥 𝑖 =𝑥𝑖 − 𝜇ℬ
𝜎ℬ2 + 𝜀
constant added for
numerical stability
𝜇ℬ and 𝜎ℬ2
computed for
each
dimension
independently
Recovering the Identity Mapping Normalize:
and then allow the network to undo it, if “wished”.
23
Notice that the network can learn:
to recover the identity mapping
𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘
𝑉𝑎𝑟 𝑥 𝑘
𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘 𝛾 𝑘 = 𝑉𝑎𝑟 𝑥 𝑘
𝛽 𝑘 = 𝔼 𝑥 𝑘
Obs.: recall that 𝑘 refers to a dimension.
learnable parameters
ConvNet Layer with BN
Batch normalization is inserted prior to the nonlinearity
24
polling
activation function
batch normalization
convolution
𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘
Batch Normalizing Transform
Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;
Parameters to be learned: 𝛾, 𝛽
Input:
𝜇ℬ =1
𝑚 𝑥𝑖
𝑚𝑖=1 // mini-batch mean
𝜎ℬ2 =
1
𝑚 𝑥𝑖 − 𝜇ℬ
2𝑚𝑖=1 // mini-batch variance
𝑥 𝑖 =𝑥𝑖−𝜇ℬ
𝜎ℬ2+𝜀
// normalize
𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ 𝐵𝑁𝛾,𝛽 𝑥𝑖 // scale and shift
25
Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,
2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167
• Improves gradient flow
through the network
• Allows higher learning
rates
• Reduces the dependence
on the initialization
Batch Normalizing Transform
Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;
Parameters to be learned: 𝛾, 𝛽
Input:
𝜇ℬ =1
𝑚 𝑥𝑖
𝑚𝑖=1 // mini-batch mean
𝜎ℬ2 =
1
𝑚 𝑥𝑖 − 𝜇ℬ
2𝑚𝑖=1 // mini-batch variance
𝑥 𝑖 =𝑥𝑖−𝜇ℬ
𝜎ℬ2+𝜀
// normalize
𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ BN𝛾,𝛽 𝑥𝑖 // scale and shift
26
Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,
2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167
At test time BN functions
differently:
The mean/std are not
computed based on the
batch. Instead, fixed
empirical values computed
during training are used.
Summary
• Activation Functions (use ReLU or Leaky ReLU)
• Data Preprocessing (subtract the mean)
• Weight Initialization (use Xavier init)
• Batch Normalization (use)
27
Training Convolutional Neural Networks
Thank you!
R. Q. FEITOSA