introduction to deep learning - robot learning...
TRANSCRIPT
![Page 1: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/1.jpg)
Introduction to Deep LearningConvolutional Neural Networks (2)
Prof. Songhwai OhECE, SNU
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 1
![Page 2: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/2.jpg)
STOCHASTIC GRADIENT DESCENT
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 2
![Page 3: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/3.jpg)
Gradient Descent
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 3
Batch gradient descent:(Steepest descent)
Stochastic gradient descent: processes one data point at a time
![Page 4: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/4.jpg)
Gradient Descent
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 4
Source: https://hackernoon.com/dl03‐gradient‐descent‐719aff91c7d6?gi=7c0ea3c7fc76https://alykhantejani.github.io/images/gradient_descent_line_graph.gif
![Page 5: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/5.jpg)
Loss Functions• : target value• : network output
• Cross entropy loss (or log loss) [ , : distribution]
log
• Hinge loss [ 1, for classification]max 0,1 ∙
• Huber loss [for regression]
if
otherwise • Kullback‐Leibler (KL) divergence [ , : distribution]
∑ log• Mean absolute error (MAE) (or L1 loss)
∑• Mean squared error (MSE) (or L2 loss)
∑
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 5
![Page 6: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/6.jpg)
Momentum
• Stochastic gradient descent can be slow on a plateau • Momentum method (Polyak, 1964)
– Accelerate gradient descent using the momentum (speed) of previous gradients
• : velocity• : parameter• : decay rate ( 1)• : learning rate
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 6
![Page 7: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/7.jpg)
RMSprop
• RMSProp: Root Mean Square Propagation1
• : moving average of magnitudes of previous gradients• : parameter• : forgetting factor (0 1)• : learning rate• Elementwise squaring and square‐rooting• Works well under online and non‐stationary settings
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 7
![Page 8: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/8.jpg)
ADAM• ADAM: Adaptive Moment Estimation
11
• : moving average of previous gradients• : moving average of magnitudes of previous gradients• : parameter• , : forgetting factor ( , ∈ 0,1 )• : learning rate• : a small number• Elementwise squaring and square‐rooting
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 8
* Combines AdaGradand RMSprop
Initialization bias correction
Diederik P. Kingma, Jimmy Ba, “Adam: A method for Stochastic Optimization,” ICLR 2015.
![Page 9: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/9.jpg)
Valley
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 9
![Page 10: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/10.jpg)
Saddle Point
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 10
![Page 11: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/11.jpg)
DROPOUT, BATCH NORMALIZATION
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 11
![Page 12: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/12.jpg)
Dropout
• A regularization method to address the overfitting problem• Randomly drop units during training• Each unit is retained with probability (hyperparameter),
independent of other units
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 12
Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. "Dropout: a simple way to prevent neural networks from overfitting." Journal of Machine Learning Research 15, no. 1 (2014): 1929‐1958.
![Page 13: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/13.jpg)
Batch Normalization
• To reduce internal covariate shift (changes in input distribution)• Faster convergence (requires only 7% of training steps)• Applied before a nonlinear activation unit by normalizing each dimension• During testing, fixed mean and variance (estimated during training) are used
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 13
Ioffe, Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ICML, 2015.
CONV
Batch Normalization
ReLU
![Page 14: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/14.jpg)
Accelerating BN Networks
• Increase learning rate (x5, x30)• Remove dropout
– BN can address the overfitting problem similar to dropout
• Reduce the weight of L2 regularization (x5)• Accelerate the learning rate decay (x6)• Remove local response normalization• Shuffle training examples more thoroughly• Reduce the photometric distortions
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 14
Ioffe, Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ICML, 2015.
![Page 15: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/15.jpg)
POPULAR CNN ARCHITECTURES
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 15
![Page 16: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/16.jpg)
AlexNet (2012)
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 16
5 convolutional layers3 fully
connected layers
Key ideas: • Rectified Linear Unit (ReLU): an activation function• GPU implementation (2 GPUs)• Local response normalization, Overlapping pooling• Data augmentation, Dropout (p=0.5)• 62M parameters
Learned 11x11x3 filters
![Page 17: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/17.jpg)
VGGNet (2015)
• Increasing depth with 3x3 convolution filters (with stride 1)– More discriminative (compared to larger receptive fields). – Less number of parameters
• 1x1 convolution filters• Dropout (p=0.5)• VGG16, VGG19
– 138M parameters for VGG16
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 17
VGG16
Source: https://www.cs.toronto.edu/~frossard/post/vgg16/
![Page 18: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/18.jpg)
ResNet (2016)
• Degradation problem with deep networks– Accuracy saturates as the depth increases
• Solution – Instead of learning a mapping directly, learn the
residual function instead– E.g., learning an identity function
• Shortcut connections
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 18
Residual Learning
ResNet‐18ResNet‐34ResNet‐50ResNet‐101 (40M)ResNet‐152ResNet‐1202
![Page 19: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/19.jpg)
Other Architectures
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 19
GoogLeNet (2015, 22 layers, 5M, inception)• Inception: good representation with lower complexity
DenseNet (2017, 264 layers, 70M) ResNeXt• Multiple paths with the same topology
WRN: Wide Residual Networks • increase width, not depth for computation
efficiency
![Page 20: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/20.jpg)
ImageNet Challenge
20K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning
![Page 21: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/21.jpg)
Comparison
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 21
Canziani, Paszke, Culurciello, "An analysis of deep neural network models for practical applications," arXiv preprint arXiv:1605.07678, 2016.
Inception‐v4 = ResNet + Inception
![Page 22: Introduction to Deep Learning - Robot Learning Laboratoryrllab.snu.ac.kr/courses/deep-learning-2018/files/03_dl... · • Û: forgetting factor (0 O1) • ß: learning rate • Elementwise](https://reader033.vdocument.in/reader033/viewer/2022042204/5ea53ba4a9493a1a550f6adf/html5/thumbnails/22.jpg)
Wrap Up
• Stochastic gradient descent– Momentum– RMSprop– ADAM
• Dropout• Batch normalization• CNN Architectures
– AlexNet– VGGnet– ResNet
Prof. Songhwai Oh (ECE, SNU) Introduction to Deep Learning 22