a brief introduction to deep learning and its...

A Brief Introduction to Deep Learning and its Application to Vision

Recognition

Shangwen LiAdvisor: C.-C. Jay Kuo

1

Friday, May 9, 14

Outline

2

CONCEPTS

TECHNIQUES

CONCLUSION

My Work in CSCI 686

Friday, May 9, 14

Outline

3

CONCEPT

TECHNIQUES

CONCLUSION

My Work in CSCI 686

Friday, May 9, 14

What is Deep Learning?

4

• Hierarchical Representation• Structure of the system naturally matches the problem which is

inherently hierarchical.

• Called “Deep Learning” because the hierarchy is “DEEEEEEP”!

• Feature Learning• Features are learned from data rather than hand crafted

• End to End Learning• Features and classifiers learned jointly using data

3 Key Ideas

Friday, May 9, 14

Hierarchical Representation

5

Deep Learning Based

Traditional Object Classification

“Car”SIFT/HOG K Mean Classifier

Pixels Edge Texture Pattern Part Object

Recognizing Object Hierarchically

They call it shallow learning

........

Friday, May 9, 14

Feature Learning

“Car”1st Layer

2nd Layer

3rd Layer Classifier

Extracting feature hierarchically (aka layer by layer)

Low-level features are shared among categories

6

High-level features are more global and more invariant

Zeiler, Matthew D., and Rob Fergus. "Visualizing and Understanding Convolutional Neural Networks." arXiv preprint arXiv:1311.2901 (2013).

Friday, May 9, 14

End to End Learning

7

“Car”1st Layer

2nd Layer

3rd Layer Classifier

• End: Image End: Class Label

• Layers and classifier can be extracted as connected network and trained jointly

• Each layer can be seen as a non-linear transformation of input

• Goal: learn a non-linear mapping function between image and label

Friday, May 9, 14

Why use Deep Learning?

8

Theoretically (Without Proof!!)

• Simplest Answer:• More efficient

• Simpler Answer: • More efficient for representing complicated mapping functions

• More Technically: • Trade breath for depth

Practically

• Learning good features

Friday, May 9, 14

“Shallow” Theory of Deep Learning

9

• Let us look at a common non-linear mapping: Kernel Machine

• Kernel Machine can be considered as a two-layer non-linear mapping

• What does deep learning try to learn?

• A hierarchy of non-linear mapping (K layers)

• Intuitively, deep architecture are more efficient for representing complex functions

Friday, May 9, 14


10




• A hierarchical of non-linear mapping (K layers)


No So

lid Pr

oof

Friday, May 9, 14


11




• A hierarchical of non-linear mapping (K layers)


No So

lid Pr

oof

If we only study models for which we can prove things, we wouldn't have speech, handwriting, and visual object recognition systems today.

Yann Lecun

Friday, May 9, 14

Trade breath for depth• “So Called” Logic Circuit Example for calculating N-bit parity

12

Use Multiple Layers

.....XOR XOR XOR XOR

XOR XOR

XOR

.....

N-1 XOR gates in a tree of depth log(N)

Use Two Layers??

• Decompose into “AND” & “OR”

• Proved that need O(exp(N)) number of gate elements to achieve this

• Shorter but Wider

...........................AND AND AND AND

OR OR

Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127.

Friday, May 9, 14

What is Good Features?

13

• The Manifold Hypothesis:

• Natural data lives in a low-dimensional (non-linear) manifold

• Because variables in natural data are mutually dependent

Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16

Friday, May 9, 14


14

• Example: all face images of a person• 1000x1000 pixels = 1,000,000 dimensions• But the face has 3 cartesian coordinates and 3 Euler angles• And humans have less than about 50 muscles in the face• Hence the manifold of face images for a person has <56 dimensions

• The perfect representations of a face image:• Its coordinates on the face manifold • Its coordinates away from the manifold

• No general methods to learn this kind of representation


Friday, May 9, 14


15

• The Ideal Disentangling Feature Extractor

• Deep Learning aim at learning this good feature extractorSource: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16

Friday, May 9, 14

Current Status

• Despite lacking theoretical foundation, it works much better than traditional algorithm in vision recognition.

• All participants in ILSVRC (Large Scale Visual Recognition Challenge hosted by ImageNet) adopt deep architecture in their framework

http://www.clarifai.com/

16

Interesting Demo

Friday, May 9, 14



clarifai Example

17http://www.clarifai.com/

Friday, May 9, 14



clarifai Example


Friday, May 9, 14



clarifai Example

20

Yann Lecun


Friday, May 9, 14



Outline

21

CONCEPT

TECHNIQUES

CONCLUSION

My Work in CSCI 686

Friday, May 9, 14

Aside: “Avengers” in Machine Learning

22

SVM

Deep Belief Net

Convolutional Neural Network

Deep Auto Encoder Deconvolutional Network

Friday, May 9, 14

Space of Machine Learning Algorithm

23

RNN

CNN

RBM

GMMSparse Coding

Auto Encoder

SVM

Boosting

Decision Tree

Perceptron

DBN

Deep Auto Encoder

Neural Network

Sum Product

Source: Marc'Aurelio Ranzato. "Deep Learning for Object Category Recognition" Guest Lecture, Stanford, 11 February 2014

Friday, May 9, 14


24

RNN

CNN

RBM

GMMSparse Coding

Auto Encoder

SVM

Boosting

Decision Tree

Perceptron

DBN

Deep Auto Encoder

Neural Network

Sum Product

Deep Shallow

Friday, May 9, 14


25

RNN

CNN

RBM

GMMSparse Coding

Auto Encoder

SVM

Boosting

Decision Tree

Perceptron

DBN

Deep Auto Encoder

Neural Network

Sum Product

Supervised

Unsupervised

Deep Shallow

Friday, May 9, 14


26

RNN

CNN

RBM

GMMSparse Coding

Auto Encoder

SVM

Boosting

Decision Tree

Perceptron

DBN

Deep Auto Encoder

Neural Network

Sum Product

Supervised

Unsupervised

Probabilistic

Deep Shallow

Friday, May 9, 14


27

RNN

CNN

RBM

GMMSparse Coding

Auto Encoder

SVM

Boosting

Decision Tree

Perceptron

DBN

Deep Auto Encoder

Neural Network

Sum Product

Supervised

Unsupervised

Probabilistic

Deep Shallow

Friday, May 9, 14

Neural Network

28

Basic Neuron

Connection Weight Vector WActivation Function fActivation a = hW,b(x)

Simplest Neural Network (Shallow)

Only one hidden layerConnection Weight between layers is a matrix in general

http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

Friday, May 9, 14



Multilayer Neural Network

29

• When we stacked hidden layers together, we obtain a deep architecture

• Goal: train the network (obtain the optimal connection weights that minimize a certain cost function)

Neural Network with Two Hidden Layers


Friday, May 9, 14



Neural Network Training

30

• Optimize connection weights with respect to a certain cost function

MSE Cost

• What is the first optimization algorithm that comes to mind?

• Gradient Descend

• Is it obvious to calculate the derivate of the cost function?

• No, final output is a hierarchy of non linear mapping

Complicate Composite Function

Friday, May 9, 14

Calculate Gradient

31

Back Propagation = Use Chain Rule

“Ship”W1x W2h1 W3h2

“Car”

Loss

• Loss L(W) can be:

• MSE for regression problem

• Cross Error Entropy for classification problem

h1 h2 h3


Friday, May 9, 14

Calculate Gradient

32


W1x W2h1 W3h2

“Car”

Loss

• Assuming we can calculate the loss gradient with respect to output h3

h1 h2


Friday, May 9, 14

Calculate Gradient

33


W1x W2h1 W3h2

“Car”

Loss

h1


Friday, May 9, 14

Calculate Gradient

34


W1x W2h1 W3h2

“Car”

Loss


Friday, May 9, 14

Auto Encoder

35

• Essentially a neural network

• The goal of the network aims at reconstructed the input itself with minimum error

If we enforce the hidden layers unit size to be smaller than the input layer,

what does this remind you????


Friday, May 9, 14



Auto Encoder

36

• Essentially a neural network

• The goal of the network aims at reconstructed the input itself with minimum error

If we enforce the hidden layers unit size is smaller than the input layer,

what does this remind you????

Compression!!!This is why it is called as “encoder”


Friday, May 9, 14



Sparse Auto Encoder

37

• If we do not enforce the size of hidden layer to be smaller than the input layer, we can still obtain some meaningful structure among the data

• Enforce the sparse output of hidden layer unit

• Enforce the average activation of hidden layer unit to be small, approximately 0.05


Friday, May 9, 14



Deep Sparse Auto Encoder

38

• We can stack the several layers together to train a deep sparse auto encoder

• Can use layer wise training method to train the auto encoder


Friday, May 9, 14




39

• Example: 200x200 image• Fully-connected, 400,000 hidden units = 16 billion parameters• Locally-connected, 400,000 hidden units 10x10 fields = 40 million

params• Local connections capture local dependencies


Friday, May 9, 14


40

• STATIONARITY? Statistics are similar at different locations hidden units

• Features that are useful on one part of the image and probably useful elsewhere.

• All units share the same set of weights• Shift equivalent processing:

• When the input shifts, the output also shifts but stays otherwise unchanged

• The filtered “image” is called a feature map


Friday, May 9, 14


41

• Detects multiple features at each location• The collection of units looking at the same patch is

akin to a feature vector for that patch.• The result is a 3D array, where each slice is a

feature map.


Friday, May 9, 14

Convolutional Neural Network - Pooling

42

• Let us assume filter is an “eye” detector• How can we make the detection robust to the exact location of the eye?• By “pooling” (e.g., taking max) filter responses at different locations we gain

robustness to the exact spatial location of features.


Friday, May 9, 14

Deep Convolutional Neural Network

43

WINNER of ILSVRC2010Source: http://deeplearning.net/ & Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." NIPS. Vol. 1. No. 2. 2012.

Friday, May 9, 14

http://deeplearning.net/

http://deeplearning.net/

Three Types of Training Framework

44

• Purely Supervised• Initialize parameters randomly

• Train in supervised mode with Stochastic Gradient Descent

• Good when there is lots of labeled data

• Layer-wise Unsupervised + Supervised Classifier• Train each layer unsupervised in sequence

• Hold fix the feature extractor, train linear classifier on features

• Good when labeled data is scarce but there is lots of unlabeled data

• Layer-wise Unsupervised + Supervised ALL• Layer-wise Unsupervised + Supervised Classifier

• Retrain the whole framework supervised

• Better performanceSource: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16

Friday, May 9, 14

WARNING: Bunch of Training Trick

• Use ReLU non-linearities (tanh and logistic are falling out of favor)• Use cross-entropy loss for classification• Use Stochastic Gradient Descent on minibatches• Shuffle the training samples• Normalize the input variables (zero mean, unit variance)• Schedule to decrease the learning rate• Use a bit of L1 or L2 regularization on the weights (or a combination)

• But it's best to turn it on after a couple of epochs• Use “dropout” for regularization

• Hinton et al 2012 http://arxiv.org/abs/1207.0580• Lots more in [LeCun et al. “Efficient Backprop” 1998]• Lots, lots more in “Neural Networks, Tricks of the Trade” (2012 edition)

edited by G. Montavon, G. B. Orr, and K-R Müller (Springer)

45Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16

Friday, May 9, 14

http://arxiv.org/abs/1207.0580

http://arxiv.org/abs/1207.0580

WARNING: Non Convex Optimization

46

Deep Learning Optimization is :Like walking on a ridge between valleys


Friday, May 9, 14

Outline

47

CONCEPT

TECHNIQUES

CONCLUSION

My Work in CSCI 686

Friday, May 9, 14

Introduction

48

Object Classification is hard!!!

67% in Caltech-101

vs.

36% in Caltech-256

• Performance decrease dramatically as the number of classes increase

• More training data per class will help to increase the performance

Griffin, Gregory, Alex Holub, and Pietro Perona. "Caltech-256 object category dataset." (2007).

Friday, May 9, 14

Related Work

49

WINNER IS

Research group from University of Toronto

PERFORMANCE is

Achieve 62.5% recognition accuracy in classification task over 1000 categories

SECRET IS

• Take advantage of Deep Convolution Neural Network (CNN)

• Nonlinearity property of each activation function within the network

• Massive amount of labeled data provided by ImageNet

ILSVRC2010 (Large Scale Visual Recognition Challenge 2010)

PROBLEM !!!

• There seldom exist such massive amount of labeled data in real world !!!

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." In NIPS, vol. 1, no. 2, p. 4. 2012.

Friday, May 9, 14

What the Training Dataset look like

Dataset Problem

50

• Consider a situation we want to train a classifier to distinguish between images of cats and dogs

What real world data look like

Friday, May 9, 14

Dataset Problem

51

• To create a training dataset

• Retrieve Images contain only dogs and cats

• Manually label each image with “cat” or “dog”

• Time consuming and laborious

• Labeled training data are limited but valuable resources

• Can we learn something from these unlabeled data and thus boost our performance on object classification?

• This is called Unsupervised Feature Learning

Friday, May 9, 14

Dataset

52

• Contains 10 classes of objects: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck

• Large number of unlabeled images that comes from a wide range of other different classes

STL-10 dataset

• Focus on classify very detail classes which locates at the leaf node of the concept trees

• 150GB huge size goes beyond our handling capability

ILSVRC2010 dataset

Friday, May 9, 14

Methodology

53

•Convolutional Neural Network using labeled training dataset (500 images per class)

•Network trained jointly using back propagation

•Convolutional filters trained using 100000 image patches from unlabeled data

•Multinomial logistic regression still trained using labeled dataset

Classification Accuracy and Training Time Comparison

• Utilize more unlabeled data, thus can extract better representative features : Better classification

• Avoid back propagation in convolutional neural network training, more efficient to train the filter kernel : Faster training

Unsupervised TrainingSupervised Training

Theoretically, Unsupervised training will have advantages

Friday, May 9, 14

Supervised Training CNN Framework

54

• Train the whole network (convolution filters and Multinomial Logistic model) jointly using back propagation

• Slow due to convolution process during back propagation

Convolution Layer Pooling Layer Stacked Feature Vector

Multinomial Logistic Regression

“Cat”

Convolution Filters and Multinomial Logistic Regression weights trained jointly

Friday, May 9, 14

Unsupervised Feature Learning Framework

55

First train a sparse auto encoder, the trained filter kernel will be used in the following stages

Randomly Sampled Image Patches Hidden Layer

ReconstructedImage Patches

Weights learned for convolution

No need for back propagation

“Cat”

Only Weights here need to be trained

Friday, May 9, 14

Preliminary Experiment Results

56

• 2000 Training Images and 3200 Test Images from 4 classes in the dataset

• 100000 Image patches randomly sampled from unlabeled data

Method Classification Accuracy Training Time

Supervised 65.156% 27964s ~ 8h

Unsupervised 80.406% 5023s ~ 1.5h Trained Features

Convolved Feature Maps

Friday, May 9, 14

Mini Conclusion for my work

57

• Current good performance of deep learning method in object classification may partly due to large amount of training data

• Unsupervised feature learning is promising in boosting the performance in object classification task in terms of both classification accuracy and training speed

Conclusion

Friday, May 9, 14

Outline

58

CONCEPT

TECHNIQUES

CONCLUSION

My Work in CSCI 686

Friday, May 9, 14

Conclusion

59

• What is Deep Learning

• Learning feature in a deep hierarchy

• Why Deep Learning seems good

• Cascade non linear transformation is more efficient

• Current Status

• Lack the the theory foundation and require tricky training techniques

Conceptually

Technically

• Hope you remember

• Back propagation = Use chain rule to calculate gradient

• What is Auto Encoder and Convolutional Neural Network

Friday, May 9, 14

60

Thanks!Question?

Friday, May 9, 14

a brief introduction to deep learning and its...

Documents