transfer learning style transfer in deep...
TRANSCRIPT
Transfer Learning
&
Style Transfer in Deep Learning
4-DEC-2016
Gal Barzilai , Ram Machlev
Deep Learning Seminar
School of Electrical Engineering – Tel Aviv University
Part 1:Transfer Learning in Deep Learning
6-OCT-2013
(976 cited)
Yangqing Jia, author of Caffe and DeCAF.
3
One of The Main Problems in Deep Learning Approaches:
with limited training data, fully-supervised deep architectures generally overfit
many visual recognition challenges have tasks with few training examples
Task B
Transfer Learning Concept
Task A
Input A(for example:
cars)
Input B(for example:
trucks)
Transfer
AnB: Frozen Weights
Task B
Back-propagation
Layer n
learning the features on large-scale data in a supervised
setting, then transferring them to different tasks with different labels.
4
New Learned Weights
Task A
Input A-500 Classes from ImageNet
Input B-500 Classes from ImageNet
TransferA3B: Frozen Weights
Back-propagation
Layer 3
Task B
Accuracy experiments in :
J Yosinski - 2014
5
DeCAF Approach-
Deep convolutional representations are learned on a
set of related problems but applied to new
tasks which have too few training examples to
learn a full deep representation.
The model can be considered as :
deep architecture for transfer learning based on a supervised pre-training phase.
Or simply as-
convolutional network weights learned on a set of pre-defined object recognition tasks.
6
7
Deep CNN architecture proposed by Krizhevsky [Krizhevsky NIPS 2012].
− 5 convolutional layers (with pooling and ReLU)
− 3 fully-connected layers
− won ImageNet Large Scale Visual recognition Challenge (ILSVRC) 2012
(10,000,000 labeled images depicting 10,000+ object categories) as training
− top-1 validation error rate of 40.7%
Adopted Network
follow architecture and training protocol with two differences
− input 256 x 256 images rather than 224 x 224 images
− no data augmentation trick
8
Activations of The nth Hidden Layer of The
Deep Convolutional Neural Network As a
Feature DeCAFn.
DeCAF7DeCAF6DeCAF5DeCAF4DeCAF3DeCAF2DeCAF1
Feature Generalization and Visualization
9
visualize features in the following way:
• run t- SNE algorithm - a 2-dimensional embedding of the high-
dimensional feature space.
• plot features as points colored depending on their semantic category.
Krizhevsky’snet
ImageNetTraining
Database(ILVRC-2012)
t-SNE map
DeCAFn
t-SNE map
LLC features
GIST features
Features that were compared :
GIST Features Known features extraction approach
(a low dimensional representation of the scene, not require any form of
segmentation, Oliva A and Torralba 2001)
LLC Features
Known features extraction approach
(Locality-constrained Linear Coding . J Wang , 2010)
Feature extraction
GIST FEATURES
LLC FEATURES
GIST or LLC fail to
capture the semantic
difference between images
t-SNE feature visualizations on the ILSVRC-2012 validation set.(after trained on ILSVRC-2012 training set , prevent overfitting)
11
12
t-SNE feature visualizations on the ILSVRC-2012 validation set.(after trained on ILSVRC-2012 training set , prevent overfitting)
DeCAF6DeCAF1
first layers learn
“low-level”
features
latter layers learn
semantic or “high level” features.
DeCAF1 FEATURESDeCAF6 FEATURES
DeCAF6 features trained on ILSVRC-2012generalized to SUN-397
SUN-397: Large-scale scene recognition from abbey to
zoo. (899 categories and 130,519 images)
13
Different
semantic categories
all the network’s hidden layer weights are frozen to thoselearned on the ILSVRC-2012 dataset.
14
Results on multiple datasets to evaluate the strength of DeCAF for
Experiments
each task differ somewhat from that for which the architecture was trained.
Krizhevsky’s netAfter train on
ILVRC-2012
Newdataset
linearClassifier
Frozen weights
Activation feature Of new dataset
TrainNew Task
. . .
Objectrecognition
Domainadaptation
subcategoryrecognition
scenerecognition
Experiment: Object Recognition
Caltech-101 - Pictures of objects belonging to 101 categories. About
40 to 800 images per category. Most categories have about 50
images.
• Evaluating linear classifier performance on DeCAF6 and DeCAF7.• using “dropout”
15Compared also with the two-layers convolutional network of Jarret et al (2009)
Experiment: Domain Adaptation
Office dataset (Saenko et al., 2010), which has 3 domains (31 categories in each
domain):− Amazon: images taken from amazon.com− Webcam & Dslr: images taken in office environment using a webcam or
digital SLR camera
Domain shift:
Source -> target Source -> target
Domain shift:
Recent deep
domain
adaptation
Adaptive
methods
Trained
Liner
classifiers
Experiments Domain Adaptation
SURF FEATURESDeCAF6 FEATURES
17
− DeCAF robust to resolution changes− DeCAF provides better category clustering than SURF− DeCAF clusters same category instances across domains
Experiment: Subcategory Recognition
Caltech-UCSD birds dataset ( ~6000 photos of 200 bird species)
18
Fine grained recognition involves recognizing subclasses of the same object class such as different bird species, dog breeds, flower types, etc.
- First ,adopt ImageNet-like pipeline, DeCAF6 and a multi-class logistic regression ( as previous experiments)
- Second, adopt deformable part descriptors (DPD) method [Zhang et al., 2013]
Experiment: Subcategory Recognition
19
(only)
(applied DeCAF in the same pre-trained DPM model and part predictions and used the same pooling weights).
Experiment: Scene Recognition
SUN-397 large-scale scene recognition database
(899 categories and 130,519 images)
20
Goal: classify the scene of the entire image
Outperforms Xiao ed al. (2010), the current state-of-the-art method
21
DeCAF demonstrate:
Achieve high classification accuracy on tasks with sparse labeled data using simple linear classifiers.
outperforming current state-of-the-art approaches based on sophisticated multi-kernel learning techniques with traditional hand-engineered features.
the features tend to cluster images into interesting semantic categories on which the network was never explicitly trained.
can substantially improve the performance of a wide variety of existing methods across a spectrum of visual recognition tasks
Discussion
An Open-Source Convolutional Model
22
Caffe ( at first it was called decaf) is a deep learning Python framework made
with expression, speed, and modularity in mind. It is developed by the
Berkeley Vision and Learning Center (BVLC) and by community contributors.
Yangqing Jia created the project during his PhD at UC Berkeley.
The framework allows one to easily train networks consisting of various
layer types and to execute pre-trained networks efficiently
without being restricted to a GPU .
able to process about 40 images per second with an 8-core commodity
machine when the CNN model is executed in a minibatch mode.
In addition, they have released the network parameters used in their
experiments to allow for out-of-the-box feature extraction without the need to
re-train the large network
Image Style Transfer UsingConvolutional Neural Networks
4-DEC-2016
Gal Barzilai , Ram Machlev
Deep Learning Seminar
School of Electrical Engineering – Tel Aviv University
Part 2:Style Transfer
25
Texture Transfer - ReviewTransferring the style from one image onto another can be consider a problem of texture transfer.
Our goal – synthesize a texture from a source image while constraining the texture synthesis in order to preserve the semantic content of a target image.
26
Several Examples
27
Texture Transfer – Former Approaches Large range of powerful non-parametric algorithms can
synthesize photorealistic natural textures by resampling the pixels of a given source texture. For example “Texture synthesis by non-parametric sampling”
These algorithms suffer from the limitation that they use only low-level image features of the target image to inform the texture transfer.
There is a need for an algorithm that uses the high level image features for style transfer, and this article addresses this issue.
28
Texture Transfer – Deep Learning Approach
The article proposes a novel algorithm “A neural Algorithm of Artistic Style” (also suggested by the authors in an article with the same name).
Many available implementations on Github (for example:
https://github.com/jcjohnson/neural-style )
The algorithm uses a CNN network that was trained for object recognition and localization (the chosen network was a VGG network).
Image Representations in CNN
The number of different filters increase along the processing hierarchy.
The size of filtered images is reduced by down-sampling mechanism, in our network average-pooling, leading to a decrease in the total number of units per layer of the network.
30
CNN Network – VGG The VGG network was designed by the Visual Geometry
Group in Oxford. Article: “Very Deep Convolutional Networks for Large-Scale Image Recognition”
http://www.robots.ox.ac.uk/~vgg/research/very_deep/
Developed a 16 layer and 19 layer models.
Network parameters: normalized version, 16 convolutional, 5 pooling layers of the 19 VGG network. The normalization is of the weights such that the mean activation of each convolutional filter over images and positions is equal to one.
No Fully connected layers. Average pooling instead of max pooling (gave better results, no theoretical explanation was provided.
Style Transfer Algorithm
32
Content Representation - Notations A layer with 𝑁𝑙 distinct filters has 𝑁𝑙 feature maps of size 𝑀𝑙, Where 𝑀𝑙 is the height times the width of the feature maps. The responses in layer 𝑙 can be stored in a
matrix 𝐹𝑙 ∈ 𝑅𝑁𝑙×𝑀𝑙 ,Where 𝐹𝑖𝑗𝑙 is the activation of the 𝑖𝑡ℎ
filter at position j in layer l.
𝑝 is the original content image, 𝑥 is the image generated (initialized from white noise).
𝑃𝑙 and 𝐹𝑙 are their respective feature representations in layer 𝑙.
Style Transfer Algorithm
34
Content Representation - Calculations
Squared-error loss between two feature representations
The derivative of the loss with respect to the activation in layer l
The gradient with respect to the image 𝑥 can be computed using standard error back-propagation.
Style Transfer Algorithm
Content Representation, ignoring style effect
ℒ𝑡𝑜𝑡𝑎𝑙 = ℒ𝑐𝑜𝑛𝑡𝑒𝑛𝑡
36
Style Representation 1 To obtain a representation of the style of an input image,
the authors used a feature space designed to capture texture information (the authors published it “Texture Synthesis Using Convolutional Neural Networks”.
The feature space can be built on top of the filter responses in any layer of the network. It consists of the correlations between the different filter responses. The feature correlations are given by the gram matrix
𝐺𝑙 ∈ 𝑅𝑁𝑙×𝑁𝑙 where 𝐺𝑖𝑗𝑙 is the inner product between the
vectorised feature maps i and j in layer l.
37
Style Representation 2 We include the feature correlations of multiple layers,
and gain a stationary, multi-scale representation of the input image which captures its texture.
We can visualize the information captured by the style feature spaces built on different layers of the network by constructing an image that matches the style representation of the style image.
This is done by using gradient decent from a white noise image to minimize the mean-square distance between the entries of the Gram matrices from the style image and the gram matrices of the image to be generated.
38
Style Representation 3 𝑎 is the original style image, 𝑥 is the image generated (initialized from white noise).
𝐴𝑙 and 𝐺𝑙 are their respective style representation in layer 𝑙.
The contribution of each layer l to the total loss is:
The total style loss is
Style Transfer Algorithm
40
Style Representation 4 𝑤𝑙 are weighting factors of the contribution of each layer
to the total loss.
The derivative of 𝐸𝑙 with resect to the activations in layer l can be computed as:
The gradients of 𝐸𝑙 with respect to the pixel values of 𝑥can be computed using standard error back-propagation.
Style Transfer Algorithm
ℒ𝑡𝑜𝑡𝑎𝑙 = ℒ𝑠𝑡𝑦𝑙𝑒
Style Representation, ignoring content effect
42
Style Transfer Now we want to find a compromise between the style of
the style picture and the content of the content image.
We will now jointly minimize the distance of the feature representations of a white noise from the content representation of the photograph in one layer (a high one) and the style representation of the painting defined on a number of layer of the CNN.
The loss function to be minimized is
𝛼 and 𝛽 are weighting factors.
Style Transfer Algorithm
44
Style Transfer – Implementation Consideration
The optimization strategy is L-BFGS which the authors found best for image synthesis. This is a limited memory BFGS, BFGS is an iterative method for solve unconstrained nonlinear optimization problems.
The style image was resized to the size of the content image in order to extract image information on comparable size.
Style Transfer Algorithm
46
The Main Result The representation of content and style in CNN are well
separable. Therefore we can manipulate both representations independently to produce new, perceptually meaningful images.
47
Trade-Off Between Content And Style Matching
The higher the ratio of α
βthe content of the
picture resembles to the content of the content image and
less to the style of the style image
48
The Effect of Matching The Content Representation in Different Layers of The Network
On the lower layer of the network (conv2_2) the texture of the painting is blended over the
photograph.On the higher layer of the network (conv4_2) the new picture looks as
if the content of the original picture was preserved in the style
of the painting.Therefore, usually the more
appealing image are creating from matching the style from the higher
layers.Both images were produced with a
ratio 𝛼
𝛽= 10−3.
49
Initialization of Gradient Decent
The initial guess changes the output image!
Initializing from a predefined image leads to one image
(neglecting the stochasticity of the gradient-decent)!
Image A – initialized from content image.
Image B – initialized from style image.
The last four images were initialized from white noise.
Initializing from white noise gives us an infinite number of potential
output images.Small Bias toward initial guess (A
or B).
50
Photorealistic Style Transfer
Style – New York.Content – London.The photo-realism
doesn’t fully preserved.
51
Discussion Slide 1 In this article it was demonstrated how to use feature
representations from CNN to transfer image style between arbitrary images.
Limitations – resolution of the synthesized images – the speed of generating an image depends linearly in the number of pixels (both for the optimization problem and the number of units in the CNN). In this article a 512x512 pixels images were generated with an Nvidia K40 GPU and it could take an hour. The meaning of this limitation is that the algorithm can’t be used for online and interactive applications.
52
Discussion Slide 2 Synthesized images are sometimes subject to some low-
level noise. This is less problematic for artistic style transfer, and more relevant when both content and style images are photographs, because the photorealism of the image is affected. The authors say the noise resembles the filters of units in the network, and suggest to develop a de-noising technique to post process the image after the optimization.
The separation of image content from style is not a well defined problem. This is because it’s hard to define what is style in an image.
Questions
53