transfer learning style transfer in deep...

Transfer Learning

&

Style Transfer in Deep Learning

4-DEC-2016

Gal Barzilai , Ram Machlev

Deep Learning Seminar

School of Electrical Engineering – Tel Aviv University

Part 1:Transfer Learning in Deep Learning

6-OCT-2013

(976 cited)

Yangqing Jia, author of Caffe and DeCAF.

3

One of The Main Problems in Deep Learning Approaches:

with limited training data, fully-supervised deep architectures generally overfit

many visual recognition challenges have tasks with few training examples

Task B

Transfer Learning Concept

Task A

Input A(for example:

cars)

Input B(for example:

trucks)

Transfer

AnB: Frozen Weights

Task B

Back-propagation

Layer n

learning the features on large-scale data in a supervised

setting, then transferring them to different tasks with different labels.

4

New Learned Weights

Task A

Input A-500 Classes from ImageNet

Input B-500 Classes from ImageNet

TransferA3B: Frozen Weights

Back-propagation

Layer 3

Task B

Accuracy experiments in :

J Yosinski - 2014

5

DeCAF Approach-

Deep convolutional representations are learned on a

set of related problems but applied to new

tasks which have too few training examples to

learn a full deep representation.

The model can be considered as :

deep architecture for transfer learning based on a supervised pre-training phase.

Or simply as-

convolutional network weights learned on a set of pre-defined object recognition tasks.

6

7

Deep CNN architecture proposed by Krizhevsky [Krizhevsky NIPS 2012].

− 5 convolutional layers (with pooling and ReLU)

− 3 fully-connected layers

− won ImageNet Large Scale Visual recognition Challenge (ILSVRC) 2012

(10,000,000 labeled images depicting 10,000+ object categories) as training

− top-1 validation error rate of 40.7%

Adopted Network

follow architecture and training protocol with two differences

− input 256 x 256 images rather than 224 x 224 images

− no data augmentation trick

8

Activations of The nth Hidden Layer of The

Deep Convolutional Neural Network As a

Feature DeCAFn.

DeCAF7DeCAF6DeCAF5DeCAF4DeCAF3DeCAF2DeCAF1

Feature Generalization and Visualization

9

visualize features in the following way:

• run t- SNE algorithm - a 2-dimensional embedding of the high-

dimensional feature space.

• plot features as points colored depending on their semantic category.

Krizhevsky’snet

ImageNetTraining

Database(ILVRC-2012)

t-SNE map

DeCAFn

t-SNE map

LLC features

GIST features

Features that were compared :

GIST Features Known features extraction approach

(a low dimensional representation of the scene, not require any form of

segmentation, Oliva A and Torralba 2001)

LLC Features

Known features extraction approach

(Locality-constrained Linear Coding . J Wang , 2010)

Feature extraction

GIST FEATURES

LLC FEATURES

GIST or LLC fail to

capture the semantic

difference between images

t-SNE feature visualizations on the ILSVRC-2012 validation set.(after trained on ILSVRC-2012 training set , prevent overfitting)

11

12

t-SNE feature visualizations on the ILSVRC-2012 validation set.(after trained on ILSVRC-2012 training set , prevent overfitting)

DeCAF6DeCAF1

first layers learn

“low-level”

features

latter layers learn

semantic or “high level” features.

DeCAF1 FEATURESDeCAF6 FEATURES

DeCAF6 features trained on ILSVRC-2012generalized to SUN-397

SUN-397: Large-scale scene recognition from abbey to

zoo. (899 categories and 130,519 images)

13

Different

semantic categories

all the network’s hidden layer weights are frozen to thoselearned on the ILSVRC-2012 dataset.

14

Results on multiple datasets to evaluate the strength of DeCAF for

Experiments

each task differ somewhat from that for which the architecture was trained.

Krizhevsky’s netAfter train on

ILVRC-2012

Newdataset

linearClassifier

Frozen weights

Activation feature Of new dataset

TrainNew Task

. . .

Objectrecognition

Domainadaptation

subcategoryrecognition

scenerecognition

Experiment: Object Recognition

Caltech-101 - Pictures of objects belonging to 101 categories. About

40 to 800 images per category. Most categories have about 50

images.

• Evaluating linear classifier performance on DeCAF6 and DeCAF7.• using “dropout”

15Compared also with the two-layers convolutional network of Jarret et al (2009)

Experiment: Domain Adaptation

Office dataset (Saenko et al., 2010), which has 3 domains (31 categories in each

domain):− Amazon: images taken from amazon.com− Webcam & Dslr: images taken in office environment using a webcam or

digital SLR camera

Domain shift:

Source -> target Source -> target

Domain shift:

Recent deep

domain

adaptation

Adaptive

methods

Trained

Liner

classifiers

Experiments Domain Adaptation

SURF FEATURESDeCAF6 FEATURES

17

− DeCAF robust to resolution changes− DeCAF provides better category clustering than SURF− DeCAF clusters same category instances across domains

Experiment: Subcategory Recognition

Caltech-UCSD birds dataset ( ~6000 photos of 200 bird species)

18

Fine grained recognition involves recognizing subclasses of the same object class such as different bird species, dog breeds, flower types, etc.

- First ,adopt ImageNet-like pipeline, DeCAF6 and a multi-class logistic regression ( as previous experiments)

- Second, adopt deformable part descriptors (DPD) method [Zhang et al., 2013]

Experiment: Subcategory Recognition

19

(only)

(applied DeCAF in the same pre-trained DPM model and part predictions and used the same pooling weights).

Experiment: Scene Recognition

SUN-397 large-scale scene recognition database

(899 categories and 130,519 images)

20

Goal: classify the scene of the entire image

Outperforms Xiao ed al. (2010), the current state-of-the-art method

21

DeCAF demonstrate:

Achieve high classification accuracy on tasks with sparse labeled data using simple linear classifiers.

outperforming current state-of-the-art approaches based on sophisticated multi-kernel learning techniques with traditional hand-engineered features.

the features tend to cluster images into interesting semantic categories on which the network was never explicitly trained.

can substantially improve the performance of a wide variety of existing methods across a spectrum of visual recognition tasks

Discussion

An Open-Source Convolutional Model

22

Caffe ( at first it was called decaf) is a deep learning Python framework made

with expression, speed, and modularity in mind. It is developed by the

Berkeley Vision and Learning Center (BVLC) and by community contributors.

Yangqing Jia created the project during his PhD at UC Berkeley.

The framework allows one to easily train networks consisting of various

layer types and to execute pre-trained networks efficiently

without being restricted to a GPU .

able to process about 40 images per second with an 8-core commodity

machine when the CNN model is executed in a minibatch mode.

In addition, they have released the network parameters used in their

experiments to allow for out-of-the-box feature extraction without the need to

re-train the large network

http://bvlc.eecs.berkeley.edu/

Image Style Transfer UsingConvolutional Neural Networks

4-DEC-2016

Gal Barzilai , Ram Machlev

Deep Learning Seminar

School of Electrical Engineering – Tel Aviv University

Part 2:Style Transfer

25

Texture Transfer - ReviewTransferring the style from one image onto another can be consider a problem of texture transfer.

Our goal – synthesize a texture from a source image while constraining the texture synthesis in order to preserve the semantic content of a target image.

26

Several Examples

27

Texture Transfer – Former Approaches Large range of powerful non-parametric algorithms can

synthesize photorealistic natural textures by resampling the pixels of a given source texture. For example “Texture synthesis by non-parametric sampling”

These algorithms suffer from the limitation that they use only low-level image features of the target image to inform the texture transfer.

There is a need for an algorithm that uses the high level image features for style transfer, and this article addresses this issue.

28

Texture Transfer – Deep Learning Approach

The article proposes a novel algorithm “A neural Algorithm of Artistic Style” (also suggested by the authors in an article with the same name).

Many available implementations on Github (for example:

https://github.com/jcjohnson/neural-style )

The algorithm uses a CNN network that was trained for object recognition and localization (the chosen network was a VGG network).

https://github.com/jcjohnson/neural-style

Image Representations in CNN

The number of different filters increase along the processing hierarchy.

The size of filtered images is reduced by down-sampling mechanism, in our network average-pooling, leading to a decrease in the total number of units per layer of the network.

30

CNN Network – VGG The VGG network was designed by the Visual Geometry

Group in Oxford. Article: “Very Deep Convolutional Networks for Large-Scale Image Recognition”

http://www.robots.ox.ac.uk/~vgg/research/very_deep/

Developed a 16 layer and 19 layer models.

Network parameters: normalized version, 16 convolutional, 5 pooling layers of the 19 VGG network. The normalization is of the weights such that the mean activation of each convolutional filter over images and positions is equal to one.

No Fully connected layers. Average pooling instead of max pooling (gave better results, no theoretical explanation was provided.

http://www.robots.ox.ac.uk/~vgg/research/very_deep/

Style Transfer Algorithm

32

Content Representation - Notations A layer with 𝑁𝑙 distinct filters has 𝑁𝑙 feature maps of size 𝑀𝑙, Where 𝑀𝑙 is the height times the width of the feature maps. The responses in layer 𝑙 can be stored in a

matrix 𝐹𝑙 ∈ 𝑅𝑁𝑙×𝑀𝑙 ,Where 𝐹𝑖𝑗𝑙 is the activation of the 𝑖𝑡ℎ

filter at position j in layer l.

𝑝 is the original content image, 𝑥 is the image generated (initialized from white noise).

𝑃𝑙 and 𝐹𝑙 are their respective feature representations in layer 𝑙.

34

Content Representation - Calculations

Squared-error loss between two feature representations

The derivative of the loss with respect to the activation in layer l

The gradient with respect to the image 𝑥 can be computed using standard error back-propagation.


Content Representation, ignoring style effect

ℒ𝑡𝑜𝑡𝑎𝑙 = ℒ𝑐𝑜𝑛𝑡𝑒𝑛𝑡

36

Style Representation 1 To obtain a representation of the style of an input image,

the authors used a feature space designed to capture texture information (the authors published it “Texture Synthesis Using Convolutional Neural Networks”.

The feature space can be built on top of the filter responses in any layer of the network. It consists of the correlations between the different filter responses. The feature correlations are given by the gram matrix

𝐺𝑙 ∈ 𝑅𝑁𝑙×𝑁𝑙 where 𝐺𝑖𝑗𝑙 is the inner product between the

vectorised feature maps i and j in layer l.

37

Style Representation 2 We include the feature correlations of multiple layers,

and gain a stationary, multi-scale representation of the input image which captures its texture.

We can visualize the information captured by the style feature spaces built on different layers of the network by constructing an image that matches the style representation of the style image.

This is done by using gradient decent from a white noise image to minimize the mean-square distance between the entries of the Gram matrices from the style image and the gram matrices of the image to be generated.

38

Style Representation 3 𝑎 is the original style image, 𝑥 is the image generated (initialized from white noise).

𝐴𝑙 and 𝐺𝑙 are their respective style representation in layer 𝑙.

The contribution of each layer l to the total loss is:

The total style loss is

40

Style Representation 4 𝑤𝑙 are weighting factors of the contribution of each layer

to the total loss.

The derivative of 𝐸𝑙 with resect to the activations in layer l can be computed as:

The gradients of 𝐸𝑙 with respect to the pixel values of 𝑥can be computed using standard error back-propagation.


ℒ𝑡𝑜𝑡𝑎𝑙 = ℒ𝑠𝑡𝑦𝑙𝑒

Style Representation, ignoring content effect

42

Style Transfer Now we want to find a compromise between the style of

the style picture and the content of the content image.

We will now jointly minimize the distance of the feature representations of a white noise from the content representation of the photograph in one layer (a high one) and the style representation of the painting defined on a number of layer of the CNN.

The loss function to be minimized is

𝛼 and 𝛽 are weighting factors.

44

Style Transfer – Implementation Consideration

The optimization strategy is L-BFGS which the authors found best for image synthesis. This is a limited memory BFGS, BFGS is an iterative method for solve unconstrained nonlinear optimization problems.

The style image was resized to the size of the content image in order to extract image information on comparable size.

46

The Main Result The representation of content and style in CNN are well

separable. Therefore we can manipulate both representations independently to produce new, perceptually meaningful images.

47

Trade-Off Between Content And Style Matching

The higher the ratio of α

βthe content of the

picture resembles to the content of the content image and

less to the style of the style image

48

The Effect of Matching The Content Representation in Different Layers of The Network

On the lower layer of the network (conv2_2) the texture of the painting is blended over the

photograph.On the higher layer of the network (conv4_2) the new picture looks as

if the content of the original picture was preserved in the style

of the painting.Therefore, usually the more

appealing image are creating from matching the style from the higher

layers.Both images were produced with a

ratio 𝛼

𝛽= 10−3.

49

Initialization of Gradient Decent

The initial guess changes the output image!

Initializing from a predefined image leads to one image

(neglecting the stochasticity of the gradient-decent)!

Image A – initialized from content image.

Image B – initialized from style image.

The last four images were initialized from white noise.

Initializing from white noise gives us an infinite number of potential

output images.Small Bias toward initial guess (A

or B).

50

Photorealistic Style Transfer

Style – New York.Content – London.The photo-realism

doesn’t fully preserved.

51

Discussion Slide 1 In this article it was demonstrated how to use feature

representations from CNN to transfer image style between arbitrary images.

Limitations – resolution of the synthesized images – the speed of generating an image depends linearly in the number of pixels (both for the optimization problem and the number of units in the CNN). In this article a 512x512 pixels images were generated with an Nvidia K40 GPU and it could take an hour. The meaning of this limitation is that the algorithm can’t be used for online and interactive applications.

52

Discussion Slide 2 Synthesized images are sometimes subject to some low-

level noise. This is less problematic for artistic style transfer, and more relevant when both content and style images are photographs, because the photorealism of the image is affected. The authors say the noise resembles the filters of units in the network, and suggest to develop a de-noising technique to post process the image after the optimization.

The separation of image content from style is not a well defined problem. This is because it’s hard to define what is style in an image.

Questions

53

transfer learning style transfer in deep...

Documents