unsupervised learning of visual representations by solving ... · representations by solving jigsaw...
Post on 11-Jul-2020
2 Views
Preview:
TRANSCRIPT
Ehsan Amiri
Unsupervised Learning of Visual
Representations by Solving Jigsaw Puzzles
Mehdi Noroozi and Paolo Favaro
Presented by : Ehsan Amiri
Ehsan Amiri
• Introduction
– Deep Learning in Visual Tasks
• Unsupervised Learning
• Self-supervised learning
• Transfer learning
• Related works
• The Jigsaw puzzles
– Motivation
• Proposed Method
– CFN Architecture
– Training the CFN
– Implementation
• Experiments
• Summary
2
Outline
Ehsan Amiri
Introduction : Deep Learning in Visual Tasks
3
• Supervised learning
– Use labeled data to train a parametric model
– Deep Convolutional Neural Networks (AlexNet )
– Manually labeling of data (costly)
Source : Krizhevsky et al.
[1]
Ehsan Amiri
Introduction : Deep Learning in Visual Tasks
4
• Unsupervised learning
– Representation / Feature learning
• General-purpose priors (smoothness, temporal and spatial coherence,
sparsity, sharing of factors, and other priors).
• General criterion is not available.
• Solution: disentangling the factors of variations.
– Methods :
1. Probabilistic Methods
2. Direct Mapping Methods
3. Manifold learning Methods
4. Self-supervised learning Methods
Ehsan Amiri
Introduction : Deep Learning in Visual Tasks
5
• Unsupervised learning
– Probabilistic Methods
• Observed and latent variables
• Max P(latent | observed)
• Restricted Boltzmann Machine (RBM)
• Problem: intractable in present of multiple layers
– Direct Mapping Methods (autoencoders)
• Feature extraction function (encoder)
• Mapping from feature back to input (decoder)
• Minimizing the reconstruction error
– Manifold learning Methods
• Map smooth variations of factors to observations
• Problem: computation of nearest neighbors(quadratically) + needs high
density of samples
– Self-supervised learning Methods
Ehsan Amiri
Introduction : Deep Learning in Visual Tasks
6
• Self-supervised learning
– Exploit freely available labelings in visual data
– Two types of labels :
• Easily accessible via non-visual signals
(ego-motion, audio, text and so on)
• Obtained from the structure of Data
(pixel arrangement)
Ehsan Amiri
Introduction : Deep Learning in Visual Tasks
7
• Transfer learning
Learned Features repurposed
Features
Task 1 Task 2
Extracted Features
Pre-training Fine-tuning
Ehsan Amiri
Related works
8
• Wang and Gupta
– extract matched patches via Tracking in videos.
– Bounding boxes (SURF) – Tracking (KCF)
Source : Wang and Gupta.
[2]
Ehsan Amiri
Related works
9
• Wang and Gupta
– Siamese- triple network
– Builds a metric to define patches’ similarities
– Use the learned features in object detection (PASCAL VOC
2012) and surface normal estimation
Source : Wang and Gupta.
[2]
Ehsan Amiri
Related works
10
• Wang and Gupta
– Advantage : Intraclass variability (i.e. illumination, occlusion,
viewpoint ,pose and clutter factors)
– Disadvantage : One object’s different instances may not
necessarily semantically be clustered.
Source : Wang and Gupta.
[2]
Ehsan Amiri
Related works
11
• Agrawal et al.
– Freely available egomotion
– Siamese Network (on MNIST)
– Use the learned features in object recognition (ILSVRC-2012)
,scene recognition(SUN), intraclass keypoint matching(PASCAL
VOC 2012) and visual odometry(SF)
Source : Agrawal et al.
[3]
Ehsan Amiri
Related works
12
• Agrawal et al.
– Disadvantage :
• Intraclass variability is limited.
• Learned features focus on similarities (color and texture) rather than high
level structure.
Source : Agrawal et al.
[3]
Ehsan Amiri
Related works
13
• Doersch et al.
– convolutional network
– classify the relative positions
– ImageNet 2012
Source : Doersch et al.
[4]
Ehsan Amiri
Related works
14
• Doersch et al.
– Use the learned features in object detection (PASCAL VOC
2007) and visual data mining (PASCAL VOC 2011)
– Many ambiguities (only two patches).
Source : Doersch et al.
[4]
Ehsan Amiri
The Jigsaw puzzles
15
• Appearance
– John Spilsbury (1760)
– Associated with learning
– Hooper Visual Organization Test
• visual perception, construction and integration
Source : http://www.jigzone.com
Ehsan Amiri
The Jigsaw puzzles: Motivation
16
• Reassembly problem
– Visuospatial representation of objects
– Jigsaw puzzle intersects all ambiguities and reduces them to
one singleton.
Source :
Noroozi and Favaro
Ehsan Amiri
Method
17
• Solving the puzzle
– Convolutional Neural Network (CNN)
• Immediate solution:
– Input data : 9 × 3 = 27 channels
– Increase the depth in the 1st layer of AlexNet
– CNN learns low level texture statistics close to the boundaries
– Understanding of the global object is needed
• Idea
– First compute features based on each tile’s pixels
– Delay the computation of statistics across tiles
Ehsan Amiri
Method
18
• Proposed Architecture
– Siamese-ennead convolutional neural network
– Context Free Network ( CFN )
– Context is only handled in the last fully connected layers.
– Each row up to the fc6 layer uses AlexNet architecture.
– Shared weights up to fc7
Source : Noroozi and Favaro
Ehsan Amiri
Method
19
• Context Free Network ( CFN )
Source : Noroozi and Favaro
Ehsan Amiri
Method
20
• CFN vs. AlexNet Architecture
– In each row is the same
– Stride in first layer is set to 2 instead of 4
– CFN is more compact than AlexNet
• Total : 27.5M vs. 61M parameters in AlexNet
• fc6 layer : ~2M vs. 37.5M parameters in AlexNet
• fc7 layer : 2M parameters more than the same in AlexNet
Ehsan Amiri
Training the CFN
21
• Input data
– ImageNet (1.3M Images)
– Resize input images to either height or width = 256 pixels
– Crop a random region 225 × 225
– Split to 3 × 3 grid of 75 × 75 pixels tiles
– By random shifts extract 64 × 64 region
– No color dropping or filling with noise
225
225
64
64
Ehsan Amiri
Training the CFN
22
• Jigsaw Puzzle task
– Set of tile configurations
– Rearrange the Input according to one configuration.
– Use only a subset of 100 Instead of 9! Possible solutions.
– Select them based on Hamming Distance (min-avg-max)
– Generate them in each iteration via hash tables.
– output : vector of probability values
1 2 3
4 5 6
7 8 9
CFN
Input output
1 {1,2,3,4,5,6,7,8,9}
2 {7,8,3,2,5,4,6,1,9}
3 {8,7,3,6,5,1,4,2,9}
.
.
.
.
.
.
100 {6,1,3,7,5,2,8,4,9}
0.0
0.0
1.0
.
.
.
0.0
Possible solutions Index patches
8 7 3
6 5 1
4 2 9
Ehsan Amiri
Training the CFN
23
• Jigsaw Puzzle Task
– Output as a PDF of scene part’s spatial arrangements
𝑝 S 𝐴1, 𝐴2, … , 𝐴9 = 𝑝 S 𝐹1, 𝐹2, … , 𝐹9 𝑝(
9
𝑖=1
𝐹𝑖|𝐴𝑖)
– S : configuration of the tiles
– 𝐴𝑖 : i-th part appearance of the object
– 𝐹𝑖 : Intermediate feature representation
– Goal : train CFN so that 𝐹𝑖 have semantic attributes and identify
the relative position
– High dimensional PDF
Ehsan Amiri
Training the CFN
24
• Jigsaw Puzzle Task
– Output as a PDF of scene part’s spatial arrangements
𝑝 S 𝐴1, 𝐴2, … , 𝐴9 = 𝑝 S 𝐹1, 𝐹2, … , 𝐹9 𝑝(
9
𝑖=1
𝐹𝑖|𝐴𝑖)
– Problem : CFN learns to associate each 𝐴𝑖 to an absolute
position. 𝐹𝑖 will have no semantic meaning.
– Strategy : feed several puzzles of the same image
• (average 69 /100 configurations)
– If 𝑆 = {𝐿1, 𝐿2, … , 𝐿9}, then
𝑝 𝐿1, 𝐿2, … , 𝐿9 𝐹1, 𝐹2, … , 𝐹9 = 𝑝(9𝑖=1 𝐹𝑖|𝐿𝑖)
– About 90M jigsaw puzzles(from 1.3M images)
Ehsan Amiri
Implementation
25
• Jigsaw Puzzle task
– Stochastic gradient decent
– Without batch normalization
– Titan X GPU
– Converges after 350K iterations
– Basic learning rate 0.01
– 59.5 hours in total (2.5 days)
Ehsan Amiri
CFN Filter activations
26
• Visualization of top 16 activations
• 6 Significant hand-picked channels
• 20 randomly sampled 64 × 64 patches from ImageNet
validation set
Conv1 filters Source : Noroozi and Favaro
Ehsan Amiri
CFN Filter activations
27
Conv1 activations
• Different types
of textures
Conv2 activations
• Different types
of textures
Source : Noroozi and Favaro
Ehsan Amiri
CFN Filter activations
28
Conv3 activations
• Face Detector
Conv4 activations
• Part Detector
Source : Noroozi and Favaro
Ehsan Amiri
CFN Filter activations
29
Conv5 activations
• Other Part Detectors
• Scene Part Detectors
Source : Noroozi and Favaro
Ehsan Amiri
• Experiment 1: (Transfer learning from classification task to Jigsaw
puzzles)
– Goal : show the relation between object classification and
jigsaw puzzle.
– Transfer features from pre-trained AlexNet to solve Jigsaw
puzzles
– Use locking scheme
– Semantic training is Helpful.
Results
30
Source : Noroozi and Favaro
Ehsan Amiri
• Experiment 2: (Object Classification)
– Where one should extract the features.
– Last layers of AlexNet are specific to the task while first layers
are general-purpose.
– Repurpose the CFN,[2] and [4] to classification on ImageNet
2012.
– Use locking scheme
– Reference max accuracy 57.4% AlexNet
Results
31
Source : Noroozi and Favaro
[5]
Ehsan Amiri
• Experiment 3: (Object Detection)
– Use CFN features for object detection with Fast R-CNN .
– Use AlexNet trained on ImageNet as pre-training weights with
Fast R-CNN as baseline. 56.5% mAP
– Fill fully connected layers in Fast R-CNN with Gaussian random
weights(mean: 0.1 and std: 0.001).
– Step strategy
• baseline learning rate: 0.001
• Step : 5K
• Max. iteration 150K
– Check all the methods on PASCAL VOC 2007
Results
32
[6]
Ehsan Amiri
• Experiment 3: (Object Detection)
– CFN pre-trained on ImageNet(CFN-Sup). 56.3% mAP
– CFN pre-trained with jigsaw puzzle
• CFN-4 : based on 2×2 tile grid
• CFN-9 : based on 3×3 tile grid
– CFN-9(min) : average hamming distance 0.45
– CFN-9(middle) : average hamming distance 0.67
– CFN-9(max) : average hamming distance 0.88
Results
33
*
* Using R-CNN Source : Noroozi and Favaro
Ehsan Amiri
• Experiment 4: (Image retrieval)
– Find the nearest neighbors (NN) of pool5 features
• Bounding boxes on PASCAL VOC 2007 test set (Query)
• Bounding boxes of trainval set (retrieval entries)
– Discard Bounding boxes fewer than 10K pixels inside
– Rank the images
• inner product between normalized features of a query image and
normalized features of the retrieval set
– Top 4 matches
Results
34
Ehsan Amiri
• Experiment 4: (Image retrieval)
– Qualitative evaluation
Results
35
Query AlexNet CFN [4] Doersch et al.
So
urc
e : N
oro
ozi a
nd
Fava
ro
Ehsan Amiri
• Experiment 4: (Image retrieval)
– Qualitative evaluation
Results
36
Query [2] Wang and Gupta [1] AlexNet with random weights
Source : Noroozi and Favaro
Ehsan Amiri
• Experiment 4: (Image retrieval)
– Quantitative evaluation
Results
37
Source : Noroozi and Favaro
Ehsan Amiri
• Context Free Network (CFN)
– Transferable features between Jigsaw puzzle reassembly ,
detection/classification tasks.(compatible)
– Required no manual labeling.
• Lower Converge time (2.5 days) than Doersch et al (4 weeks).
• In object classification
– On ImageNet 2012 - without fine-tuning 38.1% (best among other
unsupervised methods)
• In object detection
– On PASCAL VOC 2007 - 51.8mAP (The performance of the learned features
are close to the supervised AlexNet -56.5mAP )
Summary
38
Learning the Features
Utilizing the Features
Extracted Features
Solving the Jigsaw Puzzle
Pretext task Classification / Object Detection
Transfer Learning
Ehsan Amiri
References
39
[1] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classication with deep
convolutional neural networks. Advances in Neural Information Processing Systems 25
pp. 1097-1105 (2012)
[2] Wang, X., Gupta, A.: Unsupervised learning of visual representations using
videos.ICCV (2015)
[3] Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. The IEEE
International Conference on Computer Vision (ICCV) (December 2015)
[4] Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning
by context prediction. ICCV (2015)
[5] Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep
neural networks? NIPS pp. 3320-3328 (2014)
[6] Girshick, R.: Fast r-cnn. In: The IEEE International Conference on Computer Vision
(ICCV) (December 2015)
Ehsan Amiri
40
Thank you
Q&A
top related