unsupervised deep learning · 2017-11-03 · one layer summery w 1 h 8 output maps 171x171. local...
TRANSCRIPT
![Page 1: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/1.jpg)
UNSUPERVISED DEEP LEARNING
Erez AharonovNoam Eilon
Deep Learning Seminar School of Electrical Engineer – Tel Aviv University 1
![Page 2: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/2.jpg)
Building High-level Features Using Large Scale Unsupervised
Learning
Quoc V. Le
Marc’Aurelio
Rajat Monga
Matthieu Devin
Kai Chen
Greg S. Corrado
Jeff Dean
Andrew Y. Ng
2012 2
![Page 3: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/3.jpg)
Outline
• Short introduction - Unsupervised Learning
• Overview
• Training Deep autoencoder
• Model Architecture
• Parallelism and ASGD
• Results
3
![Page 4: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/4.jpg)
Supervised Learning
Input Data
Learning Machine
Outputs
ObjectiveExternal rewards
Target
World
Machine
4
![Page 5: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/5.jpg)
Unsupervised Learning
Input Data
Learning Machine
Outputs
ObjectiveIntrinsic Rewards
Target
World
Machine
5
![Page 6: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/6.jpg)
Three Kinds of Learning
Supervised Leaning Unsupervised Learning Reinforced Learning
Input X – Data, Y- Label X – Data Current state, reward
Goal Learn a function to map X to Y Learn structure Optimize reward
Limitation Availability of labeled data Complexity and size Training model
Examples Classification, Segmentation, Object detection, Image captioning
Feature learning, Generative models.
Policy/Decisions/Games
6
![Page 7: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/7.jpg)
Overview• Building high level class-specific feature detectors from unlabeled data.
• How can a perceptual system build itself by looking at the world? How much prior structure is necessary?
• Could a network learn, in an unsupervised way, to be sensitive to high level concepts like human faces, cats.
• Inspiration: “Grandmother neurons”: Represents a complex but specific concept or object.
“Invariant visual representation by single neurons in the human brain,” Quian Quiroga et al.7
![Page 8: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/8.jpg)
Main concept: Deep Autoencoders
• Hierarchy of representations with increasing level of abstraction.
• Each module transforms its input representation into a higher-level one.
• High-level features are more global and more invariant.
• Low-level features are shared among categories.
x1
x2
x3
x4
x5
x'1
x'2
x‘3
x‘4
x‘5
Encoding Decoding
8
![Page 9: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/9.jpg)
Training Deep autoencoders
End to End training:x1
x2
x3
x4
x5
x'1
x'2
x‘3
x‘4
x‘5
Encoding Decoding
• Encoding decoding through all layers• Calculating loss on input and
reconstruction.
9
![Page 10: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/10.jpg)
Training Deep autoencoders
Greedy Layer wise:• Training each layer separately as an autoencoder.
• The input of each autoencoder is the output of the previous hierarchy
• Finetuning on the full network.
x'1
x'2
x‘3
x‘4
x‘5
x1
x2
x3
x4
x5
x'1
x'2
x‘3
x1
x2
x3
x'1
x'2
x‘3
x‘4
x1
x2
x3
x4
10
![Page 11: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/11.jpg)
The Network Outline
• 3 Encoding-Decoding layers.
• 9 Layer autoencoder.
• All parameters in our model were trained jointly with the objective being the sum of the objectives of the three layers.
Image
Encode
Pool & LCN
Decode
Encode
Pool & LCN
Decode
Encode
Pool & LCN
Decode
60,000 Neurons
200X200 Image 11
![Page 12: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/12.jpg)
One layer architecture
First sublayer: Local receptive fields.
• 18x18 pixels RF windows.• 8 Feature maps.• Each neuron connects to all
input channels• Not convolutional for more
invariance.
12
![Page 13: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/13.jpg)
One layer architecture
Second sublayer - Pooling
• L2 pooling • 5x5 overlapping windows.• H – Fixed pooling matrix.• Pooling over one feature.• Improves invariance to local deformations.
𝑦𝑗,𝑖 =
𝑢𝑣
𝐻𝑢,𝑣𝑔𝑗+𝑢,𝑖+𝑣2
13
![Page 14: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/14.jpg)
One layer architecture
Third sublayer – Local contrast normalization
𝑔𝑖,𝑗,𝑘 = ℎ𝑖,𝑗,𝑘 − 𝑖𝑢𝑣𝐺𝑢𝑣ℎ𝑖,𝑗+𝑢,𝑖+𝑣
𝑦𝑖,𝑗,𝑘= 𝑔𝑖,𝑗,𝑘
max{𝑐, 𝑖𝑢𝑣 𝐺𝑢𝑣𝑔2𝑖,𝑗+𝑢,𝑖+𝑣}
𝐺 − Gaussian weighted window 5x5𝑐 − Small constant to prevent numerical errorsi/u/v – Channel number, and window size
• 5x5 overlapping windows. • Connects to all input channels
14
![Page 15: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/15.jpg)
Local contrast normalization
• Relatively dominant activations are preferred over high activations on all features.• Enforcing a sort of local competition between adjacent feature, and between
features at the same spatial location in different feature maps• Improves optimization.
LCN
0
1
0
-1
0
0.5
0
-0.5
LCN
10
10
10
10
0
0
0
0
15
![Page 16: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/16.jpg)
One Layer summery
W1
H
8 output maps 171x171.Local Contrast Normalization.
Input3 x 200 x 200 image
Local receptive fields:• 18x18 pixels RF windows.• Not convolutional• 8 Feature maps.• Each neuron connects to all
input channels
Second sublayerFirst Sublayer
Pooling:• 5x5 overlapping pooling
maps• L2 pooling• Pooling over one feature
Third sublayer
Local contrast normalization:• Pooling over all features
3 x 200 x 200 image xi
16
![Page 17: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/17.jpg)
9 Layer structure
3 x 200 x 200 image xi
LCN LCN LCN
W11 W1
2W1
3
HH
L2 Pooling
L2 Pooling
L2 Pooling
Local ReceptiveFields
Local ReceptiveFields
Local ReceptiveFields
H
17
![Page 18: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/18.jpg)
The Optimization Problem
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑤1𝑤2
𝑖=1
𝑚
𝑊2𝑊1𝑇𝑥(𝑖) − 𝑥(𝑖)
2
2+ 𝜆
𝑗=1
𝑘
𝜖 + 𝐻𝑗(𝑊1𝑇𝑥(𝑖))2
𝑊1 Encoding matrix
𝑊2 Decoding matrix
𝜆 Tradeoff between sparsity and reconstruction (0.1)
k Number of pooling units
𝐻𝑗 Vector of weights of the j-th pooling unit (constant)
𝜖 Numerical stability constant
m Number of examples
ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning, Le, Q. V et al18
![Page 19: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/19.jpg)
The Optimization Problem
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑤1𝑤2
𝑖=1
𝑚
𝑊2𝑊1𝑇𝑥(𝑖) − 𝑥(𝑖)
2
2+ 𝜆
𝑗=1
𝑘
𝜖 + 𝐻𝑗(𝑊1𝑇𝑥(𝑖))2
Global reconstruction cost -Ensures the representations encode important information about the data = they can reconstruct the input data
Group Sparsity / Spatial pooling –• Outputs of second sublayer.• Lower sum of activations is preferred.• Encourages pooling to group similar
features together to achieve invariances.
19
![Page 20: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/20.jpg)
Feature Grouping
Forces encodings to be organized in a topographical map by pooling together structure-correlated features belonging to the same hidden topic, More specifically, features that are near to each other in the topographic map are relatively strongly dependent in the sense of mutual information.
Kavukcuoglu, Koray, Rob Fergus, and Yann LeCun. "Learning invariant features through topographic filter maps."20
![Page 21: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/21.jpg)
Training the Network
3 x 200 x 200 image xi8 LCN maps.5x5 kernels.Unit computes
8 LCN maps.5x5 kernels.Unit computes
8 LCN maps.5x5 kernels.Unit computes
W11 W1
2W1
3
HHH
W21 W2
2 W23
LCN maps from prior layer LCN maps from prior layer
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑤1𝑤2
𝑖=1
𝑚
𝑊2𝑊1𝑇𝑥(𝑖) − 𝑥(𝑖)
2
2+ 𝜆
𝑗=1
𝑘
𝜖 + 𝐻𝑗(𝑊1𝑇𝑥(𝑖))2
21
![Page 22: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/22.jpg)
Implementation
Year Deep network Arcitecture Parameters
2012 Alexnet 60M
2014 VGGnet 138M
2014 GoogLeNet 5M
2012 Google autoencoder 1.15B
Dataset: 10 million 200X200 unlabeled images from YouTube
Training: 2000 machines with 16000 CPU cores for 1 week
Parameters: 1.15B learned weights
22
![Page 23: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/23.jpg)
Model parallelism
• The network is partitioned• Each machine store the partition parameters• Partitions pass update messages• Less fault tolerant (requires some recovery if
any single machine fails).• Good for convolution layers, less for fully-
connected.
Large Scale Distributed Deep Networks, Dean et al.23
![Page 24: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/24.jpg)
Data parallelism
• Multiple instances of the model.• Each computes parameters updates.• Communicates results with parameter server.
Large Scale Distributed Deep Networks, Dean et al.24
![Page 25: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/25.jpg)
Asynchronous Gradient Descent
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, Abadi et al
Synchronized parallelismEach iteration: • Waiting for all devices to finish• Calculating parameter updates• Updating parameters server
Asynchronous parallelism• Each model run separately • Updates the parameters without synchronization• Less accurate if using ASGD.
25
![Page 26: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/26.jpg)
Results -Detection
• Looking for a neuron that is sensitive to high level concepts, a face/cat/body part - detector.
• Method – Test set with known positive/negative ratio .(Example - Faces: 37,000 images, of those 13,026 are of faces.)
• For each neuron checking the minimum and maximum activation values.
• Splitting the activation range to 20 equally spaced threshold.
• Picking the best neuron and the best threshold that gives the highest accuracy.
26
![Page 27: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/27.jpg)
Results -DetectionSummary of numerical comparisons against other baselines:
Histograms of faces (red) vs. no faces (blue):
27
![Page 28: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/28.jpg)
Results - Invariance
• Method - choosing of 10 face images and perform distortions to them, e.g., scaling and translating.
• Out-of-plane rotation using 10 images of faces rotating in 3D.
pixels pixels
28
![Page 29: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/29.jpg)
Results -Visualization
Most responsive stimuli in the test set. The optimal stimulus according to numerical constraint optimization
29
![Page 30: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/30.jpg)
Results – ImageNet
• Unsupervised training on YouTube and ImageNet images.• Logistic classifier on top of the highest layer.• Training the logistic classifiers and then fine-tuned the network.• The entire training was carried out on 2,000 machines for one week
Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet.
30
![Page 31: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/31.jpg)
Summary• This work shows that it is possible to train neurons to be selective for
high-level concepts using entirely unlabeled data.
• The network was able to learn invariances from unlabeled data.
• Object recognition on ImageNet: A significant leap of 70% relative improvement over the state-of-the-art.
Google Builds a Brain that Can Search for Cat Videos, Time, June 2012How Many Computers to Identify a Cat? 16,000, NYT June 2012
31
![Page 32: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/32.jpg)
Unsupervised Learning of Visual Representations using Videos
Xiaolong Wang, Abhinav Gupta
Robotics Institute, Carnegie Mellon University
Published in 2015
http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Wang_Unsupervised_Learning_of_ICCV_2015_paper.pdf
32
![Page 33: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/33.jpg)
Agenda
• Overview
• Patch Mining in Videos
• CNN implementation
• Results
• Discussion and Conclusion
33
![Page 34: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/34.jpg)
Overview
• Do we really need millions of semantically-labeled images to learn a good representation?
• It seems humans can learn visual representations using little or no semantic supervision but our approaches still remain completely supervised.
34
![Page 35: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/35.jpg)
Overview
• Previous work on unsupervised learning:• Millions of static images or frames extracted from
videos• The most common architecture used is an auto-
encoder which learns representations based on its ability to reconstruct the input images.
35
![Page 36: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/36.jpg)
Overview
• Previous work on unsupervised learning:• Have been able to automatically learn V1-like filters
given unlabeled data, but they are still far away from supervised approaches on tasks such as object detection
36
![Page 37: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/37.jpg)
Overview
37
![Page 38: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/38.jpg)
Overview
• Key insight:• Visual tracking is one of the first capabilities that develops
in infants and often before semantic representations are learned.
• Using a video and tracking we are able to produce patches of the same object. Should have similar visual representation in deep feature space.
38
![Page 39: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/39.jpg)
Overview
http://www.aoa.org/patients-and-public/good-vision-throughout-life/childrens-vision/infant-vision-birth-to-24-months-of-age?sso=y#1
39
![Page 40: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/40.jpg)
Overview
• Proposal:• Siamese-triplet network with ranking loss function to train
a CNN representation.• This ranking loss function enforces that in the final deep
feature space the first frame patch should be much closer to the tracked patch than any other randomly sampled patch.
40
![Page 41: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/41.jpg)
Overview
• Proposal:
41
![Page 42: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/42.jpg)
Patch Mining in Videos
• Source for videos: YouTube• Estimated number of new videos uploaded: 300K per
minute (2016)• Tracking:
• Obtain SURF interest points (Speed up robust features, 2006)
• Improved Dense Trajectories (IDT) to obtain motion (2013)
• Kernelized correlation filter (KCF, 2014)
42
![Page 43: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/43.jpg)
Patch Mining in Videos
43
• Patches accepted:o > 25 % of moving SURF points ando < 75 % of moving SURF points
![Page 44: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/44.jpg)
Patch Mining in Videos
44
![Page 45: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/45.jpg)
Siamese Triplet Network
• 3 networks which share the same parameters• Image with size 227 × 227 as input• Based on the AlexNet architecture• Two fully connected layers stacked on the pool5 outputs,
whose neuron numbers are 4096 and 1024 respectively• Thus final output of each single network is 1024
dimensional feature space
45
![Page 46: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/46.jpg)
Siamese Triplet Network
46
![Page 47: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/47.jpg)
Ranking Loss Function
• Cosine distance in the feature space
𝐷 𝑋1, 𝑋2 = 1 −𝑓 𝑋1 ∙ 𝑓 𝑋2𝑓 𝑋1 𝑓 𝑋2
• Goal: 𝐷 𝑋𝑖 , 𝑋𝑖− > 𝐷 𝑋𝑖 , 𝑋𝑖
+
• 𝑋𝑖 - first frame patch
• 𝑋𝑖+ - last frame patch
• 𝑋𝑖− - patch from different video
47
![Page 48: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/48.jpg)
Ranking Loss Function
• Per triplet of images:𝐿 𝑋𝑖 , 𝑋𝑖
+, 𝑋𝑖− = 𝑚𝑎𝑥 0, 𝐷 𝑋𝑖 , 𝑋𝑖
+ − 𝐷 𝑋𝑖 , 𝑋𝑖− +𝑀
• Total objective:
min𝑊
𝜆
2𝑊 22 +
𝑖=1
𝑁
𝐿 𝑋𝑖 , 𝑋𝑖+, 𝑋𝑖−
M = 0.5
𝜆 = 0.0005
48
![Page 49: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/49.jpg)
Patch Mining for Triplet Sampling
• Given 𝑋𝑖 , 𝑋𝑖+, how to select 𝑋𝑖
−
• Random Selection:• For each images couple in batch B randomly sample K
negative matches in the same batch• Shuffle all the images randomly after each epoch of
training
49
![Page 50: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/50.jpg)
Patch Mining for Triplet Sampling
• Given 𝑋𝑖 , 𝑋𝑖+, how to select 𝑋𝑖
−
• Hard Negative Mining• Applied after 10 epochs of training• Choose k samples from batch with highest loss• K = 4, B = 100
50
![Page 51: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/51.jpg)
Adapting for Supervised Tasks
• Method #1:oBased on RCNN paper.oUse pre-trained unsupervised “AlexNet” based network oParameters of layers till pool5 are used as initialization.oTwo fully connected layers initialized randomly.oLearning rate is 0.01 instead of 0.001 (RCNN)
51
![Page 52: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/52.jpg)
Adapting for Supervised Tasks
• Method #2:oIterative approach
1) Fine-tune using the PASCAL VOC data2) Re-adapt to ranking triplet task3) Again, transfer convolutional parameters for re-
adaptingoNetwork converges after two iterations
52
![Page 53: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/53.jpg)
Implementation Details
• 100K videos into 8M patches
• 3 different networks using 1.5M, 5M and 8M patches
• Batch size: 100
• Initial learning rate: 0.001
• Random negative sampling for 150K iterations, afterwards hard negative mining
53
![Page 54: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/54.jpg)
Implementation Details
• 1.5M Patches:• Reduce learning rate by 10 every 80K iterations
• Total: 240K iterations
• 5M & 8M Patches:• Reduce learning rate by 10 every 120K iterations
• Total: 350K iterations
54
![Page 55: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/55.jpg)
Results: Learned features
55
![Page 56: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/56.jpg)
Results: Network response
56
![Page 57: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/57.jpg)
Results, no fine-tuning: Qualitative comparison
57
![Page 58: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/58.jpg)
Results, no fine-tuning: Quantitative comparison • Measurement: retrieval rate by counting number of correct
retrievals in top-K retrievals (K=20)
• Pool 5 features with cosine distance
58
Method Score
Article’s 40%
Elda on HOG 24%
Random AlexNet 19%
ImageNet CNN 62%
![Page 59: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/59.jpg)
Results, with fine-tuning: Object detection• Follows the pipeline in RCNN• PASCAL VOC 2012 dataset• Trainval set & Test set ~ 10K images• SVM classifier• Learning rate: 0.01, x0.1 each 80K• Total iteration for fine-tune: 200K• 21 Clasees
59
![Page 60: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/60.jpg)
Results, with fine-tuning: Object detection
60
![Page 61: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/61.jpg)
Results, with fine-tuning: Object detection
•Without using a single image from ImageNet, just 100K unlabeled videos and VOC 2012 dataset, an ensemble of AlexNet networks achieves 52% mAP.
• ImageNet-supervised counterpart: an ensemble which achieves 54.4% mAP
61
![Page 62: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/62.jpg)
Results, with fine-tuning: Surface Normal Estimation
62
![Page 63: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/63.jpg)
Results, with fine-tuning: Surface Normal Estimation
63
• 227 × 227 image as input• Output of our network is 20 × 20 pixels• Each of which is represented by a distribution over
20 code-words, which learnt using K-means• Dimension of output is 20 × 20 × 20 = 8000• Two fully connected layers with 4096 and 8000
neurons on the pool5
![Page 64: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/64.jpg)
Results, with fine-tuning: Surface Normal Estimation
64
![Page 65: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/65.jpg)
Discussion and Conclusion
65
• Much more data available
• Might be as close as 2.5% in mAP to supervised networks
• Greater boost using ensemble of networks
• Can be generalized to different tasks
• Mimic human brain?
![Page 66: UNSUPERVISED DEEP LEARNING · 2017-11-03 · One Layer summery W 1 H 8 output maps 171x171. Local Contrast Normalization. Input 3 x 200 x 200 image Local receptive fields: •18x18](https://reader034.vdocument.in/reader034/viewer/2022050210/5f5cf9ac6aba2845a829dfa6/html5/thumbnails/66.jpg)
Questions
66