unsupervised deep learning · 2017-11-03 · one layer summery w 1 h 8 output maps 171x171. local...

UNSUPERVISED DEEP LEARNING

Erez AharonovNoam Eilon

Deep Learning Seminar School of Electrical Engineer – Tel Aviv University 1

Building High-level Features Using Large Scale Unsupervised

Learning

Quoc V. Le

Marc’Aurelio

Rajat Monga

Matthieu Devin

Kai Chen

Greg S. Corrado

Jeff Dean

Andrew Y. Ng

2012 2

Outline

• Short introduction - Unsupervised Learning

• Overview

• Training Deep autoencoder

• Model Architecture

• Parallelism and ASGD

• Results

3

Supervised Learning

Input Data

Learning Machine

Outputs

ObjectiveExternal rewards

Target

World

Machine

4

Unsupervised Learning

Input Data

Learning Machine

Outputs

ObjectiveIntrinsic Rewards

Target

World

Machine

5

Three Kinds of Learning

Supervised Leaning Unsupervised Learning Reinforced Learning

Input X – Data, Y- Label X – Data Current state, reward

Goal Learn a function to map X to Y Learn structure Optimize reward

Limitation Availability of labeled data Complexity and size Training model

Examples Classification, Segmentation, Object detection, Image captioning

Feature learning, Generative models.

Policy/Decisions/Games

6

Overview• Building high level class-specific feature detectors from unlabeled data.

• How can a perceptual system build itself by looking at the world? How much prior structure is necessary?

• Could a network learn, in an unsupervised way, to be sensitive to high level concepts like human faces, cats.

• Inspiration: “Grandmother neurons”: Represents a complex but specific concept or object.

“Invariant visual representation by single neurons in the human brain,” Quian Quiroga et al.7

Main concept: Deep Autoencoders

• Hierarchy of representations with increasing level of abstraction.

• Each module transforms its input representation into a higher-level one.

• High-level features are more global and more invariant.

• Low-level features are shared among categories.

x1

x2

x3

x4

x5

x'1

x'2

x‘3

x‘4

x‘5

Encoding Decoding

8

Training Deep autoencoders

End to End training:x1

x2

x3

x4

x5

x'1

x'2

x‘3

x‘4

x‘5

Encoding Decoding

• Encoding decoding through all layers• Calculating loss on input and

reconstruction.

9

Training Deep autoencoders

Greedy Layer wise:• Training each layer separately as an autoencoder.

• The input of each autoencoder is the output of the previous hierarchy

• Finetuning on the full network.

x'1

x'2

x‘3

x‘4

x‘5

x1

x2

x3

x4

x5

x'1

x'2

x‘3

x1

x2

x3

x'1

x'2

x‘3

x‘4

x1

x2

x3

x4

10

The Network Outline

• 3 Encoding-Decoding layers.

• 9 Layer autoencoder.

• All parameters in our model were trained jointly with the objective being the sum of the objectives of the three layers.

Image

Encode

Pool & LCN

Decode

Encode

Pool & LCN

Decode

Encode

Pool & LCN

Decode

60,000 Neurons

200X200 Image 11

One layer architecture

First sublayer: Local receptive fields.

• 18x18 pixels RF windows.• 8 Feature maps.• Each neuron connects to all

input channels• Not convolutional for more

invariance.

12


Second sublayer - Pooling

• L2 pooling • 5x5 overlapping windows.• H – Fixed pooling matrix.• Pooling over one feature.• Improves invariance to local deformations.

𝑦𝑗,𝑖 =

𝑢𝑣

𝐻𝑢,𝑣𝑔𝑗+𝑢,𝑖+𝑣2

13


Third sublayer – Local contrast normalization

𝑔𝑖,𝑗,𝑘 = ℎ𝑖,𝑗,𝑘 − 𝑖𝑢𝑣𝐺𝑢𝑣ℎ𝑖,𝑗+𝑢,𝑖+𝑣

𝑦𝑖,𝑗,𝑘= 𝑔𝑖,𝑗,𝑘

max{𝑐, 𝑖𝑢𝑣 𝐺𝑢𝑣𝑔2𝑖,𝑗+𝑢,𝑖+𝑣}

𝐺 − Gaussian weighted window 5x5𝑐 − Small constant to prevent numerical errorsi/u/v – Channel number, and window size

• 5x5 overlapping windows. • Connects to all input channels

14

Local contrast normalization

• Relatively dominant activations are preferred over high activations on all features.• Enforcing a sort of local competition between adjacent feature, and between

features at the same spatial location in different feature maps• Improves optimization.

LCN

0

1

0

-1

0

0.5

0

-0.5

LCN

10

10

10

10

0

0

0

0

15

One Layer summery

W1

H

8 output maps 171x171.Local Contrast Normalization.

Input3 x 200 x 200 image

Local receptive fields:• 18x18 pixels RF windows.• Not convolutional• 8 Feature maps.• Each neuron connects to all

input channels

Second sublayerFirst Sublayer

Pooling:• 5x5 overlapping pooling

maps• L2 pooling• Pooling over one feature

Third sublayer

Local contrast normalization:• Pooling over all features

3 x 200 x 200 image xi

16

9 Layer structure

3 x 200 x 200 image xi

LCN LCN LCN

W11 W1

2W1

3

HH

L2 Pooling

L2 Pooling

L2 Pooling

Local ReceptiveFields



H

17

The Optimization Problem

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑤1𝑤2

𝑖=1

𝑚

𝑊2𝑊1𝑇𝑥(𝑖) − 𝑥(𝑖)

2

2+ 𝜆

𝑗=1

𝑘

𝜖 + 𝐻𝑗(𝑊1𝑇𝑥(𝑖))2

𝑊1 Encoding matrix

𝑊2 Decoding matrix

𝜆 Tradeoff between sparsity and reconstruction (0.1)

k Number of pooling units

𝐻𝑗 Vector of weights of the j-th pooling unit (constant)

𝜖 Numerical stability constant

m Number of examples

ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning, Le, Q. V et al18

The Optimization Problem


𝑖=1

𝑚


2

2+ 𝜆

𝑗=1

𝑘


Global reconstruction cost -Ensures the representations encode important information about the data = they can reconstruct the input data

Group Sparsity / Spatial pooling –• Outputs of second sublayer.• Lower sum of activations is preferred.• Encourages pooling to group similar

features together to achieve invariances.

19

Feature Grouping

Forces encodings to be organized in a topographical map by pooling together structure-correlated features belonging to the same hidden topic, More specifically, features that are near to each other in the topographic map are relatively strongly dependent in the sense of mutual information.

Kavukcuoglu, Koray, Rob Fergus, and Yann LeCun. "Learning invariant features through topographic filter maps."20

Training the Network

3 x 200 x 200 image xi8 LCN maps.5x5 kernels.Unit computes

8 LCN maps.5x5 kernels.Unit computes

8 LCN maps.5x5 kernels.Unit computes

W11 W1

2W1

3

HHH

W21 W2

2 W23

LCN maps from prior layer LCN maps from prior layer


𝑖=1

𝑚


2

2+ 𝜆

𝑗=1

𝑘


21

Implementation

Year Deep network Arcitecture Parameters

2012 Alexnet 60M

2014 VGGnet 138M

2014 GoogLeNet 5M

2012 Google autoencoder 1.15B

Dataset: 10 million 200X200 unlabeled images from YouTube

Training: 2000 machines with 16000 CPU cores for 1 week

Parameters: 1.15B learned weights

22

Model parallelism

• The network is partitioned• Each machine store the partition parameters• Partitions pass update messages• Less fault tolerant (requires some recovery if

any single machine fails).• Good for convolution layers, less for fully-

connected.

Large Scale Distributed Deep Networks, Dean et al.23

Data parallelism

• Multiple instances of the model.• Each computes parameters updates.• Communicates results with parameter server.

Large Scale Distributed Deep Networks, Dean et al.24

Asynchronous Gradient Descent

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, Abadi et al

Synchronized parallelismEach iteration: • Waiting for all devices to finish• Calculating parameter updates• Updating parameters server

Asynchronous parallelism• Each model run separately • Updates the parameters without synchronization• Less accurate if using ASGD.

25

Results -Detection

• Looking for a neuron that is sensitive to high level concepts, a face/cat/body part - detector.

• Method – Test set with known positive/negative ratio .(Example - Faces: 37,000 images, of those 13,026 are of faces.)

• For each neuron checking the minimum and maximum activation values.

• Splitting the activation range to 20 equally spaced threshold.

• Picking the best neuron and the best threshold that gives the highest accuracy.

26

Results -DetectionSummary of numerical comparisons against other baselines:

Histograms of faces (red) vs. no faces (blue):

27

Results - Invariance

• Method - choosing of 10 face images and perform distortions to them, e.g., scaling and translating.

• Out-of-plane rotation using 10 images of faces rotating in 3D.

pixels pixels

28

Results -Visualization

Most responsive stimuli in the test set. The optimal stimulus according to numerical constraint optimization

29

Results – ImageNet

• Unsupervised training on YouTube and ImageNet images.• Logistic classifier on top of the highest layer.• Training the logistic classifiers and then fine-tuned the network.• The entire training was carried out on 2,000 machines for one week

Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet.

30

Summary• This work shows that it is possible to train neurons to be selective for

high-level concepts using entirely unlabeled data.

• The network was able to learn invariances from unlabeled data.

• Object recognition on ImageNet: A significant leap of 70% relative improvement over the state-of-the-art.

Google Builds a Brain that Can Search for Cat Videos, Time, June 2012How Many Computers to Identify a Cat? 16,000, NYT June 2012

31

http://newsfeed.time.com/2012/06/27/google-builds-a-brain-that-can-search-for-cat-videos/

http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html

Unsupervised Learning of Visual Representations using Videos

Xiaolong Wang, Abhinav Gupta

Robotics Institute, Carnegie Mellon University

Published in 2015

http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Wang_Unsupervised_Learning_of_ICCV_2015_paper.pdf

32

http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Wang_Unsupervised_Learning_of_ICCV_2015_paper.pdf

Agenda

• Overview

• Patch Mining in Videos

• CNN implementation

• Results

• Discussion and Conclusion

33

Overview

• Do we really need millions of semantically-labeled images to learn a good representation?

• It seems humans can learn visual representations using little or no semantic supervision but our approaches still remain completely supervised.

34

Overview

• Previous work on unsupervised learning:• Millions of static images or frames extracted from

videos• The most common architecture used is an auto-

encoder which learns representations based on its ability to reconstruct the input images.

35

Overview

• Previous work on unsupervised learning:• Have been able to automatically learn V1-like filters

given unlabeled data, but they are still far away from supervised approaches on tasks such as object detection

36

Overview

37

Overview

• Key insight:• Visual tracking is one of the first capabilities that develops

in infants and often before semantic representations are learned.

• Using a video and tracking we are able to produce patches of the same object. Should have similar visual representation in deep feature space.

38

Overview

http://www.aoa.org/patients-and-public/good-vision-throughout-life/childrens-vision/infant-vision-birth-to-24-months-of-age?sso=y#1

39

http://www.aoa.org/patients-and-public/good-vision-throughout-life/childrens-vision/infant-vision-birth-to-24-months-of-age?sso=y

Overview

• Proposal:• Siamese-triplet network with ranking loss function to train

a CNN representation.• This ranking loss function enforces that in the final deep

feature space the first frame patch should be much closer to the tracked patch than any other randomly sampled patch.

40

Overview

• Proposal:

41

Patch Mining in Videos

• Source for videos: YouTube• Estimated number of new videos uploaded: 300K per

minute (2016)• Tracking:

• Obtain SURF interest points (Speed up robust features, 2006)

• Improved Dense Trajectories (IDT) to obtain motion (2013)

• Kernelized correlation filter (KCF, 2014)

42


43

• Patches accepted:o > 25 % of moving SURF points ando < 75 % of moving SURF points


44

Siamese Triplet Network

• 3 networks which share the same parameters• Image with size 227 × 227 as input• Based on the AlexNet architecture• Two fully connected layers stacked on the pool5 outputs,

whose neuron numbers are 4096 and 1024 respectively• Thus final output of each single network is 1024

dimensional feature space

45

Siamese Triplet Network

46

Ranking Loss Function

• Cosine distance in the feature space

𝐷 𝑋1, 𝑋2 = 1 −𝑓 𝑋1 ∙ 𝑓 𝑋2𝑓 𝑋1 𝑓 𝑋2

• Goal: 𝐷 𝑋𝑖 , 𝑋𝑖− > 𝐷 𝑋𝑖 , 𝑋𝑖

+

• 𝑋𝑖 - first frame patch

• 𝑋𝑖+ - last frame patch

• 𝑋𝑖− - patch from different video

47

Ranking Loss Function

• Per triplet of images:𝐿 𝑋𝑖 , 𝑋𝑖

+, 𝑋𝑖− = 𝑚𝑎𝑥 0, 𝐷 𝑋𝑖 , 𝑋𝑖

+ − 𝐷 𝑋𝑖 , 𝑋𝑖− +𝑀

• Total objective:

min𝑊

𝜆

2𝑊 22 +

𝑖=1

𝑁

𝐿 𝑋𝑖 , 𝑋𝑖+, 𝑋𝑖−

M = 0.5

𝜆 = 0.0005

48

Patch Mining for Triplet Sampling

• Given 𝑋𝑖 , 𝑋𝑖+, how to select 𝑋𝑖

−

• Random Selection:• For each images couple in batch B randomly sample K

negative matches in the same batch• Shuffle all the images randomly after each epoch of

training

49

Patch Mining for Triplet Sampling

• Given 𝑋𝑖 , 𝑋𝑖+, how to select 𝑋𝑖

−

• Hard Negative Mining• Applied after 10 epochs of training• Choose k samples from batch with highest loss• K = 4, B = 100

50

Adapting for Supervised Tasks

• Method #1:oBased on RCNN paper.oUse pre-trained unsupervised “AlexNet” based network oParameters of layers till pool5 are used as initialization.oTwo fully connected layers initialized randomly.oLearning rate is 0.01 instead of 0.001 (RCNN)

51

Adapting for Supervised Tasks

• Method #2:oIterative approach

1) Fine-tune using the PASCAL VOC data2) Re-adapt to ranking triplet task3) Again, transfer convolutional parameters for re-

adaptingoNetwork converges after two iterations

52

Implementation Details

• 100K videos into 8M patches

• 3 different networks using 1.5M, 5M and 8M patches

• Batch size: 100

• Initial learning rate: 0.001

• Random negative sampling for 150K iterations, afterwards hard negative mining

53

Implementation Details

• 1.5M Patches:• Reduce learning rate by 10 every 80K iterations

• Total: 240K iterations

• 5M & 8M Patches:• Reduce learning rate by 10 every 120K iterations

• Total: 350K iterations

54

Results: Learned features

55

Results: Network response

56

Results, no fine-tuning: Qualitative comparison

57

Results, no fine-tuning: Quantitative comparison • Measurement: retrieval rate by counting number of correct

retrievals in top-K retrievals (K=20)

• Pool 5 features with cosine distance

58

Method Score

Article’s 40%

Elda on HOG 24%

Random AlexNet 19%

ImageNet CNN 62%

Results, with fine-tuning: Object detection• Follows the pipeline in RCNN• PASCAL VOC 2012 dataset• Trainval set & Test set ~ 10K images• SVM classifier• Learning rate: 0.01, x0.1 each 80K• Total iteration for fine-tune: 200K• 21 Clasees

59

Results, with fine-tuning: Object detection

60

Results, with fine-tuning: Object detection

•Without using a single image from ImageNet, just 100K unlabeled videos and VOC 2012 dataset, an ensemble of AlexNet networks achieves 52% mAP.

• ImageNet-supervised counterpart: an ensemble which achieves 54.4% mAP

61

Results, with fine-tuning: Surface Normal Estimation

62


63

• 227 × 227 image as input• Output of our network is 20 × 20 pixels• Each of which is represented by a distribution over

20 code-words, which learnt using K-means• Dimension of output is 20 × 20 × 20 = 8000• Two fully connected layers with 4096 and 8000

neurons on the pool5


64

Discussion and Conclusion

65

• Much more data available

• Might be as close as 2.5% in mAP to supervised networks

• Greater boost using ensemble of networks

• Can be generalized to different tasks

• Mimic human brain?

Questions

66

unsupervised deep learning · 2017-11-03 · one layer summery w 1 h 8 output maps 171x171. local...

Documents