deep learning of representations for unsupervised and transfer learning

42
DEEP LEARNING OF REPRESENTATIONS FOR UNSUPERVISED AND TRANSFER LEARNING Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and Transfer Learning July 2 nd 2011, Bellevue, WA

Upload: frisco

Post on 25-Feb-2016

78 views

Category:

Documents


0 download

DESCRIPTION

Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and Transfer Learning July 2 nd 2011, Bellevue, WA. Deep Learning of Representations for Unsupervised and Transfer Learning . How to Beat the Curse of Many Factors of Variation?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

DEEP LEARNING OF REPRESENTATIONS FOR UNSUPERVISED AND TRANSFER LEARNING

Yoshua BengioStatistical Machine Learning Chair, U. Montreal

ICML 2011 Workshop on Unsupervised and Transfer LearningJuly 2nd 2011, Bellevue, WA

Page 2: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

How to Beat the Curse of Many Factors of Variation?

Compositionality: exponential gain in representational power

• Distributed representations• Deep architecture

Page 3: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Distributed Representations

Many neurons active simultaneously Input represented by the activation of a set

of features that are not mutually exclusive Can be exponentially more efficient than

local representations

Page 4: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Local vs Distributed

Page 5: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

RBM Hidden Units Carve Input Space

h1 h2 h3

x1 x2

Page 6: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Unsupervised Deep Feature LearningClassical: pre-process data with PCA = leading factorsNew: learning multiple levels of features/factors,

often over-completeGreedy layer-wise strategy:

raw input x

unsupervisedunsupervised

unsupervised

raw input x

P(y|x)Supervised fine-tuning

(Hinton et al 2006, Bengio et al 2007, Ranzato et al 2007)

raw input x raw input x

Page 7: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Why Deep Learning?

Hypothesis 1: need deep hierarchy of features to efficiently represent and learn complex abstractions needed for AI and mammal intelligence.

Computational & statististical efficiency

Hypothesis 2: unsupervised learning of representations is a crucial component of the solution.

Optimization & regularization.

Theoretical and ML-experimental support for both.

Cortex: deep architecture, the same learning rule everywhere

Page 8: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Deep Motivations

Brains have a deep architectureHumans’ ideas & artifacts composed from simpler ones

Unsufficient depth can be exponentially inefficient

Distributed (possibly sparse) representations necessary for non-local generalization, exponentially more efficient than 1-of-N enumeration of latent variable values

Multiple levels of latent variables allow combinatorial sharing of statistical strength

raw input x

task 1 task 3

task 2

shared intermediate representations

Page 9: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Deep Architectures are More Expressive

Theoretical arguments:

…1 2 3 2n

1 2 3…

n

= universal approximator

2 layers ofLogic gatesFormal neuronsRBF units

Theorems for all 3:(Hastad et al 86 & 91, Bengio et al 2007)

Functions compactly represented with k layers may require exponential size with k-1 layers

Page 10: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Sharing Components in a Deep ArchitecturePolynomial expressed with shared components: advantage of depth may grow exponentially (Bengio & Delalleau, Learning Workshop 2011)

Sum-product network

Page 11: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Parts Are Composed to Form Objects

Layer 1: edges

Layer 2: parts

Lee et al. ICML’2009

Layer 3: objects

Page 12: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Deep Architectures and Sharing Statistical Strength, Multi-Task Learning

Generalizing better to new tasks is crucial to approach AI

Deep architectures learn good intermediate representations that can be shared across tasks

Good representations make sense for many tasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

Page 13: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Feature and Sub-Feature Sharing

Different tasks can share the same high-level features

Different high-level features can be built from the same set of lower-level features

More levels = up to exponential gain in representational efficiency

task 1 output y1

task N output yN

High-level features

Low-level features

task 1 output y1

task N output yN

High-level features

Low-level features

Sharing intermediate features

Not sharing intermediate features

Page 14: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Representations as Coordinate Systems PCA: removing low-variance directions easy but what if

signal has low variance? We would like to disentangle factors of variation, keeping them all.

Overcomplete representations: richer, even if underlying distribution concentrates near low-dim manifold.

Sparse/saturated features: allows for variable-dim manifolds. Different few sensitive features at x = local chart coordinate system.

Page 15: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Effect of Unsupervised Pre-trainingAISTATS’2009+JMLR 2010, with Erhan, Courville, Manzagol, Vincent, S.

Bengio

Page 16: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Effect of Depth

w/o pre-training with pre-training

Page 17: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Unsupervised Feature Learning: a Regularizer to Find Better Local Minima of Generalization Error

Unsupervised pre-training acts like a regularizer

Helps to initialize in basin of attraction of local minima with better generalization error

Page 18: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Non-Linear Unsupervised Feature Extraction Algorithms CD for RBMs SML (PCD) for RBMs Sampling beyond Gibbs (e.g. tempered MCMC) Mean-field + SML for DBMs Sparse auto-encoders Sparse Predictive Decomposition Denoising Auto-Encoders Score Matching / Ratio Matching Noise-Contrastive Estimation Pseudo-Likelihood Contractive Auto-Encoders

See my book / review paper (F&TML 2009): Learning Deep Architectures for AI

Page 19: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Auto-Encoders Reconstruction=decoder(encoder(input)) Probable inputs have small reconstruction error Linear decoder/encoder = PCA up to rotation Can be stacked to form highly non-linear

representations, increasing disentangling (Goodfellow et al, NIPS 2009)

code= latent features

encoder decoder

input reconstruction

Page 20: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Restricted Boltzman Machine (RBM)

The most popular building block for deep architectures

Bipartite undirected graphical model

Inference is trivial:P(h|x) & P(x|h) factorize

observed

hidden

Page 21: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Gibbs Sampling in RBMs

P(h|x) and P(x|h) factorize

h1 ~ P(h|x1)

x2 ~ P(x|h1) x3 ~ P(x|h2) x1

h2 ~ P(h|x2) h3 ~ P(h|x3)

Easy inferenceConvenient Gibbs sampling

xhxh…

Page 22: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

RBMs are Universal Approximators

Adding one hidden unit (with proper choice of parameters) guarantees increasing likelihood

With enough hidden units, can perfectly model any discrete distribution

RBMs with variable nb of hidden units = non-parametric

(Le Roux & Bengio 2008, Neural Comp.)

Page 23: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Denoising Auto-Encoder

• Learns a vector field towards higher probability regions

• Minimizes variational lower bound on a generative model

• Similar to pseudo-likelihood

• A form of regularized score matching

Recon

structi

o

n

Corrupted input

Corrupted input

Page 24: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Stacked Denoising Auto-Encoders

• No partition function, can measure training criterion

• Encoder & decoder: any parametrization

• Performs as well or better than stacking RBMs for usupervised pre-training

Infinite MNIST

Page 25: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

• Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, Rifai, Vincent, Muller, Glorot & Bengio, ICML 2011.

• Higher Order Contractive Auto-Encoders, Rifai, Mesnil, Vincent, Muller, Bengio, Dauphin, Glorot, ECML 2011.

• Part of winning toolbox in final phase of the Unsupervised & Transfer Learning Challenge 2011

Contractive Auto-Encoders

Page 26: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Contractive Auto-Encoders

Training criterion:wants contraction in all directions

cannot afford contraction in manifold directions

• Few active units represent the active subspace (local chart)

• Jacobian’s spectrum is peaked = local low-dimensional representation / relevant factors

Page 27: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Unsupervised and Transfer Learning Challenge: 1st Place in Final Phase

Raw data

1 layer2 layers

4 layers

3 layers

Page 28: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Transductive Representation Learning Validation set and test sets have different

classes than training set, hence very different input distributions

Directions that matter to distinguish them might have small variance under the training set

Solution: perform last level of unsupervised feature learning (PCA) on the validation / test set input data.

Page 29: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

• Small (4-domain) Amazon benchmark: we beat the state-of-the-art handsomely

Domain Adaptation (ICML 2011)

• Sparse rectifiers SDA finds more features that tend to be useful either for predicting domain or sentiment, not both

Page 30: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Sentiment Analysis: Transfer Learning

• 25 Amazon.com domains: toys, software, video, books, music, beauty, …

• Unsupervised pre-training of input space on all domains

• Supervised SVM on 1 domain, generalize out-of-domain

• Baseline: bag-of-words + SVM

Page 31: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Sentiment Analysis: Large Scale Out-of-Domain Generalization

Relative loss by going from in-domain to out-of-domain testing

340k examples from Amazon, from 56 (tools) to 124k (music)

Page 32: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

• Deep Sparse Rectifier Neural Networks, Glorot, Bordes & Bengio, AISTATS 2011.

• Sampled Reconstruction for Large-Scale Learning of Embeddings, Dauphin, Glorot & Bengio, ICML 2011.

Representing Sparse High-

Dimensional Stuff

code= latent features

… sparse input dense output probabilities

cheap expensive

Page 33: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Speedup from Sampled Reconstruction

Page 34: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Deep Self-Taught Learning for Handwritten Character Recognition

Y. Bengio & 16 others (IFT6266 class project & AISTATS 2011 paper)

• discriminate 62 character classes (upper, lower, digits), 800k to 80M examples

• Deep learners beat state-of-the-art on NIST and reach human-level performance

• Deep learners benefit more from perturbed (out-of-distribution) data

• Deep learners benefit more from multi-task setting

Page 35: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

35

Page 36: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Improvement due to training on perturbed data

Deep Shallow{SDA,MLP}1: trained only on data with distortions: thickness, slant, affine, elastic, pinch{SDA,MLP}2: all perturbations

Page 37: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Improvement due to multi-task setting

Deep ShallowMulti-task: Train on all categories, test on target categories (share representations)

Not multi-task: Train and test on target categories only (no sharing, separate models)

Page 38: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Comparing against Humans and SOA

All 62 character classes

Deep learners reach human performance

[1] Granger et al., 2007

[2] Pérez-Cortes et al., 2000 (nearest neighbor)

[3] Oliveira et al., 2002b (MLP)

[4] Milgram et al., 2005 (SVM)

Page 39: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Tips & Tricks

• Don’t be scared by the many hyper-parameters: use random sampling (not grid search) & clusters / GPUs

• Learning rate is the most important, along with top-level dimension

• Make sure selected hyper-param value is not on the border of interval

• Early stopping • Using NLDR for visualization• Simulation of Final Evaluation Scenario

Page 40: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Conclusions

• Deep Learning: powerful arguments & generalization principles

• Unsupervised Feature Learning is crucial: many new algorithms and applications in recent years

• DL particularly suited for multi-task learning, transfer learning, domain adaptation, self-taught learning, and semi-supervised learning with few labels

Page 41: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

http://deeplearning.net/software/theano : numpy GPU

http://deeplearning.net

Page 42: Deep  Learning of  Representations  for  Unsupervised  and Transfer Learning

Merci! Questions?LISA ML Lab team: