deep learning via semi-supervised...
TRANSCRIPT
![Page 1: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/1.jpg)
Deep Learning viaSemi-Supervised EmbeddingJason Weston∗, Frederic Ratle† & Ronan Collobert∗
∗ NEC Labs America, Princeton, USA
† IGAR, University of Lausanne, Lausanne, Switzerland
![Page 2: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/2.jpg)
1 Summary
A new way of deep semi-supervised learning for neural networks
Really easy, simple, fast, really works
Experiments: can train very deep networks (15 layers)with better results than shallow networks (≤4 layers).(including SVMs: 1 layer!)
NLP application: semi-supervised learning on over 600 millionexamples..
What’s the idea?
Train NN jointly with supervised and unsupervised signals.Unsupervised = learning an embedding on any (or all) layers of the NN.
![Page 3: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/3.jpg)
2 Deep Learning with Neural Networks [Image: Y. Bengio, 2007]
Deep = lot of layers. Powerful systems.
Standard backpropagation doesn’t give great results.
![Page 4: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/4.jpg)
3 Some New Deep Training MethodsHinton et al.’s DBNs.Bengio et al. and Ranzato et al. propose different autoencoders.
Pre-train with unlabeled data: “hope parameters in a region of spacewhere good optimum can be reached by local descent.”
Pre-training: greedy layer-wise [Image: Larochelle et al. 2007]
“Fine-tune” network afterwards using backprop.
Works, but all seems a bit messy...?
![Page 5: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/5.jpg)
4 Deep and Shallow Research
Deep Researchers (DRs) believe:
Learn sub-tasks in layers. Essential for hard tasks.
Natural for multi-task learning.
Non-linearity is efficient compared to n3 shallow methods.
Shallow Researchers believe:
NNs were already complicated and messy.
New deep methods are even more complicated and messy.
Shallow methods: clean and give valuable insights into what works.
![Page 6: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/6.jpg)
5 Existing Embedding Algorithms
Many existing (“shallow”) embedding algorithms optimize:
minU∑
i,j=1
L(f(xi), f(xj),Wij), fi ∈ Rd
MDS: minimize (||fi − fj || −Wij)2
ISOMAP: same, but W defined by shortest path on neighborhood graph.
Laplacian Eigenmaps: minimize∑ij
Wij ||fi − fj ||2
subject to “balancing constraint”: f>Df = I and f>D1 = 0.
![Page 7: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/7.jpg)
6 Siamese Networks: functional embedding
Equivalent to Lap. Eigenmaps but f(x) is a NN.
DrLIM [Hadsell et al.,’06 ]:
L(fi, fj , Wij) =
||fi − fj || if Wij = 1,
max(0, m− ||fi − fj ||)2 if Wij = 0.
→ neighbors close, others have distance of at least m
• Balancing handled by Wij = 0 case→ easy optimization
• f(x) not just a lookup-table→ control capacity,add prior knowledge, no out-of-sample problem
![Page 8: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/8.jpg)
7 Shallow Semi-supervision
SVM: minw,b γ||w||2 +∑L
i=1H(yif(xi))
Add embedding regularizer: unlabeled neighbors have same output:
• LapSVM [Belkin et al.]:
SVM + λU∑
i,j=1
Wij ||f(x∗i )− f(x∗j )||2
e.g. Wij = 1 if two points are neighbors, 0 otherwise.
• “Preprocessing”:
Using ISOMAP vectors as input to SVM [Chapelle et al.]. . .
![Page 9: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/9.jpg)
8 New regularizer for NNs: Deep Embedding
OUTPUTOUTPUT
INPUTINPUT
EmbeddingSpace
LAYER 1
LAYER 2
LAYER 3
OUTPUTOUTPUT
INPUTINPUT
LAYER 1
LAYER 2
LAYER 3
EmbeddingSpace
OUTPUTOUTPUT
INPUTINPUT
LAYER 1
LAYER 2
LAYER 3Embedding
Layer
• Define Neural Network: f(x) = h3(h2(h1(x))
• Supervised Training: minimize∑
i `(f(xi), yi)
• Add Embedding Regularizer(s) to training:
(a)∑
i L(f(xi), f(xj),Wij) or
(b)∑
i L(h2(h1((xi)), h2(h1(xj)),Wij)
(c)∑
i L(e(xi), e(xj),Wij), where e(x) = e3(h2(h1(x)))
![Page 10: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/10.jpg)
9 Deep Semi-Supervised Embedding
Input: labeled data (xi, yi) and unlabeled data x∗i , and matrix Wrepeat
Pick random labeled example (xi, yi)Gradient step for H(yif(xi))for each embedding layer do
Pick a random pair of neighbors x∗i , x∗j .
Gradient step for L(x∗i , x∗j , 1)
Pick a random pair x∗i , x∗k.
Gradient step for L(x∗i , x∗k, 0)
end foruntil stopping criteria
![Page 11: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/11.jpg)
10 Semi-Supervised Experiments
Typical shallow semi-supervised datasets:
data set classes dims points labeled
g50c 2 50 500 50
Text 2 7511 1946 50
Uspst 10 256 2007 50
Mnist1h 10 784 70k 100
Mnist6h 10 784 70k 600
Mnist1k 10 784 70k 1000
• Keep algorithm simple: equal weight to supervised & unlabeled.
• First experiment: Only consider two-layer nets.
![Page 12: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/12.jpg)
11 Deep Semi-Supervised Results
g50c Text Uspst
SVM 8.32 18.86 23.18
SVMLight-TSVM 6.87 7.44 26.46
∇TSVM 5.80 5.71 17.61
LapSVM∗ 5.4 10.4 12.7
NN 8.54 15.87 24.57
EmbedNN 5.66 5.82 15.49
![Page 13: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/13.jpg)
Mnist1h Mnist6h Mnist1k
SVM 23.44 8.85 7.77
TSVM 16.81 6.16 5.38
RBM(∗) 21.5 - 8.8
SESM(∗) 20.6 - 9.6
DBN-rNCA(∗) - 8.7 -
NN 25.81 11.44 10.70
EmbedONN 17.05 5.97 5.73
EmbedI1NN 16.86 9.44 8.52
EmbedA1NN 17.17 7.56 7.89
CNN 22.98 7.68 6.45
EmbedOCNN 11.73 3.42 3.34
EmbedI5CNN 7.75 3.82 2.73
EmbedA5CNN 7.87 3.82 2.76
![Page 14: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/14.jpg)
12 Really Deep Results
Same MNIST1h dataset, but training 2-15 layer nets (50HUs each):
layers= 2 4 6 8 10 15
NN 26.0 26.1 27.2 28.3 34.2 47.7
EmbedNNO 19.7 15.1 15.1 15.0 13.7 11.8
EmbedNNALL 18.2 12.6 7.9 8.5 6.3 9.3
• EmbedNNO: auxiliary 10-dim embedding on output layer
• EmbedNNALL: auxiliary 10-dim embedding on every layer.
• Trained jointly with supervised signal, as before.
![Page 15: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/15.jpg)
13 Training on Pairs: “isn’t that slow?” (ANSWER: No)
k-nn slow.
many methods to speed it up.
naively: sampling methods
Sequences: text, images (video), speech (audio)
video: patch in frames t & t+ 1→ same label
audio: consecutive audio frames→ same speaker + word ..
text: two words close in text→ same topic
Web: use link information to collect neighbors
![Page 16: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/16.jpg)
14 NLP: semantic role labeling
• Take large unlabeled text, Wikipedia, for embedding (631M words)
• Pick two random windows→Wij = 1
“The cat sat on the mat” “Champion Federer wins again” +ve pair
• Pick two random windows and “distort” one of them→Wij = 0
“The cat sat on the mat” “Champion began wins again” -ve pair
Method Test Error
ASSERT [Pradhan et al.] 16.54%
CNN [no prior knowledge] 18.40%
EmbedA1CNN [no prior knowledge] 14.55%
More info: “A Unified Architecture for NLP: Deep Neural Networks with
Multitask Learning”, Collobert & Weston.
![Page 17: Deep Learning via Semi-Supervised Embeddingtranslectures.videolectures.net/site/normal_dl/tag=25281/icml08_ratl… · L A Y E R 1 L A Y E R 2 L A Y E R 3 O UTPUT IN PUT L A Y E R](https://reader034.vdocument.in/reader034/viewer/2022051607/6031f411c50e3300ca18b35a/html5/thumbnails/17.jpg)
15 Conclusions
Existing “shallow” methods:
• embedding unlabeled data as a separate pre-processing (“first-layer”)
• using embedding as a regularizer at the output layer
EmbedNN generalizes these results.
Easy to train.
No pre-training, no decoding step = simple, fast.
Can train very deep networks.
Should help when the “auxiliary” embedding matrix W is correlated tothe supervised task.