a simple framework for contrastive learning of visual...
TRANSCRIPT
![Page 1: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/1.jpg)
A Simple Framework for
Contrastive Learning of Visual Representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton
Presented by Yien Xu and Lichengxi Huang
1
![Page 2: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/2.jpg)
Previously…
Learning without human supervision is a long-standing problem
Two mainstream approachesGenerative
+ Learns to generate model pixels in the input space - Computationally expensive - May not be necessary for representation learning
Discriminative+ Learns representations using objective functions + Train networks to perform pretext tasks+ Both the inputs and labels are derived from an unlabeled dataset
2
![Page 3: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/3.jpg)
SimCLR
3
![Page 4: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/4.jpg)
Experiment Results
Evaluated on ImageNetSimCLR achieves 76.5% top-1 accuracy 7% relative improvement over previous state-of-the-art
When fine-tuned with only 1% of the ImageNet labels SimCLR achieves 85.8% top-5 accuracy10% relative improvement over previous state-of-the-art
When fine-tuned on other natural image classification datasets SimCLR performs on par with or better than a strong supervised baseline On 10 out of 12 datasets
4
![Page 5: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/5.jpg)
Outline
MotivationFrameworkEvaluationConclusion
5
![Page 6: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/6.jpg)
Framework
Data Augmentation
Base Encoder Network
Projection HeadContrastive
LossFunction
Batch Size
6
![Page 7: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/7.jpg)
Framework
Data Augmentation
Base Encoder Network
Projection HeadContrastive
LossFunction
Batch Size
7
![Page 8: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/8.jpg)
Data Augmentation
Transforms any given data example randomly Results in two correlated views of the same example
Three augmentations applied sequentiallyRandom croppingRandom color distortionsRandom Gaussian blur
8
![Page 9: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/9.jpg)
Data Augmentation
Why? - Data augmentation defines predictive tasksPrevious approaches
Define contrastive prediction tasks by changing the architectureglobal-to-local view prediction via constraining the receptive field in the network architecture neighboring view prediction via a fixed image splitting procedure and a context aggregation network
Can be avoided by performing simple random cropping (with resizing)Broader contrastive prediction tasks can be defined
By extending the family of augmentationsBy composing them stochastically
9
![Page 10: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/10.jpg)
Data Augmentation
10
![Page 11: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/11.jpg)
Composition of Data Augmentation
To investigate which data augmentation to performApply augmentations individually or in pairs
Always apply crop and resize images (since ImageNet are of diff. size)On one branch
apply the targeted transformation(s)
On the other branchLeave it as identity (𝑡 𝒙! = 𝒙!)
11
![Page 12: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/12.jpg)
Composition of Data Augmentation
12
![Page 13: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/13.jpg)
Composition of Data Augmentation
No single transformation suffices to learn good representations
When composing augmentationsThe quality of representation improves
Note a composition of augmentations:random croppingrandom color distortion
13
![Page 14: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/14.jpg)
Composition of Data Augmentation
Why Color Distortion?
Before Random cropping of images share similar color distributionsNN may take this shortcut
AfterSuffices to distinguish images
14
![Page 15: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/15.jpg)
Stronger Data Augmentation
Unsupervised contrastive learning benefits from Stronger (color) data augmentation than supervised learning
15
![Page 16: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/16.jpg)
Framework
Data Augmentation
Base Encoder Network
Projection HeadContrastive
LossFunction
Batch Size
16
![Page 17: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/17.jpg)
Base Encoder
Extracts representation vectors from augmented data examples
Framework allows various choices of the network architecture
SimCLR chooses ResNet𝒉! = 𝑓 &𝒙! = 𝑅𝑒𝑠𝑁𝑒𝑡 &𝒙!
17
![Page 18: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/18.jpg)
Base Encoder
Performance gap shrinks as model size increasesUnsupervised learning benefits more from bigger models
18
![Page 19: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/19.jpg)
Framework
Data Augmentation
Base Encoder Network
Projection HeadContrastive
LossFunction
Batch Size
19
![Page 20: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/20.jpg)
Projection Head
Maps representations to the space where contrastive loss is applied
A small neural networkMultilayer Perceptron (MLP)
𝜎 is ReLU (non-linearity)
20
![Page 21: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/21.jpg)
Projection Head
Non-Linear > Linear >> None
21
![Page 22: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/22.jpg)
h vs g(h)
h > g(h)
Conjecture: due to loss of information induced by the contrastive lossg can remove information that may be useful for the downstream task
22
![Page 23: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/23.jpg)
Framework
Data Augmentation
Base Encoder Network
Projection HeadContrastive
LossFunction
Batch Size
23
![Page 24: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/24.jpg)
Contrastive Loss Function
Given a set !𝒙! , the contrastive prediction task aims toIdentify &𝒙" in &𝒙# #$! for a given &𝒙!
Randomly sample a minibatch of N examples Define the task on pairs of augmented examples (2N images)Pick out one pair (2) as positivesTreat the other 2(N − 1) augmented examples as negatives
NT-Xent (the normalized temperature-scaled cross entropy loss)
SimCLR computes the loss from all positive pairs in a mini-batch 24
![Page 25: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/25.jpg)
Framework
Data Augmentation
Base Encoder Network
Projection HeadContrastive
LossFunction
Batch Size
25
![Page 26: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/26.jpg)
Batch Size
Vary the training batch size N from 256 to 8192
Use the LARS optimizer (You et al., 2017)
26
![Page 27: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/27.jpg)
Batch Size vs Training
Contrastive learning benefits from larger batch sizes and longer training
27
![Page 28: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/28.jpg)
28
![Page 29: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/29.jpg)
Outline
MotivationFrameworkEvaluationConclusion
29
![Page 30: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/30.jpg)
Evaluation ProtocolDataset: ImageNet ILSVRC-2012
Also evaluated on CIFAR-10 and others (for transfer learning)
Protocol: linear evaluation A linear classifier is trained on top of the frozen base network Test accuracy is used as a proxy for representation quality
Data Augmentation: crop & resize, color distortion, and Gaussian blurOptimizer: LARS with LR=4.8Batch size: 4096Epochs: 100
30
![Page 31: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/31.jpg)
Comparison with State-of-the-art
31
![Page 32: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/32.jpg)
Semi-Supervised Learning
Sample 1% or 10% of the labelled ImageNet training datasetsClass-balanced12.8 and 128 images per class respectively
32
![Page 33: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/33.jpg)
Semi-Supervised Learning
33
![Page 34: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/34.jpg)
Transfer Learning
34
![Page 35: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/35.jpg)
Outline
MotivationFrameworkEvaluationConclusion
35
![Page 36: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/36.jpg)
Conclusion
Composition of multiple data augmentation operations is crucial
Nonlinear transformation (g) substantially improves the quality
Larger batch sizes and more training steps bring more benefits
36
![Page 37: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/37.jpg)
Quiz Questions
37
![Page 38: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/38.jpg)
Which of the following statements are true about effective visual representation learning?
• Heuristics may limit generality of representations.
• Generative approaches train networks using pretext tasks with unlabeled data.
• Discriminative approaches are more widely used than generative approaches.
• Pixel-level generation is expensive in computation.
38
![Page 39: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/39.jpg)
Which of the following statements are true about SimCLR?
• Given batch size N, SimCLR’s learning algorithm requires computing similarities between all pairs of 2N projections.
• SimCLR’s learning algorithm computes contrastive loss across all positive and all negative pairs.
• The representation h is achieved after max pooling.
• It is beneficial to define loss on z instead of h due to a nonlinear projection head can improve the representation quality of the layer before it.
39
![Page 40: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet](https://reader033.vdocument.in/reader033/viewer/2022052616/60a00d383f2bce1a444feb27/html5/thumbnails/40.jpg)
Which of the following statements are true on project heads?
• Nonlinear projection is better than linear projection.
• Nonlinear projection is worse than linear projection.
• Linear projection is better than no projection.
• Linear projection is worse than no projection.
40