spatial transformer networks - computer...

55
Spatial Transformer Networks Shashank Tyagi Ishan Gupta Based on: Jaderberg, Max, et al. "Spatial transformer networks." Proceedings of the 28th International Conference on Neural Information Processing Systems. MIT Press, 2015.

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial Transformer Networks

Shashank TyagiIshan Gupta

Based on: Jaderberg, Max, et al. "Spatial transformer networks." Proceedings of the 28th International Conference on Neural Information Processing Systems. MIT Press, 2015.

Page 3: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Outline● Introduction● Limitations of CNNs● Related work● Spatial transformer

○ Architecture○ Mathematical formulation

● Experimental results● Conclusion

3

Page 4: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Introduction● Convolutional Neural Networks.

4

Page 5: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Visualizing CNNs

Harley, Adam W. "An interactive node-link visualization of convolutional neural networks." International Symposium on Visual Computing. Springer International Publishing, 2015.

5

Page 6: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

● Limited spatial invariance.● Max pooling has small spatial support.● Only deep layers (towards output) achieve invariance.● No rotation and scaling invariance.● Fixed location and size of the receptive field puts a bottleneck on dealing with

invariance.

Limitations

6http://cdn-ak.f.st-hatena.com/images/fotolife/v/vaaaaaanquish/20150126/20150126055504.png

Page 7: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Related Work

7

● Hinton’s work on Autoencoders

● Local Scale Invariant Convolutional Neural Networks

Page 8: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Related Work● Previous works cover the ideas behind modelling transformations with Neural

Networks and learning transformation invariant representations.● Spatial Transformers manipulate the data layer rather than feature extractors.● The introduction of selective attention brought the idea of looking at specific

parts in the image which can be termed as the region of interests.● In that sense, Spatial Transformers are introduced as a differentiable attention

scheme which also learns along the spatial transformation.

8

Page 9: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial Transformer● A dynamic mechanism that actively spatially transforms an image or feature map by learning

appropriate transformation matrix.● Transformation matrix is capable of including translation, rotation, scaling, cropping and non-rigid

deformations. ● Allows for end to end trainable models using standard back-propagation.

9

Page 10: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial Transformer● Three differentiable modules:

○ Localisation network.○ Parameterised Sampling Grid (Grid Generator).○ Differentiable Image Sampling (Sampler).

10

Page 11: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Localisation Network

● Takes in feature map U ∈ RH×W×C and outputs parameters of the transformation.

● Can be realized using fully-connected or convolutional networks regressing the transformation parameters.

11

Page 12: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Parameterised Sampling Grid (Grid Generator)

● Generates sampling grid by using the transformation predicted by the localization network.

12

Page 13: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Parameterised Sampling Grid (Grid Generator)● Attention Model:

13

Target regular gridSource transformed grid

Identity Transform (s=1, tx=0, ty = 0)

Page 14: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Parameterised Sampling Grid (Grid Generator)● Affine transform:

14

Target regular gridSource transformed grid

Page 15: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Differentiable Image Sampling (Sampler)

● Samples the input feature map by using the sampling grid and produces the output map.

15

Page 16: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Mathematical Formulation of Sampling● General Formulation

Target feature value at location i in channel c

Input feature value at location (n,m) in channel c

Sampling coordinates

Sampling kernel

16

Kernel parameters

Page 17: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Kernels● Integer sampling kernel

● Bilinear sampling kernel

17

Page 18: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Backpropagation through Sampling Mechanism● Gradient with bilinear sampling kernel

18

Page 19: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Experiments: Evaluating spatial transformer networks.

● Distorted MNIST● Traffic Sign Detection● Co-localisation

19

Applications: Incorporating spatial transformers in CNNs.

● Multiple Spatial Transformers● Spatial Attention● Saliency Detection and Refinement

Page 20: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Distorted MNIST● Heavy reduction in training losses can be easily achieved using deep networks

already trained on diverse classes of images.● But what happens when the trained networks sees this !

20

Page 21: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Distorted MNIST● Distorted MNIST dataset is created by performing rotation, scale and rotation

and projective transformation on the available MNIST dataset.● Affine , Projective and Thin Spline Transformations were learnt by the

localization network of Spatial Transformer.● ST-FCN model improvises over the baseline FCN and CNN model and the

experiment justifies that spatial transformers have a complementary relationship with max pooling.

21

Page 22: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Distorted MNIST● Results

22

R : RotationRTS : Rotation, Translation and ScalingP : Projective DistortionE: Elastic Distortion

(a) : Inputs to the network(b) : Transformation applied by the Spatial Transformer Network(c) : Output of the Spatial Transformer Network

Page 23: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Traffic Sign Detection● Experiment performed by MoodStocks (French Image Recognition Startup)● Evaluation on GTSRB (German Traffic Sign Recognition Benchmark).● GTSRB dataset contains images spread over 43 classes.● A total of 39,209 training examples and 12,630 test ones.

23

Page 24: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Traffic Sign DetectionVisualising Spatial Transformers during training:

24

● On the left is the original image.

● On the right is the spatial transformation.

● On the bottom is the counter for training steps.

Page 25: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Traffic Sign DetectionPost Training:

● Images took from a video sequence while approaching a traffic sign.

25

Page 26: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Traffic Sign DetectionResults:

26

Page 27: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Co-Localisation● A semi-supervised learning scheme.● Require no training labels or the location ground truth. ● Applied on a dataset where each sample contains a common feature of any

class.● Wait, this covers the semi part but it is still supervised. How do you train it?

Triplet Loss

27

Cropped Image In Cropped Image Im Randomly Sampled Patch

Page 28: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Co-LocalisationTraining Procedure:

28

Page 29: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Co-LocalisationIterating the training process.

29

Page 30: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Multiple Spatial Transformers● As seen in the previous slides, Spatial Transformers can be inserted

before/after the conv layers, before/after max pooling.● Spatial Transformers can also be attached in parallel to learn the focus on

multiple objects in parallel.● Limitation :

○ Need to have as many spatial transformers in parallel as the number of objects to model.

30

Page 31: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Multiple Spatial Transformers● Adding digits in two images

31

Page 32: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionInspiration behind attention:

● How do humans perceive a scene?● Do they compress the entire image into a static representation?● Or, do we focus on a single object at a time and learn the sequence generated to

develop a semantic?

32

Page 33: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionInspiration behind attention:

33

Neural Machine Translation

Page 34: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial Attention● Motivation:

34

Page 35: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionHard vs Soft Attention:

35

Page 36: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionSoft Attention:

● Uses a weighted sum of features as an input to the sequence generator.● Probabilities against each feature are learned.● Fully Differentiable and can be trained using standard back propagation.● Uses the whole input at all times.

36

Page 37: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionSoft Attention:

37

Page 38: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionHard Attention:

● Uses a single feature at a time for sequence generation.● A subset of soft attention where all the weights except one are zero.● Not differentiable.● Uses Reinforcement Learning to set rewards and decide the next state.

38

Page 39: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionHard Attention:

39

Page 40: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial Attention

● Spatial Transformers can be utilised as a differentiable attention mechanism.● Each transformer in the network focuses on discriminative object parts.● Predicts the location of the attention window and samples the cropped region.● Each output can then be described by its own network stream.

40

Page 41: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionNetwork Architecture:

41

Page 42: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Spatial AttentionResults on CUB-200-2011 Birds Dataset using Spatial Transformers.

42

Page 43: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementWhat is Saliency Detection?

● Detect high level modalities from the image by segmenting out objects with boundaries.

43Input Image Saliency Map

Page 44: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementDetection Cues:

● Color spatial distribution.● Center Surround Histogram.● MultiScale Contrast.

44

Page 45: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementThe need for accurate detection and refinement.

● Not able to capture high level information about the object and the surroundings.

● Computationally Intensive solutions to handle all scales.

45

Page 46: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementCNN-DecNN Architecture.

46

Input Image Saliency Map

Page 47: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementRecurrent Model using Spatial Transformers.

47

Spatial Transformers can perform attention….Remember!

Page 48: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementImplementation Details.

● Generate Initial Saliency Map using a predefined CNN-DeCNN network.● RNNs are used to provide recurrent attention to refine the saliency map.● Spatial Transformers are learned to focus on sub-parts.● Deciding the focus using the context information from previous RNN state.

48

Input ImageInitial Saliency Map AttentionAttention

Page 49: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementImplementation Details:

● Hidden to hidden interaction passes contextual information which is used for saliency refinement.

● Convolutional operations used in RNNs to maintain the spatial information for the deconvolutional networks.

● Double Layer RNN used for learning location and contextual dependencies separately.

49

Page 50: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementImplementation Details:

50

Page 51: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementResults

51Precision Recall Curves

Page 52: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Saliency Detection and RefinementResults

52Qualitative saliency results of some evaluated images. From the leftmost column: input image, saliency groundtruth, the saliency output maps of our proposed method (CNN-DecNN + RACDNN) with mean-shift post-processing, MCDL, MDF, RRWR, BSCA, DRFI, RBD, DSR, MC and HS

Page 53: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Conclusion● Introduced a new module - spatial transformer.● Helps in learning explicit spatial transformations like translation, rotation,

scaling, cropping, non-rigid deformations, etc. of features.● Can be used in any networks and at any layer and learnt in an end-to-end

trainable manner.● Provides improvement in the performance of existing models.

53

Page 54: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

QUESTIONS?

54

Page 55: Spatial Transformer Networks - Computer Sciencecseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170522.pdf · Each transformer in the network focuses on discriminative object parts

Resources● Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances

in Neural Information Processing Systems. 2015.● A. W. Harley, "An Interactive Node-Link Visualization of Convolutional Neural Networks," in ISVC,

pages 867-877, 2015● CS231n Coursework @Stanford● Spatial Transformer Networks - Slides by Victor Campos● Kuen, Jason, Zhenhua Wang, and Gang Wang. "Recurrent Attentional Networks for Saliency

Detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016● Hinton, Geoffrey, Alex Krizhevsky, and Sida Wang. "Transforming auto-encoders." Artificial Neural

Networks and Machine Learning–ICANN 2011 (2011): 44-51.● Kanazawa, Angjoo, Abhishek Sharma, and David Jacobs. "Locally scale-invariant convolutional neural

networks." arXiv preprint arXiv:1412.5104 (2014).

55