classification and semantic segmentationyboykov/courses/cs898/lectures/lec5... · semantic...

Classification

and

Semantic Segmentation

Most slides are from Fei-Fei Li, Justin Johnson, Andrej Karpathy, Serena Yeung, Jia-Bin Huang, Bharath Hariharan, Jeremy Jordan

Supervised Machine Learning

▪ Training data 𝑥1, … , 𝑥𝑁 with true labels (targets) 𝑦1, … , 𝑦𝑁▪ Chose hypothesis class ℎ 𝑥,𝑊▪ Define loss function for 𝑥 when the true label is 𝑦

▪ i.e. 𝐿 ℎ 𝑥,𝑊 , 𝑦 = 𝑦 − ℎ 𝑥,𝑊 2

▪ Training stage▪ minimize total loss on training set using gradient descent

min𝑊

𝑖=1

𝑁

𝐿 ℎ 𝑥𝑖 ,𝑊 , 𝑦𝑖

▪ Test stage▪ compute accuracy on test data, unseen during training

Single Layer Neural Network on Images

▪ 2 classes (cat vs dog)▪ ℎ 𝑥,𝑊 = σ 𝑊𝑥

▪ range in (0,1)

Single Layer Neural Network on Images

▪ 2 classes (cat vs dog)▪ ℎ 𝑥,𝑊 = σ 𝑊𝑥

▪ range in (0,1) 𝑊p(dog)sigmoid

▪ Also called Linear Classifier▪ Works well only for linearly separable classes

▪ not expressive enough

Single Layer NN for multiple classes

▪ Several classes (dog, cat, horse)

p(horse)softmax p(dog)

p(cat)

▪ One-hot encoding for labels 𝑦 horse = 100

, dog = 010

, cat = 001

−

𝑐𝑙𝑎𝑠𝑠𝑒𝑠

𝑦𝑡𝑟𝑢𝑒log(𝑦𝑝𝑟𝑒𝑑)▪ Cross-entropy loss

Multilayer Neural Network on images

Linear +

ReLU

Linear +

ReLULinear + sigmoid p(dog)

1024

32

▪ cat vs dog

256x256 2048

▪ Layers are called fully-connected▪ Expressive enough, but huge number of parameters

▪ expensive, requires lots of data to train well

Reducing Number of Parameters

65,536

Idea 1: local connectivity

Pixels only related to nearby pixels

Idea 2: Translation invariance

Pixels only related to nearby pixelsWeights should not depend on the location of the neighborhood

Linear function + translation invariance = convolution

▪ Local connectivity determines kernel size

5.4 0.1 3.6

1.8 2.3 4.5

1.1 3.4 7.2

Convolution over multiple channels

*

*

*

*+

+

= =

CNN: Convolutional layer

w

h

c

w

h

c’

Convolution

c

c’

CNN: Convolution Subsampling Convolution

▪ Subsampling can be implemented by applying convolution in strides▪ every 2 (or 3, or 4,…) pixels ▪ number of features is usually increased after subsampling, to maintain

expressiveness

subsampling

▪ Convolution in earlier steps detects more local patterns less resilient to distortion▪ Convolution in later steps detects more global patterns more resilient to

distortion▪ Subsampling allows capture of larger, more invariant patterns

Invariance to distortions: Pooling

▪ Each window reduced to one value▪ with max or average

…

4 7 6 9 3 11

8 3 21 4 0 0

1 2 1 3 5 6

7 9 4 3 1 8

5 2 1 5 5 0

0 1 6 4 5 6

Invariance to distortions: Max Pooling

8 21 11

9 4 8

5 6 6

4 7 6 9 3 11

8 3 21 4 0 0

1 2 1 3 5 6

7 9 4 3 1 8

5 2 1 5 5 0

0 1 6 4 5 6

Invariance to distortions: Average Pooling

5.5 10 3.5

4.75 2.75 5

2 4 4

▪ Each pooling layer takes a collection of feature maps as input and produces a collection of feature maps as output

▪ Output feature maps are usually smaller in height and width▪ Parameters: None

CNN: Pooling Layer

Convolutional networks

Horse

convolutional and pooling layers

fully connected layers

First Successful Classification CNN

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86.11 (1998): 2278-2324.

AlexNet - 2012

▪ Won ImageNet competition by a large margin

▪ First simple widely used net▪ Smaller filters and Deeper Network

VGGNet 2014

ResNet 2015

▪ Many more layers▪ Special skip connections for better training


person

grass

trees

motorbike

road

Semantic Segmentation: One-hot encoding

Semantic Segmentation: Cross-Entropy Loss Function

−

𝑐𝑙𝑎𝑠𝑠𝑒𝑠

𝑦𝑡𝑟𝑢𝑒log(𝑦𝑝𝑟𝑒𝑑)

▪ Pixelwise loss

▪ Added over all pixels

Semantic Segmentation with CNNs

h

w

3


h/4

w/4

d


d

h/4

w/4


h/4

w/4

d𝑑 good features for classifying top left ‘pixel’


𝑑

convolve with 𝑐 filters of size 1x1

𝑐

h/4

w/4

▪ Finally pass 𝑐 features of each pixel feature through softmax


▪ Pass image through convolution and subsampling layers

▪ Final convolution with #classes outputs▪ Get scores for subsampled image▪ Upsample back to original size

person

bicycle

The Resolution Issue

▪ Problem: Need fine details!▪ Shallower network/earlier layers?

▪ not very semantic

Horse

Visualizations from : M. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In ECCV 2014

The Resolution Issue

▪ Problem: Need fine details!▪ Remove subsampling?

▪ Need many features per pixel▪ expensive without subsampling

▪ Need large field of view for final features▪ very deep network, expensive without subsampling

Solution 1: Image pyramids

Learning Hierarchical Features for Scene LabelingClement Farabet, Camille Couprie, Laurent Najman, Yann LeCun. In TPAMI, 2013

Hig

her

res

olu

tio

nLe

ss c

on

text

▪ Does not scale well to deep architectures

Solution 2: CNN+Conditional Random Fields▪ Combine with CRF as post-processing

▪ “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFS”, Chen et.al. ICLR’2015

Solution 2: CNN+Conditional Random Fields

CNN

input class probabilities Full CRF final output

RNN

◼ Combine with CRF in end-to-end trainable system ◼ mean field inference implemented as RNN◼ Zheng et.al., “Conditional Random Fields as Recurrent Neural Networks”ICCV’2015

Solution 3: Learn to Upsample

◼ Encoder/decoder structure

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015Badrinarayanan et al, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation”, TPAMI 2017

Methods for Upsampling

Decoding using only Upsampling

From long et al.: struggles to produce fine-grained segmentationsSemantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where... Combining fine layers and coarse layers lets the model make local predictions that respect global structure.

Solution 4: Skip connections

upsample

Skip connections

Fully convolutional networks for semantic segmentation. Evan Shelhamer, Jon Long, Trevor Darrell. In CVPR 2015

without skip connections

with skip connections

Solution 5: Dilation

▪ Need subsampling to allow convolutional layers to capture large regions with small filters▪ can we do this without subsampling?



Fully convolutional networks for semantic segmentation. Evan Shelhamer, Jon Long, Trevor Darrell. In CVPR 2015Multi-Scale Context Aggregation by Dilated Convolutions. Yu et.al.ICRL’2016


▪ Instead of subsampling by factor of 2: dilate by factor of 2▪ allows for exponential increase in field of view without decrease

of spatial dimensions▪ Not panacea: without subsampling, feature maps are much larger

▪ memory issues

Putting it all together

55

60

65

70

Basic +Skip +Dilation +CRF

mean IoU on PASCAL VOC

Best Non-CNN approach: ~46.4%

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan Yuille. In ICLR, 2015.

More Architectures: U-net

▪ expanding the decoder with symmetry

“U-Net: Convolutional Networks for Biomedical Image Segmentation”, Ronneberger et. al., ICMI’2015

PSPNet▪ Pyramid Pooling mode

▪ new module to capture global scene context▪ 82.6 mean IoU on PASCAL VOC

“Pyramid Scene Parsing Network”, Zhao et.al., CVPR 2017

ICNet for Real-Time Semantic Segmentation

“ICNet for Real-Time Semantic Segmentation on High-Resolution Images ”, Zhao et.al., ECCV 2018

▪ Apply heavier CNN to small resolution

classification and semantic segmentationyboykov/courses/cs898/lectures/lec5... · semantic...

Documents