module 5 deep convnets for local recognition · deep convnets for local recognition joost van de...

Post on 26-Dec-2019

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Module 5

Deep Convnets for Local RecognitionJoost van de Weijer4 April 2016

Previously, end-to-end..

2Slide credit: Jose M Àlvarez

Dog

Previously, end-to-end..

3Slide credit: Jose M Àlvarez

Dog

Learned Representation

4

Dog

Learned Representation

Part I: End-to-end learning (E2E)

Previously, end-to-end..

5

Learned Representation

Part I: End-to-end learning (E2E)

Task A(eg. image classification)

Previously, end-to-end..

6

Part I: End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part I’: End-to-End Fine-Tuning (FT)

Part I: End-to-end learning (E2E)

Domain ALearned Representation

Part I: End-to-end learning (E2E)

Transfer

Previously,finetuning..

slide credit: X. Giro

8Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)

Fine-tuning a pre-trained network

Fine-tuning: High learning rate in new layer, and low learning rate in all other layers.

Previously,finetuning..

9

Task A(eg. image classification)

Learned Representation

Part I: End-to-end learning (E2E)

Task B(eg. image retrieval)Part II: Off-the-shelf features

Previously, off-the-shelf features..

slide credit: X. Giro

Orange

Image classification: image as an input, label as output

spatial coded image representations(like spatial pyramids)

x y Fd d d

orderless image representation (like BOW)

1 1 Fd

Previously, off-the-shelf features..

Two deep lectures in M5

Global Scale(today’s lecture)

Local Scale(next lecture)

Deep ConvNets for Recognition at...

Orange

Image ClassificationImage classification: image as an input, label as output

How to process non-squared images ?

resize zero padding largest centred square

Local object recognition

object localization

(single object)

object detection

semantic segmentation

Classification+LOCALIZATION

slide credit: Li, Karpathy, Johnson

Localization as regression

slide credit: Li, Karpathy, Johnson

slide credit: Li, Karpathy, Johnson

Localization as regression

regression head

classification head

Localization as regression

slide credit: Li, Karpathy, Johnson

regression head

classification head

Localization as regression

slide credit: Li, Karpathy, Johnson

Localization as regression

slide credit: Li, Karpathy, Johnson

Localization as regressionClassification head:C- class scores

regression head:Cx4 - numbers

slide credit: Li, Karpathy, Johnson

Problem: multiple classes

Localization as regression

slide credit: Li, Karpathy, Johnson

Localization as regression (example)

Example of localization of cloths. Regression is done in two steps: first the person bounding box and then the cloth bounding boxes (master project 2015)

Esteve Cervantes: Evaluating deep features for Fashion Recognition

Local object recognition

object localization

(single object)

object detection

semantic segmentation

any ideas ?

Sliding window227

22

7

227

22

7

0.03

classification + regression

227

22

7

227

22

7

0.83classification + regression

Compute a new regressed bounding box and classification score for all sliding window positions.

Sliding window

227

22

7

Repeat for different scales and combine all results (e.g. with non maxima suppression)

22

7

227

0.83

0.99

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

What are the spatial coordinates of conv1 ?

10

10

12x17

conv1 filter(5x5)

Part of the convolutionalfeatures are the same and do not need recomputation!

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

How many 10x10 windows are there in this 12x17 image ?

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 1

13

8

5

The convolutions can be computed in a single pass.

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 1

13

8

5 6x6x5

1x1x10

fc2

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 1(5x5x3)

13

8

5

8

103

fc2=conv2(6x6x5)

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 1(5x5x3)

13

8

5

8

103

fc2=conv2(6x6x5)

1x1x2

fc3

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 15 fillters of (5x5x3)

13

8

5

8

103

fc2=conv210 filters of (6x6x5)

8

23

fc3=conv32 filters of (1x1x10)

We have the 8x3=24 classification scores sharing computation of the convolutional feaures.

Example of bear and fish detection on multiple scales.

Semanet et al, ‘Integrated Recognition, Localization and Detection using Convolutional Networks’ ICLR 2014

Networks can be written as fully convolutional networks to speed up computation at testing time.

Sliding window (efficient computation)

object proposals

selective search

K. Van de Sande et al. Segmentation as selective search for object recognition. ICCV 2011.

• object proposal methods compute boxes which potentially contain an object.

• Features for each box are extracted and a classifier is applied.

• typically thousands of boxes (but much less than sliding window)

• Many different approaches: selective search, edge boxes, GOP, etc.

object proposals (RCNN)

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

1. compute object proposals (~2k)

2. warp dilated bounding box

4. classify regions

3. compute CNN features

car: yesperson : no

bounding box regression

object proposals (RCNN)

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

Alex Net

object proposals (RCNN)

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

Alex Net

remove last layer and finetune for 20 PASCAL classes

Use fc7 4096-d vector as the description of the bounding box.

Train a SVM on this representation for classification

object proposals (RCNN)

slide credit: Girshick

object proposals (RCNN)

object proposals (RCNN)

slide credit: Li, Karpathy, Johnson

object proposals (RCNN)

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

1. compute object proposals (~2k)

2. warp dilated bounding box

4. classify regions

3. compute CNN features

car: yesperson : no

improved bounding box

drawbacks:• not end-to-end• warping of boxes• lots of double computation (overlap of bounding boxes)

object proposals (Fast R-CNN)

object proposals (Fast R-CNN)

He, Kaiming, et al. "Spatial pyramid pooling in deep convolutionalnetworks for visual recognition." PAMI 2015

‘conv 5’ • compute ones the convolutional features per image.

shar

ed

co

mp

uta

tio

n(c

on

v1-c

on

v5)

object proposals (Fast R-CNN)

This was first proposed by: He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

• compute ones the convolutional features• extract features from conv5 for all bb’s

shar

ed c

om

pu

tati

on

‘conv 5’

object proposals (Fast R-CNN)

• pool the features in a spatial grid.

for all bounding boxes:Region of Interest pooling(ROI pooling)

shar

ed c

om

pu

tati

on

object proposals (Fast R-CNN)

• pool the features in a spatial grid

ROI pooling:

FCsclassification:log loss

regression:smooth L1 loss

end-to-end training

shar

ed c

om

pu

tati

on

object proposals (Fast R-CNN)

Fast R-CNN R-CNN

Train time 9.5 84

-speedup 8.8x -

Test time/image 0.32s 47s

Test speedup 146x -

mAP 66.9% 66.0%

multi-task improves also classification performance. end-to-end improves results

Test time does not include object proposal computation (which is now the bottleneck)

object proposals (Faster R-CNN)

shar

ed c

om

pu

tati

on

‘conv5’

compute the object proposals directly in the network.

FCs Region Proposal Network (RPN)

ROI pooling:

object proposals (Faster R-CNN)

slide credit: Kaming He

Slide a window over the feature map.

Add a network which classifies and regresses the bounding boxes.

The classification score provides the confidence of the presence of object.

object proposals (Faster R-CNN)

slide credit: Kaming He

Slide a window over the feature map.

Add a network which classifies and regresses the bounding boxes.

The classification score provides the confidence of the presence of object.

Use N anchors for proposals of varying aspect ratios.

object proposals (Faster R-CNN)

slide credit: Kaming He

Model Time

Edge boxes + R-CNN 0.25 sec + 1000*ConvTime + 1000*FcTime

Edge boxes + fast R-CNN 0.25 sec + 1*ConvTime + 1000*FcTime

faster R-CNN 1*ConvTime + 1000*FcTime

Computation for 1000 boxes.

object proposals (Faster R-CNN)

slide credit: Li, Karpathy, johnson

object proposals (Faster R-CNN)

slide credit: Li, Karpathy, johnson

object localization

Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 challenge with residual networks and Faster RCNN.

object localization

Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 challenge with residual networks and Faster RCNN

summary object detection

slide credit: Li, Karpathy, johnson

• object localization: when there is one or a known number of objects/classes you can do object localization by adding a ‘regression head’ to your network.

• Sliding window + CNN can be computed efficiently by writing the network as a fully convolutional network.

• Object proposal methods are straightforwardly combined with CNNs, but for fast/good results consider:

• adding a regression head to improve bounding box estimation.• share computation of the convolutional features (SPP)• end-to-end training of network (fast RCNN)• include Region Proposal Network for fast object proposals within the network (faster RCNN).

Local object recognition

object localization

(single object)

object detection

semantic segmentation

semantic segmentation

semantic segmentation:assign a class to all pixels

instance segmentation : assign pixels to a particular instance of a class (chair1, etc..)

semantic segmentationConvNet

predict center pixel

Because of the convolutions the resolution is smaller and upsampling is required

Write network as fully convolutionalnetwork and apply to image

semantic segmentation

Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015

pixelwise loss

semantic segmentation

Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015

Convolution (3x3)padding[1 1 1 1]stride [1 1]

inp

ut

semantic segmentationConvolution (3x3)padding[1 1 1 1]stride [1 1]

inp

ut

semantic segmentation

Convolution (3x3)padding[1 1 1 1]stride [2 2]

inp

ut

Convolution (3x3)padding[1 1 1 1]stride [1 1]

inp

ut

semantic segmentation

Convolution (3x3)padding[1 1 1 1]stride [2 2]

inp

ut

Convolution (3x3)padding[1 1 1 1]stride [1 1]

inp

ut

semantic segmentationdeconvolution (3x3)padding [1 1 1 1]stride [2 2]

inp

ut

semantic segmentationdeconvolution (3x3)padding [1 1 1 1]stride [2 2]

inp

ut

• deconvolutions are also called fractionally strided convolutions, convolution transpose.

semantic segmentation

Noh et al. ICCV 2015

semantic segmentation

Noh et al. ICCV 2015

semantic segmentation

combine where (local, shallow) with what (global, deep)

Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015

semantic segmentation

Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015

interp + sum

interp + sum

dense output

‘skip layers’

semantic segmentation

Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015

stride 32

no skips

stride 16

1 skip

stride 8

2 skips

ground truthinput image

semantic segmentation

Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labelswith a Common Multi-Scale Convolutional Architecture, ICCV 2015

semantic segmentation

Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labelswith a Common Multi-Scale Convolutional Architecture, ICCV 2015

Surface normalsresults

instance segmentation

Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.

instance segmentation

Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.

instance segmentation

Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.

instance segmentation

Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.

results ground-truth

Generative Adversarial Networks

Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.

Fractionally strided convolutions (deconvolutions) can be used to generate images.

noise

Generative Adversarial Networks

Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.

max log log 1D

D x D G z

G(z)

generated horses

I can train a discriminative network D which is trained to distinguish real horse images x from generated horse images G(z)

x

real horses

D

Generative Adversarial Networks

Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.

maxlog log 1D

D x D G z

G(z)

generated horses

I can then optimize my generative network to fool the discriminative network.

x

real horses

D

minG

Generative Adversarial Networks

Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.

G(z)

generated horses

You can re-optimize the Discriminate network D, etc...

x

real horses

D

log oax l g 1mD

D x D G z minG

Generative Adversarial Networks

Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.

G(z)

generated horses

You can re-optimize the Discriminate network D, etc...until D gives in...

x

real horses

D

log oax l g 1mD

D x D G z minG

Goodman et al. Generative Adversarial NetsNIPS 2014

Generative Adversarial Networks

Examples of generated bedrooms.Unsupervised Representation Radford et al. Learning with Deep ConvolutionalGenerative Adversarial Nteworks ICLR 2016

Generative Adversarial Networks

Interpolation between points in z.

Unsupervised Representation Radford et al. Learning with Deep ConvolutionalGenerative Adversarial Nteworks ICLR 2016

summary semantic segmentation

slide credit: Li, Karpathy, johnson

• Fully convolutional networks can be applied for efficient classification of all pixels.• To get high quality segmentations deep features of multiple scales need to be combined (e.g. with skip layers).• upsampling can be done by de-convolution and de-pooling operations.• Instance segmentation can be performed by combining object detection and semantic segmentation pipelines.

top related