module 5 deep convnets for local recognition · deep convnets for local recognition joost van de...

Module 5

Deep Convnets for Local RecognitionJoost van de Weijer4 April 2016

Previously, end-to-end..

2Slide credit: Jose M Àlvarez

3Slide credit: Jose M Àlvarez

Learned Representation

Part I: End-to-end learning (E2E)

Task A(eg. image classification)

Domain BFine-tuned

Part I’: End-to-End Fine-Tuning (FT)

Domain ALearned Representation

Transfer

Previously,finetuning..

slide credit: X. Giro

Fine-tuning a pre-trained network

Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)

8Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)

Fine-tuning a pre-trained network

Fine-tuning: High learning rate in new layer, and low learning rate in all other layers.

Task A(eg. image classification)

Task B(eg. image retrieval)Part II: Off-the-shelf features

Previously, off-the-shelf features..

slide credit: X. Giro

Orange

Image classification: image as an input, label as output

spatial coded image representations(like spatial pyramids)

x y Fd d d

orderless image representation (like BOW)

1 1 Fd

Previously, off-the-shelf features..

Two deep lectures in M5

Global Scale(today’s lecture)

Local Scale(next lecture)

Deep ConvNets for Recognition at...

Orange

Image ClassificationImage classification: image as an input, label as output

How to process non-squared images ?

resize zero padding largest centred square

Local object recognition

object localization

(single object)

object detection

semantic segmentation

Classification+LOCALIZATION

slide credit: Li, Karpathy, Johnson

Localization as regression

regression head

classification head

regression head

classification head

Localization as regressionClassification head:C- class scores

regression head:Cx4 - numbers

Problem: multiple classes

Localization as regression (example)

Example of localization of cloths. Regression is done in two steps: first the person bounding box and then the cloth bounding boxes (master project 2015)

Esteve Cervantes: Evaluating deep features for Fashion Recognition

object localization

(single object)

object detection

any ideas ?

Sliding window227

classification + regression

0.83classification + regression

Compute a new regressed bounding box and classification score for all sliding window positions.

Sliding window

Repeat for different scales and combine all results (e.g. with non maxima suppression)

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

conv 1 fc1 fc2

car/not car

What are the spatial coordinates of conv1 ?

conv1 filter(5x5)

Part of the convolutionalfeatures are the same and do not need recomputation!

conv 1 fc1 fc2

car/not car

conv1 filter(5x5)

How many 10x10 windows are there in this 12x17 image ?

conv 1 fc1 fc2

car/not car

conv1 filter(5x5)

conv 1

The convolutions can be computed in a single pass.

conv 1 fc1 fc2

car/not car

conv1 filter(5x5)

conv 1

5 6x6x5

1x1x10

conv 1 fc1 fc2

car/not car

conv1 filter(5x5)

conv 1(5x5x3)

fc2=conv2(6x6x5)

conv 1 fc1 fc2

car/not car

conv1 filter(5x5)

conv 1(5x5x3)

fc2=conv2(6x6x5)

conv 1 fc1 fc2

car/not car

conv1 filter(5x5)

conv 15 fillters of (5x5x3)

fc2=conv210 filters of (6x6x5)

fc3=conv32 filters of (1x1x10)

We have the 8x3=24 classification scores sharing computation of the convolutional feaures.

Example of bear and fish detection on multiple scales.

Semanet et al, ‘Integrated Recognition, Localization and Detection using Convolutional Networks’ ICLR 2014

Networks can be written as fully convolutional networks to speed up computation at testing time.

object proposals

selective search

K. Van de Sande et al. Segmentation as selective search for object recognition. ICCV 2011.

• object proposal methods compute boxes which potentially contain an object.

• Features for each box are extracted and a classifier is applied.

• typically thousands of boxes (but much less than sliding window)

• Many different approaches: selective search, edge boxes, GOP, etc.

object proposals (RCNN)

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

1. compute object proposals (~2k)

2. warp dilated bounding box

4. classify regions

3. compute CNN features

car: yesperson : no

bounding box regression

Alex Net

remove last layer and finetune for 20 PASCAL classes

Use fc7 4096-d vector as the description of the bounding box.

Train a SVM on this representation for classification

slide credit: Girshick

1. compute object proposals (~2k)

2. warp dilated bounding box

4. classify regions

3. compute CNN features

car: yesperson : no

improved bounding box

drawbacks:• not end-to-end• warping of boxes• lots of double computation (overlap of bounding boxes)

object proposals (Fast R-CNN)

He, Kaiming, et al. "Spatial pyramid pooling in deep convolutionalnetworks for visual recognition." PAMI 2015

‘conv 5’ • compute ones the convolutional features per image.

This was first proposed by: He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

• compute ones the convolutional features• extract features from conv5 for all bb’s

‘conv 5’

• pool the features in a spatial grid.

for all bounding boxes:Region of Interest pooling(ROI pooling)

• pool the features in a spatial grid

ROI pooling:

FCsclassification:log loss

regression:smooth L1 loss

end-to-end training

Fast R-CNN R-CNN

Train time 9.5 84

-speedup 8.8x -

Test time/image 0.32s 47s

Test speedup 146x -

mAP 66.9% 66.0%

multi-task improves also classification performance. end-to-end improves results

Test time does not include object proposal computation (which is now the bottleneck)

object proposals (Faster R-CNN)

‘conv5’

compute the object proposals directly in the network.

FCs Region Proposal Network (RPN)

ROI pooling:

slide credit: Kaming He

Slide a window over the feature map.

Add a network which classifies and regresses the bounding boxes.

The classification score provides the confidence of the presence of object.

Slide a window over the feature map.

Add a network which classifies and regresses the bounding boxes.

The classification score provides the confidence of the presence of object.

Use N anchors for proposals of varying aspect ratios.

Model Time

Edge boxes + R-CNN 0.25 sec + 1000*ConvTime + 1000*FcTime

Edge boxes + fast R-CNN 0.25 sec + 1*ConvTime + 1000*FcTime

faster R-CNN 1*ConvTime + 1000*FcTime

Computation for 1000 boxes.

slide credit: Li, Karpathy, johnson

object localization

Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 challenge with residual networks and Faster RCNN.

object localization

Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 challenge with residual networks and Faster RCNN

summary object detection

• object localization: when there is one or a known number of objects/classes you can do object localization by adding a ‘regression head’ to your network.

• Sliding window + CNN can be computed efficiently by writing the network as a fully convolutional network.

• Object proposal methods are straightforwardly combined with CNNs, but for fast/good results consider:

• adding a regression head to improve bounding box estimation.• share computation of the convolutional features (SPP)• end-to-end training of network (fast RCNN)• include Region Proposal Network for fast object proposals within the network (faster RCNN).

object localization

(single object)

object detection

semantic segmentation:assign a class to all pixels

instance segmentation : assign pixels to a particular instance of a class (chair1, etc..)

semantic segmentationConvNet

predict center pixel

Because of the convolutions the resolution is smaller and upsampling is required

Write network as fully convolutionalnetwork and apply to image

Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015

pixelwise loss

Convolution (3x3)padding[1 1 1 1]stride [1 1]

semantic segmentationConvolution (3x3)padding[1 1 1 1]stride [1 1]

semantic segmentationdeconvolution (3x3)padding [1 1 1 1]stride [2 2]

• deconvolutions are also called fractionally strided convolutions, convolution transpose.

Noh et al. ICCV 2015

combine where (local, shallow) with what (global, deep)

interp + sum

dense output

‘skip layers’

stride 32

no skips

stride 16

1 skip

stride 8

2 skips

ground truthinput image

Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labelswith a Common Multi-Scale Convolutional Architecture, ICCV 2015

Surface normalsresults

instance segmentation

Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.

results ground-truth

Generative Adversarial Networks

Fractionally strided convolutions (deconvolutions) can be used to generate images.

Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.

max log log 1D

D x D G z

generated horses

I can train a discriminative network D which is trained to distinguish real horse images x from generated horse images G(z)

real horses

maxlog log 1D

D x D G z

generated horses

I can then optimize my generative network to fool the discriminative network.

real horses

generated horses

You can re-optimize the Discriminate network D, etc...

real horses

log oax l g 1mD

D x D G z minG

generated horses

You can re-optimize the Discriminate network D, etc...until D gives in...

real horses

log oax l g 1mD

D x D G z minG

Goodman et al. Generative Adversarial NetsNIPS 2014

Examples of generated bedrooms.Unsupervised Representation Radford et al. Learning with Deep ConvolutionalGenerative Adversarial Nteworks ICLR 2016

Interpolation between points in z.

Unsupervised Representation Radford et al. Learning with Deep ConvolutionalGenerative Adversarial Nteworks ICLR 2016

summary semantic segmentation

• Fully convolutional networks can be applied for efficient classification of all pixels.• To get high quality segmentations deep features of multiple scales need to be combined (e.g. with skip layers).• upsampling can be done by de-convolution and de-pooling operations.• Instance segmentation can be performed by combining object detection and semantic segmentation pipelines.

module 5 deep convnets for local recognition · deep convnets for local recognition joost van de...

Documents

staff recognition year end

clockwork convnets for video semantic...

year-end recognition

end-to-end radio trafﬁc sequence recognition with … ·...

overfeat: integrated recognition, localization and ... ·...

temporal segment networks: towards good practices...

matching disparate image pairs using shape-aware convnets

do convnets learn...

very deep convnets for large-scale image...

4d spatio-temporal convnets: minkowski convolutional neural...

deformable convnets v2: more deformable, better results

end-to-end text recognition with convolutional neural...

convnets for nlp

end-to-end analysis for text detection and recognition in...

deformation modeling in convnets - objects365 · •regular...

multilingual speech recognition with a single end-to-end …

do convnets learn correspondence.pdf

front-end speech recognition

an end-to-end face recognition system evaluation framework

look-ahead before you leap: end-to-end active recognition...