adaptive computation time in deep visual learning at ai next conference

Li ZhangGoogle

Prevalent in computer vision in last 5 years} Image classification} Object detection} Image segmentation} Image captioning} Visual question answer} Image synthesis} ...

More data, more compute => bigger models, better resultsOkay as a cloud serviceMore challenging on mobile or even embedded devices.

} Reduce number of channels in each layer

} Low rank decomposition

◦ Speeding up convolutional neural networks with low rank expansions, BMCV 2014

◦ Efficient and Accurate Approximations of Nonlinear Convolutional Networks, CVPR 2015

◦ ResNet, Inception

} Reduce connections

◦ Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, 2016. 5

} Cascade classifier

◦ A convolutional neural network cascade for face detection. CVPR, 2015.◦ Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling

and cascaded rejection classifiers. CVPR, 2016.

} Glimpse-based attention model

◦ Learning to combine foveal glimpses with a third-order boltzmann machine. NIPS, 2010◦ Recurrent models of visual attention. NIPS, 2014. ◦ Multiple object recognition with visual attention. ICLR, 2015.◦ Spatial transformer networks. NIPS, 2015

} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.

◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.

◦ Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.


◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.

◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.

◦ Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.


◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.

◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.

◦ Branchynet: Fast inference via early exiting from deep neural networks.

◦ ICPR, 2016.

http://mscoco.org/explore/?id=19431

https://en.wikipedia.org/wiki/Westphalian_horse

s1State

t=1

s2F

t=2

s3F

t=3

s4F

t=4

s5F

t=5

s6F

t=6

Output: s6

Consider an RNN: output = state

https://arxiv.org/abs/1603.08983

s1State

t=1

Halting probability

0.01

RNN-ACT

Cumulative sum: 0.01Output: 0.01 s1 + 0.1 s2 + 0.7 s3 + 0.19 s4 Ponder cost �: 1 + 0.19

Remainder1 - 0.01 - 0.1 - 0.7

Differentiable w.r.t.halting probabilities!


H

s2F

t=2

0.1

+ 0.1

+ 1

H

+ 0.7

s3F

t=3

0.7

+ 1

H

s4F

t=4

0.5

+ 0.5 > 1- 𝜀

+ 1

H

Residual block:

http://arxiv.org/abs/1512.03385

group group group

avg. pool + fc

image


avg. pool + fc

Powerful regularizerRepresentations of the layers are compatible with each other


avg. pool + fc

Ponder cost �: 4.7

s1

halting probability 0.1

F1

H1

0.1

s2F2

H2

0.1

s3F3

H3

0.9

s4F4

H4

group of residual blocks

s5F5

0.1 s1 + 0.1 s2+ 0.1 s3 + 0.7 s4

output

input si - ResNet block activation

High ponder cost

Low ponder cost

ResNet-110, 𝜏 = 0.01

High ponder cost

Low ponder cost

ResNet-101, 𝜏 = 0.001

1.0 1.0

0.40.2

halting probability

update

copy from previous

H1

group of residual blocks

F1 F3

s3s1

H2

0.70.1

F2

s2 0.6

0.4

outputinput

copy copy

residual block

Strict generalization of ACT (consider zero weights for 3x3 conv)

global avg-pooling

3x3 conv

Linear model

add𝛔

sihi

Two ways to train the model:● From scratch● Warm-up with a pretrained model (following results use this)

Important trick: initialize biases of halting probabilities with negative values

ResNet-110, 𝜏 = 0.01

ResNet-101, 𝜏 = 0.005

Suppose that the average number of blocks used in the groups is 3 - 3.9 - 13.7 - 3

Baseline: train a ResNet with 3 - 4 - 14 - 3 blocks from scratch with “warming up” from ResNet-101 network

Apply the models to images of higher resolution than the training setSACT improves scale invariance

● Train on ImageNet classification, fine-tune on COCO detection● Apply ponder cost penalty to the feature extractor

Model mAP Feature Extractor FLOPS

ResNet v2 101

29.24 100%

SACT, 𝜏 = 0.001

29.04 72.44%

SACT, 𝜏 = 0.005

27.61 55.98%

cat2000 dataset● No explicit supervision for attention!● No center prior

Model AUC-Judd

ImageNet SACT𝜏 = 0.005

77%

COCO SACT𝜏 = 0.005

80%

One human 65%

Center prior 83%

State of the art

87%Kudos to Maxwell for evaluating!

Middle of leaderboard

http://saliency.mit.edu/home.html

“Spatially Adaptive Computation Time for Residual Networks” to appear in CVPR 2017, https://arxiv.org/pdf/1612.02297.pdf

● The idea of Adaptive Computation Time can be successfully usedfor computer vision

● Adaptive Computation Time○ Dynamic number of layers in ResNet

● Spatially Adaptive Computation Time○ Dynamic number of layers for different parts of image○ Attention maps for free :)

● Both models○ Reduce the amount of computation○ Can be implemented efficiently○ Work on ImageNet classification (first attention models with this property?)○ Work on COCO detection

adaptive computation time in deep visual learning at ai next conference

Technology