adaptive computation time in deep visual learning at ai next conference

41
Li Zhang Google

Upload: bill-liu

Post on 11-Apr-2017

42 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Li ZhangGoogle

Page 2: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference
Page 3: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Prevalent in computer vision in last 5 years} Image classification} Object detection} Image segmentation} Image captioning} Visual question answer} Image synthesis} ...

Page 4: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

More data, more compute => bigger models, better resultsOkay as a cloud serviceMore challenging on mobile or even embedded devices.

Page 5: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

} Reduce number of channels in each layer

} Low rank decomposition

◦ Speeding up convolutional neural networks with low rank expansions, BMCV 2014

◦ Efficient and Accurate Approximations of Nonlinear Convolutional Networks, CVPR 2015

◦ ResNet, Inception

} Reduce connections

◦ Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, 2016. 5

Page 6: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

} Cascade classifier

◦ A convolutional neural network cascade for face detection. CVPR, 2015.◦ Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling

and cascaded rejection classifiers. CVPR, 2016.

Page 7: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

} Glimpse-based attention model

◦ Learning to combine foveal glimpses with a third-order boltzmann machine. NIPS, 2010◦ Recurrent models of visual attention. NIPS, 2014. ◦ Multiple object recognition with visual attention. ICLR, 2015.◦ Spatial transformer networks. NIPS, 2015

Page 8: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

} Glimpse-based attention model

◦ Learning to combine foveal glimpses with a third-order boltzmann machine. NIPS, 2010◦ Recurrent models of visual attention. NIPS, 2014. ◦ Multiple object recognition with visual attention. ICLR, 2015.◦ Spatial transformer networks. NIPS, 2015

Page 9: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.

◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.

◦ Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.

Page 10: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.

◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.

◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.

◦ Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.

Page 11: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.

◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.

◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.

◦ Branchynet: Fast inference via early exiting from deep neural networks.

◦ ICPR, 2016.

Page 12: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

http://mscoco.org/explore/?id=19431

https://en.wikipedia.org/wiki/Westphalian_horse

Page 13: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference
Page 14: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

s1State

t=1

s2F

t=2

s3F

t=3

s4F

t=4

s5F

t=5

s6F

t=6

Output: s6

Consider an RNN: output = state

https://arxiv.org/abs/1603.08983

Page 15: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

s1State

t=1

Halting probability

0.01

RNN-ACT

Cumulative sum: 0.01Output: 0.01 s1 + 0.1 s2 + 0.7 s3 + 0.19 s4 Ponder cost �: 1 + 0.19

Remainder1 - 0.01 - 0.1 - 0.7

Differentiable w.r.t.halting probabilities!

https://arxiv.org/abs/1603.08983

H

s2F

t=2

0.1

+ 0.1

+ 1

H

+ 0.7

s3F

t=3

0.7

+ 1

H

s4F

t=4

0.5

+ 0.5 > 1- 𝜀

+ 1

H

Page 16: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Residual block:

http://arxiv.org/abs/1512.03385

group group group

avg. pool + fc

image

Page 17: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

https://arxiv.org/abs/1603.09382

avg. pool + fc

Page 18: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

https://arxiv.org/abs/1603.09382

avg. pool + fc

Page 19: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

https://arxiv.org/abs/1603.09382

avg. pool + fc

Page 20: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Powerful regularizerRepresentations of the layers are compatible with each other

https://arxiv.org/abs/1603.09382

avg. pool + fc

Page 21: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Ponder cost �: 4.7

s1

halting probability 0.1

F1

H1

0.1

s2F2

H2

0.1

s3F3

H3

0.9

s4F4

H4

group of residual blocks

s5F5

0.1 s1 + 0.1 s2+ 0.1 s3 + 0.7 s4

output

input si - ResNet block activation

Page 22: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

High ponder cost

Low ponder cost

ResNet-110, 𝜏 = 0.01

Page 23: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

High ponder cost

Low ponder cost

ResNet-101, 𝜏 = 0.001

Page 24: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

1.0 1.0

0.40.2

halting probability

update

copy from previous

H1

group of residual blocks

F1 F3

s3s1

H2

0.70.1

F2

s2 0.6

0.4

outputinput

Page 25: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

copy copy

residual block

Page 26: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Strict generalization of ACT (consider zero weights for 3x3 conv)

global avg-pooling

3x3 conv

Linear model

add𝛔

sihi

Page 27: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Two ways to train the model:● From scratch● Warm-up with a pretrained model (following results use this)

Important trick: initialize biases of halting probabilities with negative values

Page 28: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

ResNet-110, 𝜏 = 0.01

Page 29: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

ResNet-101, 𝜏 = 0.005

Page 30: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

ResNet-101, 𝜏 = 0.005

Page 31: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

ResNet-101, 𝜏 = 0.005

Page 32: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference
Page 33: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Suppose that the average number of blocks used in the groups is 3 - 3.9 - 13.7 - 3

Baseline: train a ResNet with 3 - 4 - 14 - 3 blocks from scratch with “warming up” from ResNet-101 network

Page 34: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

Apply the models to images of higher resolution than the training setSACT improves scale invariance

Page 35: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

● Train on ImageNet classification, fine-tune on COCO detection● Apply ponder cost penalty to the feature extractor

Model mAP Feature Extractor FLOPS

ResNet v2 101

29.24 100%

SACT, 𝜏 = 0.001

29.04 72.44%

SACT, 𝜏 = 0.005

27.61 55.98%

Page 36: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference
Page 37: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

cat2000 dataset● No explicit supervision for attention!● No center prior

Model AUC-Judd

ImageNet SACT𝜏 = 0.005

77%

COCO SACT𝜏 = 0.005

80%

One human 65%

Center prior 83%

State of the art

87%Kudos to Maxwell for evaluating!

Middle of leaderboard

http://saliency.mit.edu/home.html

Page 38: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference
Page 39: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

“Spatially Adaptive Computation Time for Residual Networks” to appear in CVPR 2017, https://arxiv.org/pdf/1612.02297.pdf

Page 40: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference

● The idea of Adaptive Computation Time can be successfully usedfor computer vision

● Adaptive Computation Time○ Dynamic number of layers in ResNet

● Spatially Adaptive Computation Time○ Dynamic number of layers for different parts of image○ Attention maps for free :)

● Both models○ Reduce the amount of computation○ Can be implemented efficiently○ Work on ImageNet classification (first attention models with this property?)○ Work on COCO detection

Page 41: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference