adaptive computation time in deep visual learning at ai next conference
TRANSCRIPT
![Page 1: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/1.jpg)
Li ZhangGoogle
![Page 2: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/2.jpg)
![Page 3: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/3.jpg)
Prevalent in computer vision in last 5 years} Image classification} Object detection} Image segmentation} Image captioning} Visual question answer} Image synthesis} ...
![Page 4: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/4.jpg)
More data, more compute => bigger models, better resultsOkay as a cloud serviceMore challenging on mobile or even embedded devices.
![Page 5: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/5.jpg)
} Reduce number of channels in each layer
} Low rank decomposition
◦ Speeding up convolutional neural networks with low rank expansions, BMCV 2014
◦ Efficient and Accurate Approximations of Nonlinear Convolutional Networks, CVPR 2015
◦ ResNet, Inception
} Reduce connections
◦ Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, 2016. 5
![Page 6: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/6.jpg)
} Cascade classifier
◦ A convolutional neural network cascade for face detection. CVPR, 2015.◦ Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling
and cascaded rejection classifiers. CVPR, 2016.
![Page 7: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/7.jpg)
} Glimpse-based attention model
◦ Learning to combine foveal glimpses with a third-order boltzmann machine. NIPS, 2010◦ Recurrent models of visual attention. NIPS, 2014. ◦ Multiple object recognition with visual attention. ICLR, 2015.◦ Spatial transformer networks. NIPS, 2015
![Page 8: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/8.jpg)
} Glimpse-based attention model
◦ Learning to combine foveal glimpses with a third-order boltzmann machine. NIPS, 2010◦ Recurrent models of visual attention. NIPS, 2014. ◦ Multiple object recognition with visual attention. ICLR, 2015.◦ Spatial transformer networks. NIPS, 2015
![Page 9: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/9.jpg)
} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.
◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.
◦ Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.
![Page 10: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/10.jpg)
} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.
◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.
◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.
◦ Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.
![Page 11: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/11.jpg)
} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.
◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.
◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.
◦ Branchynet: Fast inference via early exiting from deep neural networks.
◦ ICPR, 2016.
![Page 12: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/12.jpg)
http://mscoco.org/explore/?id=19431
https://en.wikipedia.org/wiki/Westphalian_horse
![Page 13: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/13.jpg)
![Page 14: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/14.jpg)
s1State
t=1
s2F
t=2
s3F
t=3
s4F
t=4
s5F
t=5
s6F
t=6
Output: s6
Consider an RNN: output = state
https://arxiv.org/abs/1603.08983
![Page 15: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/15.jpg)
s1State
t=1
Halting probability
0.01
RNN-ACT
Cumulative sum: 0.01Output: 0.01 s1 + 0.1 s2 + 0.7 s3 + 0.19 s4 Ponder cost �: 1 + 0.19
Remainder1 - 0.01 - 0.1 - 0.7
Differentiable w.r.t.halting probabilities!
https://arxiv.org/abs/1603.08983
H
s2F
t=2
0.1
+ 0.1
+ 1
H
+ 0.7
s3F
t=3
0.7
+ 1
H
s4F
t=4
0.5
+ 0.5 > 1- 𝜀
+ 1
H
![Page 16: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/16.jpg)
Residual block:
http://arxiv.org/abs/1512.03385
group group group
avg. pool + fc
image
![Page 17: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/17.jpg)
https://arxiv.org/abs/1603.09382
avg. pool + fc
![Page 18: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/18.jpg)
https://arxiv.org/abs/1603.09382
avg. pool + fc
![Page 19: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/19.jpg)
https://arxiv.org/abs/1603.09382
avg. pool + fc
![Page 20: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/20.jpg)
Powerful regularizerRepresentations of the layers are compatible with each other
https://arxiv.org/abs/1603.09382
avg. pool + fc
![Page 21: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/21.jpg)
Ponder cost �: 4.7
s1
halting probability 0.1
F1
H1
0.1
s2F2
H2
0.1
s3F3
H3
0.9
s4F4
H4
group of residual blocks
s5F5
0.1 s1 + 0.1 s2+ 0.1 s3 + 0.7 s4
output
input si - ResNet block activation
![Page 22: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/22.jpg)
High ponder cost
Low ponder cost
ResNet-110, 𝜏 = 0.01
![Page 23: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/23.jpg)
High ponder cost
Low ponder cost
ResNet-101, 𝜏 = 0.001
![Page 24: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/24.jpg)
1.0 1.0
0.40.2
halting probability
update
copy from previous
H1
group of residual blocks
F1 F3
s3s1
H2
0.70.1
F2
s2 0.6
0.4
outputinput
![Page 25: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/25.jpg)
copy copy
residual block
![Page 26: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/26.jpg)
Strict generalization of ACT (consider zero weights for 3x3 conv)
global avg-pooling
3x3 conv
Linear model
add𝛔
sihi
![Page 27: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/27.jpg)
Two ways to train the model:● From scratch● Warm-up with a pretrained model (following results use this)
Important trick: initialize biases of halting probabilities with negative values
![Page 28: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/28.jpg)
ResNet-110, 𝜏 = 0.01
![Page 29: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/29.jpg)
ResNet-101, 𝜏 = 0.005
![Page 30: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/30.jpg)
ResNet-101, 𝜏 = 0.005
![Page 31: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/31.jpg)
ResNet-101, 𝜏 = 0.005
![Page 32: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/32.jpg)
![Page 33: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/33.jpg)
Suppose that the average number of blocks used in the groups is 3 - 3.9 - 13.7 - 3
Baseline: train a ResNet with 3 - 4 - 14 - 3 blocks from scratch with “warming up” from ResNet-101 network
![Page 34: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/34.jpg)
Apply the models to images of higher resolution than the training setSACT improves scale invariance
![Page 35: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/35.jpg)
● Train on ImageNet classification, fine-tune on COCO detection● Apply ponder cost penalty to the feature extractor
Model mAP Feature Extractor FLOPS
ResNet v2 101
29.24 100%
SACT, 𝜏 = 0.001
29.04 72.44%
SACT, 𝜏 = 0.005
27.61 55.98%
![Page 36: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/36.jpg)
![Page 37: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/37.jpg)
cat2000 dataset● No explicit supervision for attention!● No center prior
Model AUC-Judd
ImageNet SACT𝜏 = 0.005
77%
COCO SACT𝜏 = 0.005
80%
One human 65%
Center prior 83%
State of the art
87%Kudos to Maxwell for evaluating!
Middle of leaderboard
http://saliency.mit.edu/home.html
![Page 38: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/38.jpg)
![Page 39: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/39.jpg)
“Spatially Adaptive Computation Time for Residual Networks” to appear in CVPR 2017, https://arxiv.org/pdf/1612.02297.pdf
![Page 40: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/40.jpg)
● The idea of Adaptive Computation Time can be successfully usedfor computer vision
● Adaptive Computation Time○ Dynamic number of layers in ResNet
● Spatially Adaptive Computation Time○ Dynamic number of layers for different parts of image○ Attention maps for free :)
● Both models○ Reduce the amount of computation○ Can be implemented efficiently○ Work on ImageNet classification (first attention models with this property?)○ Work on COCO detection
![Page 41: Adaptive Computation Time in Deep Visual Learning at AI NEXT Conference](https://reader031.vdocument.in/reader031/viewer/2022030308/58ecf01e1a28ab2b378b465f/html5/thumbnails/41.jpg)