cs 103: representation learning, information theory and ... · image from cnops et al., 2008 hubel...

19
CS 103: Representation Learning, Information Theory and Control Lecture 8, Mar 1, 2019

Upload: others

Post on 15-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

CS 103: Representation Learning, Information Theory and ControlLecture 8, Mar 1, 2019

Page 2: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

2

Recap

Group nuisances- Group convolutions- Canonical reference frames- SIFT descriptorsGeneral nuisances- Minimal information in activation ⇒ Invariance to nuisances- Information Bottleneck- IB loss can upper-bounded by introducing an auxiliary variable- Aside: Variational Auto-Encoder can be seen as a particular case- Aside: Disentanglement in VAE

How does this relate to standard deep learning?

Page 3: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

3

The Kolmogorov Structure of a Task

How can we define the structure of a task? Define the Kolmogorov Structure Function:

Trai

ning

Los

s

Kolmogorov complexity of model

S𝒟(t) = minK(M)≤t

L(𝒟; M)

Increasing the complexity of the model leadsto big gains in accuracy: We are learning the structure of the problem.

After learning all the structure, we can only memorize: inefficient asymptotic phase.

Optimal

Information Complexity of Tasks, their Structure and their Distance, Achille et al., 2018

Tangent = 1 in the asymptote: Need to store 1 bit in the model to decrease the loss by 1 bit

Kolmogorov minimal sufficient statistic

Kolmogorov's Structure Functions and Model Selection, Vereshchagin and Vitanyi, 2002

Page 4: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

4

Optimizing using Deep Neural Networks

How do we find the optimal solution?

S𝒟(t) = minK(M)<t

L(𝒟; M)

ℒ(M) = L(𝒟; M) + λK(M)

Corresponding Lagrangian

K(M) ≤ KL(q(w |𝒟)∥p(w))

ℒ(M) = L(𝒟; M) + λ KL(q(w |𝒟)∥p(w))

Use the boundLet w be the parameters of the model

This loss can be implemented using a DNN and the local reparametrization trick.** Variational Dropout and the Local Reparameterization Trick, Kingma et al., 2015

Information Complexity of Tasks, their Structure and their Distance, Achille et al., 2018

Page 5: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

5

Let’s rewrite it using Information Theory

We used an upperbound, what is the best we value it can assume?

I(w; 𝒟) ≤ 𝔼𝒟[KL(q(w |𝒟)∥p(w))],Recall that:

which is obtained when p(w) = q(w|D). Hence, on expectation over the datasets, the best function loss function to use to recover the task structure is:

ℒ(M) = 𝔼𝒟[H(𝒟 |w)] + λI(w; 𝒟) .

IB Lagrangian for the weights

ℒ(M) = 𝔼w∼q(w|𝒟)[Hp,q(𝒟 |w)] + λ KL(q(w |𝒟)∥p(w)) .

Page 6: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

A new Information Bottleneck

dataset weights real distributionD w p(y|x)

data activations labelx z y

Activations IBInvariance

Weights IBOverfitting

6

minwL = Hp,qw (y |z) + �I(D;w)

<latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit>

minq(z |x)

L = Hp,q(y |z) + �I(z ; x)<latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit>

Page 7: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

7

PAC-Bayes bound (Catoni, 2007; McAllester 2013).

Corollary. Minimizing the IB Lagrangian for the weights minimizes an upper bound on the test error.

This gives non-vacuous generalization bounds! (Dziugaite and Roy, 2017)

Catoni, 2007; McAllester 2013The PAC-Bayes generalization bound

Page 8: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

8

Can we really minimize the IBL for the weights?

We are making an approximation by assuming that both q(w|D) and p(w) are Gaussian.

Let w* be a local minimum. The optimal amount of gaussian noise is to add is:

where F(w*) is the Fisher Information Matrix (equiv. Hessian) computed in w*.

Flat minima have low information in the weights.

Σ = (I +2λ2

βF (w0))

−1,

I (w ;D) kwk2

�2+ log |2�2 N F (w⇤) + I |

<latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">AAACUHicbVDLSgMxFL1T3/VVdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN00hA3ULn5zeOMEbxzmmtdreaMet2uZrF2we4X+AmcAsjCo01rmBddDGvss0FQQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zzO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">AAACUHicbVDLSgMxFL1T3/VVdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN00hA3ULn5zeOMEbxzmmtdreaMet2uZrF2we4X+AmcAsjCo01rmBddDGvss0FQQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zzO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">AAACUHicbVDLSgMxFL1T3/VVdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN00hA3ULn5zeOMEbxzmmtdreaMet2uZrF2we4X+AmcAsjCo01rmBddDGvss0FQQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zzO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">AAACUHicbVDLSgMxFL1T3/VVdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN00hA3ULn5zeOMEbxzmmtdreaMet2uZrF2we4X+AmcAsjCo01rmBddDGvss0FQQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zzO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit>

Weight Information is bounded by the geometry of the loss landscape*

Page 9: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

9

Is this approximation good? Phase transition

𝛽 < 1 ⇒ overfitting

𝛽 > 1 ⇒ underfitting

𝛽 < 1 ⇒ overfitting

𝛽 >> 1 ⇒ underfitting

fitting

Consider a dataset with random labels. There is no structure, so for λ < 1we should not fit anything, and for λ > 1 we should memorize everything (see the structure function).

10�2 10�1 100 101 102

Value of �

0%

20%

40%

60%

80%

100%

Tra

iner

ror

All-CNN

ResNet

Small AlexNet

Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018

Phase transition

Even with local approximation, we can observe this behavior in real deep networks.For real label, we have a “Goldilocks Zone” where we fit without overfitting for λ > 1.

Page 10: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

Information is a better measure of complexity than number of parameters

10

Bias-variance tradeoff

Variance

Bias2

Total error

Model complexity

Err

or

Optimal DNN

Optimal model

Optimal DNN

Information complexity

Parametrizing the complexity with information in the weights, we recover bias-variance trade-off trend.

Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018Arora et al., Stronger generalization bounds for deep nets via a compression approach, ICML 2018

Page 11: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

A new Information Bottleneck

dataset weights real distributionD w p(y|x)

data activations labelx z y

Activations IBInvariance

Weights IBOverfitting

11

minwL = Hp,qw (y |z) + �I(D;w)

<latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit>

minq(z |x)

L = Hp,q(y |z) + �I(z ; x)<latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit>

Page 12: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

12

The Emergence Bound

Minimality of the weights (representation of the training set) induces minimality (hence invariance) and disentanglement of the activations.

g(I (w ;D)) I (z ; x) + TC (z) g(I (w ;D)) + O(1/ dim(x))<latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">AAACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva//CMevHhV/AtezLYedPpA4OF5n/cjjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNNHkjAakoaiipFmLAjiPiN3/n11XL97JELSKLxRg5h4HHVDGlCMlJbaxeuuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iyy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkdd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45ddq6OSpWLLKY82AG7wAQOOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">AAACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva//CMevHhV/AtezLYedPpA4OF5n/cjjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNNHkjAakoaiipFmLAjiPiN3/n11XL97JELSKLxRg5h4HHVDGlCMlJbaxeuuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iyy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkdd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45ddq6OSpWLLKY82AG7wAQOOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">AAACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva//CMevHhV/AtezLYedPpA4OF5n/cjjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNNHkjAakoaiipFmLAjiPiN3/n11XL97JELSKLxRg5h4HHVDGlCMlJbaxeuuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iyy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkdd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45ddq6OSpWLLKY82AG7wAQOOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">AAACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva//CMevHhV/AtezLYedPpA4OF5n/cjjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNNHkjAakoaiipFmLAjiPiN3/n11XL97JELSKLxRg5h4HHVDGlCMlJbaxeuuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iyy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkdd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45ddq6OSpWLLKY82AG7wAQOOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit>

I(zL;x) mink<L{dim(zk)⇥g( I(Wk;D)

dim(Wk) ) + 1⇤}

Tight for one layer; for more layers we have:

Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018

Info in activations Info in weights

Page 13: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

13

How do minimize the information in the weights?

1. Explicitly minimize the IBL for the weights using local reparametrization trick.2. Let SGD do it for you:• Empirically we know that SGD tend to find flatter minima.• We know from the local information bound that flat minima have less

information in the weights.• Hence, SGD implicitly minimizes the information in the weights.

3. Modify SGD to reduce information more aggressively.

Warning: These are still open questions and the claims are not proved.

Counterpoint: Are flat minima the whole story?

Page 14: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

14

Information in Weights during training

Training Epoch

Info

rmat

ion

in W

eigh

ts

What should we expect from the information in the weights during training?

Maybe something like this?

Page 15: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

15

Information during training

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

Information extraction Information consolidation

Page 16: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

16

A snag: Critical periods in Deep Networks

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

Deficit Normal training

0 160+NN

Show network blurred images to simulate cataract

The network does not classify correctly if the deficit is removed to late

A short deficit at epoch ~40 is enough to permanently damage the network.

Page 17: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

17

Critical periods

Critical periods. A time-period in early development where sensory deficits can permanently impair the acquisition of a skill

Examples: monocular deprivation, cataracts, imprinting, language acquisition, …

Deficit Normal training

0 180 + NN

Kitten does not recover vision in covered eye

Image from Cnops et al., 2008

Hubel and Wiesel

Page 18: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

18

Critical learning periods and Information in Weights

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

Sensitivity to deficits peaks when network is absorbing information.Is minimal when the network is consolidating information.

Page 19: CS 103: Representation Learning, Information Theory and ... · Image from Cnops et al., 2008 Hubel and Wiesel. 18 Critical learning periods and Information in Weights Achille, Rovere,

19

Are flat minima an epiphenomenon?

Final sharpness correlates with generalization…

…but generalization quality is decided here, far from convergence to minima