understanding deep learning requires rethinking generalization
TRANSCRIPT
IDS Lab
Understanding deep learning requires rethinking generalization
Does deep learning really doing some generalization?presented by Jamie Seol
IDS Lab
Jamie Seol
Motivation• Normally, we measure a generalization by:
• generalization error = |training error - test error|• if we overfit, the training error should be low, while test error
becomes large = high generalization error!• However, a complex neural network is fragile to be overfitted!
• for example, let’s train some human baby by randomly labeled CIFAR-10 dataset
• then, give’em some sample in the training set (2nd+ epoch)• they will say "what the…" to any question
• because it’s impossible to generalize some kind of abtracted concept!
• what about in neural network?
IDS Lab
Jamie Seol
CIFAR-10• This is the CIFAR-10 dataset
• The goal of this task is to classify given image into one of 10 classes
• CNNs that we know well will solve this rather easily
IDS Lab
Jamie Seol
Randomized CIFAR-10• When we randomize information of CIFAR-10’s training set, the
result of accuracy becomes:
IDS Lab
Jamie Seol
Randomized CIFAR-10• This is just nothing more than over-overfit!• What’s the problem than?
• neural networks memorized datasets• even if it should have no meaning!
• it’s random! raaaaandddddddommm!!!• aaaaarrrrrrrr!!!
• it did not generalize some concepts• it just memorized!!!!
IDS Lab
Jamie Seol
Randomized CIFAR-10• Even if you didn’t intend to, neural nets can just memorize thing
rather than generalizing!• According to the experiment,
• the effective capacity of neural network is sufficient for memorizing the entire data set
• randomizing (corrupting) data set makes task harder just by small constant factor compared to the origial task!
• Again, even if you didn’t want to!! neural network is fragile to overfit in natural sense!!• "You don’t have to explain the meanings. I’ll just memorize it" - Chatur,
from the movie "3 Idiots"
IDS Lab
Jamie Seol
Regularization• However, we do know that there are a lot of techniques for
regularization, which supports generalizations!• dropout, batch norm, early stopping, weigh decay…
• It does seem help, but wait….• can someone prove that regularizations fundamentally
improves generalization?• does this works really really well? really???
IDS Lab
Jamie Seol
Regularization• Isn’t data augmentation significantly more important than weight
decay?• Even with regulizations, neural networks are good memorizers• Just changing the model increased test accuracy
IDS Lab
Jamie Seol
Regularization• Early stopping helps
• but not necessarily…
IDS Lab
Jamie Seol
Regularization• Well… these techniques seem does helpful, but suspicion
remains…
IDS Lab
Jamie Seol
Rademacher complexity• By the way, what’s so big deal about memorizing everything?• The following measurement is called Rademacher complexity
• Detailed math is omitted here• The thing is, if some model can memorize everything (actually, if
the hypothesis have power to fit randomized dataset), then theoritical upper bound of generalization error is just 1• which is useless!!!!• actually, using regularization scheme lowers the bound, but this
is not true in ReLU, and we’ll show that there is some situation that regularization helps nothing
IDS Lab
Jamie Seol
Finite-sample expressivity• Remember Universal Approximation Theorem?
• finite-sample expressivity theorem is more practical version of it• note that this statement shows that UAT does not guarantees
generalization!• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that
can represent any function given by n samples in d dimensions• This is not a hard theorem to prove, so let’s do it
IDS Lab
Jamie Seol
Lemma 1• Lemma 1: for b1 < x1 < b2 < … < bn < xn, matrix A = [ReLU(xi - bj)]ij has
full rank • Proof: obvious
IDS Lab
Jamie Seol
Theorem 1• Theorem 1: there exists a 2-layer NN by ReLU with 2n+d weights that can
represent any function given by n samples in d dimensions • Proof: Note that 2-layered neural network with ReLU can be expressed as
• where w, b ∈ ℝn and a ∈ ℝd • for data S = {z1, …, zn} and label y ∈ ℝn where zi ∈ ℝd, WTS yi = NN2(zi)
for all i from 1 to n • choose a, b so that xi = ⟨a, zi⟩ meets the condition for Lemma 1 • Then, this becomes y = Aw, while Lemma 1 says that A is invertable • done
IDS Lab
Jamie Seol
Finite-sample expressivity• What does it mean?• It means that once you have more than about 2n + d parameters,
your model already possesses a willingful power to super-overfit and just to remember everything instead of generalizing some concept, therefore it gains trivial bound for generalization error and is exposed to sudden-death-danger of doing nothing more than a memorizer
• long story short: we can’t speak formally about generalization in deep learning yet
• a snake’s leg: for deeper network, use intermediate layers to choose splitted interval rather than target, resulting similar O(n + k) parameters required
IDS Lab
Jamie Seol
Stochastic Gradient Descent• Let’s think about linear optimization
• If we have large d, which is a underdetermined problem, then we can have multiple globla minima
• But hey, can we determine which optima gives best generalization?• in non-linear systems, peeking curvature helped• but there’s no such thing as a curvature in linear system!
IDS Lab
Jamie Seol
Stochastic Gradient Descent• Funny thing about SGD is, it gives optima for l2 loss for
underdetermined system, and known to be a regularizer itself
IDS Lab
Jamie Seol
Stochastic Gradient Descent• However… the result shows minimum l2 norm wasn’t always the
global optima in sense of generalization• furthermore, it is possible to generate some dataset that
minimum l2 norm is not optima! a constructive counter example!
• adding l2 regularization to parameters didn’t help a bit (not shown in the table)
norm = 220
norm = 390
IDS Lab
Jamie Seol
Conclusion• "Be careful whenever you speak 'generalization' in deep learning"• Contributions of this paper:
• experimental framework for suspecting suspicious activities of generalization techniques
• proof for lack of theoritical boundary of generalization error in deep learning (since it can just memorize it all with small effective capacity)
• optimization does not necessarily means generalization• "beware of the light" - Caliban, from the movie "Logan"
IDS Lab
Jamie Seol
References• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." arXiv preprint arXiv:1611.03530 (2016). • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-12 • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-
requires-rethinking-generalization-2017-2-22