understanding - medien.ifi.lmu.de€¦ · ou changkun (lmu, hi@changkun.us) understanding...

Post on 04-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Understanding

Ou Changkun

WiSe 2017/2018University of Munich, Seminar Deep Learning

hi@changkun.us

February 1, 2018

1

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 2

“This paper explains why deep learning can generalize well”

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Outline

1 Motivation

1 Motivation

● Backgrounds Introduction

● Recap of Traditional Generalization Theory

2 Rethinking of Generalization3 Bounds of Generalization Gap4 DARC Regularizer

3

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 4

What is Generalization?

1 Motivation Backgrounds Introduction

●●●●

Definition: Generalization Bound

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 5

Goals of Generalization Theory

1 Motivation Backgrounds Introduction

●●●

Generalization theory is all about: Test error - Training error = ????

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 6

The Force of Generalization

1 Motivation Backgrounds Introduction

●○○○

●○○○

●●

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Recap I: Model Complexity

1 Motivation Recap of Generalization Theory

7

Images souce [ Mohri et al. 2012 ]

Memorization Generalization

●●●●●

: Test error ≤ Traning error + Complexity Penalty

Rademacher bound:

VC bound:

error

complexity

test error

complexity term

training error

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Recap II: Algorithmic Stability & Robustness

1 Motivation Recap of Generalization Theory

8

Image Source [Google Inc., 2015]

: Test error ≤ Traning error + Algorithm Penalty

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Recap III: Flat Region Conjecture

1 Motivation Recap of Generalization Theory

9

○○

Loss landscape, Image source [Li et al. 2018]

Image source [He et al. 2015]

“Flat region” “Sharp region”

Image source [Dauphin et al. 2015]

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Outline

1 Motivation2 Rethinking of Generalization

● Understanding generalization failure

● Emperical risk guarantee

3 Bounds of Generalization Gap4 DARC Regularizer

10

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

On the origin of “paradox”

2 Rethinking of Generalization

11

●●

●○

●○

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 12

Empirical Risk Guarantee (Estimating Test Error)

2 Rethinking of Generalization Empirical Risk Guarantee

●●●

Theorem: empirical risk estimation (for finite hypothesis class)

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

13

Empirical Risk Guarantee II: An Example

2 Rethinking of Generalization Empirical Risk Guarantee

Theorem: empirical risk estimation (for finite hypothesis class)

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Outline

1 Motivation2 Rethinking of Generalization3 Bounds of Generalization Gap

● Data-dependent Bound

● Data-independent Bound

4 DARC Regularizer

14

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

“Good” Dataset

3 Bounds of Generalization Gap Data-dependent Bound

Definition: Concentrated dataset

15

●●

β β β

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Data-dependent Bound (Generalization Quality)

3 Bounds of Generalization Gap Data-dependent Bound

16

Theorem: Data-dependent Bound

●●●

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 17

Two-phase Training Technique

3 Bounds of Generalization Gap Data-independent Bound

… …

… ……

“Standard” Phase

Train the network in standard way

with αm samples of the dataset;

… …

… ……

“Freeze” Phase

Freeze activation pattern and keep training with the

rest of (1-α)m samples.

* ratio = two-phase/base

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Data-independent Bound (Determine Good Model)

3 Bounds of Generalization Gap Data-independent Bound

Theorem: Data-independent bound

18

●● ρ, m_sigma

●●●

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Data-independent Bound: Examples

3 Bounds of Generalization Gap Data-independent Bound

Theorem: Data-independent bound

19

●●

○ ≤

●○ ≤

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Outline

1 Motivation2 Rethinking of Generalization3 Bounds of Generalization4 DARC Regularizer

● Directly Approximately Regularizing Complexity

● DARC1 Experiments

20

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Directly Approximately Regularizing Complexity (DARC) Family

4 DARC Regularizer Directly Approximately Regularizing Complexity

21

DARC1 Regularization

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Experiment results: DARC1 Regularizer on MNIST & CIFAR-10

4 DARC Regularizer DARC1 Experiments

Test errorratio

MNIST (ND) MNIST CIFAR-10

mean stdv mean stdv mean stdv

DARC1/Base 0.89 0.61 0.95 0.67 0.97 0.79

22

DARC1 implementation (Keras)

def darc1_loss(model, lamb=0.001):

def _loss(y_true, y_pred):

original_loss = K.categorical_crossentropy(y_true, y_pred)

custom_loss = lamb*K.max(K.sum(K.abs(model.layers[-1].output), axis=0))

return original_loss + custom_loss

return _loss

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

TL;DR: Only Model Architecture & Dataset Matter

5 Conclusions

●●●●●

23

This paper is all about: Test error ≤ Traning error + (Model, Data) Penalty

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 24

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Outline

1 Motivation2 Rethinking of Generalization3 Bounds of Generalization4 DARC Regularizer5 Related Readings

25

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Related Readings I: Previous Researches

References

[ Shalev-Shwartz et al. 2014 ]Understanding machine learning from theory to algorithmsCambridge University Press.

[ Neyshabur et al. 2017 ]Exploring generalization in deep learning

arXiv preprint: 1706.08947, NIPS 2017

[ Xu et al. 2010 ]Robustness and generalization

arXiv preprint: 1005.2243, JMLR 2012

[ Bousquet et al. 2000 ]Stability and generalization

JMLR 2000

26

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Related Readings II: Rethinking Generalization

References

[ Zhang et al. 2017 ]Understanding deep learning requires rethinking generalizationarXiv preprint: 1611.03530, ICLR 2017

[ Krueger et al. 2017 ]Deep nets don’t learn via memorizationICLR 2017 Workshop

[ Kawaguchi et al. 2017 ]Generalization in deep learning

arXiv preprint: 1710.05468

27

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Related Readings III: The Trick of Deep Path

References

[ Baldi, et al. 1989 ]Neural networks and principal component analysis: Learning from examples without local minimaJournal of Neural networks, 1989, Vol.2 52–58

[ Choromanska et al. 2015 ]The Loss Surfaces of Multilayer NetworksarXiv preprint: 1412.0233, JMLR 2015

[ Kawaguchi 2016 ]Deep learning without poor local minimaNIPS 2016, oral presentation paper

28

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Related Readings IV: Recent Research

References

[ Dziugaite et al. 2017 ]Computing Non-vacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many

More Parameters than Training DataarXiv preprint: 1703.11008

[ Bartlett et al. 2017 ]Spectrally-normalized margin bounds for neural networksarXiv preprint: 1706.08498, NIPS 2017

[ Neyshabur et al. 2018 ]A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networksarXiv preprint: 1707.09564, ICLR 2018

[ Neyshabur et al. 2018 ]Certifiable Distributional Robustness with Principled Adversarial TrainingarXiv preprint: 1710.10571, ICLR 2018 Best Paper

29

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 30

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 31

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 32

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 33

General Concpet Questions

1 Motivation Backgrounds Introduction

●○

●○

●○

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 34

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Properties of (over-parameterized) linear models

2 Rethinking of Generalization

Theorem: Generalization failure on linear models

35

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

On the origin of “paradox”

2 Rethinking of Generalization

Proposition

36

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 37

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 38

Path Trick in ReLU Net [ Choromanska et al. 2015 ]

3 Bounds of Generalization Gap Data-dependent Bound

… …

… ……

Layer (H) Layer (H+1)Layer 1

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Concentrated Dataset

3 Bounds of Generalization Gap Data-dependent Bound

39

Definition 3: “good” dataset

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Data-dependent Guarantee

3 Bounds of Generalization Gap Data-dependent Bound

40

Proposition 4: Data-dependent Bound

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 41

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Why z?

2 Rethinking of Generalization Empirical Risk Guarantee

42

Bernstein Inequality

●●

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 43

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

SGD, z, w; relations?

3 Bounds of Generalization Gap Data-independent Bound

44

Did you forget that sigma depends on w?

No, they could be independent.

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 45

Two-phase training

3 Bounds of Generalization Gap Data-independent Bound

… …

… ……

… …

… ……

Standard Phase

Freeze Phase

Train the network in standard way with

partial dataset;

Freeze activation pattern and keep

training with the rest of dataset.

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 46

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Directly Approximately Regularizing Complexity (DARC)

Backstage Slides Directly Approximately Regularizing Complexity

Bound by Rademacher Complexity [Koltchinskii and Panchenko 2002]

47

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 48

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Experiment results: DARC1 Regularizer with ResNet-50 on ILSVRC2012

Backstage Slides Experiments

49

●○○○

10 Image classes Entirely

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us) 50

February 1, 2018Understanding GeneralizationOu Changkun (LMU, hi@changkun.us)

Personal Opinions regarding SGD

5 Conclusions

● nonlinearity

● bypass

51

top related