introduction to the analysis of learning...

71
Introduction to the analysis of learning algorithms Does Bayesianism save you? Ryota Tomioka Toyota Technological Institute at Chicago [email protected] Exercise materials: https://github.com/ryotat/dtuphd14

Upload: duongtruc

Post on 04-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Introduction to the analysis of learning algorithms ― Does Bayesianism save you?

Ryota Tomioka Toyota Technological Institute at Chicago [email protected]

Exercise materials: https://github.com/ryotat/dtuphd14

Page 2: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Why learn theory? •  Think about your market value – Many good algorithms – Many good implementations (e.g., scikit-learn, mahout, Stan, infer.net, Church,…)

Page 3: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Two sides of machine learning

Modeling (How to do things)

Evaluating (How they perform)

Page 4: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

About this lecture •  I will try to make it as interactive as possible

•  If you don’t get something, probably I am doing something wrong.

•  So, please ask questions –  It will not only help you but also help others.

–  It will help you stay awake!

Page 5: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Key questions •  Learning – the goal is to generalize to a new test example from a limited number of training instances

•  What is over-fitting? •  How do we avoid over-fitting? •  Does Bayesian methods avoid over-fitting?

The  key  is  to  understand  an  es0mator    as  a  random  variable

Page 6: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

What we will cover •  First part – Ridge regression – Bias-variance decomposition – Model selection • Mallows’ CL •  Leave-one-out cross validation

Second part – Bayesian regression – PAC Bayes theory

Page 7: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Ridge Regression

Key idea: Estimator is a random variable

Page 8: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Problem Setting •  Training examples: (xi, yi) (i=1,…, n), xi∈Rd

•  Goal –  Learn a linear function f(x) = wTx (w∈Rp) that predicts the output yn+1 for a test point (xn+1,yn+1) ~P(X,Y)

•  Note that the test point is not included in the traning examples (We want generalization!)

x1 x2 xn

y1 y2 yn

P(X,Y) 〜 IID

?

xn+1

Page 9: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Ridge Regression •  Solve the minimization problem

Training  error Regulariza0on  (ridge)  term  (λ:  regulariza0on  const.)

Note:  Can  be  interpreted  as  a  Maximum  A  Posteriori  (MAP)  es0ma0on        –  Gaussian  likelihood  with  Gaussian  prior.

minimizew2Rd

nX

i=1

�yi � xi

>w

�2+ �kwk2

Page 10: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Ridge Regression •  More compactly

minimizew

⇥y �Xw⇥2 + �⇥w⇥2

Training  error Regulariza0on  (ridge)  term  (λ:  regulariza0on  const.)

Note:  Can  be  interpreted  as  a  Maximum  A  Posteriori  (MAP)  es0ma0on        –  Gaussian  likelihood  with  Gaussian  prior.

y =

⇧⇧⇧⇤

y1

y2...

yn

⌃⌃⌃⌅ X =

⇧⇧⇧⇤

x1�

x2�

...xn�

⌃⌃⌃⌅

Target  output  

Design  matrix  

Page 11: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Designing the design matrix •  Columns of X can be different sources of info

–  e.g., predicting the price of an apartment

•  Columns of X can also be nonlinear –  e.g., polynomial regression

#roo

ms

Neighbo

rhoo

d

Size

Bathroom

Sunlight

Train  st.

X =� ⇥

X =

0

BBB@

x

p1 · · · x

21 x1 1

x

p2 · · · x

22 x2 1

......

x

pn · · · x

2n xn 1

1

CCCA

Page 12: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Solving ridge regression •  Take the gradient, and solve

�X� (y �Xw) + �w = 0

which  gives

(Id:  d×d  iden0ty  matrix)

The  solu0on  can  also  be  wriSen  as  (exercise)

w =�X>X + �Id

��1X>y

w = X> �XX> + �In

��1y

Page 13: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Example: polynomial fitting

•  Degree d-1 polynomial model

Design  matrix:

y = w1xd�1

+ · · ·+ wd�1x+ wd + noise

=

�x

d�1 · · · x 1

�0

@w1

.

.

.

wd�1wd

1

A+ noise

X =

0

BBB@

x

d�11 · · · x

21 x1 1

x

d�12 · · · x

22 x2 1

......

x

d�1n · · · x

2n xn 1

1

CCCA

Page 14: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Example: 5th-order polynomial fitting

True

w⇥ =

⇤0010�10

Learned

h=0.001

TrueLearnedTraining examples

w =

⇧⇧⇧⇧⇧⇧⇤

�0.360.302.32�1.34�1.930.61

⌃⌃⌃⌃⌃⌃⌅

Page 15: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Example: 5th-order polynomial fitting

True

w⇥ =

⇤0010�10

Learned

h=0.01

TrueLearnedTraining examples

w =

⇧⇧⇧⇧⇧⇧⇤

�0.270.251.99�1.16�1.730.56

⌃⌃⌃⌃⌃⌃⌅

Page 16: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Example: 5th-order polynomial fitting

True

w⇥ =

⇤0010�10

Learned

h=0.1

TrueLearnedTraining examples

w =

⇧⇧⇧⇧⇧⇧⇤

0.080.050.74�0.52�0.980.36

⌃⌃⌃⌃⌃⌃⌅

Page 17: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Example: 5th-order polynomial fitting

True

w⇥ =

⇤0010�10

Learned

h=1

TrueLearnedTraining examples

w =

⇧⇧⇧⇧⇧⇧⇤

0.27�0.06�0.01�0.12�0.410.19

⌃⌃⌃⌃⌃⌃⌅

Page 18: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Example: 5th-order polynomial fitting

True

w⇥ =

⇤0010�10

Learned

h=10

TrueLearnedTraining examples

w =

⇧⇧⇧⇧⇧⇧⇤

0.22�0.070.01�0.05�0.100.04

⌃⌃⌃⌃⌃⌃⌅

Page 19: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Binary classification •  Target y is +1 or -1.

•  Just apply ridge regression with +1/-1 targets –  forget about the Gaussian noise assumption!

Orange  (+1)    or  lemon  (-­‐1) y =

⇤1�11...1

⌅Outputs  to  be  predicted  

Page 20: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Multi-class classification

0

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

5

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

1

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

3

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

4

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

6

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

7

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

8

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

9

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

USPS  digits  dataset

hSp://www-­‐stat-­‐class.stanford.edu/~0bs/ElemStatLearn/datasets/zip.info

y =

⇤0 1 0 0 ··· 01 0 0 0 ··· 00 0 0 1 ··· 0...

...0 1 0 0 ··· 0

⌅ Number  of  samples

7291  training  samples,  2007  test  samples

Page 21: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

500 1000 2000 4000 800040

50

60

70

80

90

100

Number of samples

Accu

racy (

%)

USPS dataset We  can  obtain  88%  accuracy  on  a  held-­‐out  test-­‐set  using  about  7300  training  examples

A  machine  can  learn!  (using  a  very  simple  learning  algorithm)

λ=10-­‐6

Page 22: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Summary (so far) •  Ridge regression (RR) is very simple. •  RR can be coded in one line:

W=(X’*X+lambda*eye(d))\(X’*Y);!

•  RR can prevent over-fitting by regularization. •  Classification problem can also be solved by properly defining the output Y.

•  Nonlinearities can be handled by using basis functions (polynomial, Gaussian RBF, etc.).

Page 23: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Singularity - The dark side of RR

Page 24: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

USPS dataset (d=256) (What I have been hiding) •  The more data the less accurate??

16 32 64 128 256 512 1024 2048 4096 819220

30

40

50

60

70

80

90

Number of samples

Accura

cy (

%)

256  is  the  number    of  pixels  (16x16)  in  the  image

(n)

λ=10-­‐6

Page 25: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Breast Cancer Wisconsin (diagnostic) dataset (d=30)

0 50 100 150 200 25070

75

80

85

90

95

Number of samples

Accura

cy (

%)

Breast Cancer Wisconsin

(n)

λ=10-­‐6

30  real-­‐valued  features  •   radius  •   texture  •   perimeter  •   area,  etc.

Page 26: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

SPECT Heart dataset (d=22)

0 50 100 150 200 25055

60

65

70

75

80

85

90

Number of samples (n)

Accura

cy (

%)

SPECT Heart p=22

λ=10-­‐6

22  binary  features  

Page 27: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Spambase dataset (d=57)

100

101

102

103

104

50

60

70

80

90

100

Number of samples (n)

Accura

cy (

%)

Spambase p=57

λ=10-­‐6

55  real-­‐valued  features  •   word  frequency  •   character  frequency  2  integer-­‐valued  feats  •   run-­‐length  

Page 28: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Musk dataset (d=166)

0 100 200 300 400 50055

60

65

70

75

80

85

Number of samples (n)

Accura

cy (

%)

musk p=166

λ=10-­‐6

166  real-­‐valued  features  

Page 29: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Singularity Why does it happen? How can we avoid it?

Page 30: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Let’s analyze the simplest case: regression. •  Model – Design matrix X is fixed (X is not random) – Output

•  Estimator

•  Estimation Error

y = Xw� + � � : noise

expectation over noise

w =�X>X + �Id

��1X>y

The  es0mator  is  a  random  variable!

Err(w) = E⇠ kw �w⇤k2

Page 31: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Demo •  try exp_ridgeregression_poly.m

Page 32: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Estimator as a random variable

•  Deriving the generalization error reduces to understanding how the estimator behaves as a random variable.

•  Two strategies – Worst case – Average case ‒ this is what we’ll do today

Page 33: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Average case analysis •  Be careful!

•  Which is smaller?

E⇠ kw �w⇤k2 6= kE⇠w �w⇤k2

Average case error (what we will analyze)

Error of the averaged estimator

Page 34: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Bias-variance decomposition

w = E�wBias2 Variance

where

Bias:  error  coming  from  the  model/design  matrix                                    -­‐  under-­‐fiing  Variance:  error  caused  by  the  noise  -­‐  over-­‐fiing

Bias

Variance

w

w�

w

E⇠ kw �w⇤k2 = E⇠ kw � wk2 + kw �w⇤k2

Page 35: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Demo •  Try exp_ridgeregression_poly.m again – How can we reduce variance? – How can we reduce bias2?

Page 36: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

For ridge regression, •  Since , if y = Xw� + �

Let’s see if this is correct…

E⇠ = 0, Cov(⇠) = �2In

where

E⇠[ ˆw] =

⇣ˆ⌃+ �nId

⌘�1ˆ⌃w⇤

Cov(

ˆw) =

�2

n

⇣ˆ⌃+ �nId

⌘�1ˆ⌃⇣ˆ⌃+ �nId

⌘�1

�n := �/n and ⌃ =1

nX>X

Page 37: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Exercise •  Analytical exercise:

–  Derive the expressions for both –  Use them to derive bias2 and variance. .

•  Empirical exercise: Plot the ellipse corresponding to the theoretically derived mean and covariance of the ridge regression estimator –  Key function:

plotEllipse(mu, Sigma, color, width, marker_size)

mean (2x1 column vec)

covariance (2x2 matrix)

E⇠[ ˆw] andCov( ˆw)

Page 38: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Bias2 and variance from the mean and covariance •  Bias2

•  Variance

E⇠[w]Cov(

ˆw)

E⇠ k ˆw � ¯wk2 = Tr (Cov(

ˆw))

(�n := �/n)

kw �w⇤k2 = kE⇠w �w⇤k2

= �2n

��⇣⌃+ �nId

⌘�1w⇤��2

Page 39: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Explaining the singularity •  Bias2 is an increasing function of λ and bounded by (cannot cause phase transition)

•  Variance can be very large when the smallest eigenvalue of is close to zero (⇔ smallest singular-value of X is close to zero)

•  Try sample a random d x n matrix and see when the smallest singular-value is close to zero.

kw⇤k2

Page 40: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Simulation (λ=10-6)

100 101 102 103 104 10510−3

10−2

10−1

100

101

102

103

Number of samples n

Estim

atio

n er

ror |

|w−w

*||2

Ridge Regression: number of variables=100, lambda=1e−06

simulationbias2

variance

Page 41: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Simulation (λ=0.001)

100 101 102 103 104 10510−3

10−2

10−1

100

101

102

103

Number of samples n

Estim

atio

n er

ror |

|w−w

*||2

Ridge Regression: number of variables=100, lambda=0.001

simulationbias2

variance

Page 42: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Simulation (λ=1)

100 101 102 103 104 10510−3

10−2

10−1

100

101

102

103

Number of samples n

Estim

atio

n er

ror |

|w−w

*||2

Ridge Regression: number of variables=100, lambda=1

simulationbias2

variance

Page 43: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Estimation error and generalization error •  So far, we’ve analyzed the estimation error

•  One might be more interested in analyzing the generalization error

•  Try exp_frequentists_errorbar.m

E⇠ kw �w⇤k2

x: Test point

Gen(x) = E⇠(x>w

⇤ � x

>w)2

= E⇠

�x

>(w⇤ � w) 2

Page 44: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Exercise •  Analytical: derive the expression for the generalization error Gen(x) at an arbitrary point x. – Hint: use the decomposition

•  Empirical: try exp_frequentists_errorbar.m and see – when is the error under-estimated? – how does it compare to Bayesian posterior?

w⇤ � w = (w⇤ � w) + (w � w)

Page 45: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Generalization error at x

•  Caution – w* is not known! – worst case – average case

Gen(x) = E⇠

x

>(w⇤ � w) 2

= �2n

n

x

>⌃�1

�nw

⇤o2

+�2

nx

>⌃�1

�n⌃⌃

�1

�nx

(⌃�n := ⌃+ �nId)

n

x

>⌃�1

�nw

⇤o2

k⌃�1

�nxk2 · kw⇤k2

Ew⇤

n

x

>⌃�1

�nw

⇤o2

= ↵�1k⌃�1

�nxk2

assuming Ew⇤ [w⇤w⇤>] = ↵�1Id

Page 46: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Frequentists’ error-bar

−5 −4 −3 −2 −1 0 1 2 3 4 5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input

Out

put

n=20 d=6

true errorpredicted errortrue functionlearned functionsamples

Page 47: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

How do we choose λ? •  Bias2 cannot be computed in practice (because we don’t know w*)

•  Practical approaches – Mallow’s CL – Leave-one-out cross validation

Page 48: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Mallows’ CL [Mallows 1973]

•  Tells us how the training error is related to bias2

Bias2 Variance

1

nE⇠ky �Xwk2 + 2�2

nTr

⇣⌃(⌃+ �nId)

�1⌘

= �2 + E⇠(w � w)>⌃(w � w) + (w �w⇤)>⌃(w �w⇤)

(�n := �/n)

Tr⇣⌃(⌃+ �nId)

�1⌘: known as the e↵ective degrees of freedom

Page 49: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Schematically

1

nE⇠ky �Xwk2 Expected

training error

Expected (fixed design) generalization error

1

nE⇠0E⇠ky0 �Xwk2

(y0 = Xw⇤ + ⇠0)2�2

nTr

⇣⌃(⌃+ �nId)

�1⌘

This is how much we have overfitted!

Page 50: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Leave-one-out cross validation

•  Idea: compute an estimator leaving sample (xi, yi) out. Then test it on (xi, yi).

•  It turns out that

where

– This can be obtained by solving just one ridge regression problem.

nX

i=1

(yi � xi>w\i)

2 =nX

i=1

✓yi � xi

>w

1� S(i, i)

◆2

S = X(X>X + �Id)�1X>

w\i

Page 51: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Discussion •  Mallow’s CL is a good approximation of the test error when – but it requires the knowledge of σ2

•  Leave-one-out cross validation is an almost unbiased estimator of the generalization error – does not require the knowledge of σ2

– can be unstable (e.g., S(i,i) close to one) – cannot be used for other number of folds (e.g., 10 folds)

⌃ ' ⌃

Page 52: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Exercise •  Analytical exercise: Derive Mallow’s CL, or LOO-CV, or both.

•  Empirical exercise: – Try and compare the two strategies on some dataset.

– compare them to the cheating strategy, i.e., choose λ that minimizes the test error

– also try them on a classification problem.

Page 53: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Further exercise •  Take any model or classifier (logistic regression, L1-regularization, kernel ridge regression, etc) – simulate a problem – visualize the scattering of the estimated coefficient vector

– does it look Gaussian? – can you see a trade-off between bias and variance?

Page 54: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Summary •  Estimator is a random variable –  it fluctuates depending on the training examples

– characterizing the fluctuation is a key to understand its ability

•  Training error is an under-estimate of the generalization error – systematically biased – understanding the bias is a key to derive a model selection criterion

Page 55: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

What we did not discuss •  Other loss functions/regularization – analysis becomes significantly more challenging because the estimator is not analytically obtained

– Solution 1: asymptotic second-order expansion. Cf. AIC

– Solution 2: upper bounding using

•  Truth (w*) not contained in the model – VC dim, Rademacher complexity, etc. The bound becomes significantly looser.

Objective(w) Objective(w⇤)

Page 56: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Bayesian regression

Can we justify why we should predict with uncertainty?

Page 57: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Bayesian linear regression •  Generative process

•  Estimator

w ⇠ N (0,↵�1Id)

⇠ ⇠ N (0,�2In)

y = Xw + ⇠

Coefficient vector

Noise vector

Observation

w|y ⇠ N (µ,C)

µ := (X>X + �2↵Id)�1X>y

C := �2(X>X + �2↵Id)�1

Page 58: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Let’s visualize it •  Try exp_bayesian_regression.m •  Does Bayesian regression get away with over-fitting?

Page 59: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Discussion •  From a frequentists’ point of view, Bayesian posterior is a distribution-valued estimator.

•  In fact,

Bayesian posterior Average log-likelihood Regularization

p(w): prior distribution

p(w|y) = argmin

q(w)

�Ew⇠q(w) [� log p(y|w)] +D(qkp)

,

subject to

Zq(w)dw = 1.

S. Kullback R. Leibler

Page 60: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Predictive distributions •  Bayesian predictive distribution

•  Plug-in predictive distribution (via RR)

yn+1|xn+1,y ⇠ N (xn+1>µ,�2 + x

>Cx)

yn+1|xn+1,y ⇠ N (xn+1>w,�2)

Note: w = µ if � = ↵�2

⇒ They only differ in the predictive variance!

Page 61: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Evaluating the qualities of predictive distributions •  Kullback-Leibler divergence between the true and the predictive distributions

D�pw⇤

(yn+1|xn+1)kp(yn+1|xn+1)�

=

�xn+1

>(w

⇤ � ˆ

w)

2

2�2pred

+

1

2

(�2

�2pred

+ log

�2pred

�2

!� 1

)

where

Discounted generalization error

Penalty for uncertainty

pw⇤(yn+1|xn+1) : yn+1|xn+1 ⇠ N (xn+1>w

⇤,�2)

p(yn+1|xn+1) : yn+1|xn+1 ⇠ N (xn+1>w,�2

pred)

Page 62: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Exercise 1.  Derive the expression for the KL

divergence. 2.  Show that the penalty term is

nonnegative and increasing for σpred

2 ≥ σ2. 3.  Derive the optimal σpred

2 that minimizes the KL divergence.

Page 63: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Optimal predictive variance

Noise variance + Frequentists’ gen. error

�⇤pred

2 = �2 + {xn+1>(w⇤ � w)}2

Page 64: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Is Bayesian predictive variance optimal? •  In some sense, yes:

– this assumes that we know the correct noise variance σ2 and the prior variance α-1

– average over the draw of the true coefficient vector w*

Ew⇤⇠N (0,↵�1Id)E⇠

�xn+1

>(w⇤ � w) 2

= x

>Cx

Page 65: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Bayes risk [see Haussler & Opper 1997] •  More generally, Bayesian predictive distribution is the minimizer of the Bayes risk

Any distribution over yn+1 that depends on previous samples y1,…,yn

R[qy] = Ew⇠p(w)Ey⇠Qn

i=1 p(yi|w)

⇥D�p(yn+1|w)kqy(yn+1)

�⇤

Assumes that the truth w comes from the prior, and the samples are drawn from the likelihood p(y|w)!

Page 66: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Discussion •  Bayesian predictive distribution minimizes the Bayes risk given the correct prior and correct likelihood. – Clearly not satisfying.

•  Can we make it independent of the choice of prior/likelihood? – PAC Bayes theory

Page 67: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Preliminaries •  Loss function

–  assumed to be bounded by Lmax –  e.g., classification error

•  Training Gibbs error

•  Gibbs error (for some “posterior” Q over w)

L(s,w)

L(s,w) =

(0 if yx>

w � 0,

1 otherwise

(s = (y,x), Lmax

= 1)

- this is the quantity that we care about

L(Q) =1

n

nX

i=1

Ew⇠Q(w)[L(si,w)]

L(Q) = EsEw⇠Q(w)[L(s,w)]

Page 68: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

PAC-Bayes training-variance bound •  Let λ>1/2, “prior” P(w) is fixed before seeing the data, “posterior” Q(w) can be any distribution that depends on the data. Then we have

[McAllester  1999,  2013;  Catoni  2007]

ESL(Q) 1

1� 1

2�

✓ESL(Q) +

�Lmax

nESD(QkP )

Expectation with respect to training examples (average case)

Note: the worst case version is more commonly presented as PAC Bayes

Page 69: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Discussion •  What is Gibbs error?

–  Error of a prediction made randomly according to the posterior

–  Bayes generalization error ≤ Gibbs generalization error

•  What is the role of λ? – more or less an artifact in the analysis –  can be fixed at a large but fixed constant (say λ=10)

•  What is the best prior P(w)? –  P(w) = ES[Q(w)] minimizes ES D(Q(w)|P(w)) –  ESD(Q(w)|ES[Q(w)]): measure of variance of the posterior Q(w)

Page 70: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Summary •  Bayesian methods are not exempt from overfitting.

•  Posterior- and predictive distribution are random distributions

•  Does it make sense to predict with posterior variance? –  Only if you measure the quality of the predictive distribution with the KL (or other) divergence.

•  PAC-Bayes training-variance bound reflects the variance of the posterior distribution.

Page 71: Introduction to the analysis of learning algorithmsttic.uchicago.edu/~ryotat/teaching/dtuphd14.pdf · Introduction to the analysis of learning algorithms ― Does Bayesianism save

Beyond this lecture •  Non-parametric analysis of GP – van der Vaart & van Zanten (2011) “Information Rates of Nonparametric Gaussian Process Methods”

ESkf � f⇤k2n O⇣n�min(↵,�)/(2↵+d)

for f* with smoothness parameter β and posterior mean f^ using Matérn kernel with parameter α.