excessriskdecomposition - github pages · excessriskdecomposition davidrosenberg new york...

31
Excess Risk Decomposition David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 31

Upload: others

Post on 15-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

David Rosenberg

New York University

October 29, 2016

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 31

Page 2: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Review: Statistical Learning Theory

Review: Statistical Learning Theory

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 2 / 31

Page 3: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Review: Statistical Learning Theory

Statistical Learning Theory Framework

The Spaces

X: input space Y: output space A: action space

Decision FunctionA decision function produces an action a ∈A for any input x ∈ X:

f : X → A

x 7→ f (x)

Loss FunctionA loss function evaluates an action in the context of the output y .

` : A×Y → R(a,y) 7→ `(a,y)

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 3 / 31

Page 4: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Review: Statistical Learning Theory

The Gold Standard: Bayes Decision Function

DefinitionThe expected loss or “risk” of a decision function f : X→A is

R(f ) = E`(f (x),y),

where the expectation taken is over (x ,y) ∼ PX×Y.

DefinitionA Bayes decision function f ∗ : X→A is a function that achieves theminimal risk among all possible functions:

R(f ∗) = inffE`(f (x),y).

But risk function cannot be computed because we don’t know PX×Y.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 4 / 31

Page 5: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Review: Statistical Learning Theory

Empirical Risk Minimization

Let Dn = ((x1,y1), . . . ,(xn,yn)) be drawn i.i.d. from PX×Y.

DefinitionThe empirical risk of f : X→A with respect to Dn is

R̂n(f ) =1n

n∑i=1

`(f (xi ),yi ).

Minimizing empirical risk over all functions leads to overfitting.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 5 / 31

Page 6: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Review: Statistical Learning Theory

Constrain to a Hypothesis Space

Hypothesis space F, a set of functions mapping X→A

Example hypothesis spaces?

Empirical risk minimizer (ERM) in F is

f̂n = argminf∈F

1n

n∑i=1

`(f (xi ),yi ).

Risk minimizer in F is

fF = argminf∈F

E`(f (x),y).

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 6 / 31

Page 7: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

Excess Risk Decomposition

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 7 / 31

Page 8: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

Error Decomposition

f ∗ =argminf

E`(f (X ),Y )

fF =argminf∈F

E`(f (X ),Y ))

f̂n =argminf∈F

1n

n∑i=1

`(f (xi ),yi )

Approximation Error (of F) = R(fF)−R(f ∗)

Estimation error (of f̂n in F) = R(f̂n)−R(fF)

Figure from Sasha Rakhlin’s MLSS Lectures (2012):http://yosinski.com/mlss12/MLSS-2012-Rakhlin-Statistical-Learning-Theory/

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 8 / 31

Page 9: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

Excess Risk

DefinitionThe excess risk compares the risk of f to the Bayes optimal f ∗:

Excess Risk(f ) = R(f )−R(f ∗)

Can excess risk ever be negative?

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 9 / 31

Page 10: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

Excess Risk Decomposition for ERM

The excess risk of the ERM f̂n can be decomposed:

Excess Risk(f̂n) = R(f̂n)−R(f ∗)

= R(f̂n)−R(fF)︸ ︷︷ ︸estimation error

+ R(fF)−R(f ∗)︸ ︷︷ ︸approximation error

.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 10 / 31

Page 11: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

Approximation Error

Approximation error R(fF)−R(f ∗) isa property of the class Fthe penalty for restricting to F rather than all possible functions

Bigger F mean smaller approximation error.

Concept check: Is approximation error a random or non-random variable?

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 11 / 31

Page 12: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

Estimation Error

Estimation error R(f̂n)−R(fF)

is the performance hit for choosing f using finite training datais the performance hit for using empirical risk rather than true risk

With smaller F we expect smaller estimation error.

Concept check: Is estimation error a random or non-random variable?

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 12 / 31

Page 13: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

ERM Overview

Given a loss function ` :A×Y→ R.Choose hypothesis space F.Use an optimization method to find ERM f̂n ∈ F:

f̂n = argminf∈F

1n

n∑i=1

`(f (xi ),yi ).

Data scientist’s job:

choose F to balance between approximation and estimation error.as we get more training data, use a bigger F

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 13 / 31

Page 14: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

ERM in Practice

We’ve been cheating a bit by writing “argmin”.In practice, we need a method to find f̂n ∈ F.For nice choices of loss functions and classes F, the algorithmicproblem can be solved to any desired accuracy

But takes time – is it worth it?

For neural networks, we have no idea how to find f̂n ∈ F.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 14 / 31

Page 15: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

Optimization Error

In practice, we don’t find the ERM f̂n ∈ F.We find f̃n ∈ F that we hope is good enough.Optimization error: If f̃n is the function our optimization methodreturns, and f̂n is the empirical risk minimizer, then

Optimization Error = R(f̃n)−R(f̂n).

Can optimization error be negative? Yes!But

R̂(f̃n)− R̂(f̂ (n))> 0.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 15 / 31

Page 16: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition

Error Decomposition in Practice

Excess risk decomposition for function f̃n returned by algorithm:

Excess Risk(f̃n) = R(f̃n)−R(f ∗)

= R(f̃n)−R(f̂n)︸ ︷︷ ︸optimization error

+R(f̂n)−R(fF)︸ ︷︷ ︸estimation error

+ R(fF)−R(f ∗)︸ ︷︷ ︸approximation error

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 16 / 31

Page 17: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Excess Risk Decomposition: Example

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 17 / 31

Page 18: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

A Simple Classification Problem

Bayes Error Rate = 0.1

Y= {blue,orange}

PX = Uniform([0,1]2)P(orange | x1 > x2) = .9P(orange | x1 < x2) = .1

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 18 / 31

Page 19: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Binary Decision Trees on R2

Consider a binary tree on {(X1,X2) | X1,X2 ∈ R}

From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from theauthors: G. James, D. Witten, T. Hastie and R. Tibshirani.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 19 / 31

Page 20: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Hypothesis Space: Decision Tree

F ={all decision tree classifiers on [0,1]2

}Fd =

{all decision tree classifiers on [0,1]2 with DEPTH6 d

}We’ll consider

F2 ⊂ F3 ⊂ F4 · · · ⊂ F15

Bayes error rate = 0.1

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 20 / 31

Page 21: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Theoretical Best in F2

Risk Minimizer in F2 (e.g. assuming infinite training data)Risk = P(error) = 0.2Approximation Error = 0.2 - 0.1 = 0.1

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 21 / 31

Page 22: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Theoretical Best in F3

Risk Minimizer in F3 (e.g. assuming infinite training data)Risk = P(error) = 0.15Approximation Error = 0.15 - 0.1 = 0.05

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 22 / 31

Page 23: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Theoretical Best in F4

Risk Minimizer (e.g. assuming infinite training data)Risk = P(error) = 0.125Approximation Error = 0.125 - 0.1 = 0.025

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 23 / 31

Page 24: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Decision Tree in F3 Estimated From Sample (n = 1024)

R(f̃ ) = P(error) = 0.176± .004

Estimation Error+Optimization Error= 0.176± .004︸ ︷︷ ︸R(f̃ )

− 0.150︸ ︷︷ ︸minf∈F3 R(f )

= .026± .004David Rosenberg (New York University) DS-GA 1003 October 29, 2016 24 / 31

Page 25: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Decision Tree in F4 Estimated From Sample (n = 1024)

R(f̃ ) = P(error) = 0.144± .005

Estimation Error+Optimization Error= 0.144± .005︸ ︷︷ ︸R(f̃ )

− 0.125︸ ︷︷ ︸minf∈F4 R(f )

= .019± .005David Rosenberg (New York University) DS-GA 1003 October 29, 2016 25 / 31

Page 26: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Decision Tree in F6 Estimated From Sample (n = 1024)

R(f̃ ) = P(error) = 0.148± .007

Estimation Error+Optimization Error= 0.148± .007︸ ︷︷ ︸R(f̃ )

− 0.106︸ ︷︷ ︸minf∈F6 R(f )

= .042± .008David Rosenberg (New York University) DS-GA 1003 October 29, 2016 26 / 31

Page 27: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Decision Tree in F8 Estimated From Sample (n = 1024)

R(f̃ ) = P(error) = 0.162± .009

Estimation Error+Optimization Error= 0.162± .009︸ ︷︷ ︸R(f̃ )

− 0.102︸ ︷︷ ︸minf∈F8 R(f )

= .061± .009David Rosenberg (New York University) DS-GA 1003 October 29, 2016 27 / 31

Page 28: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Decision Tree in F8 Estimated From Sample (n = 2048)

R(f̃ ) = P(error) = 0.146± .006

Estimation Error+Optimization Error= 0.146± .006︸ ︷︷ ︸R(f̃ )

− 0.102︸ ︷︷ ︸minf∈F3 R(f )

= .045± .006David Rosenberg (New York University) DS-GA 1003 October 29, 2016 28 / 31

Page 29: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Decision Tree in F8 Estimated From Sample (n = 8192)

R(f̃ ) = P(error) = 0.121± .002

Estimation Error+Optimization Error= 0.121± .002︸ ︷︷ ︸R(f̃ )

− 0.102︸ ︷︷ ︸minf∈F3 R(f )

= .019± .002David Rosenberg (New York University) DS-GA 1003 October 29, 2016 29 / 31

Page 30: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Risk Summary

Why do some curves have confidence bands and others not?David Rosenberg (New York University) DS-GA 1003 October 29, 2016 30 / 31

Page 31: ExcessRiskDecomposition - GitHub Pages · ExcessRiskDecomposition DavidRosenberg New York University October29,2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016

Excess Risk Decomposition: Example

Excess Risk Decomposition

Why do some curves have confidence bands and others not?David Rosenberg (New York University) DS-GA 1003 October 29, 2016 31 / 31