introduction to machine learning (67577) lecture 5shais/lectures2014/lecture5.pdf · 2014-05-07 ·...

85
Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 1 / 39

Upload: others

Post on 23-Feb-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Introduction to Machine Learning (67577)Lecture 5

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 1 / 39

Page 2: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Outline

1 Minimum Description Length

2 Non-uniform learnability

3 Structural Risk Minimization

4 Decision Trees

5 Nearest Neighbor and Consistency

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 2 / 39

Page 3: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

How to Express Prior Knowledge

So far, learner expresses prior knowledge by specifying the hypothesisclass H

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 3 / 39

Page 4: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Other Ways to Express Prior Knowledge

William&of&Occam&(128701347)&

Occam’s&Razor:&“A&short&explanaBon&is&preferred&over&a&longer&one”&

“Things(that(look(alike(must(be(alike”(

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 4 / 39

Page 5: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Other Ways to Express Prior Knowledge

William&of&Occam&(128701347)&

Occam’s&Razor:&“A&short&explanaBon&is&preferred&over&a&longer&one”&

“Things(that(look(alike(must(be(alike”(

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 4 / 39

Page 6: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Outline

1 Minimum Description Length

2 Non-uniform learnability

3 Structural Risk Minimization

4 Decision Trees

5 Nearest Neighbor and Consistency

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 5 / 39

Page 7: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bias to Shorter Description

Let H be a countable hypothesis class

Let w : H → R be such that∑

h∈Hw(h) ≤ 1

The function w reflects prior knowledge on how important w(h) is

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 6 / 39

Page 8: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Example: Description Length

Suppose that each h ∈ H is described by some word d(h) ∈ {0, 1}∗E.g.: H is the class of all python programs

Suppose that the description language is prefix-free, namely, for everyh 6= h′, d(h) is not a prefix of d(h′)(Always achievable by including an “end-of-word” symbol)

Let |h| be the length of d(h)

Then, set w(h) = 2−|h|

Kraft’s inequality implies that∑

hw(h) ≤ 1

Proof: define probability over words in d(H) as follows: repeatedly tossan unbiased coin, until the sequence of outcomes is a member of d(H),and then stop. Since d(H) is prefix-free, this is a valid probability overd(H), and the probability to get d(h) is w(h).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

Page 9: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Example: Description Length

Suppose that each h ∈ H is described by some word d(h) ∈ {0, 1}∗E.g.: H is the class of all python programs

Suppose that the description language is prefix-free, namely, for everyh 6= h′, d(h) is not a prefix of d(h′)(Always achievable by including an “end-of-word” symbol)

Let |h| be the length of d(h)

Then, set w(h) = 2−|h|

Kraft’s inequality implies that∑

hw(h) ≤ 1

Proof: define probability over words in d(H) as follows: repeatedly tossan unbiased coin, until the sequence of outcomes is a member of d(H),and then stop. Since d(H) is prefix-free, this is a valid probability overd(H), and the probability to get d(h) is w(h).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

Page 10: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Example: Description Length

Suppose that each h ∈ H is described by some word d(h) ∈ {0, 1}∗E.g.: H is the class of all python programs

Suppose that the description language is prefix-free, namely, for everyh 6= h′, d(h) is not a prefix of d(h′)(Always achievable by including an “end-of-word” symbol)

Let |h| be the length of d(h)

Then, set w(h) = 2−|h|

Kraft’s inequality implies that∑

hw(h) ≤ 1

Proof: define probability over words in d(H) as follows: repeatedly tossan unbiased coin, until the sequence of outcomes is a member of d(H),and then stop. Since d(H) is prefix-free, this is a valid probability overd(H), and the probability to get d(h) is w(h).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

Page 11: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Example: Description Length

Suppose that each h ∈ H is described by some word d(h) ∈ {0, 1}∗E.g.: H is the class of all python programs

Suppose that the description language is prefix-free, namely, for everyh 6= h′, d(h) is not a prefix of d(h′)(Always achievable by including an “end-of-word” symbol)

Let |h| be the length of d(h)

Then, set w(h) = 2−|h|

Kraft’s inequality implies that∑

hw(h) ≤ 1

Proof: define probability over words in d(H) as follows: repeatedly tossan unbiased coin, until the sequence of outcomes is a member of d(H),and then stop. Since d(H) is prefix-free, this is a valid probability overd(H), and the probability to get d(h) is w(h).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

Page 12: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Example: Description Length

Suppose that each h ∈ H is described by some word d(h) ∈ {0, 1}∗E.g.: H is the class of all python programs

Suppose that the description language is prefix-free, namely, for everyh 6= h′, d(h) is not a prefix of d(h′)(Always achievable by including an “end-of-word” symbol)

Let |h| be the length of d(h)

Then, set w(h) = 2−|h|

Kraft’s inequality implies that∑

hw(h) ≤ 1

Proof: define probability over words in d(H) as follows: repeatedly tossan unbiased coin, until the sequence of outcomes is a member of d(H),and then stop. Since d(H) is prefix-free, this is a valid probability overd(H), and the probability to get d(h) is w(h).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

Page 13: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Example: Description Length

Suppose that each h ∈ H is described by some word d(h) ∈ {0, 1}∗E.g.: H is the class of all python programs

Suppose that the description language is prefix-free, namely, for everyh 6= h′, d(h) is not a prefix of d(h′)(Always achievable by including an “end-of-word” symbol)

Let |h| be the length of d(h)

Then, set w(h) = 2−|h|

Kraft’s inequality implies that∑

hw(h) ≤ 1

Proof: define probability over words in d(H) as follows: repeatedly tossan unbiased coin, until the sequence of outcomes is a member of d(H),and then stop. Since d(H) is prefix-free, this is a valid probability overd(H), and the probability to get d(h) is w(h).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

Page 14: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bias to Shorter Description

Theorem (Minimum Description Length (MDL) bound)

Let w : H → R be such that∑

h∈Hw(h) ≤ 1. Then, with probability of atleast 1− δ over S ∼ Dm we have:

∀h ∈ H, LD(h) ≤ LS(h) +√− log(w(h)) + log(2/δ)

2m

Compare to VC bound:

∀h ∈ H, LD(h) ≤ LS(h) + C

√VCdim(H) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 8 / 39

Page 15: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bias to Shorter Description

Theorem (Minimum Description Length (MDL) bound)

Let w : H → R be such that∑

h∈Hw(h) ≤ 1. Then, with probability of atleast 1− δ over S ∼ Dm we have:

∀h ∈ H, LD(h) ≤ LS(h) +√− log(w(h)) + log(2/δ)

2m

Compare to VC bound:

∀h ∈ H, LD(h) ≤ LS(h) + C

√VCdim(H) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 8 / 39

Page 16: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof

For every h, define δh = w(h) · δ

By Hoeffding’s bound, for every h,

Dm({

S : LD(h) > LS(h) +

√log(2/δh)

2m

})≤ δh

Applying the union bound,

Dm({

S : ∃h ∈ H, LD(h) > LS(h) +

√log(2/δh)

2m

})=

Dm(∪h∈H

{S : LD(h) > LS(h) +

√log(2/δh)

2m

})≤∑

h∈Hδh ≤ δ .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

Page 17: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof

For every h, define δh = w(h) · δBy Hoeffding’s bound, for every h,

Dm({

S : LD(h) > LS(h) +

√log(2/δh)

2m

})≤ δh

Applying the union bound,

Dm({

S : ∃h ∈ H, LD(h) > LS(h) +

√log(2/δh)

2m

})=

Dm(∪h∈H

{S : LD(h) > LS(h) +

√log(2/δh)

2m

})≤∑

h∈Hδh ≤ δ .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

Page 18: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof

For every h, define δh = w(h) · δBy Hoeffding’s bound, for every h,

Dm({

S : LD(h) > LS(h) +

√log(2/δh)

2m

})≤ δh

Applying the union bound,

Dm({

S : ∃h ∈ H, LD(h) > LS(h) +

√log(2/δh)

2m

})=

Dm(∪h∈H

{S : LD(h) > LS(h) +

√log(2/δh)

2m

})≤∑

h∈Hδh ≤ δ .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

Page 19: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bound Minimization

MDL bound: ∀h ∈ H, LD(h) ≤ LS(h) +√− log(w(h))+log(2/δ)

2m

VC bound: ∀h ∈ H, LD(h) ≤ LS(h) + C

√VCdim(H)+log(2/δ)

2m

Recall that our goal is to minimize LD(h) over h ∈ HMinimizing the VC bound leads to the ERM rule

Minimizing the MDL bound leads to the MDL rule:

MDL(S) ∈ argminh∈H

[LS(h) +

√− log(w(h)) + log(2/δ)

2m

]

When w(h) = 2−|h| we obtain − log(w(h)) = |h| log(2)Explicit tradeoff between bias (small LS(h)) and complexity (small|h|)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

Page 20: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bound Minimization

MDL bound: ∀h ∈ H, LD(h) ≤ LS(h) +√− log(w(h))+log(2/δ)

2m

VC bound: ∀h ∈ H, LD(h) ≤ LS(h) + C

√VCdim(H)+log(2/δ)

2m

Recall that our goal is to minimize LD(h) over h ∈ H

Minimizing the VC bound leads to the ERM rule

Minimizing the MDL bound leads to the MDL rule:

MDL(S) ∈ argminh∈H

[LS(h) +

√− log(w(h)) + log(2/δ)

2m

]

When w(h) = 2−|h| we obtain − log(w(h)) = |h| log(2)Explicit tradeoff between bias (small LS(h)) and complexity (small|h|)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

Page 21: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bound Minimization

MDL bound: ∀h ∈ H, LD(h) ≤ LS(h) +√− log(w(h))+log(2/δ)

2m

VC bound: ∀h ∈ H, LD(h) ≤ LS(h) + C

√VCdim(H)+log(2/δ)

2m

Recall that our goal is to minimize LD(h) over h ∈ HMinimizing the VC bound leads to the ERM rule

Minimizing the MDL bound leads to the MDL rule:

MDL(S) ∈ argminh∈H

[LS(h) +

√− log(w(h)) + log(2/δ)

2m

]

When w(h) = 2−|h| we obtain − log(w(h)) = |h| log(2)Explicit tradeoff between bias (small LS(h)) and complexity (small|h|)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

Page 22: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bound Minimization

MDL bound: ∀h ∈ H, LD(h) ≤ LS(h) +√− log(w(h))+log(2/δ)

2m

VC bound: ∀h ∈ H, LD(h) ≤ LS(h) + C

√VCdim(H)+log(2/δ)

2m

Recall that our goal is to minimize LD(h) over h ∈ HMinimizing the VC bound leads to the ERM rule

Minimizing the MDL bound leads to the MDL rule:

MDL(S) ∈ argminh∈H

[LS(h) +

√− log(w(h)) + log(2/δ)

2m

]

When w(h) = 2−|h| we obtain − log(w(h)) = |h| log(2)Explicit tradeoff between bias (small LS(h)) and complexity (small|h|)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

Page 23: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bound Minimization

MDL bound: ∀h ∈ H, LD(h) ≤ LS(h) +√− log(w(h))+log(2/δ)

2m

VC bound: ∀h ∈ H, LD(h) ≤ LS(h) + C

√VCdim(H)+log(2/δ)

2m

Recall that our goal is to minimize LD(h) over h ∈ HMinimizing the VC bound leads to the ERM rule

Minimizing the MDL bound leads to the MDL rule:

MDL(S) ∈ argminh∈H

[LS(h) +

√− log(w(h)) + log(2/δ)

2m

]

When w(h) = 2−|h| we obtain − log(w(h)) = |h| log(2)

Explicit tradeoff between bias (small LS(h)) and complexity (small|h|)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

Page 24: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Bound Minimization

MDL bound: ∀h ∈ H, LD(h) ≤ LS(h) +√− log(w(h))+log(2/δ)

2m

VC bound: ∀h ∈ H, LD(h) ≤ LS(h) + C

√VCdim(H)+log(2/δ)

2m

Recall that our goal is to minimize LD(h) over h ∈ HMinimizing the VC bound leads to the ERM rule

Minimizing the MDL bound leads to the MDL rule:

MDL(S) ∈ argminh∈H

[LS(h) +

√− log(w(h)) + log(2/δ)

2m

]

When w(h) = 2−|h| we obtain − log(w(h)) = |h| log(2)Explicit tradeoff between bias (small LS(h)) and complexity (small|h|)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

Page 25: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

MDL guarantee

Theorem

For every h∗ ∈ H, w.p. ≥ 1− δ over S ∼ Dm we have:

LD(MDL(S)) ≤ LD(h∗) +√− log(w(h∗)) + log(2/δ)

2m

Example: Take H to be the class of all python programs, with |h| bethe code length (in bits)

Assume ∃h∗ ∈ H with LD(h∗) = 0. Then, for every ε, δ, exists

sample size m s.t. Dm({S : LD(MDL(S)) ≤ ε}) ≥ 1− δMDL is a Universal Learner

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

Page 26: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

MDL guarantee

Theorem

For every h∗ ∈ H, w.p. ≥ 1− δ over S ∼ Dm we have:

LD(MDL(S)) ≤ LD(h∗) +√− log(w(h∗)) + log(2/δ)

2m

Example: Take H to be the class of all python programs, with |h| bethe code length (in bits)

Assume ∃h∗ ∈ H with LD(h∗) = 0. Then, for every ε, δ, exists

sample size m s.t. Dm({S : LD(MDL(S)) ≤ ε}) ≥ 1− δMDL is a Universal Learner

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

Page 27: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

MDL guarantee

Theorem

For every h∗ ∈ H, w.p. ≥ 1− δ over S ∼ Dm we have:

LD(MDL(S)) ≤ LD(h∗) +√− log(w(h∗)) + log(2/δ)

2m

Example: Take H to be the class of all python programs, with |h| bethe code length (in bits)

Assume ∃h∗ ∈ H with LD(h∗) = 0. Then, for every ε, δ, exists

sample size m s.t. Dm({S : LD(MDL(S)) ≤ ε}) ≥ 1− δ

MDL is a Universal Learner

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

Page 28: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

MDL guarantee

Theorem

For every h∗ ∈ H, w.p. ≥ 1− δ over S ∼ Dm we have:

LD(MDL(S)) ≤ LD(h∗) +√− log(w(h∗)) + log(2/δ)

2m

Example: Take H to be the class of all python programs, with |h| bethe code length (in bits)

Assume ∃h∗ ∈ H with LD(h∗) = 0. Then, for every ε, δ, exists

sample size m s.t. Dm({S : LD(MDL(S)) ≤ ε}) ≥ 1− δMDL is a Universal Learner

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

Page 29: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

MDL guarantee

Theorem

For every h∗ ∈ H, w.p. ≥ 1− δ over S ∼ Dm we have:

LD(MDL(S)) ≤ LD(h∗) +√− log(w(h∗)) + log(2/δ)

2m

Example: Take H to be the class of all python programs, with |h| bethe code length (in bits)

Assume ∃h∗ ∈ H with LD(h∗) = 0. Then, for every ε, δ, exists

sample size m s.t. Dm({S : LD(MDL(S)) ≤ ε}) ≥ 1− δMDL is a Universal Learner

Bias to Shorter Description𝑝:𝒴𝒳 → [0,1] ∑ 𝑝 ℎ ≤ 1

• Based on length of (prefix-ambiguous) description 𝑑 ℎ• 𝑑:ℋ → 0,1 ∗, 𝑑(ℎ) is never a prefix of 𝑑(ℎ ) for any ℎ, ℎ′• 𝑝 ℎ = 2• Kraft Inequality: ∑ = ∑𝑝 ℎ ≤ 1

• Based on c: 𝑈 → 𝒴𝒳 (e.g. python code↦function it implements)• Set of prefix-ambiguous  “legal  programs”  𝑈 ⊂ 0,1 ∗

• 𝑝 ℎ = 2 (can think of: 𝑑 ℎ = arg min 𝜎 )

• Minimum Description Length learning rule:

𝑀𝐷𝐿 𝑆 = arg max 𝑝(ℎ) = arg min |𝑑 ℎ |

Kolmogorov ComplexityAndrei

Kolmogorov(1903-1987)

RaySolomonoff(1926-2009)

Bias to Shorter Description𝑝:𝒴𝒳 → [0,1] ∑ 𝑝 ℎ ≤ 1

• Based on length of (prefix-ambiguous) description 𝑑 ℎ• 𝑑:ℋ → 0,1 ∗, 𝑑(ℎ) is never a prefix of 𝑑(ℎ ) for any ℎ, ℎ′• 𝑝 ℎ = 2• Kraft Inequality: ∑ = ∑𝑝 ℎ ≤ 1

• Based on c: 𝑈 → 𝒴𝒳 (e.g. python code↦function it implements)• Set of prefix-ambiguous  “legal  programs”  𝑈 ⊂ 0,1 ∗

• 𝑝 ℎ = 2 (can think of: 𝑑 ℎ = arg min 𝜎 )

• Minimum Description Length learning rule:

𝑀𝐷𝐿 𝑆 = arg max 𝑝(ℎ) = arg min |𝑑 ℎ |

Kolmogorov ComplexityAndrei

Kolmogorov(1903-1987)

RaySolomonoff(1926-2009)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

Page 30: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Contradiction to the fundamental theorem of learning ?

Take again H to be all python programs

Note that VCdim(H) =∞The No-Free-Lunch theorem we can’t learn HSo how come we can learn H using MDL ???

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 12 / 39

Page 31: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Outline

1 Minimum Description Length

2 Non-uniform learnability

3 Structural Risk Minimization

4 Decision Trees

5 Nearest Neighbor and Consistency

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 13 / 39

Page 32: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Non-uniform learning

Definition (Non-uniformly learnable)

H is non-uniformly learnable if ∃A and mNULH : (0, 1)2 ×H → N s.t.,

∀ε, δ ∈ (0, 1), ∀h ∈ H, if m ≥ mNULH (ε, δ, h) then ∀D,

Dm ({S : LD(A(S)) ≤ LD(h) + ε}) ≥ 1− δ .

Number of required examples depends on ε, δ, and h

Definition (Agnostic PAC learnable)

H is agnostically PAC learnable if ∃A and mH : (0, 1)2 → N s.t.∀ε, δ ∈ (0, 1), if m ≥ mH(ε, δ), then ∀D and ∀h ∈ H,

Dm ({S : LD(A(S)) ≤ LD(h) + ε}) ≥ 1− δ .

Number of required examples depends only on ε, δ

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 14 / 39

Page 33: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Non-uniform learning vs. PAC learning

Corollary

Let H be the class of all computable functions

H is non-uniform learnable, with sample complexity,

mNULH (ε, δ, h) ≤ − log(w(h)) + log(2/δ)

2ε2

H is not PAC learnable.

We saw that the VC dimension characterizes PAC learnability

What characterizes non-uniform learnability ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 15 / 39

Page 34: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Non-uniform learning vs. PAC learning

Corollary

Let H be the class of all computable functions

H is non-uniform learnable, with sample complexity,

mNULH (ε, δ, h) ≤ − log(w(h)) + log(2/δ)

2ε2

H is not PAC learnable.

We saw that the VC dimension characterizes PAC learnability

What characterizes non-uniform learnability ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 15 / 39

Page 35: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Characterizing Non-uniform Learnability

Theorem

A class H ⊂ {0, 1}X is non-uniform learnable if and only if it is acountable union of PAC learnable hypothesis classes.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 16 / 39

Page 36: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Non-uniform learnable ⇒ countable union)

Assume that H is non-uniform learnable using A with samplecomplexity mNUL

H

For every n ∈ N, let Hn = {h ∈ H : mNULH (1/8, 1/7, h) ≤ n}

Clearly, H = ∪n∈NHn.

For every D s.t. ∃h ∈ Hn with LD(h) = 0 we have thatDn({S : LD(A(S)) ≤ 1/8}) ≥ 6/7

The fundamental theorem of statistical learning implies thatVCdim(Hn) <∞, and therefore Hn is agnostic PAC learnable

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

Page 37: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Non-uniform learnable ⇒ countable union)

Assume that H is non-uniform learnable using A with samplecomplexity mNUL

HFor every n ∈ N, let Hn = {h ∈ H : mNUL

H (1/8, 1/7, h) ≤ n}

Clearly, H = ∪n∈NHn.

For every D s.t. ∃h ∈ Hn with LD(h) = 0 we have thatDn({S : LD(A(S)) ≤ 1/8}) ≥ 6/7

The fundamental theorem of statistical learning implies thatVCdim(Hn) <∞, and therefore Hn is agnostic PAC learnable

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

Page 38: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Non-uniform learnable ⇒ countable union)

Assume that H is non-uniform learnable using A with samplecomplexity mNUL

HFor every n ∈ N, let Hn = {h ∈ H : mNUL

H (1/8, 1/7, h) ≤ n}Clearly, H = ∪n∈NHn.

For every D s.t. ∃h ∈ Hn with LD(h) = 0 we have thatDn({S : LD(A(S)) ≤ 1/8}) ≥ 6/7

The fundamental theorem of statistical learning implies thatVCdim(Hn) <∞, and therefore Hn is agnostic PAC learnable

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

Page 39: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Non-uniform learnable ⇒ countable union)

Assume that H is non-uniform learnable using A with samplecomplexity mNUL

HFor every n ∈ N, let Hn = {h ∈ H : mNUL

H (1/8, 1/7, h) ≤ n}Clearly, H = ∪n∈NHn.

For every D s.t. ∃h ∈ Hn with LD(h) = 0 we have thatDn({S : LD(A(S)) ≤ 1/8}) ≥ 6/7

The fundamental theorem of statistical learning implies thatVCdim(Hn) <∞, and therefore Hn is agnostic PAC learnable

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

Page 40: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Non-uniform learnable ⇒ countable union)

Assume that H is non-uniform learnable using A with samplecomplexity mNUL

HFor every n ∈ N, let Hn = {h ∈ H : mNUL

H (1/8, 1/7, h) ≤ n}Clearly, H = ∪n∈NHn.

For every D s.t. ∃h ∈ Hn with LD(h) = 0 we have thatDn({S : LD(A(S)) ≤ 1/8}) ≥ 6/7

The fundamental theorem of statistical learning implies thatVCdim(Hn) <∞, and therefore Hn is agnostic PAC learnable

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

Page 41: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Countable union ⇒ non-uniform learnable)

Assume H = ∪n∈NHn, and VCdim(Hn) = dn <∞

Choose w : N→ [0, 1] s.t.∑

nw(n) ≤ 1. E.g. w(n) = 6π2n2

Choose δn = δ · w(n) and εn =

√C dn+log(1/δn)

m

By the fundamental theorem, for every n,

Dm({S : ∃h ∈ Hn, LD(h) > LS(h) + εn}) ≤ δn .

Applying the union bound over n we obtain

Dm({S : ∃n, h ∈ Hn, LD(h) > LS(h) + εn}) ≤∑n

δn ≤ δ .

This yields a generic non-uniform learning rule

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

Page 42: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Countable union ⇒ non-uniform learnable)

Assume H = ∪n∈NHn, and VCdim(Hn) = dn <∞Choose w : N→ [0, 1] s.t.

∑nw(n) ≤ 1. E.g. w(n) = 6

π2n2

Choose δn = δ · w(n) and εn =

√C dn+log(1/δn)

m

By the fundamental theorem, for every n,

Dm({S : ∃h ∈ Hn, LD(h) > LS(h) + εn}) ≤ δn .

Applying the union bound over n we obtain

Dm({S : ∃n, h ∈ Hn, LD(h) > LS(h) + εn}) ≤∑n

δn ≤ δ .

This yields a generic non-uniform learning rule

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

Page 43: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Countable union ⇒ non-uniform learnable)

Assume H = ∪n∈NHn, and VCdim(Hn) = dn <∞Choose w : N→ [0, 1] s.t.

∑nw(n) ≤ 1. E.g. w(n) = 6

π2n2

Choose δn = δ · w(n) and εn =

√C dn+log(1/δn)

m

By the fundamental theorem, for every n,

Dm({S : ∃h ∈ Hn, LD(h) > LS(h) + εn}) ≤ δn .

Applying the union bound over n we obtain

Dm({S : ∃n, h ∈ Hn, LD(h) > LS(h) + εn}) ≤∑n

δn ≤ δ .

This yields a generic non-uniform learning rule

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

Page 44: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Countable union ⇒ non-uniform learnable)

Assume H = ∪n∈NHn, and VCdim(Hn) = dn <∞Choose w : N→ [0, 1] s.t.

∑nw(n) ≤ 1. E.g. w(n) = 6

π2n2

Choose δn = δ · w(n) and εn =

√C dn+log(1/δn)

m

By the fundamental theorem, for every n,

Dm({S : ∃h ∈ Hn, LD(h) > LS(h) + εn}) ≤ δn .

Applying the union bound over n we obtain

Dm({S : ∃n, h ∈ Hn, LD(h) > LS(h) + εn}) ≤∑n

δn ≤ δ .

This yields a generic non-uniform learning rule

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

Page 45: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Countable union ⇒ non-uniform learnable)

Assume H = ∪n∈NHn, and VCdim(Hn) = dn <∞Choose w : N→ [0, 1] s.t.

∑nw(n) ≤ 1. E.g. w(n) = 6

π2n2

Choose δn = δ · w(n) and εn =

√C dn+log(1/δn)

m

By the fundamental theorem, for every n,

Dm({S : ∃h ∈ Hn, LD(h) > LS(h) + εn}) ≤ δn .

Applying the union bound over n we obtain

Dm({S : ∃n, h ∈ Hn, LD(h) > LS(h) + εn}) ≤∑n

δn ≤ δ .

This yields a generic non-uniform learning rule

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

Page 46: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Proof (Countable union ⇒ non-uniform learnable)

Assume H = ∪n∈NHn, and VCdim(Hn) = dn <∞Choose w : N→ [0, 1] s.t.

∑nw(n) ≤ 1. E.g. w(n) = 6

π2n2

Choose δn = δ · w(n) and εn =

√C dn+log(1/δn)

m

By the fundamental theorem, for every n,

Dm({S : ∃h ∈ Hn, LD(h) > LS(h) + εn}) ≤ δn .

Applying the union bound over n we obtain

Dm({S : ∃n, h ∈ Hn, LD(h) > LS(h) + εn}) ≤∑n

δn ≤ δ .

This yields a generic non-uniform learning rule

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

Page 47: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Outline

1 Minimum Description Length

2 Non-uniform learnability

3 Structural Risk Minimization

4 Decision Trees

5 Nearest Neighbor and Consistency

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 19 / 39

Page 48: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Structural Risk Minimization (SRM)

SRM(S) ∈ argminh∈H

[LS(h) + min

n:h∈Hn

√Cdn − log(w(n)) + log(1/δ)

m

]

As in the analysis of MDL, it is easy to show that for every h ∈ H,

LD(SRM(S)) ≤ LS(h) + minn:h∈Hn

√Cdn − log(w(n)) + log(1/δ)

m

Hence, SRM is a generic non-uniform learner with sample complexity

mNULH (ε, δ, h) ≤ min

n:h∈Hn

Cdn − log(w(n)) + log(1/δ)

ε2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

Page 49: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Structural Risk Minimization (SRM)

SRM(S) ∈ argminh∈H

[LS(h) + min

n:h∈Hn

√Cdn − log(w(n)) + log(1/δ)

m

]

As in the analysis of MDL, it is easy to show that for every h ∈ H,

LD(SRM(S)) ≤ LS(h) + minn:h∈Hn

√Cdn − log(w(n)) + log(1/δ)

m

Hence, SRM is a generic non-uniform learner with sample complexity

mNULH (ε, δ, h) ≤ min

n:h∈Hn

Cdn − log(w(n)) + log(1/δ)

ε2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

Page 50: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Structural Risk Minimization (SRM)

SRM(S) ∈ argminh∈H

[LS(h) + min

n:h∈Hn

√Cdn − log(w(n)) + log(1/δ)

m

]

As in the analysis of MDL, it is easy to show that for every h ∈ H,

LD(SRM(S)) ≤ LS(h) + minn:h∈Hn

√Cdn − log(w(n)) + log(1/δ)

m

Hence, SRM is a generic non-uniform learner with sample complexity

mNULH (ε, δ, h) ≤ min

n:h∈Hn

Cdn − log(w(n)) + log(1/δ)

ε2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

Page 51: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

No-free-lunch for non-uniform learnability

Claim: For any infinite domain set, X , the class H = {0, 1}X is not acountable union of classes of finite VC-dimension.

Hence, such classes H are not non-uniformly learnable

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 21 / 39

Page 52: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

The cost of weaker prior knowledge

Suppose H = ∪nHn, where VCdim(Hn) = n

Suppose that some h∗ ∈ Hn has LD(h∗) = 0

With this prior knowledge, we can apply ERM on Hn, and the samplecomplexity is C n+log(1/δ)

ε2

Without this prior knowledge, SRM will need C n+log(π2n2/6)+log(1/δ)ε2

examples

That is, we pay order of log(n)/ε2 more examples for not knowing nin advanced

SRM for model selection:

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 22 / 39

Page 53: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

The cost of weaker prior knowledge

Suppose H = ∪nHn, where VCdim(Hn) = n

Suppose that some h∗ ∈ Hn has LD(h∗) = 0

With this prior knowledge, we can apply ERM on Hn, and the samplecomplexity is C n+log(1/δ)

ε2

Without this prior knowledge, SRM will need C n+log(π2n2/6)+log(1/δ)ε2

examples

That is, we pay order of log(n)/ε2 more examples for not knowing nin advanced

SRM for model selection:

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 22 / 39

Page 54: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Outline

1 Minimum Description Length

2 Non-uniform learnability

3 Structural Risk Minimization

4 Decision Trees

5 Nearest Neighbor and Consistency

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 23 / 39

Page 55: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Decision Trees

Color?

not-tasty

other

Softness?

not-tasty

other

tasty

gives slightly to palm pressure

pale green to pale yellow

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 24 / 39

Page 56: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

VC dimension of Decision Trees

Claim: Consider the class of decision trees over X with k leaves.Then, the VC dimension of this class is k

Proof: A set of k instances that arrive to the different leaves can beshattered. A set of k + 1 instances can’t be shattered since 2instances must arrive to the same leaf

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 25 / 39

Page 57: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 58: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over X

Claim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 59: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 60: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”

A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 61: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 62: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]

A leaf whose value is 1A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 63: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1

A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 64: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1A leaf whose value is 0

End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 65: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 66: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Description Language for Decision Trees

Suppose X = {0, 1}d and splitting rules are according to 1[xi=1] forsome i ∈ [d]

Consider the class of all such decision trees over XClaim: This class contains {0, 1}X and hence its VC dimension is|X | = 2d

But, we can bias to “small trees”A tree with n nodes can be described as n+ 1 blocks, each of sizelog2(d+ 3) bits, indicating (in depth-first order)

An internal node of the form ’1[xi=1]’ for some i ∈ [d]A leaf whose value is 1A leaf whose value is 0End of the code

Can apply MDL learning rule: search tree with n nodes that minimizes

LS(h) +

√(n+ 1) log2(d+ 3) + log(2/δ)

2m

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

Page 67: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Decision Tree Algorithms

NP hard problem ...

Greedy approach: ‘Iterative Dichotomizer 3’

Following the MDL principle, attempts to create a small tree with lowtrain error

Proposed by Ross Quinlan

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 27 / 39

Page 68: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

ID3(S,A)

Input: training set S, feature subset A ⊆ [d]

if all examples in S are labeled by 1, return a leaf 1

if all examples in S are labeled by 0, return a leaf 0

if A = ∅, return a leaf whose value = majority of labels in S. else :

Let j = argmaxi∈A Gain(S, i)if all examples in S have the same labelReturn a leaf whose value = majority of labels in SelseLet T1 be the tree returned by ID3({(x, y) ∈ S : xj = 1}, A \ {j}).Let T2 be the tree returned by ID3({(x, y) ∈ S : xj = 0}, A \ {j}).Return the tree:

xj = 1?

T2 T1

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 28 / 39

Page 69: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Gain Measures

Gain(S, i) = C(PS[y])−

(PS[xi]C(P

S[y|xi]) + P

S[¬xi]C(P

S[y|¬xi])

).

Train error: C(a) = min{a, 1− a}Information gain: C(a) = −a log(a)− (1− a) log(1− a)Gini index: C(a) = 2a(1− a)

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

a

C(a)

ErrorInfo Gain

Gini

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 29 / 39

Page 70: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Pruning, Random Forests,...

In the exercise you’ll learn about additional practical variants:

Pruning the tree

Random Forests

Dealing with real valued features

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 30 / 39

Page 71: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Outline

1 Minimum Description Length

2 Non-uniform learnability

3 Structural Risk Minimization

4 Decision Trees

5 Nearest Neighbor and Consistency

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 31 / 39

Page 72: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Nearest Neighbor

“Things(that(look(alike(must(be(alike”(

Memorize the training set S = (x1, y1), . . . , (xm, ym)

Given new x, find the k closest points in S and return majority voteamong their labels

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 32 / 39

Page 73: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

1-Nearest Neighbor: Voronoi Tessellation

Unlike ERM,SRM,MDL, etc., there’s no HAt training time: “do nothing”

At test time: search S for the nearest neighbors

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 33 / 39

Page 74: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

1-Nearest Neighbor: Voronoi Tessellation

Unlike ERM,SRM,MDL, etc., there’s no HAt training time: “do nothing”

At test time: search S for the nearest neighbors

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 33 / 39

Page 75: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Analysis of k-NN

X = [0, 1]d, Y = {0, 1}, D is a distribution over X × Y, DX is themarginal distribution over X , and η : Rd → R is the conditionalprobability over the labels, that is, η(x) = P[y = 1|x].

Recall: the Bayes optimal rule (that is, the hypothesis that minimizesLD(h) over all functions) is

h?(x) = 1[η(x)>1/2] .

Prior knowledge: η is c-Lipschitz. Namely, for allx,x′ ∈ X , |η(x)− η(x′)| ≤ c ‖x− x′‖Theorem: Let hS be the k-NN rule, then,

ES∼Dm

[LD(hS)] ≤

(1 +

√8

k

)LD(h

?) +(6 c√d+ k

)m−1/(d+1) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

Page 76: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Analysis of k-NN

X = [0, 1]d, Y = {0, 1}, D is a distribution over X × Y, DX is themarginal distribution over X , and η : Rd → R is the conditionalprobability over the labels, that is, η(x) = P[y = 1|x].Recall: the Bayes optimal rule (that is, the hypothesis that minimizesLD(h) over all functions) is

h?(x) = 1[η(x)>1/2] .

Prior knowledge: η is c-Lipschitz. Namely, for allx,x′ ∈ X , |η(x)− η(x′)| ≤ c ‖x− x′‖Theorem: Let hS be the k-NN rule, then,

ES∼Dm

[LD(hS)] ≤

(1 +

√8

k

)LD(h

?) +(6 c√d+ k

)m−1/(d+1) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

Page 77: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Analysis of k-NN

X = [0, 1]d, Y = {0, 1}, D is a distribution over X × Y, DX is themarginal distribution over X , and η : Rd → R is the conditionalprobability over the labels, that is, η(x) = P[y = 1|x].Recall: the Bayes optimal rule (that is, the hypothesis that minimizesLD(h) over all functions) is

h?(x) = 1[η(x)>1/2] .

Prior knowledge: η is c-Lipschitz. Namely, for allx,x′ ∈ X , |η(x)− η(x′)| ≤ c ‖x− x′‖

Theorem: Let hS be the k-NN rule, then,

ES∼Dm

[LD(hS)] ≤

(1 +

√8

k

)LD(h

?) +(6 c√d+ k

)m−1/(d+1) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

Page 78: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Analysis of k-NN

X = [0, 1]d, Y = {0, 1}, D is a distribution over X × Y, DX is themarginal distribution over X , and η : Rd → R is the conditionalprobability over the labels, that is, η(x) = P[y = 1|x].Recall: the Bayes optimal rule (that is, the hypothesis that minimizesLD(h) over all functions) is

h?(x) = 1[η(x)>1/2] .

Prior knowledge: η is c-Lipschitz. Namely, for allx,x′ ∈ X , |η(x)− η(x′)| ≤ c ‖x− x′‖Theorem: Let hS be the k-NN rule, then,

ES∼Dm

[LD(hS)] ≤

(1 +

√8

k

)LD(h

?) +(6 c√d+ k

)m−1/(d+1) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

Page 79: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

k-Nearest Neighbor: Bias-Complexity Tradeoffk-Nearest Neighbor:Data Fit / Complexity Tradeoff

k=1 k=5 k=12

k=50 k=100 k=200

S= h*=

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 35 / 39

Page 80: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Curse of Dimensionality

ES∼Dm

[LD(hS)] ≤

(1 +

√8

k

)LD(h

?) +(6 c√d+ k

)m−1/(d+1) .

Suppose LD(h?) = 0. Then, to have error ≤ ε we need

m ≥ (4 c√d/ε)d+1.

Number of examples grows exponentially with the dimension

This is not an artifact of the analysis

Theorem

For any c > 1, and every learner, there exists a distribution over[0, 1]d × {0, 1}, such that η(x) is c-Lipschitz, the Bayes error of thedistribution is 0, but for sample sizes m ≤ (c+ 1)d/2, the true error of thelearner is greater than 1/4.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 36 / 39

Page 81: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Curse of Dimensionality

ES∼Dm

[LD(hS)] ≤

(1 +

√8

k

)LD(h

?) +(6 c√d+ k

)m−1/(d+1) .

Suppose LD(h?) = 0. Then, to have error ≤ ε we need

m ≥ (4 c√d/ε)d+1.

Number of examples grows exponentially with the dimension

This is not an artifact of the analysis

Theorem

For any c > 1, and every learner, there exists a distribution over[0, 1]d × {0, 1}, such that η(x) is c-Lipschitz, the Bayes error of thedistribution is 0, but for sample sizes m ≤ (c+ 1)d/2, the true error of thelearner is greater than 1/4.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 36 / 39

Page 82: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Contradicting the No-Free-Lunch?

ES∼Dm

[LD(hS)] ≤

(1 +

√8

k

)LD(h

?) +(6 c√d+ k

)m−1/(d+1) .

Seemingly, we learn the class of all functions over [0, 1]d

But this class is not learnable even in the non-uniform model ...

There’s no contradiction: The number of required examples dependson the Lipschitzness of η (the parameter c), which depends on D.

PAC: m(ε, δ)non-uniform: m(ε, δ, h)consistency: m(ε, δ, h,D)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 37 / 39

Page 83: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Contradicting the No-Free-Lunch?

ES∼Dm

[LD(hS)] ≤

(1 +

√8

k

)LD(h

?) +(6 c√d+ k

)m−1/(d+1) .

Seemingly, we learn the class of all functions over [0, 1]d

But this class is not learnable even in the non-uniform model ...

There’s no contradiction: The number of required examples dependson the Lipschitzness of η (the parameter c), which depends on D.

PAC: m(ε, δ)non-uniform: m(ε, δ, h)consistency: m(ε, δ, h,D)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 37 / 39

Page 84: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Issues with Nearest Neighbor

Need to store entire training set“Replace intelligence with fast memory”

Curse of dimensionalityWe’ll later learn dimensionality reduction methods

Computational problem of finding nearest neighbor

What is the “correct” metric between objects ?Success depends on Lipschitzness of η, which depends on the rightmetric

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 38 / 39

Page 85: Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Summary

Expressing prior knowledge: Hypothesis class, weighting hypotheses,metric

Weaker notions of learnability:“PAC” stronger than “non-uniform” stronger than “consistency”

Learning rules: ERM, MDL, SRM

Decision trees

Nearest Neighbor

Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 39 / 39