machine learning theory by the people, for the...

57
Machine Learning Theory by the People, for the People, of the People Xiaojin Zhu Department of Computer Sciences University of Wisconsin-Madison 2011 (Oct. 2011) Machine Learning Theory of the People 1 / 27

Upload: others

Post on 15-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Machine Learning Theoryby the People, for the People, of the People

Xiaojin Zhu

Department of Computer SciencesUniversity of Wisconsin-Madison

2011

(Oct. 2011) Machine Learning Theory of the People 1 / 27

Page 2: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

war moonlight fever bravery liceA B A B A

(Oct. 2011) Machine Learning Theory of the People 2 / 27

Page 3: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Notation

X domain, e.g., a finite set of words

PX , e.g., uniform

f : X 7→ {−1, 1} classifier

f ∈ F hypothesis space

(Oct. 2011) Machine Learning Theory of the People 3 / 27

Page 4: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Notation

X domain, e.g., a finite set of words

PX , e.g., uniform

f : X 7→ {−1, 1} classifier

f ∈ F hypothesis space

(Oct. 2011) Machine Learning Theory of the People 3 / 27

Page 5: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Notation

X domain, e.g., a finite set of words

PX , e.g., uniform

f : X 7→ {−1, 1} classifier

f ∈ F hypothesis space

(Oct. 2011) Machine Learning Theory of the People 3 / 27

Page 6: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Notation

X domain, e.g., a finite set of words

PX , e.g., uniform

f : X 7→ {−1, 1} classifier

f ∈ F hypothesis space

(Oct. 2011) Machine Learning Theory of the People 3 / 27

Page 7: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

In-class exam

bravery fever lice moonlight war

(Oct. 2011) Machine Learning Theory of the People 4 / 27

Page 8: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Take home exam

cowardice daylight fun hero screech

(Oct. 2011) Machine Learning Theory of the People 5 / 27

Page 9: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Did you overfit?

(x, y) iid∼ PXY

training error: e(f) = 1n

∑ni=1(yi 6= f(xi))

true error: e(f) = E(x,y)

iid∼PXY[(y 6= f(x))]

want a bounde(f) ≤ e(f) + something

(Oct. 2011) Machine Learning Theory of the People 6 / 27

Page 10: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Did you overfit?

(x, y) iid∼ PXY

training error: e(f) = 1n

∑ni=1(yi 6= f(xi))

true error: e(f) = E(x,y)

iid∼PXY[(y 6= f(x))]

want a bounde(f) ≤ e(f) + something

(Oct. 2011) Machine Learning Theory of the People 6 / 27

Page 11: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Did you overfit?

(x, y) iid∼ PXY

training error: e(f) = 1n

∑ni=1(yi 6= f(xi))

true error: e(f) = E(x,y)

iid∼PXY[(y 6= f(x))]

want a bounde(f) ≤ e(f) + something

(Oct. 2011) Machine Learning Theory of the People 6 / 27

Page 12: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Did you overfit?

(x, y) iid∼ PXY

training error: e(f) = 1n

∑ni=1(yi 6= f(xi))

true error: e(f) = E(x,y)

iid∼PXY[(y 6= f(x))]

want a bounde(f) ≤ e(f) + something

(Oct. 2011) Machine Learning Theory of the People 6 / 27

Page 13: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Rademacher bound [Bartlett and Mendelson 2002]

On a training set of size n, w.p. at least 1− δ, ∀f ∈ F :

e(f) ≤ e(f) +R(F ,X , PX , n)

2+

√ln(1/δ)

2n

(Oct. 2011) Machine Learning Theory of the People 7 / 27

Page 14: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Rademacher complexity [Bartlett and Mendelson 2002]

x = x1, . . . , xniid∼ PX

σ = σ1, . . . , σniid∼ Bernoulli(1

2 , 12) with values ±1

fit of f : |∑n

i=1 σif(xi)|fit of F : supf∈F |

∑ni=1 σif(xi)|

Rademacher complexity

R(F ,X , PX , n) = Exσ

[supf∈F

∣∣∣∣∣ 2nn∑

i=1

σif(xi)

∣∣∣∣∣]

(Oct. 2011) Machine Learning Theory of the People 8 / 27

Page 15: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Rademacher complexity [Bartlett and Mendelson 2002]

x = x1, . . . , xniid∼ PX

σ = σ1, . . . , σniid∼ Bernoulli(1

2 , 12) with values ±1

fit of f : |∑n

i=1 σif(xi)|fit of F : supf∈F |

∑ni=1 σif(xi)|

Rademacher complexity

R(F ,X , PX , n) = Exσ

[supf∈F

∣∣∣∣∣ 2nn∑

i=1

σif(xi)

∣∣∣∣∣]

(Oct. 2011) Machine Learning Theory of the People 8 / 27

Page 16: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Rademacher complexity [Bartlett and Mendelson 2002]

x = x1, . . . , xniid∼ PX

σ = σ1, . . . , σniid∼ Bernoulli(1

2 , 12) with values ±1

fit of f : |∑n

i=1 σif(xi)|

fit of F : supf∈F |∑n

i=1 σif(xi)|Rademacher complexity

R(F ,X , PX , n) = Exσ

[supf∈F

∣∣∣∣∣ 2nn∑

i=1

σif(xi)

∣∣∣∣∣]

(Oct. 2011) Machine Learning Theory of the People 8 / 27

Page 17: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Rademacher complexity [Bartlett and Mendelson 2002]

x = x1, . . . , xniid∼ PX

σ = σ1, . . . , σniid∼ Bernoulli(1

2 , 12) with values ±1

fit of f : |∑n

i=1 σif(xi)|fit of F : supf∈F |

∑ni=1 σif(xi)|

Rademacher complexity

R(F ,X , PX , n) = Exσ

[supf∈F

∣∣∣∣∣ 2nn∑

i=1

σif(xi)

∣∣∣∣∣]

(Oct. 2011) Machine Learning Theory of the People 8 / 27

Page 18: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Rademacher complexity [Bartlett and Mendelson 2002]

x = x1, . . . , xniid∼ PX

σ = σ1, . . . , σniid∼ Bernoulli(1

2 , 12) with values ±1

fit of f : |∑n

i=1 σif(xi)|fit of F : supf∈F |

∑ni=1 σif(xi)|

Rademacher complexity

R(F ,X , PX , n) = Exσ

[supf∈F

∣∣∣∣∣ 2nn∑

i=1

σif(xi)

∣∣∣∣∣]

(Oct. 2011) Machine Learning Theory of the People 8 / 27

Page 19: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Estimating human Rademacher complexity [NIPS 09]

“Learning the noise”

1 participant study {(xi, σi)}ni=1

2 filler task

3 classify {xi}ni=1: re-ordered; not told these were training items

At the end, we observe f(x1) . . . f(xn) from the human.

(Oct. 2011) Machine Learning Theory of the People 9 / 27

Page 20: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Estimating human Rademacher complexity (cont.)

Key assumption:

f = arg supf∈F

n∑i=1

σif(xi)

therefore,

supf∈F

∣∣∣∣∣ 2nn∑

i=1

σif(xi)

∣∣∣∣∣ ≈∣∣∣∣∣ 2n

n∑i=1

σif(xi)

∣∣∣∣∣Averaging over participants gives an estimate of R.

(Oct. 2011) Machine Learning Theory of the People 10 / 27

Page 21: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Estimating human Rademacher complexity (cont.)

Key assumption:

f = arg supf∈F

n∑i=1

σif(xi)

therefore,

supf∈F

∣∣∣∣∣ 2nn∑

i=1

σif(xi)

∣∣∣∣∣ ≈∣∣∣∣∣ 2n

n∑i=1

σif(xi)

∣∣∣∣∣

Averaging over participants gives an estimate of R.

(Oct. 2011) Machine Learning Theory of the People 10 / 27

Page 22: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Estimating human Rademacher complexity (cont.)

Key assumption:

f = arg supf∈F

n∑i=1

σif(xi)

therefore,

supf∈F

∣∣∣∣∣ 2nn∑

i=1

σif(xi)

∣∣∣∣∣ ≈∣∣∣∣∣ 2n

n∑i=1

σif(xi)

∣∣∣∣∣Averaging over participants gives an estimate of R.

(Oct. 2011) Machine Learning Theory of the People 10 / 27

Page 23: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Human Rademacher complexity

rape killer funeral · · · fun laughter joy

the Shape X the Word X

0 10 20 30 400

0.5

1

1.5

2

n

Rad

emac

her

com

plex

ity

0 10 20 30 400

0.5

1

1.5

2

n

Rad

emac

her

com

plex

ity

R(F ,Shape, uniform, n) R(F ,Word, uniform, n)

(Oct. 2011) Machine Learning Theory of the People 11 / 27

Page 24: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Human Rademacher complexity

rape killer funeral · · · fun laughter joy

the Shape X the Word X

0 10 20 30 400

0.5

1

1.5

2

n

Rad

emac

her

com

plex

ity

0 10 20 30 400

0.5

1

1.5

2

nR

adem

ache

r co

mpl

exity

R(F ,Shape, uniform, n) R(F ,Word, uniform, n)

(Oct. 2011) Machine Learning Theory of the People 11 / 27

Page 25: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Does the bound work?

“Learning any task”

1 participant study {(xi, yi)}ni=1

2 filler task

3 classify {xi}n+100i=1 : re-ordered; not told some were training items

(Oct. 2011) Machine Learning Theory of the People 12 / 27

Page 26: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Yes the bound works

e(f) ≤ e(f) +R(F ,X , PX , n)

2+

√ln(1/δ)

2n

condition subject e bound e

WordEmotion 101 0.00 1.43 0.58n=5 102 0.00 1.43 0.46

103 0.00 1.43 0.04104 0.00 1.43 0.03105 0.00 1.43 0.31

WordEmotion 106 0.70 1.23 0.65n=40 107 0.00 0.53 0.04

108 0.00 0.53 0.00109 0.62 1.15 0.53110 0.00 0.53 0.05

(Oct. 2011) Machine Learning Theory of the People 13 / 27

Page 27: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Oh how they overfit!

mnemonicsI (grenade, B), (skull, A), (conflict, A), (meadow, B), (queen, B)I “a queen was sitting in a meadow and then a grenade was thrown (B

= before), then this started a conflict ending in bodies & skulls (A =after)”

idiosyncratic rulesI whether the shape “faces downward”I whether the word “tastes good”I “anything related to omitting (sic) light”I “things you can go inside”I odd or even number of syllablesI “relates to motel service”I “physical vs. abstract”

(Oct. 2011) Machine Learning Theory of the People 14 / 27

Page 28: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Oh how they overfit!

mnemonicsI (grenade, B), (skull, A), (conflict, A), (meadow, B), (queen, B)I “a queen was sitting in a meadow and then a grenade was thrown (B

= before), then this started a conflict ending in bodies & skulls (A =after)”

idiosyncratic rulesI whether the shape “faces downward”I whether the word “tastes good”I “anything related to omitting (sic) light”I “things you can go inside”I odd or even number of syllablesI “relates to motel service”I “physical vs. abstract”

(Oct. 2011) Machine Learning Theory of the People 14 / 27

Page 29: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Smaller Rademacher complexity, less actual overfitting

0 0.5 1 1.5 20

0.5

1

1.5

Rademacher complexity

e −

e^

Shape,40Word,40 Shape,5

Word,5

boundobserved

(Oct. 2011) Machine Learning Theory of the People 15 / 27

Page 30: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

now you be the teacherB B B B B

(Oct. 2011) Machine Learning Theory of the People 16 / 27

Page 31: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

The teaching dimension [Goldman and Kearns 1995]

items XH threshold functions

teaching set of h ∈ H: subset of X consistent with h only

TD(h): size of the smallest teaching set of h, 1 or 2

TD(H): TD(h∗) for the hardest h∗ ∈ H, 2

Optimal teaching should start around the decision boundary.

(Oct. 2011) Machine Learning Theory of the People 17 / 27

Page 32: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

The teaching dimension [Goldman and Kearns 1995]

items X

H threshold functions

teaching set of h ∈ H: subset of X consistent with h only

TD(h): size of the smallest teaching set of h, 1 or 2

TD(H): TD(h∗) for the hardest h∗ ∈ H, 2

Optimal teaching should start around the decision boundary.

(Oct. 2011) Machine Learning Theory of the People 17 / 27

Page 33: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

The teaching dimension [Goldman and Kearns 1995]

items XH threshold functions

teaching set of h ∈ H: subset of X consistent with h only

TD(h): size of the smallest teaching set of h, 1 or 2

TD(H): TD(h∗) for the hardest h∗ ∈ H, 2

Optimal teaching should start around the decision boundary.

(Oct. 2011) Machine Learning Theory of the People 17 / 27

Page 34: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

The teaching dimension [Goldman and Kearns 1995]

items XH threshold functions

teaching set of h ∈ H: subset of X consistent with h only

TD(h): size of the smallest teaching set of h, 1 or 2

TD(H): TD(h∗) for the hardest h∗ ∈ H, 2

Optimal teaching should start around the decision boundary.

(Oct. 2011) Machine Learning Theory of the People 17 / 27

Page 35: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

The teaching dimension [Goldman and Kearns 1995]

items XH threshold functions

teaching set of h ∈ H: subset of X consistent with h only

TD(h): size of the smallest teaching set of h, 1 or 2

TD(H): TD(h∗) for the hardest h∗ ∈ H, 2

Optimal teaching should start around the decision boundary.

(Oct. 2011) Machine Learning Theory of the People 17 / 27

Page 36: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

The teaching dimension [Goldman and Kearns 1995]

items XH threshold functions

teaching set of h ∈ H: subset of X consistent with h only

TD(h): size of the smallest teaching set of h, 1 or 2

TD(H): TD(h∗) for the hardest h∗ ∈ H, 2

Optimal teaching should start around the decision boundary.

(Oct. 2011) Machine Learning Theory of the People 17 / 27

Page 37: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

The teaching dimension [Goldman and Kearns 1995]

items XH threshold functions

teaching set of h ∈ H: subset of X consistent with h only

TD(h): size of the smallest teaching set of h, 1 or 2

TD(H): TD(h∗) for the hardest h∗ ∈ H, 2

Optimal teaching should start around the decision boundary.

(Oct. 2011) Machine Learning Theory of the People 17 / 27

Page 38: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Curriculum learning [Bengio et al. 2009]

Teaching should start from easy to hard, i.e., outside to inside.

(Oct. 2011) Machine Learning Theory of the People 18 / 27

Page 39: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

You teach robot ... [to appear at NIPS 11]

(Oct. 2011) Machine Learning Theory of the People 19 / 27

Page 40: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

... graspability

(Oct. 2011) Machine Learning Theory of the People 20 / 27

Page 41: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Observed human teaching strategy 1

0 0.5 10

5

10

15

x

t

P31, natural

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

iteration t|V

1|

(Oct. 2011) Machine Learning Theory of the People 21 / 27

Page 42: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Observed human teaching strategy 2

0 0.5 10

5

10

15

x

t

P08, constrained

(Oct. 2011) Machine Learning Theory of the People 22 / 27

Page 43: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Extending teaching dimension for curriculum learning

Humans represent objects by many dimensions!

squirrel = ( graspable, shy, store supplies for the winter, is notpoisonous, has four paws, has teeth, has two ears, has two eyes, isbeautiful, is brown, lives in trees, rodent, doesn’t herd, doesn’t sting,drinks water, eats nuts, feels soft, fluffy, gnaws on everything, has abeautiful tail, has a large tail, has a mouth, has a small head, hasgnawing teeth, has pointy ears, has short paws, is afraid of people, iscute, is difficult to catch, is found in Belgium, is light, is not a pet, isnot very big, is short haired, is sweet , jumps, lives in Europe, lives inthe wild, short front legs, small ears, smaller than a horse, soft fur,timid animal, can’t fly, climbs in trees, collects nuts, crawls up trees,eats acorns, eats plants, does not lay eggs ... )

(Oct. 2011) Machine Learning Theory of the People 23 / 27

Page 44: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Idealized assumptions

available teaching items x1, . . . ,xn ∼ unif[0, 1]d

first dim determines label p(yi = 1 | xi) = 1{xi1> 12}

learner’s version space V : axis-parallel decision boundariesI after two teaching items (x1, 1), (x2, 0)

dim 1 dim 2

������������������������������

������������������������������

ab 1/2 1θ10

1a ≡ x11, b ≡ x21

x12

��������������������

��������������������

������������������������������������������

������������������������������������������

1/2 10

1

x22

θ2

learner is a Gibbs classifier

(Oct. 2011) Machine Learning Theory of the People 24 / 27

Page 45: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Idealized assumptions

available teaching items x1, . . . ,xn ∼ unif[0, 1]d

first dim determines label p(yi = 1 | xi) = 1{xi1> 12}

learner’s version space V : axis-parallel decision boundariesI after two teaching items (x1, 1), (x2, 0)

dim 1 dim 2

������������������������������

������������������������������

ab 1/2 1θ10

1a ≡ x11, b ≡ x21

x12

��������������������

��������������������

������������������������������������������

������������������������������������������

1/2 10

1

x22

θ2

learner is a Gibbs classifier

(Oct. 2011) Machine Learning Theory of the People 24 / 27

Page 46: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Idealized assumptions

available teaching items x1, . . . ,xn ∼ unif[0, 1]d

first dim determines label p(yi = 1 | xi) = 1{xi1> 12}

learner’s version space V : axis-parallel decision boundaries

I after two teaching items (x1, 1), (x2, 0)dim 1 dim 2

������������������������������

������������������������������

ab 1/2 1θ10

1a ≡ x11, b ≡ x21

x12

��������������������

��������������������

������������������������������������������

������������������������������������������

1/2 10

1

x22

θ2

learner is a Gibbs classifier

(Oct. 2011) Machine Learning Theory of the People 24 / 27

Page 47: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Idealized assumptions

available teaching items x1, . . . ,xn ∼ unif[0, 1]d

first dim determines label p(yi = 1 | xi) = 1{xi1> 12}

learner’s version space V : axis-parallel decision boundariesI after two teaching items (x1, 1), (x2, 0)

dim 1 dim 2

������������������������������

������������������������������

ab 1/2 1θ10

1a ≡ x11, b ≡ x21

x12

��������������������

��������������������

������������������������������������������

������������������������������������������

1/2 10

1

x22

θ2

learner is a Gibbs classifier

(Oct. 2011) Machine Learning Theory of the People 24 / 27

Page 48: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Idealized assumptions

available teaching items x1, . . . ,xn ∼ unif[0, 1]d

first dim determines label p(yi = 1 | xi) = 1{xi1> 12}

learner’s version space V : axis-parallel decision boundariesI after two teaching items (x1, 1), (x2, 0)

dim 1 dim 2

������������������������������

������������������������������

ab 1/2 1θ10

1a ≡ x11, b ≡ x21

x12

��������������������

��������������������

������������������������������������������

������������������������������������������

1/2 10

1

x22

θ2

learner is a Gibbs classifier

(Oct. 2011) Machine Learning Theory of the People 24 / 27

Page 49: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Risk minimization leads to teaching extremes

learner’s risk

R =1|V |

(∫ a

b|θ1 −

12|dθ1 +

d∑k=2

∫ max(x1k,x2k)

min(x1k,x2k)

12dθk

)

teacher chooses a, b to minimize R (trade off)

Theorem

The risk R is minimized by a∗ =√

c2+2c−c+12 and b = 1− a∗, where

c ≡∑d

k=2 |x1k − x2k|.

c is the sum of d− 1 Beta(1, 2) random variables.

Corollary

When d →∞, the minimizer of R is a∗ = 1, b∗ = 0.

In practice, d = 10, a∗ = 0.94; d = 100, a∗ = 0.99

(Oct. 2011) Machine Learning Theory of the People 25 / 27

Page 50: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Risk minimization leads to teaching extremes

learner’s risk

R =1|V |

(∫ a

b|θ1 −

12|dθ1 +

d∑k=2

∫ max(x1k,x2k)

min(x1k,x2k)

12dθk

)

teacher chooses a, b to minimize R (trade off)

Theorem

The risk R is minimized by a∗ =√

c2+2c−c+12 and b = 1− a∗, where

c ≡∑d

k=2 |x1k − x2k|.

c is the sum of d− 1 Beta(1, 2) random variables.

Corollary

When d →∞, the minimizer of R is a∗ = 1, b∗ = 0.

In practice, d = 10, a∗ = 0.94; d = 100, a∗ = 0.99

(Oct. 2011) Machine Learning Theory of the People 25 / 27

Page 51: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Risk minimization leads to teaching extremes

learner’s risk

R =1|V |

(∫ a

b|θ1 −

12|dθ1 +

d∑k=2

∫ max(x1k,x2k)

min(x1k,x2k)

12dθk

)

teacher chooses a, b to minimize R (trade off)

Theorem

The risk R is minimized by a∗ =√

c2+2c−c+12 and b = 1− a∗, where

c ≡∑d

k=2 |x1k − x2k|.

c is the sum of d− 1 Beta(1, 2) random variables.

Corollary

When d →∞, the minimizer of R is a∗ = 1, b∗ = 0.

In practice, d = 10, a∗ = 0.94; d = 100, a∗ = 0.99

(Oct. 2011) Machine Learning Theory of the People 25 / 27

Page 52: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Risk minimization leads to teaching extremes

learner’s risk

R =1|V |

(∫ a

b|θ1 −

12|dθ1 +

d∑k=2

∫ max(x1k,x2k)

min(x1k,x2k)

12dθk

)

teacher chooses a, b to minimize R (trade off)

Theorem

The risk R is minimized by a∗ =√

c2+2c−c+12 and b = 1− a∗, where

c ≡∑d

k=2 |x1k − x2k|.

c is the sum of d− 1 Beta(1, 2) random variables.

Corollary

When d →∞, the minimizer of R is a∗ = 1, b∗ = 0.

In practice, d = 10, a∗ = 0.94; d = 100, a∗ = 0.99

(Oct. 2011) Machine Learning Theory of the People 25 / 27

Page 53: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Risk minimization leads to teaching extremes

learner’s risk

R =1|V |

(∫ a

b|θ1 −

12|dθ1 +

d∑k=2

∫ max(x1k,x2k)

min(x1k,x2k)

12dθk

)

teacher chooses a, b to minimize R (trade off)

Theorem

The risk R is minimized by a∗ =√

c2+2c−c+12 and b = 1− a∗, where

c ≡∑d

k=2 |x1k − x2k|.

c is the sum of d− 1 Beta(1, 2) random variables.

Corollary

When d →∞, the minimizer of R is a∗ = 1, b∗ = 0.

In practice, d = 10, a∗ = 0.94; d = 100, a∗ = 0.99

(Oct. 2011) Machine Learning Theory of the People 25 / 27

Page 54: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Risk minimization leads to teaching extremes

learner’s risk

R =1|V |

(∫ a

b|θ1 −

12|dθ1 +

d∑k=2

∫ max(x1k,x2k)

min(x1k,x2k)

12dθk

)

teacher chooses a, b to minimize R (trade off)

Theorem

The risk R is minimized by a∗ =√

c2+2c−c+12 and b = 1− a∗, where

c ≡∑d

k=2 |x1k − x2k|.

c is the sum of d− 1 Beta(1, 2) random variables.

Corollary

When d →∞, the minimizer of R is a∗ = 1, b∗ = 0.

In practice, d = 10, a∗ = 0.94; d = 100, a∗ = 0.99(Oct. 2011) Machine Learning Theory of the People 25 / 27

Page 55: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Teaching items should approach decision boundary

Theorem

Let the teaching sequence contain t0 negative labels and t− t0 positiveones. Then the version space in dim k has size |Vk| = αkβk, where

αk ∼ Bernoulli(2/(

tt0

), 1− 2/

(tt0

))βk ∼ Beta(1, t)

independently for k = 2 . . . d. Consequently, E(c) = 2(d−1)(tt0

)(1+t)

.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

iteration t

|V1|

d=1000d=100d=12d=2

(Oct. 2011) Machine Learning Theory of the People 26 / 27

Page 56: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Teaching items should approach decision boundary

Theorem

Let the teaching sequence contain t0 negative labels and t− t0 positiveones. Then the version space in dim k has size |Vk| = αkβk, where

αk ∼ Bernoulli(2/(

tt0

), 1− 2/

(tt0

))βk ∼ Beta(1, t)

independently for k = 2 . . . d. Consequently, E(c) = 2(d−1)(tt0

)(1+t)

.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

iteration t

|V1|

d=1000d=100d=12d=2

(Oct. 2011) Machine Learning Theory of the People 26 / 27

Page 57: Machine Learning Theory by the People, for the …pages.cs.wisc.edu/~jerryzhu/ssl/pub/ml4people.pdfMachine Learning Theory by the People, for the People, of the People Xiaojin Zhu

Conclusion

Machine learning and cognitive science have much to offer to each other.

AcknowledgmentsI Bryan Gibson, Faisal Khan, Bilge Mutlu, Tim RogersI NSF CAREER Award IIS-0953219I AFOSR FA9550-09-1-0313I The Wisconsin Alumni Research Foundation

(Oct. 2011) Machine Learning Theory of the People 27 / 27