cs#109 lecture#20 may11th,# 2016 - stanford university › class › archive › cs › cs109 ›...

47
CS 109 Lecture 20 May 11th, 2016

Upload: others

Post on 05-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

CS  109Lecture  20

May  11th,  2016

Page 2: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Review

Page 3: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Classification TaskHeart Ancestry

Netflix

Page 4: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Training

Real World Problem

Formal Model 𝜃

g𝜃(X)

Model the problem

Learning Algorithm

TestingData

Training Data

ClassificationAccuracy

Page 5: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Testing

Real World Problem

Formal Model 𝜃

Model the problem

Learning Algorithm

TestingData

Training Data

ClassificationAccuracyg𝜃(X)

Page 6: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• We  consider  statistical  learning  paradigm  here§ We  are  given  set  of  N “training”   instances

o Each  training  instance  is  pair:  (<x1,  x2,  …,  xm>,  y)o Training  instances  are  previously observed  datao Gives  the  output  value  y associated  with  each  observed  vector  of  input  values  <x1,  x2,  …,  xm>

§ Learning:  use  training  data  to  specify  g(X)o Generally,  first  select  a  parametric  form  for  g(X)o Then,  estimate  parameters  of  model  g(X) using  training  datao For  classification,  generally  best  choice  of

Training a Learning Machine

)(ˆ)|(ˆmaxarg),(ˆmaxarg)|(ˆmaxarg)(ˆ YPYXPYXPXYPgYyyy

==== X

Page 7: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Say,  we  have  m input  values  X =  <X1,  X2,  …,  Xm>§ Assume  variables  X1,  X2,  …,  Xm are  conditionally  independent given  Yo Really  don’t  believe  X1,  X2,  …,  Xm are  conditionally   independent

o Just  an  approximation  we  make  to  be  able  to  make  predictions  

o This  is  called  the  “Naive  Bayes”  assumption,  hence  the  name

§ Predict  Y  usingo But,  we  now  have:

by  conditional  independence

§ Note:  computation  of  PMF  table  is  linear in  m :  O(m)o Don’t  need  much  data  to  get  good  probability  estimates

∏=

==m

iim YXPYXXXPYP

121 )|()|,...,()|(X

)()|(maxarg),(maxargˆ YPYPYPYyy

XX ==

Naïve Bayes Classifier

Page 8: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Various  probabilities  you  will  need  to  compute  for  Naive  Bayesian  Classifier  (using  MLE  here):

∑∏==

===m

ii

m

iim YXPYXPYXXXPYP

1121 )|(log)|(log)|,...,(log)|(log X

)])()|((log[maxarg)()|(maxargˆ YPYPYPYPyyy

XX ==

instances # total0 classin instances# )0(ˆ =

==YP

)0(ˆ)0,0(ˆ)0|0(ˆ

=

=====

YPYXPYXP i

i )1(ˆ)1,0(ˆ)1|0(ˆ

=

=====

YPYXPYXP i

i

instances # total0class and 0 whereinstances# )0,0(ˆ ==

=== ii

XYXP

)0|0(ˆ1)0|1(ˆ ==−=== YXPYXP ii

Computing  Probabilities  from  Data

Page 9: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

On  biased  datasets

Page 10: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Ancestry  dataset  prediction

East  Asian  or

Ad  Mixed  American  (Native +  Early  Immigrants)

Page 11: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Is  the  ancestry  dataset  biased?

Page 12: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Yes!

Page 13: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

It  is  much  easier  to  write  a  binary  classifier  

when  learning  ML  for  the  first  time

Page 14: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Learn Two Things From This

1. What  classification  with  DNA  Single  Nucleotide  Polymorphisms   looks  like.

2. That  genetic  ancestry  paints  a  more  realistic  picture  of  how  we  are  mixed  in  many  nuanced  ways.

3. The  importance  of  choosing  the  right  data  to  learn  from.  Your  results  will  be  as  biased  as  your  dataset.

Know  it  so  you  can  beat  it!

Page 15: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Ethics  in  Machine  Learning  is  a  whole  new  field

Page 16: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

End  Review

Page 17: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Notation  For  Today

�(z) =1

1 + e�z

Tx =

nX

i=1

✓ixi

= ✓1x1 + ✓2x2 + · · ·+ ✓nxn

�(✓Tx) =1

1 + e�✓Tx

Sigmoid function

Weighted sum (aka dot product)

Sigmoid function ofweighted sum

Page 18: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Step  1:  Big  Picture

Page 19: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Recall  the  Naive  Bayes  Classifier§ Predict  

§ Use  assumption  that

§ We  are  really  modeling   joint  probability  P(X,  Y)  

• But  for  classification,  really  care  about  P(Y  |  X)§ Really  want  to  predict

§ Modeling  full  joint  probability  P(X,  Y)  is  equivalent

• Could  we  model  P(Y  |  X)  directly?§ Welcome  our  friend:  logistic  regression!

)()|(maxarg),(maxargˆ YPYPYPYyy

XX ==

∏=

==m

iim YXPYXXXPYP

121 )|()|,...,()|(X

)|(maxargˆ XYPyy

=

From  Naïve  Bayes  to  Logistic  Regression

Page 20: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Model  conditional likelihood  P(Y  |  X)  directly§ Model  this  probability  with  logistic function:

§ For  simplicity  define                              so

§ Since  P(Y = 0 | X) + P(Y = 1 | X) = 1:

Logistic  Regression  Assumption

P (Y = 1|X) = �(z) where z = ✓0 +mX

i=1

✓ixi

�(z) =1

1 + e�z

Recall:Sigmoid function

x0 = 1 z = ✓Tx

P (Y = 1|X = x) = �(✓Tx)

P (Y = 0|X = x) = 1� �(✓Tx)P (Y = 1|X = x) = �(✓Tx)

P (Y = 0|X = x) = 1� �(✓Tx)

Page 21: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Note:  inflection  point  at  z  =  0.    f(0)  =  0.5

zezf

−+=11)(

z

Want  to  distinguish  y  =  1  (blue)  points  from  y  =  0  (red)  points

The  Sigmoid  Function

Page 22: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Standard normal CDF

Sigmoid

The  Sigmoid  Function

Page 23: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

What  is  in  a  Name

Classification  AlgorithmsRegression  Algorithms

Linear  Regression Naïve  Bayes

Logistic  Regression

If  Chris  could  rename  it  he  would  call  it:  Sigmoidal  Classification

Awesome classifier, terrible name

Page 24: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Training  Data

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))

Assume IID data:

m = |x(i)|

Each datapoint has m features and a single y

Page 25: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Logistic  Regression

1

2

3

Make  logistic  regression  assumption

P (Y = 1|X = x) = �(✓Tx)

P (Y = 0|X = x) = 1� �(✓Tx)

Calculate  the  log  probability   for  all  data

LL(✓) =nX

i=0

y(i) log �(✓Tx(i)) + (1� y(i)) log[1� �(✓Tx(i)

)]

Get  derivative  of  log  probability  with  respect  to  thetas  

@LL(✓)

@✓j=

nX

i=0

hy

(i) � �(✓Tx(i))ix

(i)j

Page 26: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Gradient  Ascent

Walk  uphill  and  you  will  find  a  local  maxima  (if  your  step  size  is  small  enough)

Logistic regression LL function is convex

Page 27: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Gradient  Ascent  Step@LL(✓)

@✓j=

nX

i=0

hy

(i) � �(✓Tx(i))ix

(i)j

new

j = ✓

old

j + ⌘ · @LL(✓old)

@✓

old

j

= ✓

old

j + ⌘ ·nX

i=0

hy

(i) � �(✓Tx(i))ix

(i)j

Do thisfor allthetas!

𝜃1𝜃2

LL(𝜃)

Page 28: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Initialize: βj = 0 for all 0 ≤ j ≤ m

// "epochs" = number of passes over data during learning

for (i = 0; i < epochs; i++) {

Initialize: gradient[j] = 0 for all 0 ≤ j ≤ m

// Compute "batch" gradient vector

for each training instance (<x1, x2, …, xm>, y) in data {

// Add contribution to gradient for each data point

for (j = 0; j <= m; j++) {

// Note: xj below is j-th input variable and x0 = 1.

gradient[j]+=

}

}

// Update all 𝜃j. Note learning rate η is pre-set constant

𝜃j += η * gradient[j] for all 0 ≤ j ≤ m

}

∑=

−=

+−

m

jjjzj xz

eyx

0 where)

11( β

Logistic  Regression  Training

Page 29: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Training:  determine  parameters  𝜃j (for  all  0  ≤ j ≤ m)

§ After  parameters  𝜃j have  been  learned,   test  classifier

• To  test  classifier,  for  each  new  (test)  instance  X:§ Compute:  

§ Classify  instance  as:

§ Note  about  evaluation  set-­up:  parameters  𝜃j are  notupdated  during  “testing”  phase

∑=

−=

+===

m

jjjz Xz

eYPp

0 where,

11)|1( βX

⎩⎨⎧ >

=otherwise0

5.01ˆ

py

Classification  with  Logistic  Regression

z = ✓Tx

Page 30: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Step  2:  How  Come?

Page 31: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Logistic  Regression

1

2

3

Make  logistic  regression  assumption

P (Y = 1|X = x) = �(✓Tx)

P (Y = 0|X = x) = 1� �(✓Tx)

Calculate  the  log  probability   for  all  data

LL(✓) =nX

i=0

y(i) log �(✓Tx(i)) + (1� y(i)) log[1� �(✓Tx(i)

)]

Get  derivative  of  log  probability  with  respect  to  thetas  

@LL(✓)

@✓j=

nX

i=0

hy

(i) � �(✓Tx(i))ix

(i)j

Page 32: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

How  did  we  get  that  LL  function?

Page 33: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Log  Probability  of  DataP (Y = 1|X = x) = �(✓Tx)

P (Y = 0|X = x) = 1� �(✓Tx)

LL(✓) =nX

i=0

y(i) log �(✓Tx(i)) + (1� y(i)) log[1� �(✓Tx(i)

)]

P (Y = y|X = x) = �(✓Tx)y ·⇥1� �(✓Tx)

⇤(1�y)

L(✓) =nY

i=1

P (Y = y(i)|X = x

(i))

=nY

i=1

�(✓Tx(i))y(i)

·h1� �(✓Tx(i))

i(1�y(i))

Page 34: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

How  did  we  get  that  gradient?

Page 35: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Sigmoid  has  a  Beautiful  Slope

@

@✓j�(✓Tx) = �(✓Tx)[1� �(✓Tx)]xj

@

@z�(z) = �(z)[1� z]

@

@✓j�(✓Tx) =

@

@z

�(z) · @z

@✓j

@

@✓j�(✓Tx)?

Sigmoid,  you  should  be  a  ski  hill

True fact about sigmoid functions

Chain rule!

Plug and chug

Page 36: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Gradient  UpdateLL(✓) =

nX

i=0

y(i) log �(✓Tx(i)) + (1� y(i)) log[1� �(✓Tx(i)

)]

@LL(✓)

@✓j=

nX

i=0

hy

(i) � �(✓Tx(i))ix

(i)j

@LL(✓)

@✓j=

@

@✓jy log �(✓

Tx) +

@

@✓j(1� y) log[1� �(✓

Tx]

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

�@

@✓j�(✓

Tx)

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

��(✓

Tx)[1� �(✓

Tx)]xj

=

y � �(✓

Tx)

�(✓

Tx)[1� �(✓

Tx)]

��(✓

Tx)[1� �(✓

Tx)]xj

=

⇥y � �(✓

Tx)

⇤xj

For many data points

@LL(✓)

@✓j=

@

@✓jy log �(✓

Tx) +

@

@✓j(1� y) log[1� �(✓

Tx]

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

�@

@✓j�(✓

Tx)

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

��(✓

Tx)[1� �(✓

Tx)]xj

=

y � �(✓

Tx)

�(✓

Tx)[1� �(✓

Tx)]

��(✓

Tx)[1� �(✓

Tx)]xj

=

⇥y � �(✓

Tx)

⇤xj

@LL(✓)

@✓j=

@

@✓jy log �(✓

Tx) +

@

@✓j(1� y) log[1� �(✓

Tx]

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

�@

@✓j�(✓

Tx)

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

��(✓

Tx)[1� �(✓

Tx)]xj

=

y � �(✓

Tx)

�(✓

Tx)[1� �(✓

Tx)]

��(✓

Tx)[1� �(✓

Tx)]xj

=

⇥y � �(✓

Tx)

⇤xj

@LL(✓)

@✓j=

@

@✓jy log �(✓

Tx) +

@

@✓j(1� y) log[1� �(✓

Tx]

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

�@

@✓j�(✓

Tx)

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

��(✓

Tx)[1� �(✓

Tx)]xj

=

y � �(✓

Tx)

�(✓

Tx)[1� �(✓

Tx)]

��(✓

Tx)[1� �(✓

Tx)]xj

=

⇥y � �(✓

Tx)

⇤xj

@LL(✓)

@✓j=

@

@✓jy log �(✓

Tx) +

@

@✓j(1� y) log[1� �(✓

Tx]

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

�@

@✓j�(✓

Tx)

=

y

�(✓

Tx)

� 1� y

1� �(✓

Tx)

��(✓

Tx)[1� �(✓

Tx)]xj

=

y � �(✓

Tx)

�(✓

Tx)[1� �(✓

Tx)]

��(✓

Tx)[1� �(✓

Tx)]xj

=

⇥y � �(✓

Tx)

⇤xj

Page 37: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Logistic  Regression

1

2

3

Make  logistic  regression  assumption

P (Y = 1|X = x) = �(✓Tx)

P (Y = 0|X = x) = 1� �(✓Tx)

Calculate  the  log  probability   for  all  data

LL(✓) =nX

i=0

y(i) log �(✓Tx(i)) + (1� y(i)) log[1� �(✓Tx(i)

)]

Get  derivative  of  log  probability  with  respect  to  thetas  

@LL(✓)

@✓j=

nX

i=0

hy

(i) � �(✓Tx(i))ix

(i)j

Page 38: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Step  3:  Philosophy

Page 39: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

§ Logistic  regression   is  trying  to  fit  a  line that  separates  data  instances  where  y =  1  from  those  where  y =  0

§ We  call  such  data  (or  the  functions  generating  the  data)  “linearly  separable”

§ Naïve  bayes is  linear   too  as  there  is  no  interaction  between  different  features.

Discrimination  Intuition

Tx = 0

✓0x0 + ✓1x1 + · · ·+ ✓mxm = 0

Page 40: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Some  data  sets/functions  are  not  separable§ Consider   function:  y =  x1 XOR  x2§ Note:  y =1  iff one of  either  x1 or  x2 =  1

§ Not  possible   to  draw  a  line  that  successfully  separates  all  the  y =  1  points  (blue)  from  the  y =  0  points  (red)

§ Despite  this  fact,  logistic  regression  and  Naive  Bayes  still  often  work  well  in  practice

1

1

00

X1

X2

Some  Data  Not  Linearly  Seperable

Page 41: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Compare  Naive  Bayes  and  Logistic  Regression§ Recall   that  Naive  Bayes  models  P(X,  Y)  =  P(X |  Y)  P(Y)§ Logistic  Regression  directly  models  P(Y  |  X)§ We  call  Naive  Bayes  a  “generative  model”

o Tries  to  model  joint  distribution  of  how  data  is  “generated”o I.e.,  could  use  P(X,  Y)  to  generate  new  data  points  if  we  wantedo But  lots  of  effort  to  model  something  that  may  not  be  needed

§ We  call  Logistic  Regression  a  “discriminative  model”o Just  tries  to  model  way  to  discriminate  y  =  0  vs.  y  =  1  caseso Cannot  use  model  to  generate  new  data  points  (no  P(X,  Y))o Note:  Logistic  Regression  can  be  generalized  to  more  than  two  output  values  for  y  (have  multiple  sets  of  parameters  βj)

Logistic  Regression  vs  Naïve  Bayes

Page 42: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Many  trade-­offs  in  choosing  learning  algorithm§ Continuous   input  variables

o Logistic  Regression  easily  deals  with  continuous  inputso Naive  Bayes  needs  to  use  some  parametric  form  for  continuous  inputs  (e.g.,  Gaussian)  or  “discretize”  continuous  values    into  ranges  (e.g.,  temperature  in  range:  <50,  50-­60,  60-­70,  >70)

§ Discrete  input  variableso Naive  Bayes  naturally  handles  multi-­valued  discrete  data  by  using  multinomial  distribution  for  P(Xi |  Y)

o Logistic  Regression  requires  some  sort  of  representation  of    multi-­valued  discrete  data  (e.g.,  multiple  binary  features)

o Say  Xi ∈ {A,  B,  C}.    Not  necessarily  a  good  idea  to  encode  Xi  as  taking  on  input  values  1,  2,  or  3  corresponding  to  A,  B,  or  C.

Choosing  an  Algorithm?

Page 43: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Goal  of  machine  learning:  build  models  that  generalize well  to  predicting  new  data§ “Overfitting”:  fitting  the  training  data  too  well,  so  we  lose  generality  of  modelo Example:  linear  regression  vs.  Newton’s  interpolating  polynomial

o Interpolating  polynomial  fits  training  data  perfectly!o Which  would  you  rather  use  to  predict  a  new  data  point?

Good  ML  =  Generalization

Page 44: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Logistic  Regression  can  more  easily  overfittraining  data  than  Naive  Bayes§ Logistic  Regression   is  not  modeling  whole  distribution,      it  is  just  optimizing  prediction  of  Y

§ Overfitting can  is  problematic   if  distributions  of  training  data  and  testing  data  differ  a  bit

§ There  are  methods  to  mitigate  overfitting in  Logistic  Regressiono Called  “regularizers”o Use  Bayesian  priors  on  parametersj rather  than  just  maximizing  conditional  likelihood   (like  MAP)

o Many  others!

To  Consider  with  Logistic  Regression

Page 45: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• Consider  logistic  regression  as:

• Neural  network

x1

x2

x3

x4

β1

β2

β3

β4

yLogistic  regression   is  same  as  a  one  node  neural  network

x1

x2

x3

x4

Logistic  Regression  and  Neural  Networks

Page 46: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

• A  neuron

• Your  brain

Actually,  it’s  probably  someone  else’s  brain

x1

x2

x3

x4

β1

β2

β3

β4

y

x1

x2

x3

x4

Biological  Basis  for  Neural  Networks

Page 47: CS#109 Lecture#20 May11th,# 2016 - Stanford University › class › archive › cs › cs109 › cs109... · 2016-05-20 · CS#109 Lecture#20 May11th,# 2016. Review. Classification

Next  up:  Neural  Networks!