midterm reviews (cs 229/ stats 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfmidterm...

Midterm Reviews (CS 229/ STATS 229)

Zhihan Xiong

Stanford University

15th May, 2020

Zhihan Xiong Reviews 15th May, 2020 1 / 26

Supervised Learning

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)


Supervised Learning Discriminative Algorithms

Outline




Kernel and SVM

2 Neural Networks





Optimization Methods

Gradient and Hessian (di↵erentiable function f : Rd 7! R)

rx f =

h@f@x1

. . . @f@xd

iT2 Rd

(Gradient)

r2x f =

2

664

@2f@x21

. . . @2f@x1@xd

.

.

.. . .

.

.

.

@2f@xd@x1

. . . @2f@x2d

3

775 2 Rd⇥d(Hessian)

Gradient Descent and Newton’s Method (objective function J (✓))

✓(t+1)= ✓(t) � ↵r✓J(✓

(t)) (Gradient descent)

✓(t+1)= ✓(t) �

hr2

✓J(✓(t))

i�1r✓J(✓

(t)) (Newton’s method)



Least Square—Gradient Descent

Model: h✓ (x) = ✓T x

Training data:��

x (i), y (i)� n

i=1, x (i) 2 Rd

Loss: J (✓) = 12

Pni=1

�h✓(x (i))� y (i)

�2

Update rule:

✓(t+1)= ✓(t) � ↵

nX

i=1

⇣h✓(x

(i))� y (i)

⌘x (i)

Stochastic Gradient Descent (SGD)

Pick one data point x (i) and then update:

✓(t+1)= ✓(t) � ↵

⇣h✓(x

(i))� y (i)

⌘x (i)



Least Square—Closed Form

Loss in matrix form: J (✓) = 12 kX✓ � yk22, where X 2 Rn⇥d

, y 2 Rn

Normal Equation (set gradient to 0):

XT(X✓? � y) = 0

Closed form solution:

✓? =⇣XTX

⌘�1XT y

Connection to Newton’s Method

✓? =⇥r2

✓J⇤�1r✓J

Newton’s method is exact for 2nd-order objective.



Logistic Regression

A binary classification model and y (i) 2 {0, 1}Assumed model:

p (y | x ; ✓) =(g✓ (x) if y = 1

1� g✓ (x) if y = 0, where g✓ (x) =

1

1 + e�✓T x

Log-likelihood function:

` (✓) =nX

i=1

log p(y (i) | x (i); ✓)

=

nX

i=1

hy (i) log g✓(x

(i)) + (1� y (i)) log(1� g✓(x

(i)))

i

Find parameters through maximizing log-likelihood, max✓ ` (✓) (in Pset1).



The Exponential Family

Definition

Probability distribution (with natural parameter ⌘) whose density (or mass function) can be

written into the following form

p (y ; ⌘) = b (y) exp⇣⌘TT (y)� a (⌘)

⌘

Example

Bernoulli distribution:

p (y ;�) = �y(1� �)1�y

= exp

✓✓log

✓�

1� �

◆◆y + log (1� �)

◆

=) b (y) = 1, T (y) = y , ⌘ = log

✓�

1� �

◆, a (⌘) = log (1 + e⌘)



The Exponential Family

More Examples

Categorical distribution, Poisson distribution, (Multivariate) normal distribution, etc.

Properties (In Pset1)

E [T (Y ) ; ⌘] = r⌘a (⌘)

Var (T (Y ) ; ⌘) = r2⌘a (⌘)

Non-exponential Family Distribution

Uniform distribution over interval [a, b]:

p (y ; a, b) =1

b � a· 1{ayb}

Reason: b (y) cannot depend on parameter ⌘.



The Generalized Linear Model (GLM)

Components

Assumed model: p (y | x ; ✓) ⇠ ExponentialFamily (⌘) with ⌘ = ✓T x

Predictor: h (x) = E [T (Y ) ; ⌘] = r⌘a (⌘).

Fitting through maximum likelihood:

max✓

` (✓) = max✓

nX

i=1

p(y (i) | x (i); ⌘)

Examples

GLM under Bernoulli distribution: Logistic regression

GLM under Poisson distribution: Poisson regression (in Pset1)

GLM under Normal distribution: Linear regression

GLM under Categorical distribution: Softmax regression


Supervised Learning Generative Algorithms

Outline




Kernel and SVM

2 Neural Networks





Gaussian Discriminant Analysis (GDA)

Generative Algorithm for Classification

Learn p (x | y) and p (y)

Classify through Bayes rule: argmaxy p (y | x) = argmaxy p (x | y) p (y)

GDA Formulation

Assume p (x | y) ⇠ N (µy ,⌃) for some µy 2 Rdand ⌃ 2 Rd⇥d

Estimate µy , ⌃ and p (y) through maximum likelihood, which is

max

nX

i=1

hlog p(x (i) | y (i)) + log p(y (i))

i

p (y) =

Pni=1 1{y (i)=y}

n, µy =

Pni=1 1{y (i)=y}x

(i)

Pni=1 1{y (i)=y}

,⌃ =1

n

nX

i=1

(x (i) � µy (i))(x (i) � µy (i))T



Naive Bayes

Formulation

Assume p (x | y) =Qd

j=1 p (xj | y)Estimate p (xj | y) and p (y) through maximum likelihood, which gives

p (xj | y) =

Pni=1 1

nx(i)j =xj ,y (i)=y

o

Pni=1 1{y (i)=y}

, p (y) =

Pni=1 1{y (i)=y}

n

Laplace Smoothing

Assume xj takes value in {1, 2, . . . , k}, the corresponding modified estimator is

p (xj | y) =1 +

Pni=1 1

nx(i)j =xj ,y (i)=y

o

k +Pn

i=1 1{y (i)=y}


Supervised Learning Kernel and SVM

Outline




Kernel and SVM

2 Neural Networks





Kernel

Motivation

Feature map: � : Rd 7! Rp

Fitting linear model with gradient descent gives us ✓ =Pn

u=1 �i�(x(i))

Predict a new example z : h✓ (z) =Pn

i=1 �i�(x(i))T� (z) =

Pni=1 �iK (x (i), z)

It brings nonlinearity without much sacrifice in e�ciency as long as K (·, ·) can be computed

e�ciently.

Definition

K (x , z) : Rd ⇥ Rd 7! R is a valid kernel if there exists � : Rd 7! Rpfor some p � 1 such that

K (x , z) = � (x)T � (z)



Kernel (Continued)

Examples

Polynomial kernels: K (x , z) =�xT z + c

�d, 8 c � 0 and d 2 N

Gaussian kernels: K (x , z) = exp

⇣�kx�zk22

2�2

⌘, 8 �2 > 0

More in Pset2...

Theorem

K (x , z) is a valid kernel if and only if for any set of {x (1), . . . , x (n)}, its Gram matrix, definedas

G =

2

64K (x (1), x (1)) . . . K (x (1), x (n))

.... . .

...K (x (n), x (1)) . . . K (x (n), x (n))

3

75 2 Rn⇥n

is positive semi-definite.



SVM

Formulation (y 2 {�1, 1})

min{w ,b}12 kwk22

subject to y (i)(wT x (i) + b) � 1, 8 i 2 {1, . . . , n} (Hard-SVM)

min{w ,b,⇠}12 kwk22 + C

Pni=1 ⇠i

subject to y (i)(wT x (i) + b) � 1� ⇠i , 8 i 2 {1, . . . , n}⇠i � 0, 8 i 2 {1, . . . , n}

(Soft-SVM)

Properties

The optimal solution has the form w?=Pn

i=1 ↵iy (i)x (i) and thus can be kernelized.

The soft-SVM can be treated as a minimization over hinge loss plus `2 regularization:

min{w ,b}

nX

i=1

max

n0, 1� y (i)(wT x (i) + b)

o+ � kwk22


Neural Networks

Outline




Kernel and SVM

2 Neural Networks




Neural Networks

Model Formulation

Multi-layer Fully-connected Neural Networks (with Activation Function f )

a[1] = f⇣W [1]x + b[1]

⌘

a[2] = f⇣W [2]a[1] + b[2]

⌘

. . .

a[r�1]= f

⇣W [r�1]a[r�2]

+ b[r�1]⌘

h✓ (x) = a[r ] = W [r ]a[r�1]+ b[r ]

Possible Activation Functions

ReLU: f (z) = ReLU (z) := max {z , 0}Sigmoid: f (z) = 1

1+e�z

Hyperbolic Tangent: f (z) = tanh (z) := ez�e�z

ez+e�z


Neural Networks

Backpropogation

Let J be the loss function and z [k] = W [k]a[k�1]+ b[k]. By chain rule, we have

@J

@W [r ]ij

=@J

@z [r ]i

@z [r ]i

@W [r ]ij

=@J

@z [r ]i

a[r�1]j =) @J

@W [r ]=

@J

@z [r ]a[r�1]T ,

@J

@b[r ]=

@J

@z [r ]

@J

@a[r�1]i

=

drX

j=1

@J

@z [r ]j

@z [r ]j

@a[r�1]i

=

drX

j=1

@J

@z [r ]j

W [r ]ji =) @J

@a[r�1]= W [r ]T @J

@z [r ]

@J

@z [r ]:= �[r ] =) @J

@z [r�1]=

⇣W [r ]T �[r ]

⌘� f 0

⇣z [r�1]

⌘:= �[r�1]

=) @J

@W [r�1]= �[r�1]a[r�2]T ,

@J

@b[r�1]= �[r�1]

Continue for layers r � 2, . . . , 1.


Unsupervised Learning

Outline




Kernel and SVM

2 Neural Networks





k-means

Algorithm 1: k-means

Input: Training data {x (1), . . . , x (n)}; number of clusters k1 Initialize c(1), . . . , c(k) 2 Rd

as clustering centers

2 while not converge do3 Assign each x (i) to its cloest clustering centers c(j)

4 Take the mean of each cluster as new clustering center

5 end

Property

k-means tries to minimize the following loss function approximately:

min

{c(1),...,c(k)}

nX

i=1

��x (i) � c(j(i))��2

2, where j (i) = argmin

j 02{1,...,k}

��x (i) � c(j0)��2

2

However, it does not guarantee to find the global minimum.



Gaussian Mixture Model (GMM)

Formulation

Assume each data point x (i) is generated independently through the following procedure:

1. Sample z(i) ⇠ Multinomial (�), wherePk

j=1 �j = 1

2. Sample x (i) ⇠ N (µz(i) ,⌃z(i))

How to estimate parameters �, {µ1, . . . , µk} and {⌃1, . . . ,⌃k} if z(i) cannot be observed?

Maximum Likelihood

` (✓) =nX

i=1

log

0

@mX

j=1

�jp(x(i);µj ,⌃j)

1

A ,

where p(x (i);µj ,⌃j) =1q

(2⇡)d |⌃j |exp

✓�1

2(x (i) � µj)

T⌃�1j (x (i) � µj)

◆

This is too complicated to optimize directly!



Expectation-Maximization (EM)

Jensen’s Inequality

By Jensen’s inequality, for any distribution Qi over {1, . . . , k}, we have

nX

i=1

log

0

@mX

j=1

Qi (j)p(x (i), z(i) = j ; ✓)

Qi (j)

1

A �nX

i=1

mX

j=1

Qi (j) logp(x (i), z(i) = j ; ✓)

Qi (j):= ELBO (✓)

Theorem

If we takeQi (j) = p(z(i) = j | x (i); ✓(t)) and let✓(t+1)

:= argmax✓ ELBO (✓), we thenhave `(✓(t+1)

) � `(✓(t)) (proved inlecture).

Algorithm 2: EM Algorithm

Input: Training data {x (1), . . . , x (n)}1 Initialize ✓(0) by some random guess

2 for t = 0, 1, 2, . . . do3 Set Qi (j) = p(z(i) = j | x (i); ✓(t)) for each

i , j ; // E-step

4 Set ✓(t+1)= argmax✓ ELBO (✓); // M-step

5 end



EM in GMM

Posterior of z(i)

p(z(i) = j | x (i); ✓(t)) =�(t)j p(x (i);µ(t)

j ,⌃(t)j )

Pkj 0=1 �

(t)j 0 p(x (i);µ(t)

j 0 ,⌃(t)j 0 )

GMM Update Rules

By defining w (i)j = p(z(i) = j | x (i); ✓(t)), we have

�(t+1)j =

Pni=1 w

(i)j

n, µ(t+1)

j =

Pni=1 w

(i)j x (i)

Pni=1 w

(i)j

, 8 j 2 {1, . . . , k}

⌃(t+1)j =

Pni=1 w

(i)j (x (i) � µ(t+1)

j )(x (i) � µ(t+1)j )

T

Pni=1 w

(i)j

, 8 j 2 {1, . . . , k}


Practice Exam Problem (If time permits)

Outline




Kernel and SVM

2 Neural Networks




midterm reviews (cs 229/ stats 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfmidterm...

Documents