gradient ascent - stanford university€¦ · •consider i.i.d. random variables x 1, x 2, ..., x...

60
Gradient Ascent Chris Piech CS109, Stanford University

Upload: others

Post on 16-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Gradient AscentChris Piech

CS109, Stanford University

Page 2: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Our Path

Parameter Estimation

Linear

Regressio

n Naïve Baye

s Logistic

Regressio

n

Deep Learning

Page 3: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Our Path

Linear

Regressio

n Naïve Baye

s Logistic

Regressio

n

Deep Learning

Unbiased

estimators Maxim

izing

likelihood Baye

sian

estimation

Page 4: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Review

Page 5: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤
Page 6: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• Consider n I.I.D. random variables X1, X2, ..., Xn

§ Xi is a sample from density function f(Xi | q)

§ What are the best choice of parameters q ?

Parameter Learning

Page 7: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Piech, CS106A, Stanford University

Likelihood (of data given parameters):

Õ=

=n

iiXfL

1

)|()( qq

Page 8: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Piech, CS106A, Stanford University

Maximum Likelihood Estimation

Õ=

=n

iiXfL

1

)|()( qq

LL(✓) =nX

i=1

log f(Xi|✓)

✓ = argmax✓

LL(✓)

Page 9: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Argmax?

Page 10: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Option #1: Straight optimization

Page 11: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• General approach for finding MLE of q§ Determine formula for LL(q)

§ Differentiate LL(q) w.r.t. (each) q :

§ To maximize, set

§ Solve resulting (simultaneous) equations to get qMLEo Make sure that derived is actually a maximum (and not a

minimum or saddle point). E.g., check LL(qMLE ± e) < LL(qMLE) • This step often ignored in expository derivations• So, we’ll ignore it here too (and won’t require it in this class)

qq

¶¶ )(LL

0)(=

¶¶

qqLL

MLEq

Computing the MLE

Page 12: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Piech, CS106A, Stanford University

End Review

Page 13: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Maximizing Likelihood with Bernoulli

• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Ber(p)§ Probability mass function, f(Xi | p):

Page 14: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Ber(p)§ Probability mass function, f(Xi | p):

Maximizing Likelihood with Bernoulli

0 1

p

1 - p

PMF of Bernoulli

1 0 )1()|( 1 or where =-= -i

xxi xpppXf ii

PMF of Bernoulli (p = 0.2)

f(x) = 0.2x(1� 0.2)1�x

Page 15: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Bernoulli PMF

f(X = x|p) = px(1� p)1�x

X ⇠ Ber(p)

Page 16: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• Consider I.I.D. random variables X1, X2, ..., Xn

§ Xi ~ Ber(p)

§ Probability mass function, f(Xi | p), can be written as:

§ Likelihood:

§ Log-likelihood:

§ Differentiate w.r.t. p, and set to 0:

1 0 )1()|( 1 or where =-= -i

xxi xpppXf ii

Õ=

--=n

i

XX ii ppL1

1)1()(q

[ ]åå==

- --+=-=n

iii

n

i

XX pXpXppLL ii

11

1 )1log()1()(log))1(log()(q

å ==--+=

n

i iXYpYnpY1

where)1log()()(log

å=

==Þ=--

-+=¶

¶ n

iiMLE X

nnYp

pYn

pY

ppLL

1

1 01

1)(1)(

Maximizing Likelihood with Bernoulli

Page 17: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Isn’t that the same as the sample mean?

Page 18: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Yes. For Bernoulli.

Page 19: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤
Page 20: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Maximum Likelihood Algorithm

4. Use an optimization algorithm to calculate argmax

1. Decide on a model for the distribution of your samples. Define the PMF / PDF for your sample.

2. Write out the log likelihood function.

3. State that the optimal parameters are the argmax of the log likelihood function.

Page 21: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤
Page 22: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Poi(l)

§ PMF: Likelihood:

§ Log-likelihood:

§ Differentiate w.r.t. l, and set to 0:

!)|(

i

x

i xe

Xfill

l-

= Õ=

-

=n

i i

X

XeL

i

1 !)( lq

l

[ ]åå==

-

-+-==n

iii

n

i i

X

XXeXeLL

i

11

)!log()log()log()!

log()( lllql

åå==

=Þ=+-=¶

¶ n

iiMLE

n

ii X

nXnLL

11

1 01)( lll

l

åå==

-+-=n

ii

n

ii XXn

11

)!log()log(ll

Maximizing Likelihood with Poisson

Page 23: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

It is so general!

Page 24: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• Consider I.I.D. random variables X1, X2, ..., Xn

§ Xi ~ Uni(a, b)

§ PDF:

§ Likelihood:

o Constraint a ≤ x1, x2, …, xn ≤ b makes differentiation tricky

o Intuition: want interval size (b – a) to be as small as possible to maximize likelihood function for each data point

o But need to make sure all observed data contained in interval• If all observed data not in interval, then L(q) = 0

§ Solution: aMLE = min(x1, …, xn) bMLE = max(x1, …, xn)

ïî

ïíì

££= ÷÷ø

öççè

æ

-

otherwise 0

,...,, )( 211 baq ab n

n

xxxL

ïî

ïíì ££= -

otherwise0

),|(1

baba ab ii

xXf

Maximizing Likelihood with Uniform

Page 25: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Uni(0, 1)

§ Observe data:o 0.15, 0.20, 0.30, 0.40, 0.65, 0.70, 0.75

Likelihood: L(a,1)

a

L(a,1)

Likelihood: L(0, b)

b

L(0, b)

Understanding MLE with Uniform

Page 26: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• How do small samples affect MLE?

§ In many cases, = sample mean

o Unbiased. Not too shabby…

§ Estimating Normal,

o Biased. Underestimates for small n (e.g., 0 for n = 1)

§ As seen with Uniform, aMLE ≥ a and bMLE ≤ bo Biased. Problematic for small n (e.g., a = b when n = 1)

§ Small sample phenomena intuitively make sense:o Maximum likelihood Þ best explain data we’ve seen

o Does not attempt to generalize to unseen data

å=

=n

iiMLE X

n 1

å=

-=n

iMLEiXnMLE

1

22 )(1 µs

Small Samples = Problems

Page 27: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

• Maximum Likelihood Estimators are generally:

§ Consistent: for e > 0

§ Potentially biased (though asymptotically less so)

§ Asymptotically optimalo Has smallest variance of “good” estimators for large samples

§ Often used in practice where sample size is large relative to parameter space

o But be careful, there are some very large parameter spaces

1) |ˆ(|lim =<-¥®

eqqPn

Properties of MLE

Page 28: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Piech, CS106A, Stanford University

Maximum Likelihood Estimation

Õ=

=n

iiXfL

1

)|()( qq

LL(✓) =nX

i=1

log f(Xi|✓)

✓ = argmax✓

LL(✓)

Page 29: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤
Page 30: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Argmax 2: Gradient Ascent

Page 31: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

ArgmaxOption #1: Straight optimization

Page 32: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

ArgmaxOption #2: Gradient Ascent

Page 33: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Gradient Ascent

Walk uphill and you will find a local maxima (if your step size is small enough)

p(samples|✓

)

argmax

Page 34: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Gradient Ascent

Walk uphill and you will find a local maxima (if your step size is small enough)

p(samples|✓

)

argmax

Page 35: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Gradient Ascent

Walk uphill and you will find a local maxima (if your step size is small enough)

Especially good if function is convex

p(samples|✓

)

✓1 ✓2

Page 36: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

✓ newj = ✓ old

j + ⌘ · @LL(✓old)

@✓ oldj

Gradient Ascent

Repeat many times

Walk uphill and you will find a local maxima (if your step size is small enough)

This is some profound life philosophy

Page 37: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Piech, CS106A, Stanford University

Gradient ascent is your bread and butter algorithm for optimization (eg argmax)

Page 38: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Initialize: θj = 0 for all 0 ≤ j ≤ m

Gradient Ascent

Calculate all θj

Page 39: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Initialize: θj = 0 for all 0 ≤ j ≤ m

Gradient Ascent

Repeat many times:

Calculate all gradient[j]’s based on data

!j += h * gradient[j] for all 0 ≤ j ≤ m

gradient[j] = 0 for all 0 ≤ j ≤ m

Page 40: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Review: Maximum Likelihood Algorithm

4. Use an optimization algorithm to calculate argmax

1. Decide on a model for the likelihood of your samples. This is often using a PMF or PDF.

2. Write out the log likelihood function.

3. State that the optimal parameters are the argmax of the log likelihood function.

Page 41: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Review: Maximum Likelihood Algorithm

4. Calculate the derivative of LL with respect to theta

1. Decide on a model for the likelihood of your samples. This is often using a PMF or PDF.

2. Write out the log likelihood function.

3. State that the optimal parameters are the argmax of the log likelihood function.

5. Use an optimization algorithm to calculate argmax

Page 42: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Linear Regression Lite

Page 43: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Predicting Warriors

X1 = Opposing team ELO

X2 = Points in last game

X3 = Curry playing?

X4 = Playing at home?

Y = Warriors points

Page 44: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Predicting CO2 (simple)

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))

N training datapoints

Linear Regression Lite Model

Y = ✓ ·X + Z Y |X ⇠ N(✓X,�2)Z ⇠ N(0,�2)

X = CO2 level

Y = Average Global Temperature

Page 45: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

1) Write Likelihood Fn

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))

N training datapoints ModelY |X ⇠ N(✓X,�2)

First, calculate Likelihood of the data

Shorthand for:

Page 46: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

1) Write Likelihood Fn

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))

N training datapoints ModelY |X ⇠ N(✓X,�2)

First, calculate Likelihood of the data

Page 47: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

2) Write Log Likelihood Fn

Second, calculate Log Likelihood of the data

Likelihood function:

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:

Page 48: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

3) State MLE as Optimization

Third, celebrate!

Log Likelihood:

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:

Page 49: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

4) Find derivative

Fourth, optimize!

Goal:(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:

Page 50: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

5) Run optimization code

(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:

Page 51: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Initialize: θj = 0 for all 0 ≤ j ≤ m

Gradient Ascent

Repeat many times:

Calculate all gradient[j]’s based on data and current setting of theta

!j += h * gradient[j] for all 0 ≤ j ≤ m

gradient[j] = 0 for all 0 ≤ j ≤ m

Page 52: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Initialize: θ = 0

Gradient Ascent

Repeat many times:

Calculate gradient based on data

! += h * gradient

gradient = 0

Linear Regression (simple)

Page 53: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Initialize: θ = 0

Repeat many times:

For each training example (x, y):

! += h * gradient

gradient = 0

Update gradient for current training example

Linear Regression (simple)

Page 54: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Initialize: θ = 0

Repeat many times:

For each training example (x, y):

! += h * gradient

gradient = 0

gradient += 2(y - θ x)(x)

Linear Regression (simple)

Page 55: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Linear Regression

Page 56: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Predicting CO2

X1 = Temperature

X2 = Elevation

X3 = CO2 level yesterday

X4 = GDP of region

X5 = Acres of forest growth

Y = CO2 levels

Page 57: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Training: Gradient ascent to chose the best thetas to describe your data

✓MLE = argmax✓

�nX

i=1

(Y (i) � ✓Tx(i))2

Problem: Predict real value Y based on observing variable X

Linear Regression

Model: Linear weight every feature

Y = ✓1X1 + · · ·+ ✓mXm + ✓m+1

= ✓TX

Page 58: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Initialize: θj = 0 for all 0 ≤ j ≤ m

Repeat many times:

For each training example (x, y):

For each parameter j:

!j += h * gradient[j] for all 0 ≤ j ≤ m

gradient[j] = 0 for all 0 ≤ j ≤ m

gradient[j] += (y – θTx)(-x[j])

Linear Regression

Page 59: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

Predicting WarriorsY = Warriors points

X1 = Opposing team ELO

X2 = Points in last game

X3 = Curry playing?

X4 = Playing at home?

!1 = -2.3

!2 = +1.2

!3 = +10.2

!4 = +3.3

!5 = +95.4

Y = ✓1X1 + · · ·+ ✓mXm + ✓m+1

= ✓TX

Page 60: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤

§ Training data: set of N pre-classified data instanceso N training pairs: (x(1),y(1)), (x(2),y(2)), …, (x(n), y(n))

• Use superscripts to denote i-th training instance§ Learning algorithm: method for determining g(X)

o Given a new input observation of x = x1, x2, …, xmo Use g(x) to compute a corresponding output (prediction)

Output(Class)

Trainingdata

Learningalgorithm

g(X)(Classifier)

X

The Machine Learning Process