18-661 introduction to machine learning€¦ · 18-661 introduction to machine learning linear...

98
18-661 Introduction to Machine Learning Linear Regression – I Spring 2020 ECE – Carnegie Mellon University

Upload: others

Post on 24-Jul-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

18-661 Introduction to Machine Learning

Linear Regression – I

Spring 2020

ECE – Carnegie Mellon University

Page 2: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Outline

1. Recap of MLE/MAP

2. Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

1

Page 3: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Recap of MLE/MAP

Page 4: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Dogecoin

• Scenario: You find a coin on the ground.

• You ask yourself: Is this a fair or biased coin? What is the

probability that I will flip a heads?

2

Page 5: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

• You flip the coin 10 times . . .

• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

3

Page 6: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

• You flip the coin 10 times . . .

• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

3

Page 7: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

• You flip the coin 10 times . . .

• It comes up as ’H’ 8 times and ’T’ 2 times

• Can we learn from this data?

3

Page 8: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Machine Learning Pipeline

ML methoddata intelligence

feature extraction

model & parameters optimization evaluation

Two approaches that we discussed:

• Maximum likelihood Estimation (MLE)

• Maximum a posteriori Estimation (MAP)

4

Page 9: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Page 10: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Page 11: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Page 12: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Page 13: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Maximum Likelihood Estimation (MLE)

• Data: Observed set D of nH heads and nT tails

• Model: Each flip follows a Bernoulli distribution

P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]

Thus, the likelihood of observing sequence D is

P(D | θ) = θnH (1− θ)nT

• Question: Given this model and the data we’ve observed, can we

calculate an estimate of θ?

• MLE: Choose θ that maximizes the likelihood of the observed data

θMLE = arg maxθ

P(D | θ)

= arg maxθ

logP(D | θ)

=nH

nH + nT

5

Page 14: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

6

Page 15: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

6

Page 16: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

6

Page 17: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

MAP for Dogecoin

θMAP = arg maxθ

P(θ | D) = arg maxθ

P(D | θ)P(θ)

• Recall that P(D | θ) = θnH (1− θ)nT

• How should we set the prior, P(θ)?

• Common choice for a binomial likelihood is to use the Beta

distribution, θ ∼ Beta(α, β):

P(θ) =1

B(α, β)θα−1(1− θ)β−1

• Interpretation: α = number of expected heads, β = number of

expected tails. Larger value of α + β denotes more confidence (and

smaller variance).

6

Page 18: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Page 19: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}

• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Page 20: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Page 21: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Page 22: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Page 23: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Putting it all together

θMLE =nH

nH + nT

θMAP =α + nH − 1

α + β + nH + nT − 2

• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –

θMLE or θMAP?

• θMAP = 5/12, θMLE = 1/3

• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –

θMLE or θMAP?

• θMAP = 1/6, θMLE = 1/3

7

Page 24: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Linear Regression

Page 25: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

8

Page 26: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Task 1: Regression

How much should you sell your house for?

ML methoddata intelligenceregressiondata intelligence

= ??

house size

pirc

e ($

)

input: houses & features

regressiondata intelligence

= ??

house size

pirc

e ($

)

learn: x → y relationship

regressiondata intelligence

= ??

house size

pirc

e ($

)

predict: y (continuous)

Course Covers: Linear/Ridge Regression, Loss Function, SGD, Feature

Scaling, Regularization, Cross Validation 9

Page 27: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Supervised Learning

Supervised learning

In a supervised learning problem, you have access to input variables (X )

and outputs (Y ), and the goal is to predict an output given an input

• Examples:

• Housing prices (Regression): predict the price of a house based on

features (size, location, etc)

• Cat vs. Dog (Classification): predict whether a picture is of a cat

or a dog

10

Page 28: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Supervised Learning

Supervised learning

In a supervised learning problem, you have access to input variables (X )

and outputs (Y ), and the goal is to predict an output given an input

• Examples:

• Housing prices (Regression): predict the price of a house based on

features (size, location, etc)

• Cat vs. Dog (Classification): predict whether a picture is of a cat

or a dog

10

Page 29: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Supervised Learning

Supervised learning

In a supervised learning problem, you have access to input variables (X )

and outputs (Y ), and the goal is to predict an output given an input

• Examples:

• Housing prices (Regression): predict the price of a house based on

features (size, location, etc)

• Cat vs. Dog (Classification): predict whether a picture is of a cat

or a dog

10

Page 30: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Regression

Predicting a continuous outcome variable:

• Predicting a company’s future stock price using its profit and other

financial info

• Predicting annual rainfall based on local flora and fauna

• Predicting distance from a traffic light using LIDAR measurements

Magnitude of the error matters:

• We can measure ’closeness’ of prediction and labels, leading to

different ways to evaluate prediction errors.

• Predicting stock price: better to be off by 1$ than by 20$

• Predicting distance from a traffic light: better to be off 1 m than by

10 m

• We should choose learning models and algorithms accordingly.

11

Page 31: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Regression

Predicting a continuous outcome variable:

• Predicting a company’s future stock price using its profit and other

financial info

• Predicting annual rainfall based on local flora and fauna

• Predicting distance from a traffic light using LIDAR measurements

Magnitude of the error matters:

• We can measure ’closeness’ of prediction and labels, leading to

different ways to evaluate prediction errors.

• Predicting stock price: better to be off by 1$ than by 20$

• Predicting distance from a traffic light: better to be off 1 m than by

10 m

• We should choose learning models and algorithms accordingly.

11

Page 32: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Ex: predicting the sale price of a house

Retrieve historical sales records

(This will be our training data)

12

Page 33: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Features used to predict

13

Page 34: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Correlation between square footage and sale price

14

Page 35: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Roughly linear relationship

Sale price ≈ price per sqft × square footage + fixed expense

15

Page 36: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Data Can be Compactly Represented by Matrices

regressiondata intelligence

= ??

house size

pirc

e ($

)

• Learn parameters (w0,w1) of the orange line y = w1x + w0

Sq.ft

House 1: 1000× w1 + w0 = 200, 000

House 2: 2000× w1 + w0 = 350, 000

• Can represent compactly in matrix notation[

1000 1

2000 1

][w1

w0

]=

[200, 000

350, 000

]

16

Page 37: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Data Can be Compactly Represented by Matrices

regressiondata intelligence

= ??

house size

pirc

e ($

)

• Learn parameters (w0,w1) of the orange line y = w1x + w0

Sq.ft

House 1: 1000× w1 + w0 = 200, 000

House 2: 2000× w1 + w0 = 350, 000

• Can represent compactly in matrix notation[

1000 1

2000 1

][w1

w0

]=

[200, 000

350, 000

]

16

Page 38: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Some Concepts That You Should Know

• Invertibility of Matrices and Computing Inverses

• Vector Norms – L2, Frobenius etc., Inner Products

• Eigenvalues and Eigen-vectors

• Singular Value Decomposition

• Covariance Matrices and Positive Semi-definite-ness

Excellent Resources:

• Essence of Linear Algebra YouTube Series

• Prof. Gilbert Strang’s course at MIT

17

Page 39: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Matrix Inverse

• Let us solve the house-price prediction problem

[1000 1

2000 1

][w1

w0

]=

[200, 000

350, 000

](1)

[w1

w0

]=

([1000 1

2000 1

])−1 [200, 000

350, 000

](2)

=1

−1000

[1 −1

−2000 1000

][200, 000

350, 000

](3)

=1

−1000

[150, 000

−5× 107

](4)

[w1

w0

]=

[150

50, 000

](5)

18

Page 40: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

You could have data from many houses

• Sale price =

price per sqft× square footage + fixed expense + unexplainable stuff

• Want to learn the price per sqft and fixed expense

• Training data: past sales record.

sqft sale price

2000 800K

2100 907K

1100 312K

5500 2,600K

· · · · · ·

Problem: there isn’t a w = [w1,w0]T that will satisfy all equations

19

Page 41: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Want to predict the best price per sqft and fixed expense

• Sale price =

price per sqft× square footage + fixed expense + unexplainable stuff

• Want to learn the price per sqft and fixed expense

• Training data: past sales record.

sqft sale price prediction

2000 810K 720K

2100 907K 800K

1100 312K 350K

5500 2,600K 2,600K

· · · · · · · · ·

20

Page 42: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Reduce prediction error

How to measure errors?

• absolute difference: |prediction− sale price|.• squared difference: (prediction− sale price)2 [differentiable!].

sqft sale price prediction abs error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·

21

Page 43: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Reduce prediction error

How to measure errors?

• absolute difference: |prediction− sale price|.• squared difference: (prediction− sale price)2 [differentiable!].

sqft sale price prediction abs error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·

21

Page 44: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Geometric Illustration: Each house corresponds to one line

c>1 w = y1

c>2 w = y2w

c>3 w = y3

c>4 w = y4

r =

26664

c>1 w � y1

c>2 w � y2

...c>4 w � y4

37775

= Aw � y

• Want to find w that minimizes the difference between Xw, y

• But since this a vector, we need an operator that can map the

residual vector r(w) = y − Xw to a scalar

22

Page 45: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Page 46: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Page 47: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Page 48: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Page 49: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Page 50: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |

• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball.

23

Page 51: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Norms and Loss Functions

• A vector norm is any function f : Rn → R with

• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0

• f (ax) = |a|f (x) for a ∈ R

• triangle inequality: f (x + y) ≤ f (x) + f (y)

• e.g., `2 norm: ‖x‖2 =√x>x =

√∑ni=1 x

2i

• e.g., `1 norm: ‖x‖1 =∑n

i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |

from inside to outside: `1, `2, `∞ norm ball. 23

Page 52: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Minimize squared errors

Our model:

Sale price =

price per sqft× square footage + fixed expense + unexplainable stuff

Training data:

sqft sale price prediction error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·

Aim:

Adjust price per sqft and fixed expense such that the sum of the squared

error is minimized — i.e., the unexplainable stuff is minimized.

24

Page 53: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

25

Page 54: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Linear regression

Setup:

• Input: x ∈ RD (covariates, predictors, features, etc)

• Output: y ∈ R (responses, targets, outcomes, outputs, etc)

• Model: f : x→ y , with f (x) = w0 +∑D

d=1 wdxd = w0 + w>x.

• w = [w1 w2 · · · wD ]>: weights, parameters, or parameter vector

• w0 is called bias.

• Sometimes, we also call w = [w0 w1 w2 · · · wD ]> parameters.

• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

Minimize the Residual sum of squares:

RSS(w) =N∑

n=1

[yn − f (xn)]2 =N∑

n=1

[yn − (w0 +D∑

d=1

wdxnd)]2

26

Page 55: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

27

Page 56: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

What kind of function is this? CONVEX (has a unique global minimum)

28

Page 57: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

What kind of function is this?

CONVEX (has a unique global minimum)

28

Page 58: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

What kind of function is this? CONVEX (has a unique global minimum)28

Page 59: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

Stationary points:

Take derivative with respect to parameters and set it to zero

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0,

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0.

29

Page 60: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

Stationary points:

Take derivative with respect to parameters and set it to zero

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0,

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0.

29

Page 61: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

A simple case: x is just one-dimensional (D=1)

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0

Simplify these expressions to get the “Normal Equations”:∑

yn = Nw0 + w1

∑xn

∑xnyn = w0

∑xn + w1

∑x2n

Solving the system we obtain the least squares coefficient estimates:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

30

Page 62: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

A simple case: x is just one-dimensional (D=1)

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0

Simplify these expressions to get the “Normal Equations”:∑

yn = Nw0 + w1

∑xn

∑xnyn = w0

∑xn + w1

∑x2n

Solving the system we obtain the least squares coefficient estimates:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

30

Page 63: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

A simple case: x is just one-dimensional (D=1)

∂RSS(w)

∂w0= 0⇒ −2

n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

n

[yn − (w0 + w1xn)]xn = 0

Simplify these expressions to get the “Normal Equations”:∑

yn = Nw0 + w1

∑xn

∑xnyn = w0

∑xn + w1

∑x2n

Solving the system we obtain the least squares coefficient estimates:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

30

Page 64: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Example

sqft (1000’s) sale price (100k)

1 2

2 3.5

1.5 3

2.5 4.5

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

The w1 and w0 that minimize this are given by:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

31

Page 65: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Example

sqft (1000’s) sale price (100k)

1 2

2 3.5

1.5 3

2.5 4.5

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

The w1 and w0 that minimize this are given by:

w1 ≈ 1.6

w0 ≈ 0.45

32

Page 66: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

33

Page 67: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Least Mean Squares when x is D-dimensional

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

RSS(w) in matrix form:

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =∑

n

[yn − w>xn]2,

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

34

Page 68: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Least Mean Squares when x is D-dimensional

RSS(w) in matrix form:

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =∑

n

[yn − w>xn]2,

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

which leads to

RSS(w) =∑

n

(yn − w>xn)(yn − x>n w)

=∑

n

w>xnx>n w − 2ynx>n w + const.

=

{w>

(∑

n

xnx>n

)w − 2

(∑

n

ynx>n

)w

}+ const.

35

Page 69: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Least Mean Squares when x is D-dimensional

RSS(w) in matrix form:

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =∑

n

[yn − w>xn]2,

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

which leads to

RSS(w) =∑

n

(yn − w>xn)(yn − x>n w)

=∑

n

w>xnx>n w − 2ynx>n w + const.

=

{w>

(∑

n

xnx>n

)w − 2

(∑

n

ynx>n

)w

}+ const.

35

Page 70: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

RSS(w) in new notations

From previous slide:

RSS(w) =

{w>

(∑

n

xnx>n

)w − 2

(∑

n

ynx>n

)w

}+ const.

Design matrix and target vector:

X =

x>1x>2...

x>N

∈ RN×(D+1)

, y =

y1y2...

yN

∈ RN

Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

36

Page 71: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

RSS(w) in new notations

From previous slide:

RSS(w) =

{w>

(∑

n

xnx>n

)w − 2

(∑

n

ynx>n

)w

}+ const.

Design matrix and target vector:

X =

x>1x>2...

x>N

∈ RN×(D+1), y =

y1y2...

yN

∈ RN

Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

36

Page 72: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Design matrix and target vector:

X =

x>1x>2...

x>N

∈ RN×(D+1), y =

y1y2...

yN

∈ RN

. Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

37

Page 73: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Design matrix and target vector:

X =

x>1x>2...

x>N

=

1 1 2 1

1 2 2 2

1 1.5 3 2

1 2.5 4 2.5

, y =

2

3.5

3

4.5

. Compact expression:

RSS(w) = ‖Xw − y‖22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

38

Page 74: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

Gradients of Linear and Quadratic Functions

• ∇x(b>x) = b

• ∇x(x>Ax) = 2Ax (symmetric A)

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

This leads to the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

39

Page 75: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

Gradients of Linear and Quadratic Functions

• ∇x(b>x) = b

• ∇x(x>Ax) = 2Ax (symmetric A)

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

This leads to the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

39

Page 76: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 =

{w>X>Xw − 2

(X>y

)>w

}+ const

Gradients of Linear and Quadratic Functions

• ∇x(b>x) = b

• ∇x(x>Ax) = 2Ax (symmetric A)

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

This leads to the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

39

Page 77: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

Can use solvers in Matlab, Python etc., to compute this for any given X

and y.

40

Page 78: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5

Write the least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

Can use solvers in Matlab, Python etc., to compute this for any given X

and y.

40

Page 79: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Exercise: RSS(w) in compact form

Using the general least-mean-squares (LMS) solution

wLMS =(

X>X)−1

X>y

recover the uni-variate solution that we had computed earlier:

w1 =

∑(xn − x)(yn − y)∑

(xi − x)2and w0 = y − w1x

where x = 1N

∑n xn and y = 1

N

∑n yn.

41

Page 80: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Exercise: RSS(w) in compact form

For the 1-D case, the least-mean-squares solution is

wLMS =(

X>X)−1

X>y

=

[1 1 . . . 1

x1 x2 . . . xN

]

1 x11 x21 . . .

1 xN

−1[

1 1 . . . 1

x1 x2 . . . xN

]

y1y2. . .

yN

=

([N Nx

Nx∑

n x2n

])−1 [ ∑n yn∑

n xnyn

]

[w0

w1

]=

1∑(xi − x)2)

[y∑

(xi − x)2 − x∑

(xn − x)(yn − y)∑(xn − x)(yn − y)

]

where x = 1N

∑n xn and y = 1

N

∑n yn.

42

Page 81: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Exercise: RSS(w) in compact form

For the 1-D case, the least-mean-squares solution is

wLMS =(

X>X)−1

X>y

=

[1 1 . . . 1

x1 x2 . . . xN

]

1 x11 x21 . . .

1 xN

−1[

1 1 . . . 1

x1 x2 . . . xN

]

y1y2. . .

yN

=

([N Nx

Nx∑

n x2n

])−1 [ ∑n yn∑

n xnyn

]

[w0

w1

]=

1∑(xi − x)2)

[y∑

(xi − x)2 − x∑

(xn − x)(yn − y)∑(xn − x)(yn − y)

]

where x = 1N

∑n xn and y = 1

N

∑n yn.

42

Page 82: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

43

Page 83: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Why is minimizing RSS sensible?

c>1 w = y1

c>2 w = y2w

c>3 w = y3

c>4 w = y4

r =

26664

c>1 w � y1

c>2 w � y2

...c>4 w � y4

37775

= Aw � y

• Want to find w that minimizes the difference between Xw, y

• But since this a vector, we need an operator that can map the

residual vector r(w) = y − Xw to a scalar

• We take the sum of the squares of the elements of r(w)44

Page 84: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Why is minimizing RSS sensible?

Probabilistic interpretation

• Noisy observation model:

Y = w0 + w1X + η

where η ∼ N(0, σ2) is a Gaussian random variable

• Conditional likelihood of one training sample:

p(yn|xn) = N(w0 + w1xn, σ2) =

1√2πσ

e−[yn−(w0+w1xn)]

2

2σ2

45

Page 85: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Probabilistic interpretation (cont’d)

Log-likelihood of the training data D (assuming i.i.d):

logP(D) = logN∏

n=1

p(yn|xn) =∑

n

log p(yn|xn)

=∑

n

{− [yn − (w0 + w1xn)]2

2σ2− log

√2πσ

}

= − 1

2σ2

n

[yn − (w0 + w1xn)]2 − N

2log σ2 − N log

√2π

= −1

2

{1

σ2

n

[yn − (w0 + w1xn)]2 + N log σ2

}+ const

What is the relationship between minimizing RSS and maximizing the

log-likelihood?

46

Page 86: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Maximum likelihood estimation

Estimating σ, w0 and w1 can be done in two steps

• Maximize over w0 and w1:

max logP(D)⇔ min∑

n

[yn − (w0 + w1xn)]2 ← This is RSS(w)!

• Maximize over s = σ2:

∂ logP(D)

∂s= −1

2

{− 1

s2

n

[yn − (w0 + w1xn)]2 + N1

s

}= 0

→ σ∗2 = s∗ =1

N

n

[yn − (w0 + w1xn)]2

47

Page 87: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

How does this probabilistic interpretation help us?

• It gives a solid footing to our intuition: minimizing RSS(w) is a

sensible thing based on reasonable modeling assumptions.

• Estimating σ∗ tells us how much noise there is in our predictions.

For example, it allows us to place confidence intervals around our

predictions.

48

Page 88: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Recap of MLE/MAP

Linear Regression

Motivation

Algorithm

Univariate solution

Multivariate Solution

Probabilistic interpretation

Computational and numerical optimization

49

Page 89: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Computational complexity of the Least Squares Solution

Bottleneck of computing the solution?

w =(X>X

)−1Xy

Matrix multiply of X>X ∈ R(D+1)×(D+1)

Inverting the matrix X>X

How many operations do we need?

• O(ND2) for matrix multiplication

• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

theoretical advances) for matrix inversion

• Impractical for very large D or N

50

Page 90: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);

set t = 0; choose η > 0

• Loop until convergence

1. Compute the gradient

∇RSS(w) = X>(Xw (t) − y)

2. Update the parameters

w (t+1) = w (t) − η∇RSS(w)

3. t ← t + 1

What is the complexity of each iteration?

O(ND)

51

Page 91: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Why would this work?

If gradient descent converges, it will converge to the same solution as

using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

RSS(w) = w>X>Xw − 2(X>y

)>w + const

⇒ ∂2RSS(w)

∂ww>= 2X>X

X>X is positive semidefinite, because for any v

v>X>Xv = ‖X>v‖22 ≥ 0

52

Page 92: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);

set t = 0; choose η > 0

• Loop until convergence

1. Compute the gradient

∇RSS(w) = X>Xw (t) − X>y2. Update the parameters

w (t+1) = w (t) − η∇RSS(w)

3. t ← t + 1

What is the complexity of each iteration?

O(ND)

53

Page 93: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time

• Initialize w to some w (0); set t = 0; choose η > 0

• Loop until convergence

1. random choose a training a sample x t

2. Compute its contribution to the gradient

g t = (x>t w (t) − yt)xt

3. Update the parameters

w (t+1) = w (t) − ηg t

4. t ← t + 1

How does the complexity per iteration compare with gradient descent?

• O(ND) for gradient descent versus O(D) for SGD

54

Page 94: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

SGD versus Batch GD

• SGD reduces per-iteration complexity from O(ND) to O(D)

• But it is noisier and can take longer to converge

55

Page 95: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence

• Reduce η by a constant factor (eg. 10) when learning saturates so

that we can reach closer to the true minimum.

• More advanced learning rate schedules such as AdaGrad, Adam,

AdaDelta are used in practice.

56

Page 96: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence

• Reduce η by a constant factor (eg. 10) when learning saturates so

that we can reach closer to the true minimum.

• More advanced learning rate schedules such as AdaGrad, Adam,

AdaDelta are used in practice.

56

Page 97: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on

this later) and choose the one that gives fastest, stable convergence

• Reduce η by a constant factor (eg. 10) when learning saturates so

that we can reach closer to the true minimum.

• More advanced learning rate schedules such as AdaGrad, Adam,

AdaDelta are used in practice.

56

Page 98: 18-661 Introduction to Machine Learning€¦ · 18-661 Introduction to Machine Learning Linear Regression { I Spring 2020 ECE { Carnegie Mellon University

Mini-Summary

• Linear regression is the linear combination of features

f : x → y , with f (x) = w0 +∑

d wdxd = w0 + w>x

• If we minimize residual sum of squares as our learning objective, we

get a closed-form solution of parameters

• Probabilistic interpretation: maximum likelihood if assuming residual

is Gaussian distributed

• Gradient Descent and mini-batch SGD can overcome computational

issues

57