linear regression: model and algorithms...by analogy to 1d case: 1/9/2017 15 cse 446: machine...

23
1/9/2017 1 Linear Regression: Model and Algorithms CSE 446: Machine Learning Emily Fox University of Washington January 9, 2017 ©2017 Emily Fox Linear regression: The model ©2017 Emily Fox

Upload: others

Post on 01-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

1

CSE 446: Machine Learning

Linear Regression:Model and AlgorithmsCSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 9, 2017

©2017 Emily Fox

CSE 446: Machine Learning

Linear regression:The model

©2017 Emily Fox

Page 2: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

2

CSE 446: Machine Learning3

How much is my house worth?

©2017 Emily Fox

I want to listmy house

for sale

CSE 446: Machine Learning4

How much is my house worth?

©2017 Emily Fox

$$ ????

Page 3: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

3

CSE 446: Machine Learning

Data

©2017 Emily Fox

(x1 = sq.ft., y1 = $)

(x2 = sq.ft., y2 = $)

(x3 = sq.ft., y3 = $)

(x4 = sq.ft., y4 = $) Input vs. Output:• y is the quantity of interest

• assume y can be predicted from x

input output

(x5 = sq.ft., y5 = $)

CSE 446: Machine Learning6 ©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

Model –How we assume the world works

Regression model:

Page 4: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

4

CSE 446: Machine Learning7 ©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

“Essentially, all models are wrong, but some are useful.”

George Box, 1987.

Model –How we assume the world works

CSE 446: Machine Learning8 ©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

Simple linear regression model

yi = w0+w1 xi + εi

f(x) = w0+w1 x

Page 5: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

5

CSE 446: Machine Learning9 ©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

Simple linear regression model

parameters:regression coefficients

yi = w0+w1 xi + εi

CSE 446: Machine Learning10 ©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

f(x) = w0 + w1 x+ w2 x2

What about a quadratic function?

Page 6: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

6

CSE 446: Machine Learning11

Even higher order polynomial

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

f(x) = w0 + w1 x+ w2 x2 + … + wp xp

CSE 446: Machine Learning12

Polynomial regression

Model:yi = w0 + w1 xi+ w2 xi

2 + … + wp xip + εi

feature 1 = 1 (constant)feature 2 = xfeature 3 = x2

…feature p+1 = xp

©2017 Emily Fox

treat as different features

parameter 1 = w0

parameter 2 = w1

parameter 3 = w2

…parameter p+1 = wp

Page 7: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

7

CSE 446: Machine Learning13

Generic basis expansion

Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi

= wj hj(xi) + εi

feature 1 = h0(x)…often 1feature 2 = h1(x)… e.g., xfeature 3 = h2(x)… e.g., x2 or sin(2πx/12)…feature p+1 = hp(x)… e.g., xp

©2017 Emily Fox

jth regression coefficientor weight

jth feature

CSE 446: Machine Learning14

Generic basis expansion

Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi

= wj hj(xi) + εi

feature 1 = h0(x)…often 1 (constant)feature 2 = h1(x)… e.g., xfeature 3 = h2(x)… e.g., x2 or sin(2πx/12)…feature D+1 = hD(x)… e.g., xp

©2017 Emily Fox

Page 8: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

8

CSE 446: Machine Learning15

Predictions just based on house size

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

Only 1 bathroom! Only 1 bathroom! Not same as my

3 bathrooms

CSE 446: Machine Learning16

Add more inputs

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x[1]

y

x[2]

f(x) = w0 + w1 sq.ft.+ w2 #bath

Page 9: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

9

CSE 446: Machine Learning17

Many possible inputs

- Square feet

- # bathrooms

- # bedrooms

- Lot size

- Year built

- …

©2017 Emily Fox

CSE 446: Machine Learning18

General notation

Output: yInputs: x = (x[1],x[2],…, x[d])

Notational conventions:x[j] = jth input (scalar)hj(x) = jth feature (scalar)xi = input of ith data point (vector)xi[j] = jth input of ith data point (scalar)

©2017 Emily Fox

d-dim vector

scalar

Page 10: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

10

CSE 446: Machine Learning19

Generic linear regression model

©2017 Emily Fox

Model:yi = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) + εi

= wj hj(xi) + εi

feature 1 = h0(x) … e.g., 1feature 2 = h1(x) … e.g., x[1] = sq. ft.feature 3 = h2(x) … e.g., x[2] = #bath

or, log(x[7]) x[2] = log(#bed) x #bath…feature D+1 = hD(x) … some other function of x[1],…, x[d]

CSE 446: Machine Learning

Fitting the linear regression model

©2017 Emily Fox

Page 11: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

11

CSE 446: Machine Learning

Step 1:Rewrite the regression model

©2017 Emily Fox

CSE 446: Machine Learning22

Rewrite in matrix notation

For observation i

©2017 Emily Fox

yi = wj hj(xi) + εi

= + εiyi = + εi

Page 12: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

12

CSE 446: Machine Learning23

Rewrite in matrix notation

For all observations together

©2017 Emily Fox

= +

CSE 446: Machine Learning

Step 2:Compute the cost

©2017 Emily Fox

Page 13: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

13

CSE 446: Machine Learning25

“Cost” of using a given line

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y Residual sum of squares (RSS)

RSS(w0,w1) = (yi-[w0+w1xi])2

CSE 446: Machine Learning26

RSS for multiple regression

©2017 Emily Fox

square feet (sq.ft.)

pri

ce (

$)

y

RSS(w) = (yi- )2

= (y-Hw)T(y-Hw)

x[1]

x[2]

Page 14: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

14

CSE 446: Machine Learning

Step 3:Take the gradient

©2017 Emily Fox

CSE 446: Machine Learning28

Gradient of RSS

©2017 Emily Fox

RSS(w) = [(y-Hw)T(y-Hw)]

= -2HT(y-Hw)

Δ Δ

Why? By analogy to 1D case:

Page 15: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

15

CSE 446: Machine Learning

Step 4, Approach 1:Set the gradient = 0

©2017 Emily Fox

CSE 446: Machine Learning30

Closed-form solution

©2017 Emily Fox

RSS(w) = -2HT(y-Hw) = 0

Δ

Solve for w:

Page 16: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

16

CSE 446: Machine Learning31

Closed-form solution

©2017 Emily Fox

ŵ= ( HTH )-1 HTy

Invertible if:

Complexity of inverse:

CSE 446: Machine Learning

Step 4, Approach 2:Gradient descent

©2017 Emily Fox

Page 17: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

17

CSE 446: Machine Learning33

Gradient descent

©2017 Emily Fox

while not convergedw(t+1) w(t) - η RSS(w(t))

Δ

-2HT(y-Hw)

CSE 446: Machine Learning34

Interpreting elementwise

©2017 Emily Fox

wj(t+1) wj

(t) + 2η hj(xi)(yi-ŷi(w(t)))

Update to jth feature weight:

square feet (sq.ft.)

pri

ce (

$)

y

x[1]

x[2]

Page 18: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

18

CSE 446: Machine Learning35

Summary of gradient descentfor multiple regression

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while || RSS(w(t))|| > εfor j=0,…,D

partial[j] =-2 hj(xi)(yi-ŷi(w(t)))

wj(t+1) wj

(t) – η partial[j]

t t + 1

Δ

CSE 446: Machine Learning

Why min RSS?

©2017 Emily Fox

Page 19: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

19

CSE 446: Machine Learning37

Assuming Gaussian noise

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

yModel for εi:

Implied distributionon yi:

CSE 446: Machine Learning38

Maximum likelihood estimate of params

Maximize log-likelihood wrt w

©2017 Emily Fox

Page 20: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

20

CSE 446: Machine Learning

Interpreting the fitted function

©2015 Emily Fox & Carlos Guestrin

CSE 446: Machine Learning40 ©2015 Emily Fox & Carlos Guestrin

square feet (sq.ft.)

pri

ce

($

)

x

y

Interpreting the coefficients –Simple linear regression

ŷ= ŵ0 + ŵ1 x

1 sq. ft.

predictedchange in $

Page 21: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

21

CSE 446: Machine Learning41

fix

©2015 Emily Fox & Carlos Guestrin

Interpreting the coefficients –Two linear features

square feet (sq.ft.)

pri

ce (

$)

x[1]

y

x[2]

ŷ= ŵ0 + ŵ1 x[1] + ŵ2 x[2]

CSE 446: Machine Learning42

fix

©2015 Emily Fox & Carlos Guestrin

Interpreting the coefficients –Two linear features

ŷ= ŵ0 + ŵ1 x[1] + ŵ2 x[2]

# bathrooms

pri

ce

($

)

x[2]

1 bathroom

predictedchange in $

y

For fixed # sq.ft.!

Page 22: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

22

CSE 446: Machine Learning43

fix fix fixfix

©2015 Emily Fox & Carlos Guestrin

Interpreting the coefficients –Multiple linear featuresŷ= ŵ0 + ŵ1 x[1] + …+ŵj x[j] + … + ŵd x[d]

square feet (sq.ft.)

pri

ce (

$)

x[1]

y

x[2]

CSE 446: Machine Learning44 ©2015 Emily Fox & Carlos Guestrin

Interpreting the coefficients-Polynomial regression

ŷ= ŵ0 + ŵ1x +… + ŵj xj + … + ŵp xp

square feet (sq.ft.)

pri

ce

($

)

x

y

Can’t hold other features

fixed!

Page 23: Linear Regression: Model and Algorithms...By analogy to 1D case: 1/9/2017 15 CSE 446: Machine Learning Step 4, Approach 1: Set the gradient = 0 ©2017 Emily Fox 30 CSE 446: Machine

1/9/2017

23

CSE 446: Machine Learning

Recap of concepts

©2017 Emily Fox

CSE 446: Machine Learning46

What you can do now…• Describe polynomial regression• Write a regression model using multiple inputs or

features thereof• Cast both polynomial regression and regression

with multiple inputs as regression with multiple features

• Calculate a goodness-of-fit metric (e.g., RSS)• Estimate model parameters of a general multiple

regression model to minimize RSS:- In closed form- Using an iterative gradient descent algorithm

• Interpret the coefficients of a non-featurizedmultiple regression fit

• Exploit the estimated model to form predictions

©2017 Emily Fox