machine learning: linear regression - jarrar€¦ · qpart 1: motivation (regression problems)...
Post on 11-Oct-2020
12 Views
Preview:
TRANSCRIPT
Jarrar © 2018 1
Mustafa Jarrar: Lecture Notes on Linear RegressionBirzeit University, 2018
Mustafa JarrarBirzeit University
Machine Learning Linear Regression
Version 1
Jarrar © 2018 2
More Online Courses at: http://www.jarrar.infoCourse Page: http://www.jarrar.info/courses/AI/
Watch this lecture and download the slides
Acknowledgement: This lecture is based on (but not limited to) Andrew Ng’s course about Machine Learning https://www.youtube.com/channel/UCMoXOGX9mgrYNEwpcIQUcag
Jarrar © 2018 3
In this lecture:
q Part 1: Motivation (Regression Problems)q Part 2: Linear Regression
q Part 3: The Cost Function
q Part 4: The Gradient Descent Algorithm
q Part 5: The Normal Equation
q Part 6: Linear Algebra overview
q Part 7: Using Octave
q Part 8: Using R
Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018
Machine Learning Linear Regression
Jarrar © 2018 4
Motivation
We may assume a linear curve, How much a house of 1250ft2 costs?
Given the following housing prices,
Price(in 1000s of dollars)
Size (feet2)
1250
230
So, conclude that a house with 1250ft2 costs 230K$.
Jarrar © 2018 5
Motivation
Given the following housing prices
Given the right answers for each example in the data (training data)
Supervised Learning: Regression Problem:Predict real-valued outputRemember that classification (not regression) refers to predicting discrete-valued output
Price(in 1000s of dollars)
Size (feet2)
Jarrar © 2018 6
Motivation
Given the following training set of housing prices:
Notation:m = Number of training examplesx’s = “input” variable/featuresy’s = output variable/target variable(x, y) : a training example(xi, yi) : the ith training example
Size in feet2 (x) Price ($) in 1000s (y)2104 4601416 2321534 315852 178… …
Our job is to learn from this data how to predict prices
For example:x1 = 2104y1 = 460
Jarrar © 20187
In this lecture:
q Part 1: Motivation (Regression Problems)
q Part 2: Linear Regression q Part 3: The Cost Function
q Part 4: The Gradient Descent Algorithm
q Part 5: The Normal Equation
q Part 6: Linear Algebra overview
q Part 7: Using Octave
q Part 8: Using R
Machine Learning Linear Regression
Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018
Jarrar © 2018 8
y
x
x
xx
xxxx xx
x x
xx
x
x xx
Linear Regression
Training Set
Learning Algorithm
h
Hypothesis
Size of house
Estimated Price
h maps x’s to y’s
How to represent h?
h (x) = q0 + q1x
h(x)
• This is a linear function.
• Also called linear regression with one variable.
Jarrar © 2018 9
Linear Regression with Multiple Features
x1 x2 x3 x4 ySuppose we have the following features
Size ft2 bedrooms floors Age Price2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178
… … … … …
A hypothesis function h(x) might be:h(x) = q0 + q1x1 + q2x2 + q3x3 +…+ qnxn
or h(x) = q0x0 + q1x1 + q2x2 + q3x3 +…+ qnxn (x0=1)e.g., h(x) = 80 + 0.1∙x1 + 0.01∙x2 + 3∙x3 - 2∙x4
Linear regression with multiple features is also called multiple linear regression
Jarrar © 2018 10
Polynomial Regression
Another hypothesis function h(x) might be (polynomial):
h (x) = q0x0 + q1x1 + q2x22 + q2x33 +…+ qnxnn (x0=1)
Suppose we have the following featuresx1 x2 x3 x4 y
Size ft2 bedrooms floors Age Price2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178
… … … … …
Jarrar © 201811
Polynomial Regression
Price(y)
Size (x)
x
x xx
x
x
x
x
x
x xx
x
x
xx
x
x xx
h (x) = q0 + q1x + q2x2
h (x) = q0 + q1x + q2x2 + q3x3
h (x) = q0 + q1x + q2! x
• We may combine features (e.g., size = width * depth).
• We have the option of what features and what models
(quadric, cubic,…) to use.
• Deciding which features and models that best fit our data and
application, is beyond the scope of this course, but there are
several algorithms for this.
Jarrar © 201812
In this lecture:
q Part 1: Motivation (Regression Problems)
q Part 2: Linear Regression
q Part 3: The Cost Functionq Part 4: The Gradient Descent Algorithm
q Part 5: The Normal Equation
q Part 6: Linear Algebra overview
q Part 7: Using Octave
q Part 8: Using R
Machine Learning Linear Regression
Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018
Jarrar © 2018 13
Understanding qs Values
y
x
x
xx
xxxx xx
x x
xx
x
x xx
h (x) = q0 + q1x
qi’s: Parameters
How to choose qi’s?
Jarrar © 2018 14
Understanding qs Values
h(x) = 1 + 0.5x
0321
3
2
1
q0 = 1q1 = 0.5
h(x) =0 + 0.5x
0321
3
2
1
q0 = 0q1 = 0.5
h(x) = 1.5 + 0∙x
0321
3
2
1
q0 = 1.5q1 = 0
Jarrar © 2018 15
The Cost Function
y
x
x
xx
x xx
x
x
Idea: Choose q0,q1 so thath(x) is close to y for our training examples (x,y).
y: is the actual value.h(x): is the estimated value.
q0,q1
12m ∑ ( h(xi) – yi )2J(q0,q1) =
This cost function is also called a Squared Error function
The Cost Function J aims to find q0,q1 that minimizes the error.
i=1
m
Jarrar © 2018 16
Undersetting the Cost Function
0321
3
2
1
Lets try different values of qs and see how the cost function J(q0,q1) behave!
╳╳
╳
0321
3
2
1 ╳╳
╳
0321
3
2
1 ╳╳
╳J(0,1) J(0,0.5) J(0,0)
321
3
2
1
J(q1)
q1╳
J(0.5)= 1/2∙3 [(0.5-1)2 +(1-2)2(1.5-3)2]
≃ 0.68J(1)= 1/2∙3 [(0)2 +(0)2(0)2]
= 0
J(0)= 1/2∙3 [(1)2 +(2)2(3)2]
≃ 2.3
╳
╳
╳
╳╳╳
Given our training data, this is the best value of q1.
But this figure plots only q1, the problem
becomes complicated if we have also (q0. q1)
Jarrar © 2018 17
Cost Function Intuition II
Hypothesis:
Parameters:
Cost Function:
Our Goal:
h (x) = q0 + q1 x1
q0, q1
12m ∑ ( h(xi) – yi )2J(q0,q1) =
i=1
m
minimize J(q0, q1)
Remember that our goal is find the minimum values of q0 and q1
Jarrar © 2018 18
Cost Function Intuition II
Given any dataset, when we try to draw the cost function J(q0,q1), we may get this 3D shape:
Given a dataset, our goal is find the minimum values of J(q0,q1)
Jarrar © 2018 19
Cost Function Intuition II
We may draw the cost function also using contour figures:
Jarrar © 2018 20
Cost Function Intuition II
(q0,q1 )
h (x) = 800 + 0.15 ∙ x
Jarrar © 2018 21
Cost Function Intuition II
(q0,q1 )
h (x) = 360 + 0∙x
Jarrar © 2018 22
Cost Function Intuition II
(q0,q1 )
Is there any way/algorithm to find qs automatically?Yes, e.g., the Gradient Descent Algorithm
Jarrar © 2018 23
In this lecture:q Part 1: Motivation (Regression Problems)
q Part 2: Linear Regression
q Part 3: The Cost Function
q Part 4: The Gradient Descent Algorithmq Part 5: The Normal Equation
q Part 6: Linear Algebra overview
q Part 7: Using Octave
q Part 8: Using R
Machine Learning Linear Regression
Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018
Jarrar © 2018 24
The Gradient Descent Algorithm
!"#$0 ≔ '0 −α . ++,-
. .('0, '1)Repeat until convergence {
}
!"#$1 ≔ '1 −α . ++,3
. .('0, '1)'0:= !"#$0'1:= !"#$1
Starts with some initial values of '0 and '1Keep changing '0 and '1 to reduce .('0, '1)Until hopefully we end up at a minimum
Learning Rate
Partial Derivation
Jis 0or1
Jarrar © 2018 25
The Problem of the Gradient Descent algo.
q1If α is too small, gradient descent can be slow
q1If α is too large, gradient decent can overshoot the minimum. It may fail to converge or even diverge
May converge to global minimum May converge to a local minimum
Jarrar © 2018 26
The Gradient Descent for Linear Regression
!0 ≔ !0 − α . '( .∑*+'( (ℎ(.*) − 0*)
!1 ≔ !1 − α . '( .∑*+'( (ℎ(.*) − 0*) . .*
Repeat until convergence {
}
Simplified version for linear regression
Next, we will try to use Linear Algebra to numericallyminimize !2 (called Normal Equation) without needing to use iterating algorithms like Gradient Descent. However Gradient Descent scales better for bigger datasets.
Jis 0and1
Jarrar © 201827
In this lecture:
q Part 1: Motivation (Regression Problems)
q Part 2: Linear Regression
q Part 3: The Cost Function
q Part 4: The Gradient Descent Algorithm
q Part 5: The Normal Equationq Part 6: Linear Algebra overview
q Part 7: Using Octave
q Part 8: Using R
Machine Learning Linear Regression
Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018
Jarrar © 2018 28
Minimizing !" numerically
Convert the features and the target value into matrixes:
2104 5 1 451416 3 2 401534 3 2 30852 2 1 36
X =
x1 x2 x3 x4 ySize ft2 bedrooms floors Age Price2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178
… … … … …
Another way to numerically (using linear algebra) estimate the optimal values of !".Given the following features:
460232315178
y=
Jarrar © 2018 29
Minimizing !" numerically
1 2104 5 1 451 1416 3 2 401 1534 3 2 301 852 2 1 36
X =
x1 x2 x3 x4 ySize ft2 bedrooms floors Age Price2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178
… … … … …
460232315178
y=
add #0, so to represent !0
Jarrar © 2018 30
The Normal Equation
To obtain the optimal values (minimized) of !s, Use the following Normal Equation:
! = (XT ∙X)-1 ∙ XT ∙ y
In Octave: pinv(X’*X)*X’ *y
!: the set of !s we want to minimizeX: the set of features in the training sety: the output values we try to predict XT: the transpose of XX-1: the inverse of X
Where:
Jarrar © 2018 31
Gradient Descent Vs Normal Equation
Gradient Descent Normal Equation
• Needs to choose the learning rate (α)
• Needs many iterations
• Works well even when nis large
• No need to choose (α)
• Don’t need to iterate• Need to compute (XTX),
which takes about O(n3)
• Slow if n is very large• Some matrices (e.g.,
singular) are non-invertible
m training examples, n features
Recommendation: use the Normal Equation If the number of features in less than 1000, otherwise the Gradient Descent.
Jarrar © 2018 32
In this lecture:
q Part 1: Motivation (Regression Problems)
q Part 2: Linear Regression
q Part 3: The Cost Function
q Part 4: The Gradient Descent Algorithm
q Part 5: The Normal Equation
q Part 6: Linear Algebra overviewq Part 7: Using Octave
q Part 8: Using R
Machine Learning Linear Regression
Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018
Jarrar © 2018 33
Matrix Vector Multiplication
1 34 02 1
15 =╳
1×1 + 3×5 = 16
4×1 + 0×5 = 4
1647
2×1 + 1×5 = 7
Jarrar © 2018 34
Matrix Matrix Multiplication
1 3 24 0 1
1 3 0 15 2
=╳
Multiple with the first column
1 3 24 0 1
105
=╳ 119
Multiple with the second column
1 3 24 0 1
312
=╳ 1014
11 109 14
Jarrar © 201835
Matrix Inverse
3 4
2 16 =╳
If A is an m×m matrix, and if it has an inverse, then
A×A-1 = A-1×A = I
The multiplication of a matrix with its inverse produce an
identity matrix:
0.4 -0.1
-0.05 0.075
1 0
0 1
Some matrices do not have inverses e.g., if all cells are zeros
For more, please review Linear Algebra
Jarrar © 2018 36
Matrix Transpose
1 2 03 5 9A=
1 32 50 9
AT=Example:
Let A be an m×n matrix, and let B = AT
Then B is an n×m matix, and Bij = Aji
Jarrar © 2018 37
In this lecture:
q Part 1: Motivation (Regression Problems)
q Part 2: Linear Regression
q Part 3: The Cost Function
q Part 4: The Gradient Descent Algorithm
q Part 5: The Normal Equation
q Part 6: Linear Algebra overview
q Part 7: Using Octaveq Part 8: Using R
Machine Learning Linear Regression
Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018
Jarrar © 2018 38
Octave
• Download from https://www.gnu.org/software/octave/
• High-level Scientific Programming Language
• Free alternatives to (and compatible with) Matlab
• helps in solving linear and nonlinear problems numerically
Online Tutorial https://www.youtube.com/playlist?list=PLnnr1O8OWc6aAjSc50lzzPVWgjzucZsSD
Jarrar © 2018 39
Compute the normal equation using Octave
1 2104 5 1 451 1416 3 2 401 1534 3 2 301 852 2 1 36
X =
460232315178
y=Given
! =
In Octave: load X.txtload y.txtC= pinv(X’*X)*X’ *ySave !.txt
188.40.4-56-93-3.7
These are the values of !s we need to use in our hypothesis function h(x)
h (x) = q0 + q1x + q2x + q3x + q4xh (x) = 188.4 + 0.4x - 56x - 93x – 3.7x
Jarrar © 2018 40
In this lecture:
q Part 1: Motivation (Regression Problems)
q Part 2: Linear Regression
q Part 3: The Cost Function
q Part 4: The Gradient Descent Algorithm
q Part 5: The Normal Equation
q Part 6: Linear Algebra overview
q Part 7: Using Octave
q Part 8: Using R
Machine Learning Linear Regression
Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018
Jarrar © 201841
R (programming language)
Free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.
R comes with many functions that can do sophisticated stuff, and the ability to install additional packages to do much more.
Download R: https://cran.r-project.org/
A very good IDE for R is the RStudio:https://www.rstudio.com/products/rstudio/download/
R basics tutorial:https://www.youtube.com/playlist?list=PLjgj6kdf_snYBkIsWQYcYtUZiDpam7ygg
Jarrar © 2018 42
Compute the normal equation using R
1 2104 5 1 451 1416 3 2 401 1534 3 2 301 852 2 1 36
X =
460232315178
y=Given
In R:X = as.matrix(read.table("~/x.txt", header=F, sep=","))Y = as.matrix(read.table("~/y.txt", header=F, sep=","))
thetas = solve( t(X) %*% X ) %*% t(X) %*% Y
write.table(thetas, file="~/thetas.txt",row.names=F, col.names=F)
t(X): is the transpose of matrix X.solve(X): is the inverse of matrix X.
Jarrar © 2018 43
References
[1] Andrew Ng’s course about Machine Learning https://www.youtube.com/channel/UCMoXOGX9mgrYNEwpcIQUcag
[2] Sami Ghawi, Mustafa Jarrar: Lecture Notes on Introduction to Machine Learning, Birzeit University, 2018
[3] Mustafa Jarrar: Lecture Notes on Decision Tree Machine Learning, Birzeit University, 2018
[4] Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning, Birzeit University, 2018
Jarrar © 2018 44
Extra Example
Year Grape Price Watermelon Price Melon Price2013 5 3 42014 6 2 32015 7 2 3
Given the following price melon in the previous 5 years
1 5 31 6 21 7 2X=
433Y=
Jarrar © 2018 45
X= XT=
X*XT=
1 5 31 6 21 7 2
1 1 15 6 73 2 2
35 37 4237 41 4742 47 54
5.0000 -24.0000 17.0000-24.0000 126.0000 -91.000017.0000 -91.0000 66.0000
Inv(X*XT)=
C=pinv(x2'*x2)*(x2'*y2)
top related