regression analysis intro to ols linear regression
DESCRIPTION
Statistical Relationships- A warning Be aware that as with correlation and other measures of statistical association, a relationship does not guarantee or even imply a causality between the variables Also be aware of the difference between a mathematical or functional relationship based upon theory and a statistical relationship based upon data and its imperfect fit to a mathematical modelTRANSCRIPT
Regression Analysis
Intro to OLS Linear Regression
Regression AnalysisDefined as the analysis of the statistical relationship among variablesIn it’s simplest form there are only two variables:Dependent or response variable (labeled
as Y) Independent or predictor variable (labeled
as X)
Statistical Relationships- A warning
Be aware that as with correlation and other measures of statistical association, a relationship does not guarantee or even imply a causality between the variablesAlso be aware of the difference between a mathematical or functional relationship based upon theory and a statistical relationship based upon data and its imperfect fit to a mathematical model
Simple Linear RegressionThe basic function for linear regression is Y=f(X) but the equation typically takes the following form:Y=α+βX+ε α - Alpha – an intercept component to the model that
represents the models value for Y when X=0 β - Beta – a coefficient that loosely denotes the nature of
the relationship between Y and X and more specifically denotes the slope of the linear equation that specifies the model
ε - Epsilon – a term that represents the errors associated with the model
XY
XY
Examplei in this case is a “counter” representing the ith observation in the data set
ii XY
observation (i) number of ice cream cones sold (Y) cost of icecream cones (X)1 84 $2.502 89 $3.003 92 $3.254 96 $2.255 98 $1.756 102 $2.757 113 $2.008 114 $1.509 122 $1.25
10 127 $1.00
Ice Cream Demand
80
90
100
110
120
130
$0.75$1.00
$1.25$1.50
$1.75$2.00
$2.25$2.50
$2.75$3.00
$3.25$3.50
Ice Cream Cone Cost
Ice
Crea
m C
one'
s So
ldAccompanying Scatterplot
Accompanying Scatterplot with Regression Equation
Ice Cream Demand
y = -16.094x + 137.81R2 = 0.7078
80
90
100
110
120
130
$0.75$1.00
$1.25$1.50
$1.75$2.00
$2.25$2.50
$2.75$3.00
$3.25$3.50
Ice Cream Cone Cost
Ice
Crea
m C
one'
s So
ld
What does the additional info mean?
α - Alpha – 138 conesβ - Beta – -16 cones/$1 increase in costε - Epsilon – still present and evidenced by the
fact that the model does not fit the data perfectly
R2 - a new term, the Coefficient of Determination - a value of 0.71 is pretty good considering that the value is scaled between 0 and 1 with 1 being a model with a perfect agreement with the data
Coefficient of Determination
In this simple example R2 is indeed the square of RRecall that R is often the symbol for the Pearson Product Moment Correlation (PPMC) which is a parametric measure of association between two variablesR (X,Y) = -0.84 in this case –0.84^2=0.71
A Digression into HistoryAdrien Legendre- the original author of “the method of least squares”, published in 1805
The guy that got the credit-Carl-Fredrick- the “giant” of early statisticsAKA – Gauss – published the theory of least squares in 1821
Back on Topic – a recap of PPMC or r
From last semester:The PPMC coefficient is essentially the
sum of the products of the z-scores for each variable divided by the degrees of freedom
Its computation can take on a number of forms depending on your resources
What it looks like in equation form:
The sample covariance is the upper center equation without the sample standard deviations in the denominatorCovariance measures how two variables covary and it is this measure that serves as the numerator in Pearson’s r
1nzz
r yx
yxssnyyxx
r)1(
))((
1)( 2
nxx
sx
22 )()(
))((
yyxx
yyxxr
nyynxx
nyxxyr
/)(/)(
/)()(2222
Computationally EasierMathematically Simplified
Take home messageCorrelation is a measure of association between two variablesCovariance is a measure of how the two variables vary with respect to one anotherBoth of these are parametrically based statistical measures – note that PPMC is based upon z-scoresZ-scores are based upon the normal or Gaussian distribution - thus these measures as well as linear regression based upon the method of least squares is predicated upon the assumption of normality and other parametric assumptions
OLS definedOLS stands for Ordinary Least SquaresThis is a method of estimation that is used in linear regressionIts defining and nominal criteria is that it minimizes the errors associated with predicting values for YIt uses a least squares criterion because a simple “least” criterion would allow positive and negative deviations from the model to cancel each other out (using the same logic that is used for computations of variance and a host of other statistical measures)
2
1
)(min i
n
ii YY
The math behind OLSRecall that the linear regression equation for a single independent variable takes this form: Y=α+βX+ε
2
1
)(min
n
iii XY
(i) (Y) (X)1 84 $2.502 89 $3.003 92 $3.254 96 $2.255 98 $1.756 102 $2.757 113 $2.008 114 $1.509 122 $1.25
10 127 $1.00
Since Y and X are known for all I and the error term is immutable, minimizing the model errors is really based upon our choice of alpha and beta
2
1
)(min
n
iii XY This is this under the condition that S is the
total sum of squared deviations from i =1 to n for all Y and X for an alpha and beta
2
1
)(),(
n
iii XYS
The correct alpha and beta to minimize S can be found by taking the partial derivative for alpha and beta by setting each of them equal to zero for the other, yielding
0)(1
n
iii XY for alpha, and 0)(
1
n
iiii XYX for beta
which can be further simplified to
n
i
n
iii
n
iii XXYX
1 1
2
1
for alpha and
for beta
n
ii
n
ii XnY
11
Refer to page 436 for the the text’s more detailed description of the computations for solving for alpha and beta
n
i
n
iii
n
iii XXYX
1 1
2
1
n
ii
n
ii XnY
11
Given these, we can easily solve for the more simple alpha via algebra
n
ii
n
ii XnY
11
is
n
X
n
Yn
ii
n
ii
11
and since X(bar) is the sum of all X(I) from 1 to n diveded by n and the same can be said for Y(bar) we are left with XYSince the mean of both X and
Y can be obtained from the data, we can calculate the intercept or alpha very simply if we know the slope or beta
Once we have a simple equation for alpha, we can plug it into the equation for beta and then solve for the slope of the regression equation
n
i
n
iii
n
iii XXYX
1 1
2
1
n
X
n
Yn
ii
n
ii
11
n
i
n
iii
n
ii
n
iin
iii XX
n
X
n
YYX
1 1
211
1
n
ii
n
ii
n
i
n
iiin
iii X
n
X
n
XYYX
1
2
2
11 1
1
Multiply by n and you get
isolate beta and we have
n
i
n
iii
n
i
n
iii
n
iii XXnXYYXn
1
2
1
2
1 11
n
i
n
iii
n
i
n
iii
n
iii
XXn
XYYXn
1
2
1
2
1 11
XYAlpha or the regression intercept
Beta or the regression slope
n
i
n
iii
n
i
n
iii
n
iii
XXn
XYYXn
1
2
1
2
1 11
Given this info, let’sHead over to the lab and get some hand’s on practice using the small and relatively simple ice cream sale’s data setWe will cover the math behind the coefficient of determination on Thursday and introduce regression with multiple independent variables