simple linear regression 1. 2 i want to start this section with a story. imagine we take everyone in...

Simple Linear Regression

1

2

I want to start this section with a story. Imagine we take everyone in the class and line them up from shortest to tallest. As you look to the front of the class from your seat the shortest will be on the left and the tallest will be on the right.

In fact, in a face to face class we will line you up. Compare yourself to other people and if you are taller than someone else move to the right, if smaller move to the left.

Now, imagine we have everyone lined up in order from smallest to tallest. If you are back in your seat and you look down at the line-up (you have to use your imagination because you can not be both in the line-up and in your seat) I bet the line-up looks like the following (when thinking about the height of the people):

height5’6” 6’1”

3

On the previous screen you see most people are between 5’6” and 6’1”. There are some that are shorter and some that are taller. This is not rocket science, right? From the line-up we could calculate the average height for the group.

Now, instead of looking at the height of people, let’s look at the size of their feet. In the same order as height I would venture to say that the size of the feet gets larger as we go from left to right in the room. Imagine you are walking across the room looking down at peoples feet. It probably looks like the following (this may not be perfect, but a tendency):

Overview

4

Consider an example about a group of college graduates. Each graduate does not have the same dollar amount of starting salary.

Since each graduate does not have the same starting salary amount, an investigation might occur as to why not. In the investigation one might think about other variables that might influence starting salary. Starting salaries could be influenced by, among other things, the gpa of the student, the number of student groups the student was in, or even the work experience of the graduate. This gpa variable might be important because the larger the gpa the more will be the starting salary.

Overview

5

In the example so far, starting salary is called the dependent variable because the values for starting salary are thought to depend on the values for the other variables. The dependent variable is often called the y variable and in a graph is put on the vertical axis.

GPA, student group and work experience are all examples of independent variables. When we use just one independent variable with the dependent variable we have a situation where we can conduct SIMPLE LINEAR REGRESSION. The independent variable would be called the x variable and put on the horizontal axis. When two or more independent variables are used we could do MULTIPLE REGRESSION. For now we stick with simple linear regression.

The Model

6

If we investigated each graduate and noted the starting salaries and gpa of each we could create the modely = βo + β1x + ε.y represents starting salary,βo is the y intercept for the population (and note the book uses a Greek letter here(capital beta), it looks like a capital B),β1 is the slope for the population,x is the gpa, and ε is the random error in y. This term is not always zero because we know not every point in the scatter plot is on a straight line.Note the y intercept is the mean value of y when x is 0. The slope β1 is a number that represents the expected change in y when x increases by 1 unit.

Using a Sample to Estimate the Model

7

On the next slide I show some data and scatter plot for the example we have been developing. Note that a sample has been taken from 7 graduates and in the data the gpa and starting salary in the rows of the table. Each point in the scatter plot is a gpa, starting salary pair for a graduate.

With a sample of data we can estimate the regression model (meaning we find point estimates of Bo and B1) and in general notation the predicted line is written

ŷ = bo + b1x, wherebo is the point estimate for βo, b1 is the point estimate for β1 and ŷ is the value of y that will be on the line above an x value. ŷ is called the predicted value of y for observation i.

8

Graduate GPA Start Salary

1 3.26 33.8

2 2.6 29.8

3 3.35 33.5

4 2.86 30.4

5 3.82 36.4

6 2.21 27.6

7 3.47 35.3

2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 425

27

29

31

33

35

37f(x) = 5.70656898050416 x + 14.8156152986465

Start Salary

Start SalaryLinear (Start Salary)

GPA

Least Squares Method

9

X

YLine 1

Line 2

Line 3

Least Squares

10

On the previous slide I show a more generic scatter plot and I put three lines in the graph.All three lines are decent in the sense that with the upward slope they all show the same basic idea as the dots in the graph: as x rises, y rise (meaning x and y are positively related.) In theory we could find the equation for each line by algebra, or something like that. Then for each line we would have a bo and b1 value.Now line 1 is bad because it is too high. What I mean here is that if we used the line to predict y we would always predict too high a number. Similarly with line 3 we would be too low all the time.

Least Squares

11

Line 2 is “among” the data points and when you make predictions with the line sometimes you will be too high and sometimes too low. But, no straight line can be exactly perfect (unless all the points are truly on a straight line, which will likely not happen in business and social research).Line 2 is my interpretation of the line that would be picked by what is called the least squares method. When you look at a y value on the line, called ŷ, the least squares line is placed in such a way the that sum of the squared differences of each dot to the line is minimized. Since each dot has a y, the least squares method picks a bo and b1 such that the resulting differences y minus ŷ when squared, and then summed across all values, is minimized.

Least Squares

12

For now we will assume Microsoft Excel or some other program can show us the estimated regression line using least squares. We just want to use what we get. On the Excel stuff I had on the internet note in cell B25 you see the word Coefficients. In cells a26:a27 you see the words Intercept and GPA and then the numbers 14.8156153 and 5.706568981 are in cells b26:b27. This meansŷ = bo + b1Xi has been estimated to beStarting salary = 14.816 + 5.7066gpa.Note the data had starting salary measured in thousands. This means, for example, the data had 29.8 but it means the real value is 29,800.

Prediction with least squares

13

Remember our estimated line is Starting Salaries = 14.816 + 5.7066gpa.

Say we want to predict salary if the gpa is 2.7.

Starting Salaries = 14.816 + 5.7066(2.7) = 30.22382.

This starting salary is $30,223.82

Interpolation and Extrapolation

14

You will notice in our example data set that the smallest value for x was 2.21 and the largest value was 3.82.

When we want to predict a value of y, ŷ, for a given x, if the x is within the range of the data values for x (2.21 to 3.82 in our example) then we are interpolating. But if an x is outside our range for x we are extrapolating. Extrapolating should be used with a great deal of caution. Maybe the relationship between x and y is different outside the range of our data. If so, and we use the estimated line we may be way off in our predictions. Note the intercept has to be interpreted with similar caution because unless our data includes x’s that include zero in the range, the relationship between x and y could be very different in the x = 0 neighborhood than the one suggested by least squares.

Standard error of the estimate

15

Recall earlier in the term we talked about the measure of variability for a variable called the standard deviation. We looked at xi minus xbar. For the dependent variable we have the actual data values and the predicted values. The standard error of the estimate is a measure of variability for the actual values of y around the predicted values. We look at y – ŷ. The formula is in the book and programs like Excel print the value.

From earlier notes I had SSE. The standard error is the square root of SSE/(n-2)

Standard error

16

In the regression model the error term at each x is assumed to have a normal distribution, zero mean, and constant variance σ2. To do hypothesis testing in a regression setting when the constant variance is unknown it will be estimated. The estimate is s2 = SSE/(n-2) and is called the mean square error. The square root of s2 is called the standard error, s.

Hypothesis test of the slope

X x

Y y

Null hypothesis Alternative hypothesis

ON the previous slide I have two graphs with some points in each (ignore the ovals for now, please). Imagine there are more points in the same basic area as those shown.

Now, we use stats to help us understand the world and in the context of regression we think variables are related. Examples would be that income depends on years of school, or weight depends on net calories consumed, or gpa depends on hours studied per week.

IN each example the thinking is that one variable changes value from one person to the next because each person does not have the same value of another variable – not all people have the same income because not all people have the same years of schooling.

There is a tradition in statistics to say initially that there is no relationship between two variables (even if our research and theorizing suggests there is). The null hypothesis is then that the slope of a regression line between the two variables is zero. This would mean the data are the graph on the left.

In stats we take a sample from a population and make calculations – here we calculate the regression coefficients. We take a random sample, which means every data point has an equal chance of being picked.

Now if you look at the graphs again and this time look at the ovals. Say the ovals represent the data points that make it in our sample. If you just focus on the ovals could you tell which graph the data came from? No, both samples suggest a positive relationship between x and y.

Now in regression we assume the slope in the population is zero and use the sample slope as a basis for a test of hypothesis about the population slope. Under the hypothesis of a zero slope if the slope we get has a low probability of occurring then we reject the null and conclude the population slope is not zero.

Look back at the graph on the left. Could we get a random sample that would only include points in the oval? Yes we could, but is seems more likely the random sample would include other points, like in upper left. So we have a low probability of getting the sample and thus the slope. When we have a low probability result (.05 is often chosen as low) we reject the null and conclude the population is probably more like the alternative.

t Test for slope

21

Hypothesis test about the population slope β1.

Remember we have taken a sample of data. In this context we have taken a sample and estimated the unknown population regression.

Our real point in a study like this is to see if a relationship exists between the two variables in the population. If the slope is not zero in the population, then the x variable has an influence on the outcome of y. Now, in a sample, the estimated slope may or may not be zero. But the sample provides a basis for a test of the true unknown population slope being zero.

For the test we will use the t distribution. The degrees of freedom in the context of simple regression equals the sample size minus 2.

t Test for slope

22

Back to our hypothesis test about the slope. The null hypothesis is that β1 = 0, and the alternative is that β1 is not equal to zero. Since the alternative is not equal to zero we have a two-tailed test.

If we have 7 data points (pairs of points in regression) the df = 5 and if we want alpha = .05 we divide that in half because of the two tail test and our critical values are -2.571 and 2.571. If the sample based statistic, tstat, is between the two critical values we can not reject the null and we would conclude the data supports a statement of no relationship between x and y. If the tstat is outside the critical values we reject the null and go with the alternative and say the data supports that a relationship exists between the variables.

t Test for the Slope

23

In a class such as ours the point is usually not to do a lot of calculations in regression, but interpret results. See the Excel printout for the gpa, starting salary example handed out in class and at the class web site. Note we have the calculated tstat for the problem. About two-thirds down the page you see the variable GPA listed below the word Intercept. Note in the GPA row the coefficient is 5.7066. The is the point estimate of β1. The estimate is not zero, but we wonder if in the population the value is zero. The tstat is 14.4358. Since it is outside the critical values we reject Ho and go with the alternative. This means there is a relationship between x and y.The p-value approach is that if the p-value < alpha we reject the null. The p-value printed in Excel in this area is a two tail p-value. Since we have a p-value of essentially 0 we can reject the null. (note the P-value in Excel has E-05 in it – this means move decimal 5 places to the left.)

simple linear regression 1. 2 i want to start this section with a story. imagine we take everyone in...

Documents