8/7/2015slide 1 simple linear regression is an appropriate model of the relationship between two...

32
06/20/22 Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: • the data satisfies the assumption of linearity in the scatterplot of the raw data and the residual plot, • the spread of the residuals is equal for all of the predicted values in the residual plot, and • there are no outliers impacting the linear model. When the relationship we are analyzing does not meet these criteria, the use of regression analysis can still be justified if re-expressing one or both variables: • reduces the non-linear pattern in the scatterplot,

Upload: susanna-payne

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 1

Simple linear regression is an appropriate model of the relationship between two quantitative variables provided:• the data satisfies the assumption of linearity in the

scatterplot of the raw data and the residual plot, • the spread of the residuals is equal for all of the predicted

values in the residual plot, and • there are no outliers impacting the linear model.

When the relationship we are analyzing does not meet these criteria, the use of regression analysis can still be justified if re-expressing one or both variables:• reduces the non-linear pattern in the scatterplot, • equalizes the variance in the residual plot, and• reduces the distance of outliers from the other cases in the

distributions.

Page 2: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 2

Clues that re-expression might be effective in linearizing the relationship are: the identification of influential cases, severe skewing of one or both variables (outside the range from -1.0 to +1.0), and when Spearman's rho greater than Pearson's r.

There is no guarantee that re-expression will produce a scatterplot that satisfies the assumptions of linear regression. When it does not we are left with the choice of determining that the violations are not of serious consequence, or choosing an alternative strategy for modeling the relationship.

To solve these problems, we will first assess the conformity of the relationship to regression assumptions. Second, we will examine the criteria that suggest that re-expression might be effective. Third, we will examine the model using re-expressed variables to assess conformity to regression assumptions.

Page 3: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 3

We will use a new strategy for identifying outliers that we may consider omitting from the analysis – Cook’s distance. Cook’s distance combines information about standardized residuals and leverage for independent variables so we can use one measure instead of three.

Cook’s distance is a measure of the influence which a case has on the regression solution, i. e., how different would the solution be if this case were omitted. Larger values of Cook’s distance indicate a greater effect on the regression analysis.

There are different criteria for what constitutes an outlier on Cook’s distance.•Cook’s original criteria was 1.0•Fox proposed 4 / (number of cases – number of iv’s – 1)•We will use 0.5, which is about halfway between the other two.

Page 4: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 4

We will use an updated version of the script for simple linear regression to analyze relationships and test re-expressions.

The script will compute the transformations for both the dependent and independent variables.

The defaults are marked. My preferences are for a scatterplot with boxplots for each variable, and the residual plot.

We will use a combination of fit lines to evaluate normality.

We have options for the criteria for Cook’s distance and the opportunity to exclude influential cases.

Page 5: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 5

This scatterplot shows that the blue loess fit line fluctuates slightly around the regression line, and stays within the confidence interval.

Page 6: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 6

The residual plot shows that the vertical spread of the residuals is approximately the same height from left to right across the predicted values.

There is no evidence of a pattern or shape, suggesting non-linearity.

Page 7: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 7

Influential cases are green instead of blue. There are no cases with undue influence in this plot.

This relationship satisfies the criteria for a linear relationship. There is no reason to re-express the data.

Page 8: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 8

The next problem examines the relationship between poverty and per capita GDP.

Page 9: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 9

The loess fit line clearly curves outside the confidence interval.

Th boxplot for GDP suggests that we should re-express GDP on a logarithmic scale.

The large positive skew value supports the use of logarithms.

Page 10: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 10

The limited spread in the left side of the plot suggests a problem with homogeneity of variance as well as linearity.

The pattern of the points is u-shaped supporting the non-linearity

Page 11: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 11

The limited spread on the left side of the plot suggests a problem with homogeneity of variance as well as linearity.

Page 12: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 12

To re-express GDP as logarithms, mark the option button for scale.

Page 13: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 13

The log transformation improved the linearity of the scatterplot.

The loess fit line moves slightly outside the confidence interval, but it is more a fluctuation than a well-defined curve.

Page 14: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 14

The residual plot shows that the vertical spread is somewhat reduced at the left side of the plot. It is not so pronounced to be treated as a non-linear relationship.

I would interpret the relationship between poverty and the log of GDP as linear.

Page 15: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 15

The skewness of poverty (0.563) was not as pronounced as the skewness of GDP, but we can still re-express the data to see its impact on the relationship.

Page 16: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 16

Including the log of poverty increased the non-linearity shown at the middle of the loess line, though R² increased from 0.456 to 0.503.

Page 17: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 17

I think I see evidence of a curve emerging in the residual plot rather than up and down fluctuations.

I think a case could be made to include the log of poverty based on the higher R². as well as a case for using raw data for poverty since it is more linear.

Page 18: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 18

Page 19: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 19

The curve clearly curves outside the confidence interval.

Page 20: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 20

Both non-linearity and unequal variance are evident in the residual plot

Page 21: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 21

We will first re-express deathrat as logarithms.

Page 22: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 22

The curve looks more evident after the transformation.

Page 23: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 23

The curve looks more evident after the transformation.

Page 24: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 24

We will re-express poverty as logarithms as well, but I am not optimistic that it will help.

Page 25: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 25

The second transformation did not help either. We can try the transformation of poverty with the raw data for deathrat.

Page 26: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 26

Nor does it help to use the logarithm of poverty with the raw data for deathrat.

We should be very cautious about reporting this relationship as linear.

Page 27: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 27

Page 28: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 28

In addition to being non-linear, this relationship shows one influential case.

Page 29: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 29

In addition to being non-linear, this relationship shows one influential case.

Page 30: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 30

x

xx

Since the skewness for both variables is greater than 1.0, we will try a log transformation for both.

Page 31: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 31

The loess line has a very shallow curve to it, though without the loess line, I would judge this to be linear.

The influential case is not as distant from the other cases in the scatterplot and is no longer colored green as an influential case.

Page 32: 8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the

04/19/23 Slide 32

The lower left-hand corner looks suspicious for equality of variance, but this is may be the result of lower bounds for the variables, i.e. values stop at zero and cannot be negative.