8/7/2015slide 1 simple linear regression is an appropriate model of the relationship between two...

Post on 23-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

04/19/23 Slide 1

Simple linear regression is an appropriate model of the relationship between two quantitative variables provided:• the data satisfies the assumption of linearity in the

scatterplot of the raw data and the residual plot, • the spread of the residuals is equal for all of the predicted

values in the residual plot, and • there are no outliers impacting the linear model.

When the relationship we are analyzing does not meet these criteria, the use of regression analysis can still be justified if re-expressing one or both variables:• reduces the non-linear pattern in the scatterplot, • equalizes the variance in the residual plot, and• reduces the distance of outliers from the other cases in the

distributions.

04/19/23 Slide 2

Clues that re-expression might be effective in linearizing the relationship are: the identification of influential cases, severe skewing of one or both variables (outside the range from -1.0 to +1.0), and when Spearman's rho greater than Pearson's r.

There is no guarantee that re-expression will produce a scatterplot that satisfies the assumptions of linear regression. When it does not we are left with the choice of determining that the violations are not of serious consequence, or choosing an alternative strategy for modeling the relationship.

To solve these problems, we will first assess the conformity of the relationship to regression assumptions. Second, we will examine the criteria that suggest that re-expression might be effective. Third, we will examine the model using re-expressed variables to assess conformity to regression assumptions.

04/19/23 Slide 3

We will use a new strategy for identifying outliers that we may consider omitting from the analysis – Cook’s distance. Cook’s distance combines information about standardized residuals and leverage for independent variables so we can use one measure instead of three.

Cook’s distance is a measure of the influence which a case has on the regression solution, i. e., how different would the solution be if this case were omitted. Larger values of Cook’s distance indicate a greater effect on the regression analysis.

There are different criteria for what constitutes an outlier on Cook’s distance.•Cook’s original criteria was 1.0•Fox proposed 4 / (number of cases – number of iv’s – 1)•We will use 0.5, which is about halfway between the other two.

04/19/23 Slide 4

We will use an updated version of the script for simple linear regression to analyze relationships and test re-expressions.

The script will compute the transformations for both the dependent and independent variables.

The defaults are marked. My preferences are for a scatterplot with boxplots for each variable, and the residual plot.

We will use a combination of fit lines to evaluate normality.

We have options for the criteria for Cook’s distance and the opportunity to exclude influential cases.

04/19/23 Slide 5

This scatterplot shows that the blue loess fit line fluctuates slightly around the regression line, and stays within the confidence interval.

04/19/23 Slide 6

The residual plot shows that the vertical spread of the residuals is approximately the same height from left to right across the predicted values.

There is no evidence of a pattern or shape, suggesting non-linearity.

04/19/23 Slide 7

Influential cases are green instead of blue. There are no cases with undue influence in this plot.

This relationship satisfies the criteria for a linear relationship. There is no reason to re-express the data.

04/19/23 Slide 8

The next problem examines the relationship between poverty and per capita GDP.

04/19/23 Slide 9

The loess fit line clearly curves outside the confidence interval.

Th boxplot for GDP suggests that we should re-express GDP on a logarithmic scale.

The large positive skew value supports the use of logarithms.

04/19/23 Slide 10

The limited spread in the left side of the plot suggests a problem with homogeneity of variance as well as linearity.

The pattern of the points is u-shaped supporting the non-linearity

04/19/23 Slide 11

The limited spread on the left side of the plot suggests a problem with homogeneity of variance as well as linearity.

04/19/23 Slide 12

To re-express GDP as logarithms, mark the option button for scale.

04/19/23 Slide 13

The log transformation improved the linearity of the scatterplot.

The loess fit line moves slightly outside the confidence interval, but it is more a fluctuation than a well-defined curve.

04/19/23 Slide 14

The residual plot shows that the vertical spread is somewhat reduced at the left side of the plot. It is not so pronounced to be treated as a non-linear relationship.

I would interpret the relationship between poverty and the log of GDP as linear.

04/19/23 Slide 15

The skewness of poverty (0.563) was not as pronounced as the skewness of GDP, but we can still re-express the data to see its impact on the relationship.

04/19/23 Slide 16

Including the log of poverty increased the non-linearity shown at the middle of the loess line, though R² increased from 0.456 to 0.503.

04/19/23 Slide 17

I think I see evidence of a curve emerging in the residual plot rather than up and down fluctuations.

I think a case could be made to include the log of poverty based on the higher R². as well as a case for using raw data for poverty since it is more linear.

04/19/23 Slide 18

04/19/23 Slide 19

The curve clearly curves outside the confidence interval.

04/19/23 Slide 20

Both non-linearity and unequal variance are evident in the residual plot

04/19/23 Slide 21

We will first re-express deathrat as logarithms.

04/19/23 Slide 22

The curve looks more evident after the transformation.

04/19/23 Slide 23

The curve looks more evident after the transformation.

04/19/23 Slide 24

We will re-express poverty as logarithms as well, but I am not optimistic that it will help.

04/19/23 Slide 25

The second transformation did not help either. We can try the transformation of poverty with the raw data for deathrat.

04/19/23 Slide 26

Nor does it help to use the logarithm of poverty with the raw data for deathrat.

We should be very cautious about reporting this relationship as linear.

04/19/23 Slide 27

04/19/23 Slide 28

In addition to being non-linear, this relationship shows one influential case.

04/19/23 Slide 29

In addition to being non-linear, this relationship shows one influential case.

04/19/23 Slide 30

x

xx

Since the skewness for both variables is greater than 1.0, we will try a log transformation for both.

04/19/23 Slide 31

The loess line has a very shallow curve to it, though without the loess line, I would judge this to be linear.

The influential case is not as distant from the other cases in the scatterplot and is no longer colored green as an influential case.

04/19/23 Slide 32

The lower left-hand corner looks suspicious for equality of variance, but this is may be the result of lower bounds for the variables, i.e. values stop at zero and cannot be negative.

top related