review: correlation vs. regression what are the main differences between correlation &...

168

Upload: landon-standring

Post on 14-Dec-2015

263 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What
Page 2: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What
Page 3: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Review: Correlation vs. Regression

What are the main differences between correlation & regression?

What are the data requirements for each?

What are their principal vulnerabilities?

How do we establish causality?

Page 4: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Pearson correlation: a linear association between two quantitative variables.

Page 5: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Data Requirements

A probability sample—if the analysis will be inferential, as opposed to descriptive.

For OLS regression, the outcome variable must be quantitative (interval or ratio); the explanatory variables may be quantitative or categorical (nominal or ordinal).

Page 6: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

What are the disadvantages of using correlation to study the relationships between two or more variables?

See Moore/McCabe, chapter 2.

Page 7: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

0

0

xy

xy

:Ha

:Ho

Hypothesis tests for correlation

We can use Pearson correlation not only descriptively but also inferentially.

To use it inferentially, use a scatterplot to check the bivariate relationship for linearity.

If the relationship is sufficiently linear:

Page 8: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

In Stata, ‘pwcorr’ vs. ‘corr’ ‘correlate’ (corr): ‘listwise,’ or ‘casewise,’ deletion—i.e. any observation (i.e. individual, case) for which any of the correlated variables has missing data is not used (i.e. ‘corr’ only uses observations with complete data for the examined variables).

If, for the relationship between math and reading scores, observation #27 has, say, a missing math score, then ‘corr’ or ‘regress’ will automatically drop observation #27.

This is how regression works, so ‘corr’ corresponds to regression.

Moreover, ‘corr’ does not permit hypothesis tests.

Page 9: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

pwcorr: (‘pairwise’) uses all of the non-missing observations for the examined variables (e.g., it would use observation #27’s reading score, even though #27’s math score is missing).

This does not correspond to the way that regression works.

Moreover, pwcorr permits hypothesis tests

Note: There is a way to use ‘pwcorr’ so that, like regression analysis, it is based on casewise (i.e. listwise) deletion of missing observations. We’ll demonstrate this later.

Page 10: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Use a Bonferonni or other multiple-test adjustment when simultaneously testing multiple correlation hypotheses:

. pwcorr read write math science socst, obs sig star(.05) bonf

Why is the multiple-test adjustment important?

Page 11: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

If the data have no missing values, then there’s no problem using pwcorr.

Page 12: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Contingency Table vs.

Pearson Correlation What if the premises of parametric statistics don’t hold?

E.g., what if the quantitative variables are based on a small sample (say, <30)?

Or what if the relationship is non-linear, and possible transformations (such as logarithmic) aren’t applicable or don’t work, and/or it doesn’t make sense to eliminate extreme outliers?

What if the quantitative variables are ordinal rather than interval in measurement?

Page 13: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

In such cases it may be useful to: (1) use non-parametric procedures (such as Spearman rho [in Stata: ‘spearman x y’]); or (2) categorize the data & assess the bivariate association via a contingency table.

Non-parametric procedures are not premised on the approximate normality of the sample distribution of sample means (see the Moore/McCabe chap. 7 & CD-Rom chapter).

As for contingency tables, here’s an example of how to do a contingency table in response to violated parametric assumptions: What’s the association between science & reading scores?

Page 14: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. xtile xsci=science, nq(4)

. xtile xread=read, nq(4)

. tab1 xsci xread

. bys xsci: su science

. bys xread: su read

. tab xread xsci, col chi2

Always have a good reason for how you categorize a quantitative variable.

Check # observations per cell for the validity of the Chi-square hypothesis test.

Page 15: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Measures of Correlation & Association Involving Categorical Variables

Here are some alternatives to Pearson correlation (including non-parametric [‘rank’] statistics):

Page 16: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

There are other correlation coefficients & measures of association (besides ttest & prtest) for categorical variables & for combinations of categorical & continuous variables. E.g.:

Spearman correlation (i.e. ‘rank correlation’; see Moore/McCabe chap. 7 and CD-Rom chapter, Hamilton, chap. 6 & Stata Manual) (It is also an outlier-resistant alternative to ‘corr’ or ‘pwcorr’ for quantitative variables.):

. spearman ordinalscore ses

Kendall’s tau (like Spearman, but can be slower in Stata; see Hamilton, chap. 6 & Stata Manual)

. ktau ordinalscore ses

Page 17: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

eta-squared: when one variable is quantitative continuous & the other is multi-level categorical (see Moore/McCabe, chap. 12, ‘ANOVA’; Hamilton, chap. 5, ‘ANOVA’; Stata Manual, ‘oneway’).

. oneway read ses, tabulate bonf [Bartlett test must be insignificant; see also ANOVA and loneway.]

biserial correlation & point biserial correlation: when one variable is quantitative continuous & the other is binary. Just use ‘corr’ or ‘pwcorr’—Stata or any other major software automatically makes the adjustment:

. pwcorr read female, obs sig star(.05) [same result as ttest read,by(female)]

Page 18: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

phi coefficient: two categorical binary variables.

. pwcorr female white, obs sig star(.05) [same result as tab female white, col/row/cell chi2]

. tab female ses, all [output includes ‘Cramer’s V,’ which is an adaptation of phi coefficient for tables that are larger than two-by-two, but it likewise works in two-by-two tables]

Caution: Recall the ramifications of restricted-range data and ecological data for correlation results, & recall the need to consider lurking variables.

Page 19: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Non-parametric: rank data.

Parametric: premised on approximately normal sampling distribution of sample means (i.e. Central Limit Theorem).

Page 20: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

CI’s for Pearson & Spearman correlations:

. findit ci2 [& download]

. ci2 read write, corrConfidence interval for Pearson's product-moment correlation of read and write, based on Fisher's transformation. Correlation = 0.597 on 200 observations (95% CI: 0.499 to 0.679)

. ci2 read ses, corr spearmanConfidence interval for Spearman's rank correlation of read and ses, based on Fisher's transformation. Correlation = 0.280 on 200 observations (95% CI: 0.147 to 0.403)

Page 21: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Regression Analysis

Page 22: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

What are regression analysis’s major advantages over the alternatives for examining the relationships between two or more variables (see Moore/McCabe, chapter 2)?

Regression: examines how the values of an outcome variable y depend on the values of one or more explanatory variables x (i.e. the slope & direction of the y/x straight line).

Page 23: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

On average, how does risk of heart disease (y) change with every unit of increase or decrease in amount of fat consumption (x1) & in amount of exercise (x2)?

On average, how does earnings level (y) change with every unit of increase or decrease in years of education (x2) & in years of person’s age (x1)?

Page 24: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Recall the problems of causality that we discussed in Chapter Two.

Always ask: What is the conceptual basis of the hypothesized or implied causal relationship? What if it were reversed?

See, e.g., King et al., Designing Social Inquiry; McClendon, Multiple Regression and Causal Analysis; Berk, Regression Analysis: A Constructive Critique.

Page 25: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Let’s start by interpreting the following simple regression model (i.e. regression model with one explanatory variable x).

use hsb2, clear

. for varlist math read: kdensity X, norm \ more

Page 26: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

0.0

1.0

2.0

3.0

4D

en

sity

30 40 50 60 70 80math score

Kernel density estimateNormal density

0.0

1.0

2.0

3.0

4

Den

sity

20 40 60 80reading score

Kernel density estimateNormal density

Page 27: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. gr box math read3

04

05

06

07

08

0

math score reading score

Page 28: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. su math read, d math score------------------------------------------------------------- Percentiles Smallest 1% 36 33 5% 39 3510% 40 37 Obs 20025% 45 38 Sum of Wgt. 200

50% 52 Mean 52.645 Largest Std. Dev. 9.36844875% 59 7290% 65.5 73 Variance 87.7678195% 70.5 75 Skewness .284411599% 74 75 Kurtosis 2.337319

reading score------------------------------------------------------------- Percentiles Smallest 1% 32.5 28 5% 36 3110% 39 34 Obs 20025% 44 34 Sum of Wgt. 200

50% 50 Mean 52.23 Largest Std. Dev. 10.2529475% 60 7390% 67 73 Variance 105.122795% 68 76 Skewness .194837399% 74.5 76 Kurtosis 2.363052

Page 29: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

30

40

50

60

70

80

30 40 50 60 70 80reading score

math score Fitted values

. scatter math read || qfit math read

Page 30: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. corr math read(obs=200)

| math read---------+------------------ math | 1.0000 read | 0.6623 1.0000

. pwcorr math read, obs sig

| math read-------------+------------------ math | 1.0000 | | 200 | read | 0.6623 1.0000 | 0.0000 | 200 200

Page 31: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source SS df MS Number of obs = 200

F( 1, 198) = 154.70

Model 7660.75905 1 7660.75905 Prob > F = 0.0000

Residual 9805.03595 198 49.5203836 R-squared = 0.4386

Adj R-squared = 0.4358

Total 17465.795 199 87.7678141 Root MSE = 7.0371

math Coef. Std. Err. t P>t [95% Conf. Interval]

read .6051473 .0486538 12.44 0.000 .509201 .7010935

_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446

Interpretation?

Page 32: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

The simple linear regression model assumes that the mean of the outcome variable is a linear function of one explanatory variable.

The multiple linear regression model—as we’ll later see—assumes that the mean of the outcome variable is a linear function of multiple explanatory variables: implication?

Page 33: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Regression Analysis

Does Not Assume the Following!

Regression analysis does not assume that the sample values of the outcome & explanatory variables have normal distributions!

It does assume an approximately linear relationship y/x relationship.

And it does assume that the distribution of the residuals is approximately normal and is constant across the values of each explanatory variable, with an expected value of zero.

More on this later…

Page 34: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

xy 0

ppx...xy 110

Simple linear regression model:

Multiple linear regression model:

Page 35: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Regression model: a set of variables & their hypothesized relationships.

Basic research strategy in multiple regression: compare models—which model provides the best explanation (or prediction) for a research question & the data?

Page 36: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

What’s the advantage of multiple regression over simple regression?

Multiple regression allows us to examine how values of an outcome variable vary in association with changes in the values of more than one explanatory variable.

The effect of each x on the value of y is measured holding the other x’s constant at their means.

Thus a given x may perform differently within differing sets of x’s.

Page 37: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

In multiple regression, the value of each explanatory x’s slope (i.e. beta or regression) coefficient is its partial (i.e. net) effect, holding the other explanatory variables constant.

So, in multiple regression, the value of a slope (i.e. regression) coefficient may vary according to which other explanatory variables are included in the model.

Page 38: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

What is accomplished by holding the model’s other variables constant?

How does this compare to experimental design?

Page 39: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Thus, when interpreting the effect of any one explanatory variable on y, consider the model’s other explanatory variables as held constant.

On average, how does risk of heart disease (y) change with every unit of increase in amount of fat consumption (x1), holding constant amount of exercise (x2)?

On average, how does earnings level (y) change with every year person’s age (x2), holding constant years of education (x1)?

Page 40: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

The characteristics of scatterplots & correlations don’t necessarily predict whether an explanatory variable will test significant in multiple regression.

That’s because multiple regression expresses the joint, linear effect of a set of explanatory variables on an outcome variable y.

That is, the regression model’s whole is more than the sum of its parts.

Page 41: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

In fact, significant bivariate relationships may become insignificant in multiple regression.

Or insignificant bivariate relationships may become significant.

Or positive bivariate relationships may become negative, & vice versa (‘Simpson’s Paradox’).

Page 42: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

On such complexities within multiple regression models, see McClendon, Multiple Regression and Causal Analysis.

And see Agresti/Finlay, Statistical Methods for the Social Sciences, chapter 10.

Page 43: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Regression model: a set of variables & their hypothesized relationships.

Basic research strategy: compare models—which model provides the best explanation (or prediction) for a research question & the data?

To repeat:

Page 44: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

The estimated (i.e. probabilistic) regression line:

The estimated (i.e. probabilistic) regression line contains a component of uncertainty, or error: deviations between the observed values of y & the estimated values of y)

exxxy kk ...22110

Page 45: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

The most important statistical assumptions of the linear regression model: the distribution of residuals (i.e. prediction ‘errors’) is (1) approximately normal & (2) is constant for all the values of each explanatory variable x, with an expected value of zero.

These are the principal assumptions that we check in our diagnostic graphs after estimating a regression model.

Page 46: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Why are the assumptions of constant, normal distribution of residuals & zero expected value of residuals so important?

(1) The expected value of the residuals equals zero: guarantees that the estimates of the y-intercept & the slope coefficients are unbiased estimates of the corresponding population values.

Page 47: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

(2) Constant spread of residuals: minimizes the standard errors of the estimates of the y-intercept & the slope coefficients, which is necessary for the usefulness of confidence intervals & tests of significance.

Page 48: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

What if the assumption of normal, constant distribution of residuals with an expected value of zero does not hold for a given estimated regression model?

Violations of assumptions are a matter of degree. Assessing the degree of violation & taking proper corrective action are advanced topics of regression diagnostics.

Page 49: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

How to check if there’s a constant, normal distribution of residuals & zero expected value of residuals?

. reg math read

. predict e, resid [e = ‘errors’]

. hist e, norm

. rvfplot, yline(0)

Page 50: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

If the distribution is approximately normal, as it roughly is above (note some negative skewness), then the assumption basically holds. In this specific case, the slight negative skewness does alert us to possible problems with other diagnostics.

0.0

2.0

4.0

6D

en

sity

-20 -10 0 10 20Residuals

Page 51: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

-20

-10

01

02

0R

esid

uals

40 50 60 70Fitted values

If the distribution of the residuals is approximately random, which it roughly is above (note the degree of rightward expansion), then the assumption basically holds. In this specific case, we would want to check other diagnostics to confirm that there are no serious violations of the assumptions.

Page 52: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Another potential problem we check in regression diagnostics is that of x-outliers.

x-outliers that fall far from the mean in the y/x scatterplot may be influential: i.e. they may exert an excessive effect on the slope coefficients.

Always ask: Why are there outliers? What is their effect? How should we deal with them?

Page 53: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

How to check for regression outliers in Stata? Here’s the preliminary way:

. scatter math read, ml(id)

(In this data set, id identifies each subject or observation.)

Look for x-observations that fall far away from the pack to the left or right.

Page 54: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

There are no notable outliers.

1

2

3

45

6

7

8 9

10

11 12

13

14

15 16

1718

19

20

21

22

2324

25

2627

28

29

30

31

32

33

34

35

36

37

38

39

4041

42

4344

45

46

47

48

49

5051

52

53 54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

6970

71

72

73

747576

77

78

79

80

81

82

83

84

85

86

87

88

89

90

9192

9394

95

96

97

98

99

100

101

102

103

104

105

106

107

108109

110

111

112

113

114

115

116

117

118

119

120121

122123

124

125126 127

128

129

130131

132

133134

135

136

137

138

139

140

141

142

143

144

145

146

147148

149

150

151

152

153

154

155

156

157

158159160

161

162

163

164

165166

167

168

169170

171

172

173

174

175176

177

178

179

180

181182

183

184185

186

187188

189

190

191

192

193

194

195

196197

198199

2003

04

05

06

07

08

0m

ath

sco

re

30 40 50 60 70 80reading score

Page 55: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

If there were notable outliers, we would estimate the regression model both with & without the outliers, then compare the models.

. reg math read

. reg math read if id~=15 & id~=167

More important, however, are post-estimation diagnostics that assess the influence of outliers within a regression model (e.g., lvr2plot, avplots, dfbeta). E.g.:

. reg math read

Page 56: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. lvr2, ml(id)

Notable outliers would be located in in the right-hand quadrant.

1

2

3

4567

8

910

11

12

1314

15

161718

19

20 2122

23

2425

26

27

28

29

30

31 3233

34

3536

37

38

39

40

41424344

45

464748

4950

51

52

53

5455 56

57

58

59

60

61

62

6364 65

6667

68

69

7071

72

7374

7576

77

78

79

80

81

82

83

84

85

86

87

88

89

90

9192

93

94

95

96

9798 99

100

101

102

103

104 105

106

107

108

109

110

111

112

113

114

115

116

117

118

119120

121

122

123

124

125

126127

128

129130

131

132

133

134

135

136

137138

139

140

141

142

143

144

145

146147

148 149150

151152

153154

155

156

157

158159160 161162163

164

165

166

167

168 169170

171172

173

174175

176177178 179

180

181

182

183

184

185

186187

188

189190191

192

193

194

195

196

197198199

200

0.0

1.0

2.0

3.0

4

Leve

rag

e

0 .02 .04 .06Normalized residual squared

Page 57: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

In any case, we use the estimated regression line—which must be based on a random sample of observations measured on the same individuals or subjects—to estimate the population regression line.

Sampling error (as well as non-sampling error) causes uncertainty in the estimated regression line, as in inferential statistics in general.

Page 58: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Regression measures a linear association: non-linearity—which if present emerges in non-normal distributions of residuals—creates misleading results.

Page 59: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Before we proceed, recall that there are two regression lines.

What are the two regression lines? Why are there two lines? How does this distinguish regression from correlation? What are the pitfalls of this?

Page 60: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

As with correlation, beware of lurking variables in regression analysis.

As we’ll see, multiple (rather than simple) regression addresses the problem of lurking variables, though not as effectively as experimental design.

Page 61: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

How to do simple (i.e. one explanatory variable) linear regression in Stata?

Let’s assume that the preliminary graphical & numerical descriptions have been carried out, & that the scatterplot indicates an approximately linear y/x relationship.

Page 62: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source SS df MS Number of obs = 200

F( 1, 198) = 154.70

Model 7660.75905 1 7660.75905 Prob > F = 0.0000

Residual 9805.03595 198 49.5203836 R-squared = 0.4386

Adj R-squared = 0.4358

Total 17465.795 199 87.7678141 Root MSE = 7.0371

math Coef. Std. Err. t P>t [95% Conf. Interval]

read .6051473 .0486538 12.44 0.000 .509201 .7010935

_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446

Interpretation?

Page 63: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Sample-estimated variation in the x-based values of response variable y is measured by residuals (i.e. deviations or errors): the deviations between each observed y & each predicted y (‘yhat’).

Page 64: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

REGRESSION DATA = FIT + RESIDUALS

Fit: the model’s estimate of y’s average value for each level of an x-variable.

Residuals: the deviations (‘errors’) of the predicted y values (yhat) from the observed y values (e.g., the deviations of the predicted math scores from the observed math scores)

SST = SSM + SSE

Page 65: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

REGRESSION DATA = FIT + RESIDUALS

SST = SSM + SSE

This formula underpins various diagnostic measures of a model’s explanatory/ predictive worth.

Page 66: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

Page 67: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values.

Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values.

Sum of Squares for Errors (SSE) (i.e. Sum of Squares for Residuals): each observed y minus the mean of predicted y ; sum the values; square the values.

Page 68: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

Page 69: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Linear regression is called ‘ordinary least squares’ (OLS) because the equation chooses the straight line that minimizes the squared deviations between the observed values of y & the model’s predicted values of y (yhat; e.g., predicted math test scores).

Page 70: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

Page 71: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

We use the sample data—i.e. the observations on outcome variable y & explanatory variables x—to estimate the following:

Page 72: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

(1) The slope coefficients:

x

yk s

srb

*

r=correlation xy. s=standard deviation

Page 73: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

Page 74: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Here’s how to graph the confidence interval of a regression coefficient:

. twoway qfitci math read

30

40

50

60

70

30 40 50 60 70 80reading score

95% CI Fitted values

See the course document ‘Graphing confidence intervals in Stata’ for other options.

Page 75: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

xbyb 0

(2) Y-intercept (the formula being for simple regression):

The value of predicted y when the explanatory x’s are zero.

It typically has no substantive meaning. Why not?

Page 76: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.5203836 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

_cons: y-intercept

Page 77: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

(3) The residuals:

ikki

iii

xbxbxbby

yye

...

ˆ

22110

Page 78: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

The residuals ei correspond to the deviations of each predicted y (i.e. each predicted y) from each observed y.

The residuals ei must have an approximately normal distribution with an expected value of zero (over an infinite number of observations).

Page 79: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

How to Obtain the the

Predicted Values of Y & the Residuals in Stata

. reg math read

. predict yhat [yhat=predicted values of y]

. predict e, resid [e=residuals]

. for varlist read yhat e: hist X, norm

Page 80: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

222110 )...( kki xbxbxbby

The method of least squares chooses the values of the regression coefficients that make the sum of the squares of the residuals as small as possible:

Page 81: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Software regression output typically refers to the residual-terms—which go into the computation of model fit indicators— more or less as follows:

s2 = Mean Square Error (MSE: variance of predicted y/df)

s = Root Mean Square Error (Root MSE: standard error of predicted y [which equals the square root of MSE])

Page 82: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000

Residual | 9805.03595 198 49.520 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358

Total | 17465.795 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

MS Error (Residual) = 9805.04/49.52

Root MSE: sqrt 49.52 = 7.0371

Page 83: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Stata labels the Residuals, Mean Square Error & Root Mean Square Error as follows:

Top-left table

SS for Residual: sum of squared errors (i.e sum of squared residuals)

MS for residual: variance of predicted y/df

Top-right column

Root MSE: standard error of predicted y (i.e. square root of MS for residual)

Page 84: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70 Model | 7660.75905 1 7660.75905 Prob > F = 0.0000 Residual | 9805.03595 198 49.520 R-squared = 0.4386-------------+------------------------------ Adj R-squared = 0.4358 Total | 17465.795 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

Page 85: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

To repeat:

Top-left table

SS for Residual: sum of squared errors

MS for residual: variance of predicted y/d.f.

Top-right column

Root MSE: standard error of predicted y

Page 86: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

r-squared: A Measure of Model Fit

Two basic ways of assessing model fit: (1) the slope of the regression line & (2) the amount of cluster around the regression line.

The regression coefficient=slope of the regression line (higher coefficient=greater slope).

r-squared=the degree of cluster around the regression line (higher r-squared=greater cluster).

r-squared=what percentage of the variance of y is explained by the explanatory variables?

That is, how much of the variance of y is explained by the model versus how much is explained by merely using the mean of y?

Page 87: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

r-squared=square of correlation yx in simple regression

Note: r-squared for multiple regression is more complicated to compute for, as later discussed.

Regression output table:

r-squared = SS Model/SS Total

Page 88: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values.

Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values.

Sum of Squares for Errors (SSE): each observed y minus the mean of predicted y ; sum the values; square the values.

Page 89: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

Source | SS df MS Number of obs = 200-------------+------------------------------ F( 1, 198) = 154.70

Model | 7660.759 1 7660.75 Prob > F = 0.0000 Residual | 9805.03595 198 49.520 R-squared=0.4386-------------+------------------------------ Adj R-squared = 0.4358

Total | 17465.79 199 87.7678141 Root MSE = 7.0371

------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- read | .6051473 .0486538 12.44 0.000 .509201 .7010935 _cons | 21.03816 2.58945 8.12 0.000 15.93172 26.1446------------------------------------------------------------------------------

r-squared=7660.759/17465.79=0.4386

This is the percentage of variance in y that the model explains.

Page 90: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Multiple Regression

What did we say was the advantage of multiple regression over simple regression?

Multiple regression allows us to examine how values of response variable y vary according to changes in the values of more than one explanatory variable x.

The relationship of each x to the value of y is measured holding the other x’s constant. A given x may therefore perform differently within differing sets of x’s.

Page 91: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Multiple regression, then, allows us not only to examine more than one explanatory value but in doing so to control—that is, hold constant—otherwise lurking variables.

This approximates experimental control for lurking variables. Why is it not as effective as experimental design in controlling lurking variables?

And what are the intrinsic problems regarding causality?

Page 92: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read

_cons 21.03816 2.58945 8.12 0.000 15.93172 26.1446 read .6051473 .0486538 12.44 0.000 .509201 .7010935 math Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 17465.795 199 87.7678141 Root MSE = 7.0371 Adj R-squared = 0.4358 Residual 9805.03595 198 49.5203836 R-squared = 0.4386 Model 7660.75905 1 7660.75905 Prob > F = 0.0000 F( 1, 198) = 154.70 Source SS df MS Number of obs = 200

Page 93: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read write science

_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200

What happened to the coefficient for ‘read’? Why?

Page 94: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

How do we evaluate how well an estimated regression model fits the data?

(1) F-test: overall significance of the model

(2) t-tests of each slope coefficient

(3) r2: overall explanatory/predictive effectiveness of the model

(4) Post-estimation diagnostics: to assess residuals for non-linearity & to assess the influence of outliers.

Page 95: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

What is the problem with conducting several hypothesis tests of slope coefficients in a single equation?

The probability of Type I errors.

Page 96: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Begin, then, with an F-test for overall model significance, before testing the slope coefficients.

Ho:

Ha: at least one

F=Mean Square for Model/Mean Square for Error

The F-test examines if at least one of the regression coefficients has a statistically significant relationship with the outcome variable y.

Only if the F-test is significant do we then test for t-significance of the individual slope coefficients.

021 p...0β

Page 97: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read write science

Interpretation?

_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200

Page 98: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read write science

F-statistic = 3229.08/39.69 = 81.36

_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200

Page 99: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What
Page 100: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Like other tests of significance, F is the magnitude of the effect divided by the error term.

Page 101: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

To repeat, conduct the F-test for overall model significance, before testing the slope coefficients.

Ho:

Ha: at least one

F=Mean Square for Model/Mean Square for Error

The F-test examines if at least one of the regression coefficients has a statistically significant relationship with the outcome variable y.

Only if the F-test is significant do we then test for t-significance of the individual slope coefficients.

021 p...0β

Page 102: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

t-tests of Slope Coefficients

Conduct a significance test (one or two-sided) for each slope coefficient.

Ho:

Ha:

(or one-sided test: > or <)

Beware of Type I errors.

01 01

Page 103: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read write science

t-value for read = .306/.060 = 5.07

_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200

Page 104: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

R2

R2: the squared multiple correlation (capital R, vs. previous small-case r) measures the proportion of the outcome variable that is explained by the explanatory variables (i.e. degree of cluster around regression line).

R2 is the square of the correlation of the predicted values of y with the observed values of y.

It tells us what percentage of y’s variance is accounted for by the model (i.e. the explanatory variables): higher R2, greater fit.

Page 105: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg math read write science

R2: Model SS/Total SS=9687.25/17465.79 = 0.55

_cons 9.68242 2.817438 3.44 0.001 4.126034 15.23881 science .2543792 .0611411 4.16 0.000 .1338003 .3749582 write .2609042 .0618005 4.22 0.000 .1390248 .3827836 read .3064101 .0604347 5.07 0.000 .1872243 .4255959 math Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 17465.795 199 87.7678141 Root MSE = 6.2997 Adj R-squared = 0.5478 Residual 7778.54447 196 39.6864514 R-squared = 0.5546 Model 9687.25053 3 3229.08351 Prob > F = 0.0000 F( 3, 196) = 81.36 Source SS df MS Number of obs = 200

Page 106: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

R-squared, however, is much less preferred than the slope coefficients as an indicator of model fit.

Recall the difference between ‘historicist’ & ‘generalizing’ explanation. Simply adding more explanatory variables—even if they’re meaningless—will increase R-squared.

Page 107: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

How to do multiple

regression in Stata?

Page 108: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Example: How much do science achievement test scores depend on reading achievement test scores & math achievement test scores in a random sample of 200 high school students?

Page 109: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

If there are any missing observations:

. mark complete

. markout complete science read math

Alternatively: perhaps save a working data set that excludes the missing observations.

Page 110: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Regarding mark & markout:

‘complete’ is an binary, or dummy, variable:

1=complete data

0=incomplete data

. tab complete

Page 111: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. gr matrix science read math if complete==1, half

sciencescore

readingscore

mathscore

20 40 60 80

20

40

60

80

20 40 60 80

40

60

80

Page 112: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. kdensity science if complete==1, norm

. gr box science if complete==1

. kdensity read if complete==1, norm

. gr box read if complete==1

. kdensity math if complete==1, norm

. gr box math if complete==1

‘complete’ is a binary, or dummy variable: 1=complete data, 0=incomplete data

Page 113: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. su science read math if complete==1, detail

Page 114: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. pwcorr science read math if complete==1, obs sig bonf star(.05)

science read math

science 1.0000

200

read 0.6302* 1.0000

0.0000

200 200

math 0.6307* 0.6623* 1.0000

0.0000 0.0000

200 200 200

Note this way of using pwcorr (‘if complete==1’) corresponds to how regression uses the observations.

Page 115: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Correlation ci’s

. ci2 read write if complete==1, corr

. ci2 read science if complete==1, corr

. ci2 write science if complete==1, corr

Confidence interval for Pearson's product-moment correlation of read and write, based on Fisher's transformation.Correlation = 0.597 on 200 observations (95% CI: 0.499 to 0.679)

Confidence interval for Pearson's product-moment correlation of write and science, based on Fisher's transformation. Correlation = 0.570 on 200 observations (95% CI: 0.469 to 0.657)

Confidence interval for Pearson's product-moment correlation of write and science, based on Fisher's transformation.Correlation = 0.570 on 200 observations (95% CI: 0.469 to 0.657)

Page 116: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

‘Partial coefficients’—correlation coefficients of the outcome variable with the explanatory variables, for each variable holding the others constant—are also helpful:

. pcorr science read math if complete==1

Partial correlation of science with

Variable | Corr. Sig.

-------------+------------------

read | 0.3654 0.000

math | 0.3668 0.000

Compare partial correlations to correlations.

Page 117: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Here are the hypotheses that we’ll test.

Page 118: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. Ho:

Ha: at least one

check for F-test significance

. Ho:

Ha:

check for t-test significance

021 p...01β

01β01β

Page 119: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg science read if complete==1

science | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+--------------------------------------------------------------------- read | .6085207 .0532864 11.42 0.000 .503439 .7136024 _cons | 20.06696 2.836003 7.08 0.000 14.47432 25.65961------------------------------------------------------------------------------------

. reg science math if complete==1

science | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------------- math | .66658 .0582822 11.44 0.000 .5516466 .7815135 _cons | 16.75789 3.116229 5.38 0.000 10.61264 22.90315

Page 120: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg science read math if complete==1

Source SS df MS Number of obs = 200

F( 2, 197) = 90.27

Model 9328.73944 2 4664.36972 Prob > F = 0.0000

Residual 10178.7606 197 51.6688353 R-squared = 0.4782

Adj R-squared = 0.4729

Total 19507.50 199 98.0276382 Root MSE = 7.1881

science Coef. Std. Err. t P>t [95% Conf. Interval]

read .3654205 .0663299 5.51 0.000 .2346128 .4962282

math .4017207 .0725922 5.53 0.000 .2585632 .5448782

_cons 11.6155 3.054262 3.80 0.000 5.592255 17.63875

What might happen to the slope coefficients if we add other explanatory variables? Why?

Page 121: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Here’s a quick, simple way to graph regression coefficient ci’s in multiple regression:

. reg science read math if complete==1

. gorciv read math

Page 122: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

E

stim

ate

read math

-2

2

See ‘Graphing confidence intervals in Stata’ for the commands ‘parmby,’ ‘sencode,’ & ‘ecplot’ to make more useful graphs.

Page 123: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. Check that N-observations is correct in the model.

. predict yhat if e(sample)

. predict e if e(sample), resid

. hist yhat, norm

. hist e, norm

. rvfplot, yline(0)

. sort science [to order low-high values]

. list science yhat e

. lvr2, ml(id)

. lincom _cons + read*45 + math*45

. lincom _cons + read*65 + math*65

Page 124: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

0.0

2.0

4.0

6D

en

sity

-20 -10 0 10 20Residuals

The residuals are approximately normal in distribution.

Page 125: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

-20

-10

01

02

0R

esid

uals

40 50 60 70Fitted values

There might be some rightward tilt worth exploring. We can use ‘rvpplot’ to explore individual variables.

Page 126: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

-20

-10

01

02

0

Resid

uals

30 40 50 60 70 80reading score

-20

-10

01

02

0

Resid

uals

30 40 50 60 70 80math score

. rvpplot read, yline(0) . rvpplot math, yline(0) There might be some minor problem with read’s relationship with math.

Page 127: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

1

23

45

6 7

8

910

11

1213

14

15

161718

19

20

21

2223

24

252627

28

29

30

31

32

3334

3536

37

38

39

40414243

44

45

464748

49

5051

52

53

54 55

56

5758

59

60

61

62

63 64

6566

67

68

69

70

7172

737475

76

77

78

7980

81

8283

84

85

86

87

8889

9091 92

93

94

95

96

979899

100

101

102

103

104105

106

107

108

109

110

111

112113

114

115

116

117

118119

120

121

122

123

124

125126

127

128

129

130131

132

133134

135

136

137138139140

141

142

143

144

145146

147

148

149 150

151152

153154

155156

157

158159160

161162

163

164 165

166

167

168

169

170

171172

173

174

175

176177 178

179

180

181182

183

184

185186

187

188

189

190191

192

193

194

195 196197198199

200

0.0

2.0

4.0

6.0

8L

eve

rag

e

0 .01 .02 .03 .04 .05Normalized residual squared

lvr2plot: let’s estimate the model with & without id=167 to see if there’s a notable difference.

Page 128: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

_cons 11.05814 3.02127 3.66 0.000 5.099772 17.01651 math .4475735 .0738742 6.06 0.000 .3018831 .5932638 read .3280934 .0670811 4.89 0.000 .1958001 .4603867 science Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 19306.2714 198 97.506421 Root MSE = 7.0915 Adj R-squared = 0.4842 Residual 9856.80938 196 50.2898438 R-squared = 0.4895 Model 9449.46198 2 4724.73099 Prob > F = 0.0000 F( 2, 196) = 93.95 Source SS df MS Number of obs = 199

. reg science read math if id~=167

There’s some difference, but nothing that will change the interpretation of the results.

_cons 11.6155 3.054262 3.80 0.000 5.592255 17.63875 math .4017207 .0725922 5.53 0.000 .2585632 .5448782 read .3654205 .0663299 5.51 0.000 .2346128 .4962283 science Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 19507.5 199 98.0276382 Root MSE = 7.1881 Adj R-squared = 0.4729 Residual 10178.7606 197 51.6688353 R-squared = 0.4782 Model 9328.73944 2 4664.36972 Prob > F = 0.0000 F( 2, 197) = 90.27 Source SS df MS Number of obs = 200

. reg science read math if complete==1

Page 129: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Results: the regression model provides a good fit: there are significant relationships between y & the explanatory variables, with meaningful magnitudes.

The residuals are more or less properly distributed; & the one influential outlier doesn’t make any major difference.

Page 130: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Let’s try to improve the model by adding a new explanatory variable, ‘white,’ which is coded 0=nonwhite 1=white.

A binary categorical variable coded 0/1 is called an indicator, or dummy, variable.

Page 131: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. tab white

. su science, d

. ttest science, by(white) [exploratory hypothesis test]

Page 132: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. ttest science, by(white)Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

nonwhite 55 45.65455 1.255887 9.313905 43.13664 48.17245

white 145 54.2 .7552879 9.09487 52.70712 55.69288

combined 200 51.85 .7000987 9.900891 50.46944 53.23056

diff -8.545455 1.44982 -11.40452 -5.686385

Degrees of freedom: 198

Ho: mean(nonwhite) - mean(white) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

t = -5.8941 t = -5.8941 t = -5.8941

P < t = 0.0000 P > t = 0.0000 P > t = 1.0000

Page 133: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. reg science read math white if complete==1

Source | SS df MS Number of obs = 200

------------+------------------------------ F( 3, 196) = 70.84

Model | 10148.3965 3 3382.79884 Prob > F = 0.0000

Residual | 9359.10349 196 47.750528 R-squared = 0.5202

------------+------------------------------ Adj R-squared = 0.5129

Total | 19507.50 199 98.0276382 Root MSE = 6.9102

------------------------------------------------------------------------------

science | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------------+----------------------------------------------------------------

read | .3227622 .0645911 5.00 0.000 .1953793 .450145

math | .380627 .0699709 5.44 0.000 .2426345 .5186194

white | 4.719807 1.139193 4.14 0.000 2.473158 6.966456

_cons | 11.53217 2.936238 3.93 0.000 5.741491 17.32284

------------------------------------------------------------------------------

Interpretation?

Page 134: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Interpretation of the dummy variable: the science scores of ‘whites’ are 4.7 points higher than those of ‘nonwhites,’ on average, other explanatory variables held constant.

Page 135: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. drop yhat e

. Check that N-observations is correct in the model.

. predict yhat if e(sample)

. predict e if e(sample), resid

. hist yhat, norm

. hist e, norm

. sort science

. l science yhat e

. rvfplot, yline(0)

. lvr2plot, ml(id)

. lincom _cons + math*45 + white

. lincom _cons + math*45 – white

Go through the model-assessment steps: Is this an improved model or not, & why?

Page 136: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Our discussion of linear regression brings us back to an earlier topic: linear transformations (Moore & McCabe, pages 51-55).

Page 137: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Recall that multiplying each observation by a positive number b multiples both measures of center & spread (i.e. variance & standard deviation), thereby increasing dispersion (i.e. inequality).

Recall also that adding or subtracting the same number a to observations adds a to measures of center & to quartiles & other percentiles but does not change measures of spread.

Page 138: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. gen xsci=5*science

. univar science xsci

Variable n Mean S.D. Min .25 Mdn .75 Max

-------------------------------------------------------------------------------

science 200 51.85 9.90 26.00 44.00 53.00 58.00 74.00

xsci 200 259.25 49.50 130.00 220.00 265.00 290.00 370.00

. gen asci=5 + science

. univar science asciVariable n Mean S.D. Min .25 Mdn .75 Max

-------------------------------------------------------------------------------

science 200 51.85 9.90 26.00 44.00 53.00 58.00 74.00

asci 200 56.85 9.90 31.00 49.00 58.00 63.00 79.00

-------------------------------------------------------------------------------

Page 139: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Other kinds of transformations, however, are not linear but, rather, nonlinear (see McCabe & Moore, pages 187-203).

In contrast to linear transformations, nonlinear transformations potentially can normalize a variable’s distribution & straighten a curved relationship between two variables.

Page 140: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Why might we need to straighten out a skewed univariate distribution or a curved relationship between two variables?

(1) A skewed distribution may be difficult to examine became many observations may be piled up in one place or some observations may be hidden; & summary measures such as mean & standard deviation are distorted by skewed distributions.

Page 141: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

(2) Linear relationships are easier to interpret; statistical theory is better developed for linear relationships; & nonparametric statistics are not as insightful as parametric statistics.

(3) A curvilinear relationship causes invalid results for correlation and regression.

Page 142: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

The nonlinear transformations that we’ll briefly consider will consist of logarithms, powers & roots.

Remember: for correlation and regression, what really matters are not univariate distributions but rather bivariate relationships as displayed in scatterplots.

So, for correlation and regression, don’t make decisions about transformations until you’ve inspected the scatterplots.

Page 143: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Keeping the preceding point in mind, Stata makes choosing among potential non-linear transformations relatively easy.

Using data on household consumption per capita in Tegucigalpa data:

. kdensity dhc, norm

. qladder dhc

Page 144: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

0

.0001

.0002

.0003

.0004

.0005

De

nsity

0 2000 4000 6000 8000Dollars household consumption per capita

Kernel density estimate

Normal density

. kdensity dhc, norm scheme(economist)

Page 145: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

-2.0

0e+

1102.0

0e+

114.0

0e+

116.0

0e+

118.0

0e+

11

-2.00e+11-1.00e+11 0 1.00e+112.00e+11

cubic

-2.0

0e+

07

02.0

0e+

07

4.0

0e+

07

6.0

0e+

07

8.0

0e+

07

-2.00e+07-1.00e+070 1.00e+072.00e+073.00e+07

square

-50000

500010000

-2000 0 2000 4000 6000

identity0

50

100

-20 0 20 40 60 80

sqrt

46

810

4 6 8 10

log

-.1

-.05

0

-.08 -.06 -.04 -.026.94e-18 .02

1/sqrt

-.01-

.005

0.0

05

-.006 -.004 -.002 0 .002

inverse

-.00008

-.00006

-.00004

-.00002

-6.7

8e-2

1.0

0002

-.00002-.00001 0 .00001 .00002

1/square

-8.0

0e-0

7-6

.00e-0

7-4

.00e-0

7-2

.00e-0

702.0

0e-0

7

-2.00e-07 -1.00e-07 0 1.00e-07

1/cubic

new dollar-weighted welfare measureQuantile-Normal plots by transformation

. qladder dhc

Page 146: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

qladder dhc suggests that a log transformation of dhc will make its distribution more normal.

gen ldhc = ln(dhc)

su dhc ldhc

for varlist ldhc-dhc: kdensity X, norm \ more

Page 147: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. kdensity dhc, norm

0

.0001

.0002

.0003

.0004

.0005

De

nsity

0 2000 4000 6000 8000Dollars household consumption per capita

Kernel density estimateNormal density

Page 148: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. kdensity ldhc, norm

0

.2

.4

.6

De

nsity

5 6 7 8 9ldhc

Kernel density estimate

Normal density

Page 149: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

The log transformation did considerably normalize the variable’s distribution (by dampening the effect of the right-skewed distribution’s high-end values).

Page 150: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

The ‘qladder’ command is based on John Tukey’s ‘ladder of power transformations’ (see Moore & McCabe, pages 191-95; & see Stata’s command ‘ladder’).

See the ‘ladder’ command for a hypothesis-test based approach to selection.

The ladder of transformations recommends particular non-linear transformations for particular non-linear relationships.

Page 151: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

While the ‘ladder’ does help, normalizing a variable’s distribution & linearizing a bivariate relationship generally involve trial & error.

And not all skewed distributions or non-linear bivariate relationships can be straightened out in a satisfactory way.

Page 152: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

There’s always a trade-off, moreover: A nonlinear transformation may indeed normalize a variable or linearize a relationship between variables, but a significant cost may be diminished clarity of interpretation.

Remember: what really matters is the scatterplot, not the univariate frequency distributions.

So don’t make decisions about transformations until you’ve inspected the scatterplot.

Page 153: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

There’ll be lots more transforming next semester.

Page 154: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

Interaction Effects

One more thing: what if the relationship of y to x varies according to the level of another variable, z?

This is an ‘interaction effect’.

E.g., do not drink alcoholic beverages while taking medication X.

E.g., the effect of an educational intervention varies according to the gender of students.

Page 155: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

A regression example: with regard to the log annual household living standard measurement per capita of a stratified random sample of households in several collective agricultural communities (‘ejidos’) in Quintana Roo, Mexico.

The value of the outcome variable increases with higher farm levels of mahogony production & with presence (vs. absence) of a community saw mill.

Is there an interaction effect between mahogony production level & mill presence/absence?

Page 156: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

------------------------------------------------------------------------------- lsm 199 3.98 1.97 0.01 2.49 3.67 5.49 8.69-------------------------------------------------------------------------------Variable n Mean S.D. Min .25 Mdn .75 Max -------------- Quantiles --------------. univar lsm

Total 3.98 1.97 3.67 3.01 0.01 8.69 4 6.59 1.26 6.75 1.93 4.10 8.69 3 2.83 0.93 2.64 1.27 1.53 5.47 2 4.09 1.70 3.70 2.14 1.33 7.88 1 3.33 1.77 3.16 2.36 0.01 8.15 0 3.61 1.74 3.50 1.78 1.16 8.27 mvc mean sd p50 iqr min max

by categories of: mvc (Mahogany volume categories)Summary for variables: lsm

. tabstat lsm, s(mean sd med iqr min max) by(mvc) f(%9.2f)

Page 157: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

05

10

0 1 2 3 4 0 1 2 3 4

No mill Mill

Living standard measure Fitted values

Mahogany volume categories

Graphs by Mill

. twoway scatter lsm mvc || lfit lsm mvc, by(mill)

Page 158: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

_cons 2.369156 .4068352 5.82 0.000 1.566716 3.171596 resworks .6022761 .2480852 2.43 0.016 .1129537 1.091598 nmaya .9817913 .4365363 2.25 0.026 .1207687 1.842814 _Imi1Xmv 1.236629 .2460735 5.03 0.000 .7512742 1.721983 mvc -.019945 .1623813 -0.12 0.902 -.3402253 .3003354 _Imill_1 -2.725266 .6579602 -4.14 0.000 -4.023024 -1.427508 hway .6115082 .3280433 1.86 0.064 -.0355233 1.25854 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 766.361343 198 3.87051183 Root MSE = 1.5543 Adj R-squared = 0.3758 Residual 463.861506 192 2.41594534 R-squared = 0.3947 Model 302.499837 6 50.4166394 Prob > F = 0.0000 F( 6, 192) = 20.87 Source SS df MS Number of obs = 199

(1 missing value generated)i.mill _Imill_0-1 (naturally coded; _Imill_0 omitted). xi3: reg lsm hway i.mill*mvc nmaya resworks

Interaction: coef=1.24, p=.000

Page 159: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

postgr3 mv, by(mill) table: graph of saw mill X mahagony volume categories (see next slide)

3.5

44

.55

5.5

6

0 1 2 3 4Mahogany volume categories

yhat_, mill == No mill yhat_, mill == Mill

Page 160: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. postgr3 mvc, by(mill) table [predicted average values of lsm by millXmvc]

Variables left asis: mvc _Imill_1 _Imi1XmvHolding hway constant at .79396985Holding nmaya constant at .46231156Holding resworks constant at .71356784

------------------------------Mahogany |volume |categorie | Mill s | No mill Mill----------+------------------- 0 | 3.738333 1 | 3.718388 2 | 3.446435 3 | 3.678499 4 | 5.879803------------------------------

Page 161: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. findit xi3 [download]

. help xi3

. findit postgr3 [download]

. help postgr3

. xi3: reg y x i.xcat*z

. postgr3 x, by(xcat)

Page 162: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

23

45

67

Ave

. Liv

ing S

tand

ard

Me

asu

re

0 1 2 3 4Mahogany volume categories

Fitted values No MillFitted values Mill

. twoway scatter yhat mvc if mill==0 || mspline yhat mvc if mill==0, clpatt(solid) || scatter yhat mvc if mill==1, ms(oh) || mspline yhat mvc if mill==1, clpatt(solid) || , ytitle("Ave. Living Standard Measure")

Another way to graph the interaction

Page 163: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

As an alternative to postgr3, table or to complement interaction graphs in general, use lincom to explore predicted outcomes.

See Paul Allison, Multiple Regression: A Primer.

See the next slide…

. reg lsm hway mvc mill mvcXmill nmaya resworks

Page 164: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

(1) 2.221249 .793086 2.80 0.006 .656969 3.785529 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]

( 1) mill + 4 mvcXmill = 0

. lincom mill*1 + mvcXmill*4

(1) .9846203 .6311184 1.56 0.120 -.2601954 2.229436 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]

( 1) mill + 3 mvcXmill = 0

. lincom mill*1 + mvcXmill*3

(1) -.2520085 .5373451 -0.47 0.640 -1.311866 .8078492 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]

( 1) mill + 2 mvcXmill = 0

. lincom mill*1 + mvcXmill*2

(1) 1.216684 .185311 6.57 0.000 .851177 1.582191 lsm Coef. Std. Err. t P>|t| [95% Conf. Interval]

( 1) mvc + mvcXmill = 0

. lincom mvc*1 + mvcXmill*1

Page 165: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

You’ll spend lots of time analyzing interaction effects next semester.

Page 166: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

On making correlation & regression tables, see the class document ‘Making working & publication-style tables in Stata’.

Page 167: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

. findit esttab [download]

. findit eststo [download]

. reg lsm hway mill mvc nmaya resworks

. eststo

. reg lsm hway mill mvc millXmvc nmaya resworks

. estso

. esttab, se starlevels(+ .10 * .05 ** .01 *** .001) r2 nodepvars no mtitles compress

Page 168: Review: Correlation vs. Regression  What are the main differences between correlation & regression?  What are the data requirements for each?  What

+ p<.10, * p<.05, ** p<.01, *** p<.001Standard errors in parentheses R-sq 0.315 0.395 N 199 199 (0.388) (0.407) _cons 1.477*** 2.369***

(0.246) millXmvc 1.237***

(0.263) (0.248) resworks 0.636* 0.602*

(0.449) (0.437) nmaya 1.517*** 0.982*

(0.130) (0.162) mvc 0.517*** -0.0199

(0.560) (0.658) mill -0.753 -2.725***

(0.341) (0.328) hway 0.946** 0.612+ (1) (2)