regression basics

Regression Basics

Predicting a DV with a Single IV

Questions• What are predictors and

criteria?• Write an equation for

the linear regression. Describe each term.

• How do changes in the slope and intercept affect (move) the regression line?

• What does it mean to test the significance of the regression sum of squares? R-square?

• What is R-square?

• What does it mean to choose a regression line to satisfy the loss function of least squares?

• How do we find the slope and intercept for the regression line with a single independent variable? (Either formula for the slope is acceptable.)

• Why does testing for the regression sum of squares turn out to have the same result as testing for R-square?

Basic Ideas

• Jargon– IV = X = Predictor (pl. predictors)– DV = Y = Criterion (pl. criteria)– Regression of Y on X e.g., GPA on SAT

• Linear Model = relations between IV and DV represented by straight line.

• A score on Y has 2 parts – (1) linear function of X and (2) error.

Y Xi i i (population values)

Basic Ideas (2)

• Sample value:

• Intercept – place where X=0

• Slope – change in Y if X changes 1 unit. Rise over run.

• If error is removed, we have a predicted value for each person at X (the line):

Y a bX ei i i

Y a bXSuppose on average houses are worth about $75.00 a square foot. Then the equation relating price to size would be Y’=0+75X. The predicted price for a 2000 square foot house would be $150,000.

Linear Transformation

• 1 to 1 mapping of variables via line

• Permissible operations are addition and multiplication (interval data)

1086420X

40

35

30

25

20

15

10

5

0

Y

Changing the Y Intercept

Y=5+2XY=10+2XY=15+2X

Add a constant

1086420X

30

20

10

0

Y

Changing the Slope

Y=5+.5XY=5+X

Y=5+2X

Multiply by a constant

Y a bX

Linear Transformation (2)

• Centigrade to Fahrenheit

• Note 1 to 1 map

• Intercept?

• Slope?

1209060300Degrees C

240

200

160

120

80

40

0D

eg

ree

s F

32 degrees F, 0 degrees C

212 degrees F, 100 degrees C

Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32.

Slope is 1.8. When Cent goes from 0 to 100 (run), Fahr goes from 32 to 212 (rise), and 212-32 = 180. Then 180/100 =1.8 is rise over run is the slope. Y = 32+1.8X. F=32+1.8C.

Y a bX

Review

• What are predictors and criteria?

• Write an equation for the linear regression with 1 IV. Describe each term.

• How do changes in the slope and intercept affect (move) the regression line?

Regression of Weight on Height

Ht Wt

61 105

62 120

63 120

65 160

65 120

68 145

69 175

70 160

72 185

75 210

N=10 N=10

M=67 M=150

SD=4.57 SD=

33.99

767472706866646260Height in Inches

240

210

180

150

120

90

60

We

igh

t in

Lb

s




Rise

Run

Y= -316.86+6.97X

Correlation (r) = .94.

Regression equation: Y’=-316.86+6.97X

Y a bX

Illustration of the Linear Model. This concept is vital!

727068666462Height

200

180

160

140

120

100

We

igh

t


727068666462Height



(65,120)

Mean of X

Mean of Y

Deviation from X

Deviation from Y

Linear Part

Error Part

yY'

e

Y Xi i i

Y a bX ei i i

Consider Y as a deviation from the mean.

Part of that deviation can be associated with X (the linear part) and part cannot (the error).

Y a bX'

iii YYe

Predicted Values & ResidualsN Ht Wt Y' Resid

1 61 105 108.19 -3.19

2 62 120 115.16 4.84

3 63 120 122.13 -2.13

4 65 160 136.06 23.94

5 65 120 136.06 -16.06

6 68 145 156.97 -11.97

7 69 175 163.94 11.06

8 70 160 170.91 -10.91

9 72 185 184.84 0.16

10 75 210 205.75 4.25

M 67 150 150.00 0.00

SD 4.57 33.99 31.85 11.89

V 20.89 1155.56 1014.37 141.32

727068666462Height

200

180

160

140

120

100

We

igh

t


727068666462Height



(65,120)

Mean of X

Mean of Y

Deviation from X

Deviation from Y

Linear Part

Error Part

yY'

e

Numbers for linear part and error.

Note M of Y’ and Residuals. Note variance of Y is V(Y’) + V(res).

Y a bX

Finding the Regression Line

Need to know the correlation, SDs and means of X and Y. The correlation is the slope when both X and Y are expressed as z scores. To translate to raw scores, just bring back original SDs for both.

N

zzr YX

XY

X

YXY SD

SDrb

To find the intercept, use: XbYa

(rise over run)

Suppose r = .50, SDX = .5, MX = 10, SDY = 2, MY = 5.

25.

25. b 15)10(25 a XY 215'

Slope Intercept Equation

Line of Least Squares

727068666462Height

200

180

160

140

120

100

We

igh

t


727068666462Height



(65,120)

Mean of X

Mean of Y

Deviation from X

Deviation from Y

Linear Part

Error Part

yY'

e

We have some points.

Assume linear relations is reasonable, so the 2 vbls can be represented by a line. Where should the line go?Place the line so errors (residuals) are small. The line we

calculate has a sum of errors = 0. It has a sum of squared errors that are as small as possible; the line provides the smallest sum of squared errors or least squares.

Least Squares (2)

Review

• What does it mean to choose a regression line to satisfy the loss function of least squares?

• What are predicted values and residuals?

Suppose r = .25, SDX = 1, MX = 10, SDY = 2, MY = 5.What is the regression equation (line)?

Partitioning the Sum of Squares

ebXaY bXaY 'eYY ' 'YYe

Definitions

)'()'( YYYYYY = y, deviation from mean

22 )]'()'[()( YYYYYY Sum of squares

222 )'()'()( YYYYy(cross products drop out)

Sum of squared deviations from the mean

=Sum of squares due to regression

+Sum of squared residuals

reg error

Analog: SStot=SSB+SSW

Partitioning SS (2)

SSY=SSReg + SSRes Total SS is regression SS plus residual SS. Can also get proportions of each. Can get variance by dividing SS by N if you want. Proportion of total SS due to regression = proportion of total variance due to regression = R2 (R-square).

Y

s

Y

g

Y

Y

SS

SS

SS

SS

SS

SS ReRe

)1(1 22 RR

Partitioning SS (3)Wt (Y)M=150

Y' Resid (Y-Y')

Resid2

105 2025 108.19 -41.81 1748.076 -3.19 10.1761

120 900 115.16 -34.84 1213.826 4.84 23.4256

120 900 122.13 -27.87 776.7369 -2.13 4.5369

160 100 136.06 -13.94 194.3236 23.94 573.1236

120 900 136.06 -13.94 194.3236 -16.06 257.9236

145 25 156.97 6.97 48.5809 -11.97 143.2809

175 625 163.94 13.94 194.3236 11.06 122.3236

160 100 170.91 20.91 437.2281 -10.91 119.0281

185 1225 184.84 34.84 1213.826 0.16 0.0256

210 3600 205.75 55.75 3108.063 4.25 18.0625

Sum = 1500

10400 1500.01 0.01 9129.307 -0.01 1271.907

Variance 1155.56 1014.37 141.32

2)( YY YY ' 2)'( YY

Partitioning SS (4)Total Regress Residual

SS 10400 9129.31 1271.91

Variance 1155.56 1014.37 141.32

12.88.110400

91.1271

10400

31.9129

10400

10400 Proportion of SS

12.88.156.1155

32.141

56.1155

37.1014

56.1155

56.1155 Proportion of

VarianceR2 = .88

Note Y’ is linear function of X, so .

XYYY rr 94.'

1' XYr

012.35..88. '222

' EYYEYEYY rrrRr

Significance Testing

Testing for the SS due to regression = testing for the variance due to regression = testing the significance of R2. All are the same. 0: 2

0 populationRH

FSS df

SS df

SS k

SS N kreg

res

reg

res

/

/

/

/ ( )1

2 1

k=number of IVs (here it’s 1) and N is the sample size (# people). F with k and (N-k-1) df.

FSS df

SS dfreg

res

/

/

. /

. / ( ).1

2

9129 31 1

127191 10 1 157 42

)1/()1(

/2

2

kNR

kRF Equivalent test using R-square

instead of SS.

F

. /

( . ) / ( ).

88 1

1 88 10 1 158 67

Results will be same within rounding error.

Review

• What does it mean to test the significance of the regression sum of squares? R-square?

• What is R-square?• Why does testing for the regression sum of

squares turn out to have the same result as testing for R-square?

regression basics

Documents