correlation/regression - part 2 consider example 2.12 in section 2.3. look at the scatterplot…...

Correlation/Regression - part 2• Consider Example 2.12 in section 2.3. Look at the

scatterplot… Example 2.13 shows that the prediction line is given by the linear equation (nea="nonexercise activity")

predicted fat gain = 3.505 - 0.00344 x nea.• The intercept (3.505 kg) equals the fat gain when non-

exercise activity increase = 0 and the slope (-0.00344) equals the rate of change of fat gain per calorie increase in nea; i.e., the predicted fat gain decreases by about 0.00344 kg for each calorie increase in nea.

• So to get the predicted value of fat gain for an nea of say 400 calories, you can either estimate it graphically from the line (next page) or numerically by evaluating the equation at nea = 400;

pred. fat gain at nea of 400 = 3.505 - 0.00344 (400) = 2.129 kg

• But be careful about extrapolation!

The graphical method: find nea=400 on the x-axis, draw a vertical line to intersect the regression line, then draw a horizontal line to intersect the y-axis - the place of intersection will be the predicted y for that value of x.

The least squares line makes the errors (or residuals) as small as possible by minimizing their sum of squares.

• The least squares process finds the values of b0 and b1 that minimize the sums of the squares of the errors to give y-hat = b0 + b1 x , where

b1 = r (sy/sx) and b0 = ybar - b1 xbar

• As we've noted before, use software to do these calculations for you - but notice a couple of things from these equations:– b1 and r have the same sign (since sy and sx are >0)– the prediction line always passes through the point

(xbar,ybar)

• Besides the correlation coefficient (r) having the same sign as the slope of the regression line it also has the property that its square r2 gives the proportion of total variability in y explained by the regression line of y on x.

• Another important idea to mention is that if you regress y on x (i.e., treat y as the response) you will get a different line than if you regress x on y (treat x as the response), even though the value of r will be the same in both cases! See the Figure 2.15 on the following slide - read about this important set of data in Example 2.17 on page 116.

• Regress velocity on distance (solid line) and distance on velocity (dashed line) to get two distinct lines - however, r = .7842 in both cases…

• Cautions about regression and correlation:– always look at the plot of the residuals (recall that for

every observed point on the scatterplot, we have:

residual at xi = observed yi - predicted yi)

A plot of the residuals against the explanatory variable should show a "random" scatter around zero - see Fig.2.20. There should

be no pattern

to the resids. Go over Ex.

2.20, p.128

– Look out for outliers (in either the explanatory or response variable) and influential values (in the explanatory variable). Go over examples 2.21-2.22 (2.4, 2/5) carefully…note #18 is influential and #15 is an outlier in the y-direction.

Note that outliers in the y-direction can have large residuals, while outliers in the x-direction (possible influential values) might not have large residuals.

• All HW for Chapter 2:– section 2.1:

#2.6-2.9,2.11,2.13-2.15,2.18,2.19,2.21,2.26– section 2.2:

#2.29-2.32,2.35,2.39,2.41,2.43,2.46,2.50,2.51– section 2.3: #2.57-2.58,2.62,2.64,2.66,2.68,2.73,2.74– section 2.4: #2.85,2.87,2.89,2.94,2.96,2.97,2.101– section 2.5: #2.111-2.113,2.119, 2.121– section 2.6: #2.122, 2.125, 2.127, 2.129, 2.130– Chapter 2 Exercises: Do several of those on p. 161-

169

• Transformations:– Look at the dataset of 62 mammals' brain and body

weights ("Beyond the Basics:Transformations" after section 2.3 in eBook)…. copy link location from my website. Analyze it with JMP… what are the difficulties?

– Fix this by transforming the data with an appropriate mathematical function - easy to do in JMP: create a new column to contain the transformed data.

– Try the log-log transform and then re-analyze…

• Look ahead to two-way tables…

Two-way tables organize data about two

categorical variables (factors) obtained from a two-

way design. (There are now two ways to group the

data). See below from US Census data…

First factor: ageGroup by age

Second factor: education

Record education

Marginal distributionsWe can look at each categorical variable separately in a

two-way table by studying the row totals and the column

totals. They represent the marginal distributions,

expressed in counts or percentages (They are written as if

in a margin.)

2000 U.S. census

The marginal distributions summarize each categorical variable

independently. But the two-way table actually describes the

relationship between both categorical variables.

The cells of a two-way table represent the intersection of a given level of

one categorical factor with a given level of the other categorical factor.

Get the joint distribution by computing percentages of each cell with

respect to the grand total of the table.

Because counts can be misleading (for instance, one level of one factor

might be much less represented than the other levels), we prefer to

calculate percents or proportions for the corresponding cells. These

make up what are called the conditional distributions. We can

compute these using either the row or column totals and get different

conditionals… Try it with the Education data…

• Here's a summary:– Two-way tables consist of counts obtained by crosstabulating

two categorical variables - the goal is to understand the relationship or association between these two variables.

– The first method of looking for the relationship is to compute percentages - there are three types: those based on the grand total in the table (the joint distribution of the two variables); those based on the column totals and those based on the row totals (the conditional distributions)

• To look for association, consider all the percentages above but usually percent with respect to the explanatory variable's totals and compare across levels of the explanatory variable.

correlation/regression - part 2 consider example 2.12 in section 2.3. look at the scatterplot…...

Documents