correlation/regression - part 2 consider example 2.12 in section 2.3. look at the scatterplot…...
TRANSCRIPT
Correlation/Regression - part 2• Consider Example 2.12 in section 2.3. Look at the
scatterplot… Example 2.13 shows that the prediction line is given by the linear equation (nea="nonexercise activity")
predicted fat gain = 3.505 - 0.00344 x nea.• The intercept (3.505 kg) equals the fat gain when non-
exercise activity increase = 0 and the slope (-0.00344) equals the rate of change of fat gain per calorie increase in nea; i.e., the predicted fat gain decreases by about 0.00344 kg for each calorie increase in nea.
• So to get the predicted value of fat gain for an nea of say 400 calories, you can either estimate it graphically from the line (next page) or numerically by evaluating the equation at nea = 400;
pred. fat gain at nea of 400 = 3.505 - 0.00344 (400) = 2.129 kg
• But be careful about extrapolation!
The graphical method: find nea=400 on the x-axis, draw a vertical line to intersect the regression line, then draw a horizontal line to intersect the y-axis - the place of intersection will be the predicted y for that value of x.
The least squares line makes the errors (or residuals) as small as possible by minimizing their sum of squares.
• The least squares process finds the values of b0 and b1 that minimize the sums of the squares of the errors to give y-hat = b0 + b1 x , where
b1 = r (sy/sx) and b0 = ybar - b1 xbar
• As we've noted before, use software to do these calculations for you - but notice a couple of things from these equations:– b1 and r have the same sign (since sy and sx are >0)– the prediction line always passes through the point
(xbar,ybar)
• Besides the correlation coefficient (r) having the same sign as the slope of the regression line it also has the property that its square r2 gives the proportion of total variability in y explained by the regression line of y on x.
• Another important idea to mention is that if you regress y on x (i.e., treat y as the response) you will get a different line than if you regress x on y (treat x as the response), even though the value of r will be the same in both cases! See the Figure 2.15 on the following slide - read about this important set of data in Example 2.17 on page 116.
• Regress velocity on distance (solid line) and distance on velocity (dashed line) to get two distinct lines - however, r = .7842 in both cases…
• Cautions about regression and correlation:– always look at the plot of the residuals (recall that for
every observed point on the scatterplot, we have:
residual at xi = observed yi - predicted yi)
A plot of the residuals against the explanatory variable should show a "random" scatter around zero - see Fig.2.20. There should
be no pattern
to the resids. Go over Ex.
2.20, p.128
– Look out for outliers (in either the explanatory or response variable) and influential values (in the explanatory variable). Go over examples 2.21-2.22 (2.4, 2/5) carefully…note #18 is influential and #15 is an outlier in the y-direction.
Note that outliers in the y-direction can have large residuals, while outliers in the x-direction (possible influential values) might not have large residuals.
• All HW for Chapter 2:– section 2.1:
#2.6-2.9,2.11,2.13-2.15,2.18,2.19,2.21,2.26– section 2.2:
#2.29-2.32,2.35,2.39,2.41,2.43,2.46,2.50,2.51– section 2.3: #2.57-2.58,2.62,2.64,2.66,2.68,2.73,2.74– section 2.4: #2.85,2.87,2.89,2.94,2.96,2.97,2.101– section 2.5: #2.111-2.113,2.119, 2.121– section 2.6: #2.122, 2.125, 2.127, 2.129, 2.130– Chapter 2 Exercises: Do several of those on p. 161-
169
• Transformations:– Look at the dataset of 62 mammals' brain and body
weights ("Beyond the Basics:Transformations" after section 2.3 in eBook)…. copy link location from my website. Analyze it with JMP… what are the difficulties?
– Fix this by transforming the data with an appropriate mathematical function - easy to do in JMP: create a new column to contain the transformed data.
– Try the log-log transform and then re-analyze…
• Look ahead to two-way tables…
Two-way tables organize data about two
categorical variables (factors) obtained from a two-
way design. (There are now two ways to group the
data). See below from US Census data…
First factor: ageGroup by age
Second factor: education
Record education
Marginal distributionsWe can look at each categorical variable separately in a
two-way table by studying the row totals and the column
totals. They represent the marginal distributions,
expressed in counts or percentages (They are written as if
in a margin.)
2000 U.S. census
The marginal distributions summarize each categorical variable
independently. But the two-way table actually describes the
relationship between both categorical variables.
The cells of a two-way table represent the intersection of a given level of
one categorical factor with a given level of the other categorical factor.
Get the joint distribution by computing percentages of each cell with
respect to the grand total of the table.
Because counts can be misleading (for instance, one level of one factor
might be much less represented than the other levels), we prefer to
calculate percents or proportions for the corresponding cells. These
make up what are called the conditional distributions. We can
compute these using either the row or column totals and get different
conditionals… Try it with the Education data…
• Here's a summary:– Two-way tables consist of counts obtained by crosstabulating
two categorical variables - the goal is to understand the relationship or association between these two variables.
– The first method of looking for the relationship is to compute percentages - there are three types: those based on the grand total in the table (the joint distribution of the two variables); those based on the column totals and those based on the row totals (the conditional distributions)
• To look for association, consider all the percentages above but usually percent with respect to the explanatory variable's totals and compare across levels of the explanatory variable.