chapter 3 – examining relationships

Chapter 3 – Examining Relationships

Scatterplots and Correlation - 3.1

Scatterplots: Shows a relationship between two variables.

Explanatory Variables: Variable on the x-axis.Influences the response

Response Variables: Variable on the y-axis.

Response to a variable

Looking at Scatterplots:

• Direction: Positive as x increases, y increasesNegative as x increases, y decreases

• Form: Is there a linear relationship between the two variables?

• Strength: Do the points follow a single stream that is tight to the line or is there considerable spread (or variability) around the line?

(DFS!)Describe For Scatter!

Can the NOAA predict where a hurricane will go?

•The example in the text shows a negative association between central pressure and maximum wind speed•As the central pressure increases, the maximum wind speed decreases.

Calculator Tip: Diagnostics On!

Catalog – Alpha “D” – Diagnostics On - Enter

Calculator Tip: Scatterplots

L1: Explanatory Variable

L2: Response Variable

Use statplot to graph

Scientists are interested in seeing if global temperature has been increasing. They measured the average global temperature per year (in Celsius). What graph should they make?

Histogram of Global Temp.

What does the scatterplot tell us that the histogram didn’t?

Are female oscar winners getting older?

Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.

1. T-shirts at a store: Price of each, Number Sold

negative

strong

$5 $50

Price of shirt

# sold

explanatory response

2. Drivers: Reaction Time, Blood Alcohol Level

positive

strong

.01 .5

explanatoryresponse

3. Cars: Age of Owner, Weight of the Car

Makes no sense!!!

Example #2:“I have never found a quantifiable predictor in 25 years of grading that was anywhere as strong as this one. If you just graded them based on length without ever reading them, you’d be right over 90 percent of the time.” The table below shows the data set that Dr. Perlman used to draw his conclusions.

Carry out your own analysis of the data. Then write a few sentences in response to each of Dr. Perlman’s conclusions.

Essay score and length for a sample of SAT essays

Words 460 422 402 365 357 278 236 201 168 156 133

Score 6 6 5 5 6 5 4 4 4 3 2

Words 114 108 100 403 401 388 320 258 236 189 128

Score 2 1 1 5 6 6 5 4 4 3 2

Words 67 697 387 355 337 325 272 150 135 73

Score 1 6 6 5 5 4 4 2 3 1

D: F: S:positive Linear, one unusual point strong

Example #3:Regraph #2 with score as the dependent variable now. Do you see any differences in the graph?

**You may want to store these lists for tomorrow…

Correlation:

Measures the direction and strength of the linear relationship (DF only)

“r”

Must be quantitative

Attributes of the Correlation

1.The correlation coefficient is a unit-less measurement, denoted with the letter r, and has values between -1 and 1.

2. When r = 1 all the data points form a perfect straight line relationship with a positive slope.

3. When r = -1 all the data points form a perfect straight line relationship with a negative slope.

4. Correlation treats x and y symmetrically: – The correlation of x with y is the same as the

correlation of y with x.

5. Correlation is not affected by changes in the center or scale of either variable.

– Correlation depends only on the z-scores, and they are unaffected by changes in center or scale.

6. Values of r close to 0 means that the linear relationship is weak. There is a general linear trend, but there is a lot of variability around that trend.

7. When r = 0 there is no relationship between the two variables. In other words, the best fitting line has a slope of zero.

8. Outliers have a large influence on the correlation coefficient. The correlation is NOT resistant to outliers.

9. Correlation does not describe curved relationships! (ONLY LINEAR)

Guidelines: How strong is the linear relationship?

0 < r < 0.3 = weak positive -0.3 < r < 0 = weak negative0.4 < r < 0.7 = moderate positive -0.4 < r < -0.7 = moderate negative0.8 < r < 1 = strong positive -0.8 < r < -1 = strong negative

Data collected from students in Statistics classes included their heights (in inches) and weights (in pounds):

•If we had to put a number on the strength, we would not want it to depend on the units we used.•A scatterplot of heights (in centimeters) and weights (in kilograms) doesn’t change the shape of the pattern:

Types of Correlation:

r = 0 r = -0.3

r = 0.5 r = -0.7

r = 0.9 r = -0.99

Example #4

r = -0.7

r = 0.5

r = -0.99 r = -0.3

r = 0.9

• Don’t assume the relationship is linear just because the correlation coefficient is high.

Here the correlation is 0.979, but the relationship is actually bent.

Example #5:What is wrong with the following statements?

1.There is a strong correlation between the gender of American workers and their income.

Gender is categorical

b. We found a high correlation (r = 1.09) between students’ rating of faculty teaching and ratings made by other faculty members.

r can’t be bigger than 1

c. We found a very weak correlation (r = -0.95) which suggests little relationship between income and hours spent at casinos.

r = -0.95 is a strong negative relationship

d. We found a very weak correlation (r = 0.01) which suggests little relationship between age and death rate.

Should be a very strong relationship!

HOW TO CALCULATE THE CORRELATION COEFFICIENT

Remember how to calculate the z-score? We used this calculation to determine how many standard deviations our observations was from the mean.

RECALL:

z - score = z = x

In this case, we were only concerned with one variable.

Now, we are considering two variables and each must be standardized.

Notation:

s' theofdeviation standard sampleS

s' theofn observatioth ' the

s' ofmean sample

n correlatio

s' theofdeviation standard sampleS

s' theofn observatioth ' the

s' ofmean sample

nsobservatio ofnumber totaln

FORMULA:

Example #4:

Speed (x) 20 30 40

MPG (y) 25 35 45

Step #1: Find the following summary statistics:

n = ___

SPEED: Sx = _____

MPG Sy = _____

_____x

_____y

Step #2: Calculate z-scores

SPEED Z(x1) = Z(x2) = Z(x3) =

MPG Z(y1) = Z(y2) = Z(y3) =

PRODUCT Z(x1)Z(y1) = Z(x2)Z(y2) = Z(x3)Z(y3) =

Step #3: Calculate the Correlation

Calculator Tip: Correlation

L2: Response Variable

Stat-calc-LinReg(a+bx), L1, L2

(make sure your diagnostic is on!!!)

Example #7:Use your calculator to find the correlation to #2. Comment on what it means.

Words 460 422 402 365 357 278 236 201 168 156 133

Score 6 6 5 5 6 5 4 4 4 3 2

Words 114 108 100 403 401 388 320 258 236 189 128

Score 2 1 1 5 6 6 5 4 4 3 2

Words 67 697 387 355 337 325 272 150 135 73

Score 1 6 6 5 5 4 4 2 3 1

r = 0.888 D: positive S: strong

3.2 – Least-Squares Regression

Regression line: straight line that describes the linear relationship between an explanatory variable and a response variable.

LEAST SQUARES REGRESSION LINE:

• This is the best-fitting line to the data.

• The goal is to minimize the (vertical) distances of your observations (data) from your line.

• Again, we must square the distances (like the calculation of the variance) because some data points will be larger than the mean (positive) and some are smaller than the mean (negative) and they will cancel each other out. So to compensate, they are squared.

We can use this line to predict a response, y, from a given explanatory variable, x.

Remember graphing??

Slope-Intercept formula for a line:

y = mx + b where m = ____________

and b = ____________

y-intercept

Do you remember the SLOPE?

In statistics, we write it

ˆ y a bx

1.Slope: b rSy

Calculate this first!

2. Y - intercept: a = y - bx

0 1y b b x

Facts about Least Squares Regression:

1. The distinction between explanatory and response variables is essential (which variable is used to predict which?).

2. It always passes through the point (x, y).

3. Correlation ‘r’ describes the direction and strength of the straight line, but doesn’t tell us anymore about the slope than if it is positive or negative, or zero.

Extrapolation: Predicting outside the range of the x values

• Here is a timeplot of the Energy Information Administration (EIA) predictions and actual prices of oil barrel prices. How did forecasters do?

Example #8Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

a. What is the slope of the line? What does it mean?

m = 5.9

For every inch in length, it adds 5.9 pounds in weight

b. What is the y-intercept of the line? What does it mean?

b = -393

If an alligator is 0 inches, then it weights -393lbs. This makes no sense!!!

c. Describe the relationship between weight and length of alligators.

As the length increases, their weight increases.

d. What is the predicted weight for an alligator 90 inches long?

= -393 + 5.9(90)

= -393 + 531

= 138 lbs

ˆ y a bx

1.Slope: b rSy

ˆ y a bx

1.Slope: b rSy

ˆ y a bx

1.Slope: b rSy

Slope formula:

ˆ y a bx

1.Slope: b rSy

ˆ y a bx

1.Slope: b rSy

b1 rsy

sx0 1y b b x

Find slope first!

Our slope is always in units of y per unit of x

Y-intercept formula:

ˆ y a bx

1.Slope: b rSy

ˆ y a bx

1.Slope: b rSy

0 1y b b x b0 y b1xOur intercept is always in units of y

Fat Versus Protein

• The regression line for the Burger King data fits the data well:– The equation is

The predicted fat content for a BK Broiler chicken sandwich (with 30 g of protein) is 6.8 + 0.97(30) = 35.9 grams of fat.

Example #9: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:

Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396

Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843

a. Interpret the value of the correlation coefficient in the context of the problem.

As wine consumption increases, mean deaths from heart disease decreases.

b. Calculate the least-squares regression line predicting death rate from wine consumption.

ˆ y a bx

1.Slope: b rSy

= -2.2971

= 191,053–(-2.29713,026)= 198004.0991

ˆ y a bx

1.Slope: b rSy

= 198,004.0991 – 2.2971x

68,3960.0843

a y bx

c. Use your line to predict death rate for an average adult who consumes 4 liters of wine.

ˆ y a bx

1.Slope: b rSy

= 198,004.0991 – 2.2971x

ˆ y a bx

1.Slope: b rSy

= 198,004.0991 – 2.2971(4)

ˆ y a bx

1.Slope: b rSy

= 197,994.9107

Example #10:Consider n pairs of numbers. Suppose

Of the following, which could be the least squares regression line?(A) y = 2 + x(B) y = -6 + 2x(C) y = -10 + 3x(D) y = 5/3 – x (E) y = 6 – x

4, 3, 2, and 5.x yx S y S

Slope:

r can be between -1 and 1, so slope is between

Passes through:

2 = 2 + 4

2 = 5/3 - 4

2 -2.33

2 = 6 - 4

Calculator Tip: LSRL

L2: Response VariableStat-calc-LinReg(a+bx), L1, L2, vars/y-vars/Function/ Y1

Just the li

Line gra

Calculator Tip: Tables

2nd – window, then 2nd - graph

Example #11: It's easy to measure the circumference of a tree's trunk, but not so easy to measure its height. Foresters need to develop a model for ponderosa pines that they use to predict the tree's height (in feet) from the circumference of its trunk (in inches):

Trunk Diameter

8 9 7 6 13 7 11 12

Tree Height

35 49 27 33 60 21 45 51

a. Make a scatterplot of the data and find the LSRL. Define any variables used in this equation.

ˆ y a bx

1.Slope: b rSy

= -1.31467 + 4.54133x

Where x = trunk diameter and

ˆ y a bx

1.Slope: b rSy

= predicted tree height

a. Make a scatterplot of the data and find the LSRL. Define any variables used in this equation.

Strong, positive correlation, r = 0.88

b. How strong of an association is there?

c. They need to cut a tree down that is 10inches in diameter. What is the predicted height of the tree?

ˆ y a bx

1.Slope: b rSy

= -1.31467 + 4.54133x

ˆ y a bx

1.Slope: b rSy

= -1.31467 + 4.54133(10)

ˆ y a bx

1.Slope: b rSy

= 44.10ft

d. Oops! When they cut it down, it was actually 50ft tall. How much were they off?

They were 5.9ft over what they thought it would be!

Residual: How close is the data to the line?

Observed y – predicted

• The linear model assumes that the relationship between the two variables is a perfect straight line. The residuals are the part of the data that hasn’t been modeled.

Data = Model + Residual

or (equivalently)

Residual = Data – Model

residual

• A negative residual means the predicted value’s too big (an overestimate).

• A positive residual means the predicted value’s too small (an underestimate).

• In the figure, the estimated fat of the BK Broiler chicken sandwich is 36 g, while the true value of fat is 25 g, so the residual is –11 g of fat.

• Some residuals are positive, others are negative, and, on average, they cancel each other out.

• Similar to what we did with deviations, we square the residuals and add the squares.

• The smaller the sum, the better the fit.

• The line of best fit is the line for which the sum of the squared residuals is smallest, the least squares line.

Residual Plot: A plot that shows the residuals for all the data. A good line has no pattern in the residual plot.

Calculator Tip: Residual Plot

1. Calculate the LSRL

2. Graph L1 and RESID (in list)

Example of linear residual plots

Example of curved residual plots

Not a linear model, curved

Example of fanning residual plots

Less accurate for larger x values (fanning)

Remember BK?

Example #12:Graph the residual plot of #2 and comment on what the graph tells you.

Slight curve, might not be a linear model, one unusual point

Reading Computer Output:

Predictor Coef StDev T PConstantx-variable

S = R-Sq= R-Sq(adj) =

y-intSlope

Example #13:The number of students taking AP Statistics at a high school during the years of 2000-2007 is fitted with a least squares regression line. The graph of the residuals and some computer output is as follows.

How many students took AP Statistics in the year 2003?

Dependent variable is: StudentsVariable Coeff s.e. t pConstant 11 6.299 1.75 0.1313Years 13.9286 1.0506 9.25 0.0001 s = 9.758 R-sq = 93.4% R-sq(adj) = 9.24%

# Students = 11 + 13.9286(Year)

# Students = 11 + 13.9286(3)

How many students took AP Statistics in the year 2003?

# Students = 11 + 41.7858

# Students = 52.7858

Residual = actual – predicted

5 = actual – 52.7858 57.7858 = actual

About 58 students took AP stats in 2003

Example #14:An important factor in the amount of gasoline a car uses is the size of the engine. Called “displacement”, engine size measures the volume of the cylinders in cubic inches. The regression analysis is shown.

Dependent variable is: MPG89 total cases of which 0 are missingR-squared = 60.9% R-squared (adjusted) = 60%s = 3.056 with 89 – 2 = 82 degrees of freedomVariable Coefficient s.e. of Coeff t-ratio probConstant 34.9799 1.231 28.4 0.0001Eng. Displcmt -0.066196 0.0077 -8.64 0.0001

A car you are thinking of buying is available in two different size engines, 190 cubic inches or 240 cubic inches. How much difference might this make in your gas mileage?

240 – 190 = 50 50(-0.066196) = -3.3098

About 3 miles less per gallon

Standard Deviation of the residuals:

Used to measure the prediction error of the line

residuals2

The Residual Standard Deviation

• The standard deviation of the residuals measures how much the points spread around the regression line.

• Check to make sure the residual plot has about the same amount of scatter throughout.

The Residual Standard Deviation

• We don’t need to subtract the mean because the mean of the residuals = 0

• Make a histogram or normal probability plot of the residuals. It should look unimodal and roughly symmetric.

• Then we can apply the 68-95-99.7 Rule to see how well the regression model describes the data.

• The variation in the residuals is the key to assessing how well the model fits.

• In the BK menu items example, total fat has a standard deviation of 16.4 grams. The standard deviation of the residuals is 9.2 grams.

http://bcs.whfreeman.com/tps3e/

Two-variable Statistical Calculator

Exercise

3.2 & 3.3 – Correlation of Determination, Lurking Variables

Correlation of Determination: (r2)

How much of the y value is explained by the x value

Assessing the Predictive Power of the Equation:

1. Correlation of Determination: r2 = the correlation coefficient, squared

2. It is the fraction (or percent) of the variation in the values of y that is explained by the least-squares regression of y on x.

3. The closer r2 is to 1, the better the regression line describes the connection between x and y – in particular, predictions made with the equation will be more accurate.

Example #15The correlation between alcohol and yearly deaths from heart disease was -0.843. What percent of the variation in the yearly deaths from heart disease can be explained by the regression of yearly deaths in alcohol consumption?

r = -0.843

r2 = 0.710649

71% of deaths from heart disease can be explained by alcohol consumption.

Example #16Is there a linear relationship between marijuana consumption and other drug usage? For this regression, the percent of variability in other drug usage explained by the regression of other drugs on marijuana use as 66.5%. What is the correlation coefficient?

r = 0.815475

r2 = .665

Moderately strong, positive realtionship

Example #17Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.

a. Calculate the LSRL.

ˆ y a bx

1.Slope: b rSy

= 0.849(143/2.008) = 60.46165339

ˆ y a bx

1.Slope: b rSy

2. Y - intercept: a = y - bx = 446.9 – (60.467.557) = -10.00871464

ˆ y a bx

1.Slope: b rSy

= -10.0087 + 60.4617x

ˆ y a bx

1.Slope: b rSy

is the predicted number of calories and x is the serving size.

b. What percent of the variability in calories is explained by the least squares line with serving size?

r2 = 0.8492 = 0.720801

72% of the variability in calories is explained by serving size

c. Use this regression line to predict the average number of calories in a 35-ounce serving. Explain if the least squares would be appropriate to use in this situation.

xy 4617.600087.10ˆ )35(4617.600087.10ˆ y

1508.2106ˆ y

No, extrapolation, too far away from normal values.

Example #18:Find the correlation of determination and correlation coefficient for #12 and explain its meaning.

Dependent variable is: StudentsVariable Coeff s.e. t pConstant 11 6.299 1.75 0.1313Years 13.9286 1.0506 9.25 0.0001 s = 9.758 R-sq = 93.4% R-sq(adj) = 9.24%

93.4% of the variation of students that take AP Stats is explained by the year.

r = 0.9664, Strong, positive association between the number of AP stats students and the year.

Cautions in Making Predictions with Regression Lines:

1. If the correlation is not strong, predictions will not be accurate.

2. Extrapolation: Do not make predictions outside of the range for which you have data.

3. Correlation simply does not imply causation

• The correlation may be a coincidence• Both correlation variables might be directly influenced by some common underlying cause

It is a variable that is not among the explanatory or response variables, but influences the interpretation of the relationship.

Lurking Variables:

Causation (z = lurking variable)

X YX Y

Are you looking hard enough?

Example #19There is a positive correlation between the number of deaths by drowning and the number of ice cream cones sold. Is this evidence that people are not heeding the old advice to wait 2 hours after eating before swimming and are paying the price for it?

No! Summer is the lurking variable

Example #20 Smoke Causes Coughs: A strong relationship is

found between weekly sales of firewood and weekly sales of cough drops from September to March. Can we conclude that smoke from the fires causes coughs?

No! Winter is the lurking variable

Outlier: Observation away from the other data points

Influential Point:

Observation that drastically changes the LSRL

• The following scatterplot shows that something was awry in Palm Beach County, Florida, during the 2000 presidential election…

• The red line shows the effects that one unusual point can have on a regression:

• The extraordinarily large shoe size gives the data point high leverage. Wherever the IQ is, the line will follow!

http://bcs.whfreeman.com/tps3e/

Two-variable Statistical Calculator

Outlier vs. Influential

chapter 3 – examining relationships

response response variables

explanatory variables

pair of variables

likely direction

dependent variable

response variableuse

explanatory variablel2

central pressure increases

Documents

examining the relationships between internalizing and...

examining child-teacher relationships and classroom

examining relationships between enabling structures

building successful leadership coaching relationships:...

chapter 3 – examining relationships

the practice of statistics unit 4/ chapter 3 examining...

ch 3 examining relationships

examining social resistance and online-offline relationships

chapter 3 -...

chapter 3: examining relationships ap statistics

examining the relationships between performance appraisal

1 chapter 3: graphical data exploration 3.1 exploring...

chapter 3 – examining relationships. scatterplots and...

chapter 2 examining relationships between variables

examining relationships between urban principal change

chapter 3: examining relationships -...

ch 3 – examining relationships yms – 3.1 scatterplots

examining relationships

chapter 2: examining relationships · 2018-09-16 · 44...

examining relationships among income, individual and