chapter 3 – examining relationships
DESCRIPTION
Chapter 3 – Examining Relationships. Scatterplots and Correlation - 3.1. Shows a relationship between two variables. Scatterplots:. Response Variables:. Variable on the y- axis. Response to a variable. Explanatory Variables:. Variable on the x- axis. Influences the response. - PowerPoint PPT PresentationTRANSCRIPT
Chapter 3 – Examining Relationships
Scatterplots and Correlation - 3.1
Scatterplots: Shows a relationship between two variables.
Explanatory Variables: Variable on the x-axis.Influences the response
Response Variables: Variable on the y-axis.
Response to a variable
Looking at Scatterplots:
• Direction: Positive as x increases, y increasesNegative as x increases, y decreases
• Form: Is there a linear relationship between the two variables?
• Strength: Do the points follow a single stream that is tight to the line or is there considerable spread (or variability) around the line?
Calculator Tip: Scatterplots
L1: Explanatory Variable
L2: Response Variable
Use statplot to graph
Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.
1. T-shirts at a store: Price of each, Number Sold
x
yD:
S:
negative
strong
$5 $50
1
100
Price of shirt
# sold
explanatory response
Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.
2. Drivers: Reaction Time, Blood Alcohol Level
x
yD:
S:
positive
strong
.01 .5
1
10
BAC
Time
explanatoryresponse
Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.
3. Cars: Age of Owner, Weight of the Car
Makes no sense!!!
Example #2:In a study of whether a relationship exists between a child's aptitude and the age at which he/she first speaks, researchers recorded the age (in months) of a child's first speech and the child's score on an aptitude test. These data for these 21 children follow:
Make a scatterplot and describe the relationship in the context of the problem.
D:
F:
S:positive
curved
moderate
Correlation:
Measures the direction and strength of the linear relationship
“r”
Must be quantitative
Attributes of the Correlation
1.The correlation coefficient is a unit-less measurement, denoted with the letter r, and has values between -1 and 1.
2. When r = 1 all the data points form a perfect straight line relationship with a positive slope.
3. When r = -1 all the data points form a perfect straight line relationship with a negative slope.
Attributes of the Correlation
4. Values of r close to 0 means that the linear relationship is weak. There is a general linear trend, but there is a lot of variability around that trend.
5. When r =0 there is no relationship between the two variables. In other words, the best fitting line has a slope of zero.
6. Outliers have a large influence on the correlation coefficient. The correlation is NOT resistant to outliers.
Attributes of the Correlation
7. Correlation does not describe curved relationships! (ONLY LINEAR)
Types of Correlation:
r = 0 r = -0.3
r = 0.5 r = -0.7
r = 0.9 r = -0.99
Example #3:What is wrong with the following statements?
There is a strong correlation between the gender of American workers and their income.
Gender is categorical
Example #3:What is wrong with the following statements?
2. We found a high correlation (r = 1.09) between students’ rating of faculty teaching and ratings made by other faculty members.
r can’t be bigger than 1
Example #3:What is wrong with the following statements?
3. We found a very weak correlation (r = -0.95) which suggests little relationship between income and hours spent at casinos.
r = -0.95 is a strong negative relationship
Example #3:What is wrong with the following statements?
4. We found a very weak correlation (r = 0.01) which suggests little relationship between age and death rate.
Should be a very strong relationship!
Guidelines: How strong is the linear relationship?
0 < r < 0.3 = weak positive -0.3 < r < 0 = weak negative0.4 < r < 0.7 = moderate positive -0.4 < r < -0.7 = moderate negative0.8 < r < 1 = strong positive -0.8 < r < -1 = strong negative
HOW TO CALCULATE THE CORRELATION COEFFICIENT
Remember how to calculate the z-score? We used this calculation to determine how many standard deviations our observations was from the mean.
RECALL:
z - score = z = x
In this case, we were only concerned with one variable.
Now, we are considering two variables and each must be standardized.
Notation:
s' theofdeviation standard sampleS
s' theofn observatioth ' the
s' ofmean sample
n correlatio
x x
xix
xx
r
i
s' theofdeviation standard sampleS
s' theofn observatioth ' the
s' ofmean sample
nsobservatio ofnumber totaln
y y
yiy
yy
i
FORMULA:
y
i
x
i
S
yy
S
xx
n 1
1r
Calculator Tip: Correlation
L1: Explanatory Variable
L2: Response Variable
Stat-calc-LinReg(a+bx), L1, L2
(make sure your diagnostic is on!!!)
Example #4:
Speed (x) 20 30 40
MPG (y) 25 35 45
Step #1: Find the following summary statistics:
n = ________
SPEED: x = ______ sx = _______
MPG: y = ______ sy = _______
330 10
35 10
Step #2: Calculate z-scores
SPEED Z(x1) = Z(x2) = Z(x3) =
MPG Z(y1) = Z(y2) = Z(y3) =
PRODUCT Z(x1)Z(y1) = Z(x2)Z(y2) = Z(x3)Z(y3) =
10
3020Z
1Z
10
3030Z
0Z
10
3040Z
1Z
10
3525Z
1Z
10
3535Z
0Z
10
3545Z
1Z
1 0 1
Step #3: Calculate the Correlation
10113
1r
)2(2
1r
1r
3.2 – Least-Squares Regression
Regression line: straight line that describes the linear relationship between an explanatory variable and a response variable.
LEAST SQUARES REGRESSION LINE:
• This is the best-fitting line to the data.
• The goal is to minimize the (vertical) distances of your observations (data) from your line.
• Again, we must square the distances (like the calculation of the variance) because some data points will be larger than the mean (positive) and some are smaller than the mean (negative) and they will cancel each other out. So to compensate, they are squared.
We can use this line to predict a response, y, from a given explanatory variable, x.
Remember graphing??
Slope-Intercept formula for a line:
y = mx + b where m = ____________
and b = ____________
slope
y-intercept
Do you remember the SLOPE?
rise
run
y
x
In statistics, we write it
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.
Predicted Weight = – 393 + 5.9(length)
1. What is the slope of the line? What does it mean?
m = 5.9
For every inch in length, it adds 5.9 pounds in weight
Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.
Predicted Weight = – 393 + 5.9(length)
2. What is the y-intercept of the line? What does it mean?
b = -393
If an alligator is 0 inches, then it weights -393lbs. This makes no sense!!!
Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.
Predicted Weight = – 393 + 5.9(length)
3. Describe the relationship between weight and length of alligators.
As the length increases, their weight increases.
Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.
Predicted Weight = – 393 + 5.9(length)
4. What is the predicted weight for an alligator 90 inches long?
= -393 + 5.9(90)
= -393 + 531
= 138 lbs
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
CALCULATION:
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
Facts about Least Squares Regression:
1. The distinction between explanatory and response variables is essential (which variable is used to predict which?).
2. It always passes through the point (x, y).
3. Correlation ‘r’ describes the direction and strength of the straight line, but doesn’t tell us anymore about the slope than if it is positive or negative, or zero.
Extrapolation: Predicting outside the range of the x values
Calculator Tip: LSRL
L1: Explanatory Variable
L2: Response Variable
Stat-calc-LinReg(a+bx), L1, L2, vars/y-vars/Function/ Y1
Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:
Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396
Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843
a. Interpret the value of the correlation coefficient in the context of the problem.
As wine consumption increases, mean deaths from heart disease decreases.
Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:
Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396
Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843
b. Calculate the least-squares regression line predicting death rate from wine consumption.
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= -0.0843(68,396/2,510) = -2.2971
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx = 191,053–(-2.2971*3,026)= 198004.0991
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= 198,004.0991 – 2.2971x
Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:
Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396
Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843
c. Use your line to predict death rate for an average adult who consumes 4 liters of wine.
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= 198,004.0991 – 2.2971x
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= 198,004.0991 – 2.2971(4)
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= 197,994.9107
Example #3: The following data describes the relationship between a tree trunks diameter vs. it height. Make a scatterplot of the data and find the LSRL. Define any variables used in this equation. How strong of an association is there?
Trunk Diameter
8 9 7 6 13 7 11 12
Tree Height
35 49 27 33 60 21 45 51
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= -1.31467 + 4.54133x
Where x = trunk diameter and
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= tree height
Strong correlation, r = 0.88
Residual: How close is the data to the line?
Observed y – predicted
yy ˆ
y
residual
Residual Plot: A plot that shows the residuals for all the data. A good line has no pattern.
Calculator Tip: Residual PlotCalculate the LSRLL3: vars/ y-vars/ function/ Y1(L1)L4: L2 – L3
Scatterplot: L1, L4
Example of random residual plots
Example of curved residual plots
Not a linear model.
Example of fanning residual plots
Less accurate for larger x values.
Standard Deviation of the residuals:
Used to measure the prediction error of the line
2
residuals2
ns
Calculator Tip: SD of residuals
Find residuals/ in L5: L42/2nd List/ math/ sum(L5)
Example #4The ages (in years) of seven men and their systolic blood pressures are given below:
Age (x) 16 25 39 45 49 64 70Systolic BP 100 120 140 160 165 185 200
Predicted Pressure (ˆ y )
102.2 118.5 143.8 154.7 161.9 189 199.8y
Regression Equation: xy 8068.13589.73ˆ
Residuals: -2.27 1.47 -3.82 5.34 3.11 -3.99 .17
Residual Plot:
No apparent pattern.
Standard deviation of the residuals::
-2.27 1.47 -3.82 5.34 3.11 -3.99 .17
2
residuals2
ns
27
)17(.)99.3()11.3()34.5()82.3()47.1()27.2( 2222222
s
5
03275905.76s
899557899.3s
Assessing the Predictive Power of the Equation:
1. Correlation of Determination: r2 = the correlation coefficient, squared
2. It is the fraction (or percent) of the variation in the values of y that is explained by the least-squares regression of y on x.
3. The closer r2 is to 1, the better the regression line describes the connection between x and y – in particular, predictions made with the equation will be more accurate.
3.2 & 3.3 – Correlation of Determination, Lurking Variables
Correlation of Determination: (r2)
How much of the y value is explained by the x value
Reading Computer Output:
Predictor Coef StDev T PConstantx-variable
S = R-Sq= R-Sq(adj) =
y-intSlope
r2
Example #1The correlation between alcohol and yearly deaths from heart disease was -0.843. What percent of the variation in the yearly deaths from heart disease can be explained by the regression of yearly deaths in alcohol consumption?
r = -0.843
r2 = 0.710649
71% of deaths from heart disease can be explained by alcohol consumption.
Example #2Is there a linear relationship between marijuana consumption and other drug usage? For this regression, the percent of variability in other drug usage explained by the regression of other drugs on marijuana use as 66.5%. What is the correlation coefficient?
r = 0.815475
r2 = .665
Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.
a. Calculate the LSRL.
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= 0.849(143/2.008) = 60.46165339
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx = 446.9 – (60.46*7.557) = -10.00871464
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
= -10.0087 + 60.4617x
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
is the predicted number of calories and x is the serving size.
b. What percent of the variability in calories is explained by the least squares line with serving size?
Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.
r2 = 0.8492 = 0.720801
72% of the variability in calories is explained by serving size
c. Use this regression line to predict the average number of calories in a 35-ounce serving. Explain if the least squares would be appropriate to use in this situation.
Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.
xy 4617.600087.10ˆ )35(4617.600087.10ˆ y
1508.2106ˆ y
No, extrapolation, too far away from normal values.
Example #3:Commercial airlines need to know the operating cost per hour of flight for each plane in their fleet. In a study of the relationship between operating cost per hour and number of passenger seats, investigators computed the regression of operating cost per hour on the number of passenger seats. The 12 sample aircraft used in the study included planes with as few as 126 passenger seats and planes with as many as 410 passenger seats. Operating cost per hour ranged between $3,600 and $7,800. Some computer output from a regression analysis of these data are shown below.
a. What is the equation of the least squares regression line that describes the relationship between operating cost per hour and number of passenger seats in the plane? Define any variables used in this equation.
xy 673.141136ˆ
ˆ y a bx
1.Slope: b rSy
Sx
Calculate this first!
2. Y - intercept: a = y - bx
is the predicted operation cost and x is the # of passenger seats
b. What is the value of the correlation coefficient for operating cost per hour and number of passenger seats in the plane? Interpret this correlation.
57.0r = 0.75498
There is a positive strong correlation between the number of passenger seats and cost for operation.
c. Suppose that you want to describe the relationship between operating cost per hour and number of passenger seats in the planes only in the range of 250 to 350 seats. Does the line shown in the scatterplot still provide the best description of the relationship for data in this range? Why or why not?
No, Between 250 and 350 seats, the direction looks negative.
Cautions in Making Predictions with Regression Lines:
1. If the correlation is not strong, predictions will not be accurate.
2. Extrapolation: Do not make predictions outside of the range for which you have data.
3. Correlation simply does not imply causation
• The correlation may be a coincidence• Both correlation variables might be directly influenced by some common underlying cause
It is a variable that is not among the explanatory or response variables, but influences the interpretation of the relationship.
Lurking Variables:
Causation Common Response (z = lurking variable)
X YX Y
Z
Example #4There is a positive correlation between the number of deaths by drowning and the number of ice cream cones sold. Is this evidence that people are not heeding the old advice to wait 2 hours after eating before swimming and are paying the price for it?
No! Summer is the lurking variable
Example #5 Smoke Causes Coughs: A strong relationship is
found between weekly sales of firewood and weekly sales of cough drops from September to March. Can we conclude that smoke from the fires causes coughs?
No! Winter is the lurking variable
Outlier: Observation away from the other data points
Influential Point:
Observation that drastically changes the LSRL
http://bcs.whfreeman.com/tps3e/pages/bcs-main.asp?v=category&s=00020&n=99000&i=99020.01&o=|00510|00520|00530|00010|00020|00030|00040|00050|00060|00070|00080|00110|00120|00300|0P000|01000|02000|03000|04000|05000|06000|07000|08000|09000|10000|11000|12000|13000|14000|15000|99000|
Applet: