chapter 3 – examining relationships

Post on 16-Jan-2016

37 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Chapter 3 – Examining Relationships. Scatterplots and Correlation - 3.1. Shows a relationship between two variables. Scatterplots:. Response Variables:. Variable on the y- axis. Response to a variable. Explanatory Variables:. Variable on the x- axis. Influences the response. - PowerPoint PPT Presentation

TRANSCRIPT

Chapter 3 – Examining Relationships

Scatterplots and Correlation - 3.1

Scatterplots: Shows a relationship between two variables.

Explanatory Variables: Variable on the x-axis.Influences the response

Response Variables: Variable on the y-axis.

Response to a variable

Looking at Scatterplots:

• Direction: Positive as x increases, y increasesNegative as x increases, y decreases

• Form: Is there a linear relationship between the two variables?

• Strength: Do the points follow a single stream that is tight to the line or is there considerable spread (or variability) around the line?

Calculator Tip: Scatterplots

L1: Explanatory Variable

L2: Response Variable

Use statplot to graph

Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.

1. T-shirts at a store: Price of each, Number Sold

x

yD:

S:

negative

strong

$5 $50

1

100

Price of shirt

# sold

explanatory response

Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.

2. Drivers: Reaction Time, Blood Alcohol Level

x

yD:

S:

positive

strong

.01 .5

1

10

BAC

Time

explanatoryresponse

Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.

3. Cars: Age of Owner, Weight of the Car

Makes no sense!!!

Example #2:In a study of whether a relationship exists between a child's aptitude and the age at which he/she first speaks, researchers recorded the age (in months) of a child's first speech and the child's score on an aptitude test. These data for these 21 children follow:

Make a scatterplot and describe the relationship in the context of the problem.

D:

F:

S:positive

curved

moderate

Correlation:

Measures the direction and strength of the linear relationship

“r”

Must be quantitative

Attributes of the Correlation

1.The correlation coefficient is a unit-less measurement, denoted with the letter r, and has values between -1 and 1.

2. When r = 1 all the data points form a perfect straight line relationship with a positive slope.

3. When r = -1 all the data points form a perfect straight line relationship with a negative slope.

Attributes of the Correlation

4. Values of r close to 0 means that the linear relationship is weak. There is a general linear trend, but there is a lot of variability around that trend.

5. When r =0 there is no relationship between the two variables. In other words, the best fitting line has a slope of zero.

6. Outliers have a large influence on the correlation coefficient. The correlation is NOT resistant to outliers.

Attributes of the Correlation

7. Correlation does not describe curved relationships! (ONLY LINEAR)

Types of Correlation:

r = 0 r = -0.3

r = 0.5 r = -0.7

r = 0.9 r = -0.99

Example #3:What is wrong with the following statements?

There is a strong correlation between the gender of American workers and their income.

Gender is categorical

Example #3:What is wrong with the following statements?

2. We found a high correlation (r = 1.09) between students’ rating of faculty teaching and ratings made by other faculty members.

r can’t be bigger than 1

Example #3:What is wrong with the following statements?

3. We found a very weak correlation (r = -0.95) which suggests little relationship between income and hours spent at casinos.

r = -0.95 is a strong negative relationship

Example #3:What is wrong with the following statements?

4. We found a very weak correlation (r = 0.01) which suggests little relationship between age and death rate.

Should be a very strong relationship!

Guidelines: How strong is the linear relationship?

0 < r < 0.3 = weak positive -0.3 < r < 0 = weak negative0.4 < r < 0.7 = moderate positive -0.4 < r < -0.7 = moderate negative0.8 < r < 1 = strong positive -0.8 < r < -1 = strong negative

HOW TO CALCULATE THE CORRELATION COEFFICIENT

Remember how to calculate the z-score? We used this calculation to determine how many standard deviations our observations was from the mean.

RECALL:

z - score = z = x

In this case, we were only concerned with one variable.

Now, we are considering two variables and each must be standardized.

Notation:

s' theofdeviation standard sampleS

s' theofn observatioth ' the

s' ofmean sample

n correlatio

x x

xix

xx

r

i

s' theofdeviation standard sampleS

s' theofn observatioth ' the

s' ofmean sample

nsobservatio ofnumber totaln

y y

yiy

yy

i

FORMULA:

y

i

x

i

S

yy

S

xx

n 1

1r

Calculator Tip: Correlation

L1: Explanatory Variable

L2: Response Variable

Stat-calc-LinReg(a+bx), L1, L2

(make sure your diagnostic is on!!!)

Example #4:

Speed (x) 20 30 40

MPG (y) 25 35 45

Step #1: Find the following summary statistics:

n = ________

SPEED: x = ______ sx = _______

MPG: y = ______ sy = _______

330 10

35 10

Step #2: Calculate z-scores

SPEED Z(x1) = Z(x2) = Z(x3) =

MPG Z(y1) = Z(y2) = Z(y3) =

PRODUCT Z(x1)Z(y1) = Z(x2)Z(y2) = Z(x3)Z(y3) =

10

3020Z

1Z

10

3030Z

0Z

10

3040Z

1Z

10

3525Z

1Z

10

3535Z

0Z

10

3545Z

1Z

1 0 1

Step #3: Calculate the Correlation

10113

1r

)2(2

1r

1r

3.2 – Least-Squares Regression

Regression line: straight line that describes the linear relationship between an explanatory variable and a response variable.

LEAST SQUARES REGRESSION LINE:

• This is the best-fitting line to the data.

• The goal is to minimize the (vertical) distances of your observations (data) from your line.

• Again, we must square the distances (like the calculation of the variance) because some data points will be larger than the mean (positive) and some are smaller than the mean (negative) and they will cancel each other out. So to compensate, they are squared.

We can use this line to predict a response, y, from a given explanatory variable, x.

Remember graphing??

Slope-Intercept formula for a line:

y = mx + b where m = ____________

and b = ____________

slope

y-intercept

Do you remember the SLOPE?

rise

run

y

x

In statistics, we write it

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

1. What is the slope of the line? What does it mean?

m = 5.9

For every inch in length, it adds 5.9 pounds in weight

Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

2. What is the y-intercept of the line? What does it mean?

b = -393

If an alligator is 0 inches, then it weights -393lbs. This makes no sense!!!

Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

3. Describe the relationship between weight and length of alligators.

As the length increases, their weight increases.

Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

4. What is the predicted weight for an alligator 90 inches long?

= -393 + 5.9(90)

= -393 + 531

= 138 lbs

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

CALCULATION:

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

Facts about Least Squares Regression:

1. The distinction between explanatory and response variables is essential (which variable is used to predict which?).

2. It always passes through the point (x, y).

3. Correlation ‘r’ describes the direction and strength of the straight line, but doesn’t tell us anymore about the slope than if it is positive or negative, or zero.

Extrapolation: Predicting outside the range of the x values

Calculator Tip: LSRL

L1: Explanatory Variable

L2: Response Variable

Stat-calc-LinReg(a+bx), L1, L2, vars/y-vars/Function/ Y1

Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:

Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396

Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843

a. Interpret the value of the correlation coefficient in the context of the problem.

As wine consumption increases, mean deaths from heart disease decreases.

Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:

Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396

Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843

b. Calculate the least-squares regression line predicting death rate from wine consumption.

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= -0.0843(68,396/2,510) = -2.2971

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx = 191,053–(-2.2971*3,026)= 198004.0991

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 198,004.0991 – 2.2971x

Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:

Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396

Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843

c. Use your line to predict death rate for an average adult who consumes 4 liters of wine.

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 198,004.0991 – 2.2971x

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 198,004.0991 – 2.2971(4)

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 197,994.9107

Example #3: The following data describes the relationship between a tree trunks diameter vs. it height. Make a scatterplot of the data and find the LSRL. Define any variables used in this equation. How strong of an association is there?

Trunk Diameter

8 9 7 6 13 7 11 12

Tree Height

35 49 27 33 60 21 45 51

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= -1.31467 + 4.54133x

Where x = trunk diameter and

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= tree height

Strong correlation, r = 0.88

Residual: How close is the data to the line?

Observed y – predicted

yy ˆ

y

residual

Residual Plot: A plot that shows the residuals for all the data. A good line has no pattern.

Calculator Tip: Residual PlotCalculate the LSRLL3: vars/ y-vars/ function/ Y1(L1)L4: L2 – L3

Scatterplot: L1, L4

Example of random residual plots

Example of curved residual plots

Not a linear model.

Example of fanning residual plots

Less accurate for larger x values.

Standard Deviation of the residuals:

Used to measure the prediction error of the line

2

residuals2

ns

Calculator Tip: SD of residuals

Find residuals/ in L5: L42/2nd List/ math/ sum(L5)

Example #4The ages (in years) of seven men and their systolic blood pressures are given below:

Age (x) 16 25 39 45 49 64 70Systolic BP 100 120 140 160 165 185 200

Predicted Pressure (ˆ y )

102.2 118.5 143.8 154.7 161.9 189 199.8y

Regression Equation: xy 8068.13589.73ˆ

Residuals: -2.27 1.47 -3.82 5.34 3.11 -3.99 .17

Residual Plot:

No apparent pattern.

Standard deviation of the residuals::

-2.27 1.47 -3.82 5.34 3.11 -3.99 .17

2

residuals2

ns

27

)17(.)99.3()11.3()34.5()82.3()47.1()27.2( 2222222

s

5

03275905.76s

899557899.3s

Assessing the Predictive Power of the Equation:

1. Correlation of Determination: r2 = the correlation coefficient, squared

2. It is the fraction (or percent) of the variation in the values of y that is explained by the least-squares regression of y on x.

3. The closer r2 is to 1, the better the regression line describes the connection between x and y – in particular, predictions made with the equation will be more accurate.

3.2 & 3.3 – Correlation of Determination, Lurking Variables

Correlation of Determination: (r2)

How much of the y value is explained by the x value

Reading Computer Output:

Predictor Coef StDev T PConstantx-variable

S = R-Sq= R-Sq(adj) =

y-intSlope

r2

Example #1The correlation between alcohol and yearly deaths from heart disease was -0.843. What percent of the variation in the yearly deaths from heart disease can be explained by the regression of yearly deaths in alcohol consumption?

r = -0.843

r2 = 0.710649

71% of deaths from heart disease can be explained by alcohol consumption.

Example #2Is there a linear relationship between marijuana consumption and other drug usage? For this regression, the percent of variability in other drug usage explained by the regression of other drugs on marijuana use as 66.5%. What is the correlation coefficient?

r = 0.815475

r2 = .665

Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.

a. Calculate the LSRL.

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 0.849(143/2.008) = 60.46165339

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx = 446.9 – (60.46*7.557) = -10.00871464

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= -10.0087 + 60.4617x

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

is the predicted number of calories and x is the serving size.

b. What percent of the variability in calories is explained by the least squares line with serving size?

Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.

r2 = 0.8492 = 0.720801

72% of the variability in calories is explained by serving size

c. Use this regression line to predict the average number of calories in a 35-ounce serving. Explain if the least squares would be appropriate to use in this situation.

Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.

xy 4617.600087.10ˆ )35(4617.600087.10ˆ y

1508.2106ˆ y

No, extrapolation, too far away from normal values.

Example #3:Commercial airlines need to know the operating cost per hour of flight for each plane in their fleet. In a study of the relationship between operating cost per hour and number of passenger seats, investigators computed the regression of operating cost per hour on the number of passenger seats. The 12 sample aircraft used in the study included planes with as few as 126 passenger seats and planes with as many as 410 passenger seats. Operating cost per hour ranged between $3,600 and $7,800. Some computer output from a regression analysis of these data are shown below.

a. What is the equation of the least squares regression line that describes the relationship between operating cost per hour and number of passenger seats in the plane? Define any variables used in this equation.

xy 673.141136ˆ

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

is the predicted operation cost and x is the # of passenger seats

b. What is the value of the correlation coefficient for operating cost per hour and number of passenger seats in the plane? Interpret this correlation.

57.0r = 0.75498

There is a positive strong correlation between the number of passenger seats and cost for operation.

c. Suppose that you want to describe the relationship between operating cost per hour and number of passenger seats in the planes only in the range of 250 to 350 seats. Does the line shown in the scatterplot still provide the best description of the relationship for data in this range? Why or why not?

No, Between 250 and 350 seats, the direction looks negative.

Cautions in Making Predictions with Regression Lines:

1. If the correlation is not strong, predictions will not be accurate.

2. Extrapolation: Do not make predictions outside of the range for which you have data.

3. Correlation simply does not imply causation

• The correlation may be a coincidence• Both correlation variables might be directly influenced by some common underlying cause

It is a variable that is not among the explanatory or response variables, but influences the interpretation of the relationship.

Lurking Variables:

Causation Common Response (z = lurking variable)

X YX Y

Z

Example #4There is a positive correlation between the number of deaths by drowning and the number of ice cream cones sold. Is this evidence that people are not heeding the old advice to wait 2 hours after eating before swimming and are paying the price for it?

No! Summer is the lurking variable

Example #5 Smoke Causes Coughs: A strong relationship is

found between weekly sales of firewood and weekly sales of cough drops from September to March. Can we conclude that smoke from the fires causes coughs?

No! Winter is the lurking variable

Outlier: Observation away from the other data points

Influential Point:

Observation that drastically changes the LSRL

http://bcs.whfreeman.com/tps3e/pages/bcs-main.asp?v=category&s=00020&n=99000&i=99020.01&o=|00510|00520|00530|00010|00020|00030|00040|00050|00060|00070|00080|00110|00120|00300|0P000|01000|02000|03000|04000|05000|06000|07000|08000|09000|10000|11000|12000|13000|14000|15000|99000|

Applet:

top related