ap statistics - independence high school - home · 12.1 inference for linear regression – day2...

AP Statistics Chapter 11: Chi-Square Inference Chapter 12: Regression Inference

Name:__________________Period: _____

20 A Day 21 B Day 22 A Day 23 B Day 24 A Day

11.1 Chi-Square Tests for Goodness of Fit (2-4) Page 693: 1, 3, 5, 7, 9, 11, 15, 17, 19–24

11.1 Skill Check 11.2 Inference for Two-Way Tables (5-8) Page 724: 27, 29, 31, 33, 35, 37, 39, 45, 47,

49, 51–55

11. 2 Skill Check 12.1 Inference for Linear Regression – Day1 Ex: 1, 3, 5, 7, 9

27 B Day 28 A Day 29 B Day 30 A Day 31 B Day

11. 2 Skill Check 12.1 Inference for Linear Regression – Day1 Page 759: 1, 3, 5, 7, 9

12.1 Day 1 Skill Check 12.1 Inference for Linear Regression – Day2 12.1 Exercises:

Page 759: 11, 13, 15, 17, 19–24

12.1 Day 2 Skill Check 12.2 Transforming to Achieve Linearity 12.2 Exercises: Page 785: 31, 33, 35, 37, 39, 41, 43, 45, 47–50

3 A Day 4 B Day 5 A Day 6 B Day 7 A Day

Ch. 11– FRAPPY! Ch. 12– FRAPPY! Chapter 11 Review Problems Chapter 12 Review Problems

Chapters 11 and 12 Test

p2

Chapter 11 Notes: Chi-Square (χ2) Test

What is the χ

2 test used for?

Used to test the __________ of __________________ data Types of χ

2 tests:

Type1: Goodness of fit univariate data _____ variable _____ sample

Type2: Independence bivariate data _____ variables _____ sample

Type3: Homogeneity univariate data with 2 samples _____ variable _____ samples

χ

2 Distributions:

• Different _____ have different curves • Skewed _______ • As df increases, curve shifts toward right & becomes more like a _______________

χ

2 Assumptions for all 3 types of χ

2 tests:

(1) _________ – reasonably random sample (2) All expected counts are at least 5. <Must listed expected counts to demonstrate this assumption is met!>

Note: This 2

nd assumption is a combination of the following 2 statements:

- Have ___________ of categorical data & we expect each category to happen at least once - The _________________ is large enough we should expect at least five in each category. χ

2 Calculations for all 3 types of χ

2 tests:

p3

Type1: χ 2

Goodness-of-fit test

• Uses univariate data: _________ variable _______ sample

• Use when want to see how well the ____________ counts “fit” what we _____________ the counts to be

• Use χ 2

cdf function on the calculator to find p-value

• df = number of categories – 1

Hypothesis Statement for χ 2

Goodness-of-fit test Hypothesis statement is written in words - be sure to write in context Overall question: are the proportions of each variable as expected, or is at least one not as expected? H0: The proportions of <in context> are equal (OR are the same, are as expected, are the same as the theoretical model) Ha: At least one of the proportions of <in context> is not the same (OR is different, is not as expected, is not the same as the theoretical model)

Ex1) Does your zodiac sign determine how successful you will be? Fortune magazine collected the zodiac signs of 256 heads of the largest 400 companies. Is there sufficient evidence to claim that successful people are more likely to be born under some signs than others? Aries 23 Aquarius 24 Leo 20 Taurus 20 Scorpio 21 Virgo 19 Gemini 18 Sagittarius 19 Libra 18 Cancer 23 Capricorn 22 Pisces 29 # variables? # samples? Type of Chi-Square Test? How many would you expect in each sign if there were no difference between them? How many df?

State

Plan

Do

Conclude

p4

Ex2) A company says its premium mixture of nuts contains 10% Brazil nuts, 20% cashews, 20% almonds, 10% hazelnuts and 40% peanuts. You buy a large can and separate the nuts. Upon weighing them, you find there are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds, 71 g or hazelnuts, and 446 g of peanuts. You wonder whether your mix is significantly different from what the company advertises? Why is the chi-square goodness-of-fit test NOT appropriate here? What might you do instead of weighing the nuts in order to use chi-square? Ex3) Offspring of certain fruit flies may have yellow or ebony bodies and normal wings or short wings. Genetic theory predicts that these traits will appear in the ratio 9:3:3:1 (yellow & normal, yellow & short, ebony & normal, ebony & short) A researcher checks 100 such flies and finds the distribution of traits to be 59, 20, 11, and 10, respectively. Are the results consistent with the theoretical distribution predicted by the genetic model? First ask: # variables? # samples? Type of Chi-Square Test? What are the expected counts? df? Now to answer the question: Are the results consistent with the theoretical distribution predicted by the genetic model?

State

Plan

Do

Conclude

p5

Type2: χ 2

Test for Independence

• Used with _____________________________ data from ONE sample (____ variables, _____ sample) • Used to see if the two categorical variables are associated (____________) or not associated (__________) • Assumptions & Calculations remain the same!

Hypothesis Statement for χ

2 Test for Independence

Hypothesis statement is written in words - be sure to write in context Overall question: are the variables independent from each other, or do they depend on each other (one

influences another)? H0: <Two variables in context> are independent Ha: <Two variables in context> are dependent

Ex4) A beef distributor wishes to determine whether there is a relationship between geographic region and cut of meat preferred. If there is no relationship, we will say that beef preference is independent of geographic region. Suppose that, in a random sample of 500 customers, 300 are from the North and 200 from the South. Also, 150 prefer cut A, 275 prefer cut B, and 75 prefer cut C. Suppose that in the actual sample of 500 consumers the observed numbers were as follows. Is there sufficient evidence to suggest that geographic regions and beef preference are not independent? (Is there a difference between the expected and observed counts?)

First ask: # variables? # samples? Type of Chi-Square Test? What are the expected counts? df? If beef preference is independent of geographic region, how would we expect this table to be filled in? Expected Counts:

Trick to easily calculate expected counts from a two-by-two table:

Assuming H0 is true, Degrees of freedom df = (r - 1)(c – 1) or cover up one row & one column & count the number of cells remaining!

North South Total

Cut A 100 50 150

Cut B 150 125 275

Cut C 50 25 75

Total 300 200 500

North South Total

Cut A

Cut B

Cut C

Total

North South Total

Cut A

Cut B

Cut C

Total

totaltable

alcolumn tot totalrow counts expected

p6

Now back to the question in Ex4: Is there sufficient evidence to suggest that geographic regions and beef preference are not independent? (Is there a difference between the expected and observed counts?)

State

Plan

Do

Conclude

p7

Type3: χ 2

Test for Homogeneity

Used with _____________________ variable from ________________________________________.

____ variables ____ samples;

Used to see if the two populations are the __________ as to the categorical variable. (homogeneous)

Assumptions & formula (calculations) remain the same!

Expected counts & df are found the same way as test for independence.

Only change is the hypotheses!

Hypothesis Statement for χ 2

Test for Homogeneity Hypothesis statement is written in words - be sure to write in context Overall question: Is the variable distributed the same between the 2 samples, or do the 2 samples differ for

at least one value of the variable? H0: the true category proportions are the same for all the populations Ha: At least one of the category proportions is different among the populations Note: the Null Hypothesis here is called “homogeneity of populations” - the populations are distributed the same as to the categorical variable.

Ex5) The following data is on drinking behavior for independently chosen random samples of male and female students. Does there appear to be a gender difference with respect to drinking behavior? (Note: low = 1-7 drinks/wk, moderate = 8-24 drinks/wk, high = 25 or more drinks/wk)

State

Plan

Do

Conclude

Men Women Total

None 140 186 326

Low 478 661 1139

Moderate 300 173 473

High 63 16 79

Total 981 1036 2017

p8

Summary of χ 2

Tests

No. of Samples

No. of Variables

df = Expected Counts

Null and Alternative Hypotheses:

Goodness-of-Fit

Independence

Homogeneity

Assumptions: Calculations:

p9

Chapter 12 Day1 Notes: Review of Summarizing Bivariate Data

Go back and review your Unit3 notes – you will be tested on this material again! Make sure to review all topics including Topic4 on Transforming Data with nonlinear data. Equation for Regression Line:

Ex1) It is difficult to accurately determine a person’s body fat percentage without immersing him or her in water. Researchers hoping to find ways to make a good estimate immersed 20 male subjects, and then measured their weights. (Bock, Velleman, DeVeaux, p. 160) a) Find the LSRL, correlation coefficient, and coefficient of determination. b) Explain the meaning of slope in the context of the problem. c) Explain the meaning of the coefficient of determination in context. d) Create a scatter plot and residual plot for the data. e) What is the residual for the observation (173, 21)? f) Is there sufficient evidence that weight can be used to predict body fat? Here we will need to use INFERENCE…. So let’s examine the sampling distribution for the regression slope, then look at the formulas for confidence intervals and hypothesis tests for the slope of a regression line.

Weight (lb) Body Fat (%)

175 6

181 21

200 15

159 6

196 22

192 31

205 32

173 21

187 25

188 30

188 10

240 20

175 22

168 9

246 38

160 10

215 27

159 12

146 10

219 28

Summary of Interpretations for a Regression: (replace the underlined items with correct values or words in context) Slope: For each unit increase in x, there is an approximate increase/decrease of b in y. y-intercept: y is approximately a when x is zero. Correlation coefficient: There is a direction, strength, linear of association between x and y. Coefficient of determination: Approximately r

2% of the variation in y can be explained by the LSRL of x & y.

p10

Chapter 12 Activity: Sampling Distributions for the Slope of the LSRL

Ex2) Suppose we have a population of adult female faculty. We record the height (in inches) and weight (in pounds) of each female. Below is the true regression line for the population in Table1 on the next page. Weight = -39.4211 + 2.8543997 Height

What is the true slope of the regression line for weight on height? To simulate the sampling distribution of the slope:

Randomly select 9 females (without replacement) from the population given in Table1 on the next page by using Randint(1,90) and put your results in the table below

Calculate the LSRL for the sample

Place the slope of the regression line for your sample on the class dot plot.

The above dot plot is the distribution of slopes from size n=9 samples – each dot represents the slope from a randomly generated sample of size 9 from the same population in Table1.

What is the mean and standard deviation of all slopes from your class sample of n=9 samples? How many samples of size n=9 was this based on?

Based on everything you’ve learned about sampling distributions, if you took all possible samples of size n=9 from the population in Table1, what would the mean of the sampling distribution of slopes be?

Number

Height

Weight

p11

Table1: Data set of female heights and weights:

Person # Ht Wt Person # Ht Wt Person # Ht Wt

1 64 130 31 64.5 146 61 65.5 142

2 63 121 32 68 140 62 67 129

3 62 128 33 66 120 63 64 143

4 65 118 34 61 180 64 63 160

5 67 145 35 65 145 65 65 142

6 68 138 36 67.5 165 66 66 210

7 67 150 37 63 123 67 68 155

8 63 145 38 67 175 68 64 120

9 68 181 39 66 153 69 67 118

10 64 175 40 64 186 70 64 112

11 59 105 41 61 140 71 68 117

12 65 122 42 66 127 72 66 160

13 62 125 43 66 128 73 67 178

14 67.5 140 44 67 150 74 65.5 155

15 64 150 45 66 129 75 70 140

16 63 110 46 60.5 105 76 62 136

17 70 155 47 64 121 77 65 112

18 71 165 48 68 140 78 64 108

19 64 125 49 67 135 79 65.5 155

20 62 154 50 67 128 80 68 137

21 64 145 51 68 129 81 66 168

22 66 135 52 65 148 82 65 200

23 67 200 53 67 132 83 64 160

24 66 138 54 66 135 84 68 234

25 66 220 55 61 122 85 60 193

26 65 136 56 62 150 86 66 150

27 63.5 140 57 66 155 87 68 165

28 66.5 150 58 65 195 88 69 220

29 59 100 59 61 160 89 66 143

30 62 148 60 64 137 90 67 151

Summary statistics:

Column Mean n Std. Dev. Median Min Max Q1 Q3

Height 65.21667 90 2.4904876 65.5 59 71 64 67

Weight 146.73334 90 26.972645 142.5 100 234 128 155

p12

Chapter 12 Notes: Inference for Slopes

Consider the height and weight of adult females (for example Table1). Think about what values of weight are possible for a height of 60 inches? For 64 inches? For 68 inches? What would the graph of this data look like? Using different colors: Draw the distribution of females who are 60”tall. Draw the distribution of females who are 64” tall. Draw the distribution of females who are 68” tall. If each distribution (at each x value) was normally distributed, what would that look like on this graph? We will want the standard deviations σ for each distribution (at each x-value) to be equal – this is called ____________________________. How will this affect the way your draw the normal distributions at each x? Where is the LRSL in this graph? The LSRL should go through the center of each distribution. Regression Model:

• The _______________________ μy has a _______________ relationship with x:

Where: slope β and intercept a are ____________ parameters

• For any fixed value of x, the __________ y varies according to a _____________________.

• Repeated responses of y are independent of each other.

• The ___________________________ of y (σy), at any given x, is the _________ for all values of x.

σy is also an _____________ parameter Suppose we look at part of a population of adult women (shown in Table1). Let’s look at the women who are all 64 inches tall. Make a boxplot and normal probability plot (QQ plot). What distribution does their weight appear to have? So for any given height, or given x variable, we would like to assume that the y-values are approximately normally distributed.

Weig

ht

Height

p13

Now, how do we estimate the TRUE slope (TRUE regression line)?

To estimate xy , we use bxay ˆ .

• The slope ______ of the LSRL is an unbiased estimator of the true slope _____. • The intercept ______ of the LSRL is an unbiased estimator of the true intercept _______. • The standard error ______ is an unbiased estimator of the true standard deviation of y values around the

true regression line (_____) – see picture on previous page.

Note: df = n-2

• The estimated standard deviation of the statistic b ___________ is an unbiased estimator of the true standard deviation of the statistic b (_____). This formula is on the AP Statistics formula chart. Taking a closer look at this denominator, recall from the formula chart: where sx is the standard deviation of the x variables found using 1-Var Stats

So what is the difference between these two errors? s or se is the standard deviation of the LSRL

sb or sb1 is the standard deviation of the slope .

22

ˆ 22

n

residuals

n

yys

Height

22

2

2

ˆ

1

iiii

ii

b

xx

s

xx

n

yy

s

Weig

ht

Suppose you took many samples of the same size from this population & calculated the LSRL for each. Using the slope from each of

these LSRLs – we can create a sampling distribution for the slope of the true LSRL.

Sampling Distribution of the LSRL Slope

Height

Weig

ht

σ is the true standard deviation of y-values (at each x) around the true regression line. s or se is the estimate of the true standard deviation σ. s is the magnitude of a typical sample deviation (residual) from the least-squares line (LSRL). The smaller the value of s, the closer the points in the sample fall to the line and the better the line does in predicting y from x.

s is the standard error about the line, a measure of the typical size of a residual We want the standard deviations σ at each x-value to be equal – called ___________________.

What shape will this distribution have? What is the mean of this sampling distribution? What is the standard deviation of this sampling distribution?

sx =1

n-1xi - xi( )

2

å =xi - xi( )

2

ån-1

xi - xi( )2

å = sx n-1

sb1 =s

sx n-1

p14

Let’s summarize what we know about the regression model: (1) For any x-value, the responses (y) are ____________________ (2) x & y have a _________ relationship with the _______ LSRL going thru ___________________ (3) _____ is the same for each x

Assumptions for Slope Inference (Regression Inference):

1. The observations are _________________. Check that you have an SRS 2. The true relationship is _______________. Check the scatter plot & residual plot 3. The __________________ of the response is constant. Check the scatter plot & residual plot 4. The responses vary ___________ about the true regression line. Check a histogram or boxplot of residuals Note about Horizontal Lines

Hypotheses for Slope Inference (Regression Inference): Ho: Where is the true slope of the LSRL of <in context> Ha: Note that implies that there is no relationship between x & y or that x should not be used to predict y What would the slope equal if there were a perfect relationship between x & y?

Weig

ht

Height

Suppose the LSRL has a horizontal line – would height be useful in predicting weight? What is the slope of a horizontal line? A slope of zero – means that there is ______ relationship between x & y! If this was your LSRL, it means there’s _____ relationship between x and y, and x should ________ be used to predict y!

p15

bstb *

(H0: = 0)

CALCULATOR: STAT TEST LinRegTInt Note: Calculator output only gives s, NOT sb. To find sb, use the following formula developed earlier

OR use

CALCULATOR: STAT TEST LinRegTTest Note: Calculator output only gives s, NOT sb. To find sb, use the following formula developed earlier

OR use

Calculations - Confidence Interval for :

where t* is based on df = n-2 because

there are 2 unknowns and

Calculations – Hypothesis Testing for :

bsb

t

where:

Now let’s go back to Ex1) It is difficult to accurately determine a person’s body fat percentage without immersing him or her in water. Researchers hoping to find ways to make a good estimate immersed 20 male subjects, and then measured their weights. (Bock, Velleman, DeVeaux, p. 160) (Type the data in again if needed). Previously we found the LSRL, r, r

2, the equation of the regression line

, and made scatter plots and residual plots – these are things you learned in Unit3. (Don’t forget to review this unit!). We can now answer the following question: f) Is there sufficient evidence that weight can be used to predict body fat?

State

Plan

Do

Conclude

2xx

ssb

sb =s

sx n-1

2xx

ssb

upper CI = b+ t* sb( )

sb =upper CI - b

t*

sb =s

sx n-1

sb =b- b

t

p16

g) Give a 95% confidence interval for the true slope of the LSRL.

State and Plan

Do

Conclude

h) Here is the computer-generated result from the data: Find the following in the computer output: What is the df? What is the correlation coefficient? What does “s” represent (in context)? What do the numbers in the Estimate column represent? What is sb and what does it represent (in context)?

Parameter Estimate Std. Err.

Intercept -27.376263 11.547428

Weight 0.24987414 0.060653996

Sample size: 20

R-square = 43.83%

s = 7.0491323

p17

Ex3) The decline of salmon fisheries along the Columbia River in Oregon has caused great concern among commercial and recreational fishermen. The paper “Feeding of Predaceous Fishes on Our-Migrating Juvenile Salmonids in the John Day Reservoir, Columbia River” (Trans. Amer. Fisheries Soc. (1991): 405-420) gave the accompanying data on y = maximum size of salmonids consumed by northern squaw fish (the most abundant salmonid predator) and x = squawfish length, both in mm. Use the computer output to answer the questions. LSRL:______________________ r = _______ r

2 = _______

s = _______ sb = _______ t = _______ df = _______ p-value = ________ Calculate a 95% confidence interval for the slope of the true regression line: What is the maximum size for salmonids consumed by a squawfish whose length is 375 mm, and what is the residual corresponding to the observation (375,165)? Can you use squawfish length to estimate the maximum size of salmonids consumed b northern squaw fish? Explain.

Predictor Coef Stdev t-ratio p

Constant -89.09 16.83 -5.29 0.000

Length 0.72907 0.04778 15.26 0.000

s = 12.56 R-sq = 96.3% R-sq(adj) = 95.9%

Analysis of Variance

Source DF SS MS F p

Regression 1 36736 36736 232.87 0.000

Error 9 1420 158

Total 10 38156

p18

Transform data

• x & log y • log x & log y

• x & 𝑦

• x & 1

𝑦

• others

Calculate LSRL

Is residual plot scattered?

Appropriate model!

N

Recall from other math classes What to do if data is non-linear Linear Functions y = mx + b Quadratic Function y = x² Cubic Function y = x³ Exponential Function y = b

x

Power Function y = xb

Exponential Models (x, y) → (x, ln y) or (x, y) → (x, log y) (x, y) → (lnx, y) or (x, y) → (logx, y) (x, y) → (lnx, ln y) or (x, y) → (logx, logy)

Ex1) The population of Mexico, in millions of people, from 1921 through 1990 is given in the following table:

250.811990

847.661980

225.481970

923.341960

791.251950

654.191940

553.161930

355.141921

)( millionsinPopulationYear

d) Now use transformation to “linearize” the exponential model: Exponential model: (x, y) → (x, ln y)

250.811990

847.661980

225.481970

923.341960

791.251950

654.191940

553.161930

355.141921

)( millionsinPopulationYear

No

a) Draw a scatterplot. Does the overall pattern suggest a linear relationship?

b) What is the least squares regression line and correlation coefficient?

c) Sketch the residual plot. What does it suggest?

AP Statistics Name: ____________________ 12.2 Transforming Data Date: ____________ Per: ____

e) What does the scatterplot, the least squares regression line and correlation coefficient show?

f) Examine the residual plot. What patterns do you observe?

Yes

p19

g) Use the new model to estimate the population of Mexico in 1993. How accurate do you think your prediction is going to be? Explain.

h) The 1993 Information Please Environmental Almanac reports that the population of Mexico to be 88,598,000. How close is your estimate?

i) The Almanac also reports that the growth rate to be between 2.2% and 2.3%. What percentage growth rate does the exponential model represent? What does this tell you about using the exponential model to make predictions?

Ex2) Atlantic Ocean Rockfish example. Find an appropriate model (examine x vs. y; x vs. log y; logx vs. logy). Using the chosen model, predict the weight of a fish that is 20 cm long. Length Weight (cm) (g)

5.2 2 8.5 8 11.5 21 14.3 38 16.8 69 19.2 117 21.3 148 23.3 190 25 264 26.7 293 28.2 318 29.6 371 30.8 455 32 504 33 518 34 537 34.9 651 36.4 719 37.1 726 37.7 810

ap statistics - independence high school - home · 12.1 inference for linear regression – day2...

Documents