ap statistics - independence high school - home · 12.1 inference for linear regression – day2...
TRANSCRIPT
AP Statistics Chapter 11: Chi-Square Inference Chapter 12: Regression Inference
Name:__________________Period: _____
20 A Day 21 B Day 22 A Day 23 B Day 24 A Day
11.1 Chi-Square Tests for Goodness of Fit (2-4) Page 693: 1, 3, 5, 7, 9, 11, 15, 17, 19–24
11.1 Skill Check 11.2 Inference for Two-Way Tables (5-8) Page 724: 27, 29, 31, 33, 35, 37, 39, 45, 47,
49, 51–55
11. 2 Skill Check 12.1 Inference for Linear Regression – Day1 Ex: 1, 3, 5, 7, 9
27 B Day 28 A Day 29 B Day 30 A Day 31 B Day
11. 2 Skill Check 12.1 Inference for Linear Regression – Day1 Page 759: 1, 3, 5, 7, 9
12.1 Day 1 Skill Check 12.1 Inference for Linear Regression – Day2 12.1 Exercises:
Page 759: 11, 13, 15, 17, 19–24
12.1 Day 2 Skill Check 12.2 Transforming to Achieve Linearity 12.2 Exercises: Page 785: 31, 33, 35, 37, 39, 41, 43, 45, 47–50
3 A Day 4 B Day 5 A Day 6 B Day 7 A Day
Ch. 11– FRAPPY! Ch. 12– FRAPPY! Chapter 11 Review Problems Chapter 12 Review Problems
Chapters 11 and 12 Test
p2
Chapter 11 Notes: Chi-Square (χ2) Test
What is the χ
2 test used for?
Used to test the __________ of __________________ data Types of χ
2 tests:
Type1: Goodness of fit univariate data _____ variable _____ sample
Type2: Independence bivariate data _____ variables _____ sample
Type3: Homogeneity univariate data with 2 samples _____ variable _____ samples
χ
2 Distributions:
• Different _____ have different curves • Skewed _______ • As df increases, curve shifts toward right & becomes more like a _______________
χ
2 Assumptions for all 3 types of χ
2 tests:
(1) _________ – reasonably random sample (2) All expected counts are at least 5. <Must listed expected counts to demonstrate this assumption is met!>
Note: This 2
nd assumption is a combination of the following 2 statements:
- Have ___________ of categorical data & we expect each category to happen at least once - The _________________ is large enough we should expect at least five in each category. χ
2 Calculations for all 3 types of χ
2 tests:
p3
Type1: χ 2
Goodness-of-fit test
• Uses univariate data: _________ variable _______ sample
• Use when want to see how well the ____________ counts “fit” what we _____________ the counts to be
• Use χ 2
cdf function on the calculator to find p-value
• df = number of categories – 1
Hypothesis Statement for χ 2
Goodness-of-fit test Hypothesis statement is written in words - be sure to write in context Overall question: are the proportions of each variable as expected, or is at least one not as expected? H0: The proportions of <in context> are equal (OR are the same, are as expected, are the same as the theoretical model) Ha: At least one of the proportions of <in context> is not the same (OR is different, is not as expected, is not the same as the theoretical model)
Ex1) Does your zodiac sign determine how successful you will be? Fortune magazine collected the zodiac signs of 256 heads of the largest 400 companies. Is there sufficient evidence to claim that successful people are more likely to be born under some signs than others? Aries 23 Aquarius 24 Leo 20 Taurus 20 Scorpio 21 Virgo 19 Gemini 18 Sagittarius 19 Libra 18 Cancer 23 Capricorn 22 Pisces 29 # variables? # samples? Type of Chi-Square Test? How many would you expect in each sign if there were no difference between them? How many df?
State
Plan
Do
Conclude
p4
Ex2) A company says its premium mixture of nuts contains 10% Brazil nuts, 20% cashews, 20% almonds, 10% hazelnuts and 40% peanuts. You buy a large can and separate the nuts. Upon weighing them, you find there are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds, 71 g or hazelnuts, and 446 g of peanuts. You wonder whether your mix is significantly different from what the company advertises? Why is the chi-square goodness-of-fit test NOT appropriate here? What might you do instead of weighing the nuts in order to use chi-square? Ex3) Offspring of certain fruit flies may have yellow or ebony bodies and normal wings or short wings. Genetic theory predicts that these traits will appear in the ratio 9:3:3:1 (yellow & normal, yellow & short, ebony & normal, ebony & short) A researcher checks 100 such flies and finds the distribution of traits to be 59, 20, 11, and 10, respectively. Are the results consistent with the theoretical distribution predicted by the genetic model? First ask: # variables? # samples? Type of Chi-Square Test? What are the expected counts? df? Now to answer the question: Are the results consistent with the theoretical distribution predicted by the genetic model?
State
Plan
Do
Conclude
p5
Type2: χ 2
Test for Independence
• Used with _____________________________ data from ONE sample (____ variables, _____ sample) • Used to see if the two categorical variables are associated (____________) or not associated (__________) • Assumptions & Calculations remain the same!
Hypothesis Statement for χ
2 Test for Independence
Hypothesis statement is written in words - be sure to write in context Overall question: are the variables independent from each other, or do they depend on each other (one
influences another)? H0: <Two variables in context> are independent Ha: <Two variables in context> are dependent
Ex4) A beef distributor wishes to determine whether there is a relationship between geographic region and cut of meat preferred. If there is no relationship, we will say that beef preference is independent of geographic region. Suppose that, in a random sample of 500 customers, 300 are from the North and 200 from the South. Also, 150 prefer cut A, 275 prefer cut B, and 75 prefer cut C. Suppose that in the actual sample of 500 consumers the observed numbers were as follows. Is there sufficient evidence to suggest that geographic regions and beef preference are not independent? (Is there a difference between the expected and observed counts?)
First ask: # variables? # samples? Type of Chi-Square Test? What are the expected counts? df? If beef preference is independent of geographic region, how would we expect this table to be filled in? Expected Counts:
Trick to easily calculate expected counts from a two-by-two table:
Assuming H0 is true, Degrees of freedom df = (r - 1)(c – 1) or cover up one row & one column & count the number of cells remaining!
North South Total
Cut A 100 50 150
Cut B 150 125 275
Cut C 50 25 75
Total 300 200 500
North South Total
Cut A
Cut B
Cut C
Total
North South Total
Cut A
Cut B
Cut C
Total
totaltable
alcolumn tot totalrow counts expected
p6
Now back to the question in Ex4: Is there sufficient evidence to suggest that geographic regions and beef preference are not independent? (Is there a difference between the expected and observed counts?)
State
Plan
Do
Conclude
p7
Type3: χ 2
Test for Homogeneity
Used with _____________________ variable from ________________________________________.
____ variables ____ samples;
Used to see if the two populations are the __________ as to the categorical variable. (homogeneous)
Assumptions & formula (calculations) remain the same!
Expected counts & df are found the same way as test for independence.
Only change is the hypotheses!
Hypothesis Statement for χ 2
Test for Homogeneity Hypothesis statement is written in words - be sure to write in context Overall question: Is the variable distributed the same between the 2 samples, or do the 2 samples differ for
at least one value of the variable? H0: the true category proportions are the same for all the populations Ha: At least one of the category proportions is different among the populations Note: the Null Hypothesis here is called “homogeneity of populations” - the populations are distributed the same as to the categorical variable.
Ex5) The following data is on drinking behavior for independently chosen random samples of male and female students. Does there appear to be a gender difference with respect to drinking behavior? (Note: low = 1-7 drinks/wk, moderate = 8-24 drinks/wk, high = 25 or more drinks/wk)
State
Plan
Do
Conclude
Men Women Total
None 140 186 326
Low 478 661 1139
Moderate 300 173 473
High 63 16 79
Total 981 1036 2017
p8
Summary of χ 2
Tests
No. of Samples
No. of Variables
df = Expected Counts
Null and Alternative Hypotheses:
Goodness-of-Fit
Independence
Homogeneity
Assumptions: Calculations:
p9
Chapter 12 Day1 Notes: Review of Summarizing Bivariate Data
Go back and review your Unit3 notes – you will be tested on this material again! Make sure to review all topics including Topic4 on Transforming Data with nonlinear data. Equation for Regression Line:
Ex1) It is difficult to accurately determine a person’s body fat percentage without immersing him or her in water. Researchers hoping to find ways to make a good estimate immersed 20 male subjects, and then measured their weights. (Bock, Velleman, DeVeaux, p. 160) a) Find the LSRL, correlation coefficient, and coefficient of determination. b) Explain the meaning of slope in the context of the problem. c) Explain the meaning of the coefficient of determination in context. d) Create a scatter plot and residual plot for the data. e) What is the residual for the observation (173, 21)? f) Is there sufficient evidence that weight can be used to predict body fat? Here we will need to use INFERENCE…. So let’s examine the sampling distribution for the regression slope, then look at the formulas for confidence intervals and hypothesis tests for the slope of a regression line.
Weight (lb) Body Fat (%)
175 6
181 21
200 15
159 6
196 22
192 31
205 32
173 21
187 25
188 30
188 10
240 20
175 22
168 9
246 38
160 10
215 27
159 12
146 10
219 28
Summary of Interpretations for a Regression: (replace the underlined items with correct values or words in context) Slope: For each unit increase in x, there is an approximate increase/decrease of b in y. y-intercept: y is approximately a when x is zero. Correlation coefficient: There is a direction, strength, linear of association between x and y. Coefficient of determination: Approximately r
2% of the variation in y can be explained by the LSRL of x & y.
p10
Chapter 12 Activity: Sampling Distributions for the Slope of the LSRL
Ex2) Suppose we have a population of adult female faculty. We record the height (in inches) and weight (in pounds) of each female. Below is the true regression line for the population in Table1 on the next page. Weight = -39.4211 + 2.8543997 Height
What is the true slope of the regression line for weight on height? To simulate the sampling distribution of the slope:
Randomly select 9 females (without replacement) from the population given in Table1 on the next page by using Randint(1,90) and put your results in the table below
Calculate the LSRL for the sample
Place the slope of the regression line for your sample on the class dot plot.
The above dot plot is the distribution of slopes from size n=9 samples – each dot represents the slope from a randomly generated sample of size 9 from the same population in Table1.
What is the mean and standard deviation of all slopes from your class sample of n=9 samples? How many samples of size n=9 was this based on?
Based on everything you’ve learned about sampling distributions, if you took all possible samples of size n=9 from the population in Table1, what would the mean of the sampling distribution of slopes be?
Number
Height
Weight
p11
Table1: Data set of female heights and weights:
Person # Ht Wt Person # Ht Wt Person # Ht Wt
1 64 130 31 64.5 146 61 65.5 142
2 63 121 32 68 140 62 67 129
3 62 128 33 66 120 63 64 143
4 65 118 34 61 180 64 63 160
5 67 145 35 65 145 65 65 142
6 68 138 36 67.5 165 66 66 210
7 67 150 37 63 123 67 68 155
8 63 145 38 67 175 68 64 120
9 68 181 39 66 153 69 67 118
10 64 175 40 64 186 70 64 112
11 59 105 41 61 140 71 68 117
12 65 122 42 66 127 72 66 160
13 62 125 43 66 128 73 67 178
14 67.5 140 44 67 150 74 65.5 155
15 64 150 45 66 129 75 70 140
16 63 110 46 60.5 105 76 62 136
17 70 155 47 64 121 77 65 112
18 71 165 48 68 140 78 64 108
19 64 125 49 67 135 79 65.5 155
20 62 154 50 67 128 80 68 137
21 64 145 51 68 129 81 66 168
22 66 135 52 65 148 82 65 200
23 67 200 53 67 132 83 64 160
24 66 138 54 66 135 84 68 234
25 66 220 55 61 122 85 60 193
26 65 136 56 62 150 86 66 150
27 63.5 140 57 66 155 87 68 165
28 66.5 150 58 65 195 88 69 220
29 59 100 59 61 160 89 66 143
30 62 148 60 64 137 90 67 151
Summary statistics:
Column Mean n Std. Dev. Median Min Max Q1 Q3
Height 65.21667 90 2.4904876 65.5 59 71 64 67
Weight 146.73334 90 26.972645 142.5 100 234 128 155
p12
Chapter 12 Notes: Inference for Slopes
Consider the height and weight of adult females (for example Table1). Think about what values of weight are possible for a height of 60 inches? For 64 inches? For 68 inches? What would the graph of this data look like? Using different colors: Draw the distribution of females who are 60”tall. Draw the distribution of females who are 64” tall. Draw the distribution of females who are 68” tall. If each distribution (at each x value) was normally distributed, what would that look like on this graph? We will want the standard deviations σ for each distribution (at each x-value) to be equal – this is called ____________________________. How will this affect the way your draw the normal distributions at each x? Where is the LRSL in this graph? The LSRL should go through the center of each distribution. Regression Model:
• The _______________________ μy has a _______________ relationship with x:
Where: slope β and intercept a are ____________ parameters
• For any fixed value of x, the __________ y varies according to a _____________________.
• Repeated responses of y are independent of each other.
• The ___________________________ of y (σy), at any given x, is the _________ for all values of x.
σy is also an _____________ parameter Suppose we look at part of a population of adult women (shown in Table1). Let’s look at the women who are all 64 inches tall. Make a boxplot and normal probability plot (QQ plot). What distribution does their weight appear to have? So for any given height, or given x variable, we would like to assume that the y-values are approximately normally distributed.
Weig
ht
Height
p13
Now, how do we estimate the TRUE slope (TRUE regression line)?
To estimate xy , we use bxay ˆ .
• The slope ______ of the LSRL is an unbiased estimator of the true slope _____. • The intercept ______ of the LSRL is an unbiased estimator of the true intercept _______. • The standard error ______ is an unbiased estimator of the true standard deviation of y values around the
true regression line (_____) – see picture on previous page.
Note: df = n-2
• The estimated standard deviation of the statistic b ___________ is an unbiased estimator of the true standard deviation of the statistic b (_____). This formula is on the AP Statistics formula chart. Taking a closer look at this denominator, recall from the formula chart: where sx is the standard deviation of the x variables found using 1-Var Stats
So what is the difference between these two errors? s or se is the standard deviation of the LSRL
sb or sb1 is the standard deviation of the slope .
22
ˆ 22
n
residuals
n
yys
Height
22
2
2
ˆ
1
iiii
ii
b
xx
s
xx
n
yy
s
Weig
ht
Suppose you took many samples of the same size from this population & calculated the LSRL for each. Using the slope from each of
these LSRLs – we can create a sampling distribution for the slope of the true LSRL.
Sampling Distribution of the LSRL Slope
Height
Weig
ht
σ is the true standard deviation of y-values (at each x) around the true regression line. s or se is the estimate of the true standard deviation σ. s is the magnitude of a typical sample deviation (residual) from the least-squares line (LSRL). The smaller the value of s, the closer the points in the sample fall to the line and the better the line does in predicting y from x.
s is the standard error about the line, a measure of the typical size of a residual We want the standard deviations σ at each x-value to be equal – called ___________________.
What shape will this distribution have? What is the mean of this sampling distribution? What is the standard deviation of this sampling distribution?
sx =1
n-1xi - xi( )
2
å =xi - xi( )
2
ån-1
xi - xi( )2
å = sx n-1
sb1 =s
sx n-1
p14
Let’s summarize what we know about the regression model: (1) For any x-value, the responses (y) are ____________________ (2) x & y have a _________ relationship with the _______ LSRL going thru ___________________ (3) _____ is the same for each x
Assumptions for Slope Inference (Regression Inference):
1. The observations are _________________. Check that you have an SRS 2. The true relationship is _______________. Check the scatter plot & residual plot 3. The __________________ of the response is constant. Check the scatter plot & residual plot 4. The responses vary ___________ about the true regression line. Check a histogram or boxplot of residuals Note about Horizontal Lines
Hypotheses for Slope Inference (Regression Inference): Ho: Where is the true slope of the LSRL of <in context> Ha: Note that implies that there is no relationship between x & y or that x should not be used to predict y What would the slope equal if there were a perfect relationship between x & y?
Weig
ht
Height
Suppose the LSRL has a horizontal line – would height be useful in predicting weight? What is the slope of a horizontal line? A slope of zero – means that there is ______ relationship between x & y! If this was your LSRL, it means there’s _____ relationship between x and y, and x should ________ be used to predict y!
p15
bstb *
(H0: = 0)
CALCULATOR: STAT TEST LinRegTInt Note: Calculator output only gives s, NOT sb. To find sb, use the following formula developed earlier
OR use
CALCULATOR: STAT TEST LinRegTTest Note: Calculator output only gives s, NOT sb. To find sb, use the following formula developed earlier
OR use
Calculations - Confidence Interval for :
where t* is based on df = n-2 because
there are 2 unknowns and
Calculations – Hypothesis Testing for :
bsb
t
where:
Now let’s go back to Ex1) It is difficult to accurately determine a person’s body fat percentage without immersing him or her in water. Researchers hoping to find ways to make a good estimate immersed 20 male subjects, and then measured their weights. (Bock, Velleman, DeVeaux, p. 160) (Type the data in again if needed). Previously we found the LSRL, r, r
2, the equation of the regression line
, and made scatter plots and residual plots – these are things you learned in Unit3. (Don’t forget to review this unit!). We can now answer the following question: f) Is there sufficient evidence that weight can be used to predict body fat?
State
Plan
Do
Conclude
2xx
ssb
sb =s
sx n-1
2xx
ssb
upper CI = b+ t* sb( )
sb =upper CI - b
t*
sb =s
sx n-1
sb =b- b
t
p16
g) Give a 95% confidence interval for the true slope of the LSRL.
State and Plan
Do
Conclude
h) Here is the computer-generated result from the data: Find the following in the computer output: What is the df? What is the correlation coefficient? What does “s” represent (in context)? What do the numbers in the Estimate column represent? What is sb and what does it represent (in context)?
Parameter Estimate Std. Err.
Intercept -27.376263 11.547428
Weight 0.24987414 0.060653996
Sample size: 20
R-square = 43.83%
s = 7.0491323
p17
Ex3) The decline of salmon fisheries along the Columbia River in Oregon has caused great concern among commercial and recreational fishermen. The paper “Feeding of Predaceous Fishes on Our-Migrating Juvenile Salmonids in the John Day Reservoir, Columbia River” (Trans. Amer. Fisheries Soc. (1991): 405-420) gave the accompanying data on y = maximum size of salmonids consumed by northern squaw fish (the most abundant salmonid predator) and x = squawfish length, both in mm. Use the computer output to answer the questions. LSRL:______________________ r = _______ r
2 = _______
s = _______ sb = _______ t = _______ df = _______ p-value = ________ Calculate a 95% confidence interval for the slope of the true regression line: What is the maximum size for salmonids consumed by a squawfish whose length is 375 mm, and what is the residual corresponding to the observation (375,165)? Can you use squawfish length to estimate the maximum size of salmonids consumed b northern squaw fish? Explain.
Predictor Coef Stdev t-ratio p
Constant -89.09 16.83 -5.29 0.000
Length 0.72907 0.04778 15.26 0.000
s = 12.56 R-sq = 96.3% R-sq(adj) = 95.9%
Analysis of Variance
Source DF SS MS F p
Regression 1 36736 36736 232.87 0.000
Error 9 1420 158
Total 10 38156
p18
Transform data
• x & log y • log x & log y
• x & 𝑦
• x & 1
𝑦
• others
Calculate LSRL
Is residual plot scattered?
Appropriate model!
N
Recall from other math classes What to do if data is non-linear Linear Functions y = mx + b Quadratic Function y = x² Cubic Function y = x³ Exponential Function y = b
x
Power Function y = xb
Exponential Models (x, y) → (x, ln y) or (x, y) → (x, log y) (x, y) → (lnx, y) or (x, y) → (logx, y) (x, y) → (lnx, ln y) or (x, y) → (logx, logy)
Ex1) The population of Mexico, in millions of people, from 1921 through 1990 is given in the following table:
250.811990
847.661980
225.481970
923.341960
791.251950
654.191940
553.161930
355.141921
)( millionsinPopulationYear
d) Now use transformation to “linearize” the exponential model: Exponential model: (x, y) → (x, ln y)
250.811990
847.661980
225.481970
923.341960
791.251950
654.191940
553.161930
355.141921
)( millionsinPopulationYear
No
a) Draw a scatterplot. Does the overall pattern suggest a linear relationship?
b) What is the least squares regression line and correlation coefficient?
c) Sketch the residual plot. What does it suggest?
AP Statistics Name: ____________________ 12.2 Transforming Data Date: ____________ Per: ____
e) What does the scatterplot, the least squares regression line and correlation coefficient show?
f) Examine the residual plot. What patterns do you observe?
Yes
p19
g) Use the new model to estimate the population of Mexico in 1993. How accurate do you think your prediction is going to be? Explain.
h) The 1993 Information Please Environmental Almanac reports that the population of Mexico to be 88,598,000. How close is your estimate?
i) The Almanac also reports that the growth rate to be between 2.2% and 2.3%. What percentage growth rate does the exponential model represent? What does this tell you about using the exponential model to make predictions?
Ex2) Atlantic Ocean Rockfish example. Find an appropriate model (examine x vs. y; x vs. log y; logx vs. logy). Using the chosen model, predict the weight of a fish that is 20 cm long. Length Weight (cm) (g)
5.2 2 8.5 8 11.5 21 14.3 38 16.8 69 19.2 117 21.3 148 23.3 190 25 264 26.7 293 28.2 318 29.6 371 30.8 455 32 504 33 518 34 537 34.9 651 36.4 719 37.1 726 37.7 810