statistics cheat sheet
DESCRIPTION
FOR UBC Commerce Class.TRANSCRIPT
Measures of Central Tendency
Mean (Average): x=∑ x
n
Mean vs. Medium Mean x>Median: Skewed to the Right Mean x<Median: Skewed to the Left Right Long Tail: Skewed to the Right Left Long Tail: Skewed to the Left
Boxplot:1. Q1,Q2,Q32. Fences – Outliers 3. Max, Min
Q1 – First Quartile: 25th PercentileQ2 – Median: 50th PercentileQ3 – Third Quartile: 7th PercentileUpper Fence: Q3 + 1.5IQRLower Fence: Q1 – 1.5IQR
Range =xmax−xmin
Interquartile Range (IQR) = Q3 – Q1 Spread of middle 50% of distribution
Sensitivity to Outliers Not Skewed (Roughly Symmetric)– Sensitive (No) to Outliers: Mean and Range, Variance, SD Skewed (Has Outliers)– Not Sensitive to Outliers: Median and IQR Observation number of outliers = n Margin of outliers = n of outliers
Population Standard Deviation: σ=√∑ ¿¿¿¿¿ or σ=√∑ (x)2
n−x2
Sample Standard Deviation: s=√∑ ¿¿¿¿¿ or s=√∑ x2−
(x )2
nn−1
Linear Transformation Adding a constant (a): center (mean, median) will shift by constant, does not change spread Multiplying a constant (b): center and spread (SD, IQR) change, variance is b2
Linear Model: y=b0+b1 x New Linear Model: xnew=a+bx
o Center (c): cnew=a+b×c: plug original c into new formulao Spread (d): dnew=b×d: plug original d into new formulao Variance: snew
2 =b2×s2
Normal Distribution
Empirical Rule
Standard Normal Distribution: N(0,1) or N(μ, σ)
For Data:z=y− ys
For Model:z=x−μσ
For %:x=μ+z σ
P ( x≤a )=P(z≤ a−μσ
)
P ( x≥a )=1−P ( x<a )P (a<x ≤b )=P ( x≤b )−P (x ≤a )
Sum of Random Variables
N(μ1, σ1) + N(μ2, σ2) = N(μ1 + μ 2, √σ12+σ 2
2)
Sampling Distributions for Proportions: Categorical Data p: true proportion, center of histogram p: sample proportion, varies from one sample to the next
SD ( p )=√ pqn
~ N(p, √ pqn
)
For CI: SE ( p )=√ p qn
Sampling Distributions for Means: Quantitative Data : population meanμ y: sample mean
SD ( y )= σ
√n : population SD, σ known
For CI: SE ( y )= s
√n : sample SD, σ unknown
Statistical Inference
Properties of Sampling Distribution of Sample Means1. μx = : μ sample mean = population mean
2. σ x= σ
√n: SD of sample means = population SD
Central Limit Theorem The relationship between the sampling distribution of sample means and the population
that the samples are taken from If samples of n≥25 drawn from any population with mean and SD μ σ , then sampling
distribution of sample means approximates normal distribution. The greater the sample size the better the approximation
If population normally distributed, sampling distribution of sample means is normally distributed for any sample size n
If asks about one (individual), then use:
z= x−μσ
If asks about mean of sample of n individuals, then use:
z= x−μσ
√n
Hypothesis Testing Math for 1 Sample
Z-Test for a Pop Mean
H0: = μ μ0
Ha: <,≠,> μ μ0
Known : process/popσ ~N(0,1)
z=x−μ0
σ√n
P-Value
P(z< x−μσ
√n) = %
P(z>x−μσ
√n) = 1 – P(z>
x−μσ
√n)
P(z>(≠)x−μσ
√n) = 2×P(z>
x−μσ
√n)
Conclusion: based on / CIα
P-value < : reject Hα 0, accept Ha
Sufficient evidence Significant Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence Not significant Likely
CI: x± z¿ σ
√nn=( z
¿ σME )
2
Z*: new score, ME: old one used
One-Sample t-Test
H0: = μ μ0
Ha: <,≠,> μ μ0
Unknown : data/sampleσ ~tn-1
t=x−μ0
s√n
P-Value
P(tn-1< x−μs
√n) = #<p-value<#
P(tn-1>x−μs
√n) = #<p-value<#
P(tn-1>(≠)x−μs
√n)=2#<p-
value<2#
Conclusion: based on / CIα
P-value < : reject Hα 0, accept Ha
Sufficient evidence Significant Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence Not significant Likely
CI: x± t n−1¿ s
√n
CI z¿
90% 1.64595% 1.96099% 2.576
Z-Test for Pop Proportion
H0: p = p0
Ha: p <,≠,> p0
~N(0,1)
z=p−p0
√ p0q0
nP-Value
P(z<
p−p0
√ p0q0
n
) = %
P(z>
p−p0
√ p0q0
n
) = 1 – P(z>
p−p0
√ p0q0
n
)
P(z>(≠)
p−p0
√ p0q0
n
)= 2×P(z>
p−p0
√ p0q0
n)Conclusion: based on / CIα
P-value < : reject Hα 0, accept Ha
Sufficient evidence Significant Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence Not significant Likely
CI: p ± z¿√ p qn
SE ( p )=√ p qn
ME=z¿√ p qn
n=( z¿ )2 p q(ME )2
Hypothesis Testing Qualitative for 1 Sample
Null Hypothesis (H0): value of population model parameter Skeptical claim: nothing different Default is true
Alternative Hypothesis (Ha): value of population parameter that consider valuable if H0 rejected Two-Sided: H0: p = p0
Ha: p ≠ p0
o P-value: probability of deviating in either direction from H0
One-Sided: H0: p = p0
Ha: p < p0 or p > p0
o P-value: probability of deviating only in the direction of Ha away from the H0
Alpha Levels: threshold for P-Value Statistically significant/insignificant depending on alpha level
o Depends on how large sample is
α One-Sided Two-Sided0.05 (95% CI) 1.645 1.960.01 (99% CI) 2.33 2.576
0.001 (99.9% CI) 3.09 3.29
P-Value: value on which we base our decision Determines how likely data hypothesized are, if H0 true Ultimate goal of calculation
Conclusion: reject or fail to reject H0
Confidence Interval
CI: Estimated range of values, calculated from the sample data that is likely to include unknown population parameter level
+ Confidence – have to capture true value more often so make interval wider Smaller interval (less variability) – choose a large sample Estimate ± ME ME = z¿× SE: extent of interval on either side of middle value
o ME < 5% is acceptable
Level of Confidence Probability that the interval estimate contains the population parameter 95% confidence level
o One can be 95% confident that the population parameter is contained in the intervalCritical Value
# of SE’s must stretch put on either side of middle value
Types of Errors and Level of Significance
Decision Decision
Accept H0 Reject H0
H0 True Correct Type I error
H0 False Type II error Correct
Type I Error (false positive): H0 is rejected, when it is true (drew unusual sample) Healthy person diagnosed with disease Jury convicts innocent person Money will be invested in project that turns out not to be profitable
Type II Error (false negative): H0 is not rejected (fail to reject), when false Infected person diagnosed healthy Jury fails to convict guilty person Money wont be invested in project that would have been profitable
Detecting false hypothesis: Power (β)
Descriptive Statistics
Categorical Variable: descriptive responsesQuantitative Variable: measure of quantity (units)
Time-series: variable measures at regular intervals over time Consistent space-time interval (months, weeks…)
Cross Sectional Data: several variables measured at same point in time. Exact time (every Feb at Starbucks)
Stem-plot: distribution of data, which also includes the specific data points Can calculate mean, quartiles, median, shape
Scatterplot: plots 2 quantitative variablesHistogram: distribution of data by breaking the range of values of variable into intervals and displaying count or proportion of observations that fall into each intervalShapes of Distributions
Symmetric, Unimodal, Bimoda, Uniform, SkewedData Collection: From direct observation or produced through experiments
Survey (Response Rate): personal interview, telephone interview, self-administered questionnaire
Sampling Plans SRS : sample selected so that every possible sample with the same number of observations
is equally likely to be selected Stratified RS : separating the population into mutually exclusive sets or strata and drawing
a SRS from eacho Homogeneous and different to one another (Black vs. White)
Cluster S : SRS of groups or clusters of elementso Heterogeneous and similar to one another (Vancouver vs. Toronto)
Systematic S : sample every kth unit in a population Multi-stage S : randomly choose clusters and randomly sample individuals within each
clusterErrors
Sampling Error : difference between sample and population that exists only because of the observations that happened to be selected for the sample
Non-sampling Error : due to mistakes made in acquiring data or sample observations being selected improperly (more serious b/c cannot be corrected by + sample size)
o Errors in acquiring data : recording incorrect responseso Nonresponse error : error (or bias) introduced when responses are not obtained be
some members of a sampleo Response Bias : anything that influences responseso Voluntary Response Bias : large group invited to respond and all countedo Selection bias : occurs when sampling plan is such that some members of the target
population cannot be selected for inclusion in sampleo Convenience Sampling : include individuals that are conveniento Under-coverage : some portion of population not sampled at all or has smaller
representation Answer Phrasing
o Measurement Errors: inaccurate responseso Pilot Test : small trial run of study to check if method of study okay
Probability and Random Variables
Probability: likelihood of random phenomenon or chance behaviors 0 ≤ P(A) ≤ 1
P(S) = P(certain event) P(A1 or A2 or…) = P(A1) + P(A2) + P(A3)… if mutually exclusive
Interpreting Probability
Relative Frequency Approach: P (A )= ¿ of×event A occurred¿of ×the experiment was run
Classical Approach: P (A )= ¿of ways anevent occur¿of ways the experiment can occur
Random Variables: variable that assigns a numerical result to an outcome of an event that is associated with chance
Discrete: if can take only finite or countably infinite number of values Continuous: not discrete
Discrete Conditions 0 ≤ P(x) ≤ 1 ∑ P ( x )=1
Expected Value: mean of probability distribution: μ = E(x) = ∑ (x P( x)¿)¿ Variance: σ 2=∑ ( x−μ )2P ( x ) or ¿
If Linear H(x) = ax + b E(h(x)) = ∑ h(x) f (x) E(h(x)) = E(ax + b) = aE(x) + b Var(h(x)) = Var(ax + b) = a2Var(x)
If Random Variables E(x + y) = E(x) + E(y) Var(x ± y) = Var(x) + Var(y)
Hypothesis Testing For Comparing Two Means
Two Independent Means
Case 1: σ 12=σ2
2: 12≤s1
s2
≤2
Two Independent Means
Case 1: σ 12≠σ2
2
Two Population Means
Matched Pairs Experiments
Hypothesis Testing For Comparing Two Means: Qualitative
Null Hypothesis (H0): There is no difference between the means or proportions = 0
Alternative Hypothesis (Ha): There is a difference between means or proportions
P-Value: value on which we base our decision Determines how likely data hypothesized are, if H0 true Ultimate goal of calculation
Conclusion: reject or fail to reject H0
% ≤ P-value ≤ % ≥ : Do not Reject Hα 0: We cannot conclude there has been a significant reduction/ increase/ difference in mean study of experiment
-z-score < z-test value < z-score: Do not Reject H0 in inside: There is not sufficient evidence at the 5% level of significance that there is a reduction/ increase/ difference between the two means
Confidence Interval
CI: + Confidence – have to capture true value more often so make interval wider Smaller interval (less variability) – choose a large sample
Level of Confidence 90% CI: (-#,#): Since the 90% CI contains 0, we cannot conclude at the 10% level of
significance that the perception of the study examined of x1 andx2 90% CI: We are 90% confident that the interval (#, #) contains the true mean/proportion
of the study experiment
Hypothesis Testing For Comparing Population Proportion
Two Independent Means
Case 1: σ 12=σ2
2: 12≤s1
s2
≤2
Two Independent Means
Case 1: σ 12≠σ2
2
Two Population Means
Matched Pairs Experiments
Analysis of Two Way Tables
Two Separate Samples
H0: Ha:
p1−p2= p1− p2=x1
n1
−x2
n2
P-Value:P(tdf ≤ t0) = % from topP(tdf ≥ t0) = % from topP(tdf>(≠) t0) = 2×P(% from top)
Conclusion::α
P-value < : reject Hα 0, accept Ha
Sufficient evidence, Significant, Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence, Not significant, Likely
CI: (#,#): has value: Fail to reject H0
(#,#): doesn’t have value: Reject H0
(-#,#): CI has 0 cannot conclude
( p1− p2) z¿√ p1(1− p1)n1
+p2(1− p2)
n2
p1=x1
n1
p2=x2
n2
Chi-Square Tests
Analysis of Two Way Tables
Chi-Square Tests
Goodness of Fit Test
Used to measure how well observed data fit what would be expected under specified conditions
H0: no association between row and column variables
: p1 = p2 = … = pk
Ha: not all proportions equal
χ2=∑ ( f o−f e )2
f e=¿∑ (Obs−exp )2
exp¿
where f0 = the observed frequency
fe = the expected frequencydf = k – 1: k is number of
categories/cells specified under H0
f e=row total ×column total
n
n: total number of observations in the table
Conclusion: P-value:
P-value < : reject Hα 0, accept Ha
Sufficient evidence, Significant, Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence, Not significant, Likely
χ2> χ (k−1)2 , Reject H0
Test of Independence
Used to determine whether the row and column variables in a two-way contingency table are independent or related
H0: no association between row and column variables
: p1 = p2 = … = pk
Ha: not all proportions equal
χ2=∑ ( f o−f e )2
f e=¿∑ (Obs−exp )2
exp¿
f e=row total ×column total
n
n: total number of observations in the table
df =(r−1)(c−1)
r: number of rowsc: number of columns
Conclusion: P-value:
P-value < : reject Hα 0, accept Ha
Sufficient evidence, Significant, Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence, Not significant, Likely
χ2> χ (r−1)(c−1)2 , Reject H0
Homogeneity Test
Comparing observed counts from 2 or more populations
Examine sample to see if they have the same proportions of some characteristic
H0: the populations have the same proportion of some characteristic
: 2 categorical variables are independent
Ha: at least one of the populations has a different proportion
: they aren’t independent
χ2=∑ ( f o−f e )2
f e=¿∑ (Obs−exp )2
exp¿
f e=row total ×column total
n
n: total number of observations in the table
df =(r−1)(c−1)
r: number of rowsc: number of columns
Conclusion: P-value:
P-value < : reject Hα 0, accept Ha
Sufficient evidence, Significant, Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence, Not significant, Likely
χ2> χ (r−1)(c−1)2 , Reject H0
Random Sample A B TotalA (R1)(C1)
n(R1)(C2)
nR1
B (R2)(C1)n
(R2)(C2)n
R2
Total C1 C2 n
Example: Does the gender of a survey interviewer have an effect on survey responses by men?
P ¿¿): P ¿) We reject the H0 and conclude there is significant evidence that the proportion of men is
different for the two genders at the interviewers
Reject H0: We conclude that the variables are not independent. They are associated.Fail to Reject H0: There is no evidence of an association
Linear Regression (Scatterplots)
Correlation:
Relationship between two variables Only measures strength of linear relationship/ association x – independent/ explanatory/ predictor variable y – dependent variable/ response/ predicted variable
r=∑ (x−x )( y− y)
(∑ ( x−x )2)(∑ ( y− y )2) OR r=
n∑ xy−(∑ x )(∑ y )
√n∑ x2−(∑ x )2 ¿√n∑ y2−¿¿
Properties of Correlation Coefficient1. -1≤r ≤12. rxy = ryx3. Positive values = Positive correlation; Negative Values = Negative correlation4. Strong Correlation/linear relationship – closer to -1 and 15. Weak Correlation/linear relationship – closer to 06. Only used on two quantitative variables7. Calculated using means, SDs, z-scores8. Not resistant to outliers
Describing Association1. Form
Is it linear, Bell-shaped, Curved, Cloud2. Direction
Positive or Negative?3. Strength
Spread
Linear Model: y=b0+b1 x y : predicted y value for agiven x−value
Slope: y+/- by slope (y units) per x (x units) Gets sign from correlation Gets units from ratio of two SDs, so units of the slope are a ratio of units of the variables
b1=rS y
SX OR b=n∑ xy−¿¿
Intercept: y is y-intercept when x=0, y = y - b10
Starting value for predictionsb0= y−b1 x
To find slope and intercept need: Correlation (r): tells us strength of linear association Means: tells us where to locate the line SD: tells us the units
Predicting SD +/- the mean when have another SD +/- the mean: SDy = r (SDx)
Correlation to the Line: plot of standardized variables b1= r b0 = 0
Z y=r Z x: For every SD +/- the mean we are in x, we’ll predict that y is r(SD) +/- the mean y
Residual: observed (point) – predicted (line):e= y− y Does the model make sense? How well does the line fit the data?
o How much variation in y does our model explain? – coefficient of determination R2
– e: y is big (overestimate) + e: y is small (underestimate) Residuals vs. predicted values: shows no patterns, no direction, no shape, mean = 0
Point Prediction: the value of y obtained by specifying the value x and solving regression equation for X* we have for regression equation
o we can only make predictions within range of our data not away because of EXRAPOLATION BAD
Coefficient of Determination Measures the proportion of variation in y that is explained be the variation in x R2: fraction of data’s variation accounted for by model About R2 % of the variation in y that is explained by the variation in x, 1 – R2: fraction of variation left in residuals
R2=1−( y− y )2
∑ ( y− y )2=r2
Lurking Variable: Variable that is not among the explanatory or response variables. It can influence the interpretation of relationships among these variables Ex. Lung cancer ≠ stain fingernails: smoking
What is the most realistic value of SD?
Range6
Simple Linear Regression (Inference) Qualitative
First order Model: y=β0+β1 x+ε
where: y = Dependent variable x = Independent variable
β0= y-interceptβ1= Slope of the line (defined as rise/run) ε = Error variable
Measures of Variation
y = point (x,y) y = sample mean y = predicted y-value
Total Deviation: vertical distance y− y
SST=∑ ( y− y )2=∑ y2−(∑ y )2
nExplained Deviation: vertical distance y− y
SSR=∑ ( y− y )2=b0∑ y+b1∑ xy−(∑ y )2
nUnexplained Deviation: vertical distance y− y
SSE=∑ ( y− y )2=¿∑ y2−¿b0∑ y+b1∑ xy ¿¿
Coefficient of Determination/Variation (R2): the amount of the variation in y that is explained by the regression line. It is the ratio of the explained variation to the total variationR2=1− SSE
∑ ( y i− y )2∨R2= SSR
SST=SST−SSE
SST=1−SSE
SST
Standard Error of Estimate (Se): a measure of the differences between the observed sample y-values and the predicted y& that are obtained using the regression equationSum of Squares for X (SSxx): the sum of the squares of the deviations of xASSESING THE MODELStandard Deviation of the Error Variable (σε)
If σε is large: some of the errors will be large, which implies that the model’s fit is poor If σε is small, the errors tend to be close to the mean (which is 0) and so the model fits well
Sum of Squares for Regression (SSR): measures the amount of variation in y that is explained by the variation in the independent variable xVariation in y=SSE+SSR
So SSE is the amount of variation in y that remains unexplained
R2=1−SSE
∑ ( y i− y )2=∑ ( y i− y )2−SSE
∑ ( y i− y )2 =SSR
∑ ( yi− y )2=Explained variation∈ y
Variationi y
The greater the explained variation (greater the SSR or R2), the better the model
Simple Linear Regression (Inference) Testing
Significance of Regression
Predictor Coefficient/ Slope:
H0: β1= 0: y is not linearly related to x, so regression line will be horizontal so slope is 0
Ha: β1≠ 0: determines linear relationship
t=b1−β1
sb1
sb1= s
√SS xx
=sε
√(n−1)sx2
sε=√ SSEn−2
Conclusion: P-value:
P-value < : reject Hα 0, accept Ha
Sufficient evidence, Significant, Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence, Not significant, Likely
CI: β1± t n−2¿ (sb1
): n–2 distribution
df: v = n – 2
t : outside t n−2¿
from chart: Reject H0
Regression Equation
Prediction Interval Determines how closely y matches
the true value of y Single observation – predicting
individual vale – wider interval
y ±t n−2¿ (sε)√1+ 1
n+(x g−x)2
SSxx
SSxx=(n−1)sx2
Confidence Interval Estimator of E(y) will be narrower
than the prediction interval b/c there is less error in estimating a mean value opposed to predicating an individual value
y ±t n−2¿ (sε)√ 1
n+(x g−x)2
SSxx
SSxx=(n−1)sx2
(xg−x)2
SSxx
: estimated error
Coefficient of Correlation
Data observational Two variables are bivariate normally
distributed Can test for linear association b/w 2
variables using test ρ: estimate of population coefficient
of correlation – is the sample coefficient of correlation r
H0: ρ = 0: there is no relationship between the two variables
Ha: ρ ≠ 0
t=r √ n−21−r2
df: v = n – 2
Conclusion: P-value:
P-value < : reject Hα 0, accept Ha
Sufficient evidence, Significant, Unlikely
P-value > : fail to reject Hα 0
Not sufficient evidence, Not significant, Likely
t : outside t n−2¿
from chart: Reject H0
Multiple Regressions
y=β0+β1 x1+β2 x2+…+βk xk+ε
where: k = independent variables are potentially related to the dependent variable y = dependent variable
x1 , x2 ,…, xk= independent variables β0 , β1 ,…,βk= coefficients ε = error variable
Independent variables: may be function of other variables x2=x1
2, x5=x3 x4, x7=log(x¿¿6)¿
Meaning of Regression Coefficient
x1: with all other variables held constant, if the average x1function (what it means) increases by 1, your expected Y increases/decreases by x1coefficient
Adjusted R2: takes into account the sample size and the number of independent variables k > n: unadjusted R2 may be unrealistically high
Adjusted R2=1−
SSE /(n−k−1)
∑ ( y i− y )2
n−1
CI: Test for Each Variable
If P-value < α: We conclude that βi is greater/smaller/not than 0
Multiple Regressions Tests
Significance of Regression
Testing the validity of the model: f- test F test combines t- tests into a single test Not affected by the problem of multicollinearity,
which is when the independent variables are correlated with one another
H0: β1= β1= … = β1= 0 : if true says none of the independent variables
x1 , x2 ,…, xkis linearly related to y, so the model is invalid
Ha: at least one β i≠ 0 : the model has some validity
F=
[∑ ( y i− y )2−SSE ]k
SSEn−k−1
=
SSRk
SSEn−k−1
=(SST−SSE)/kSSE/(n−k−1)
=MSRMSE
Sε=√ SSEn−k−1
Conclusion: P-value:
P-value < : reject Hα 0, accept Ha
Sufficient evidence, Significant, UnlikelyP-value > : fail to reject Hα 0
Not sufficient evidence, Not significant, Likely
df (numerator): v = kdf (denominator): v = n – k - 1
F>F(α, k ,n−k−1 ), Reject H0
because the model is valid, regression is significant
Significance of Each Variable
Testing the coefficients: t- tests T-tests of individual coefficients allow us to
determine whether β i≠ 0 (for I = 1, 2, … , k) if linear relationship exists between xi and y.
using lots of t-tests instead of F-tests to test the validity of the model increases the probability of Type 1 Error
H0: at least one β i= 0 : if true says no linear relationship
Ha: β i≠ 0 for any i
t=bi−β i
sbi
sb1= s
√SS xx
=sε
√(n−1)sx2
df: v = n – k - 1
Confidence Interval Estimator of E(y) will be narrower than the
prediction interval b/c there is less error in estimating a mean value opposed to predicating an individual value
y ±t n−2¿ (sε)√ 1
n+(x g−x)2
SSxx
SSxx=(n−1)sx2
(xg−x)2
SSxx
: estimated error
t>F(α /2 ,n−k−1), Reject H0
β i parameter is significant
t>F(α /2 ,n−k−1), Reject H0
Analysis of Variance (ANOVA)
One-Way ANOVA: one way analysis of variance because we use a single property, or characteristic, for categorizing the populations
H0: μ1 = μ2 = μ3 Regression StatisticsMultiple R √r2=r
R Square r2= SSRSST
where r : correlationcoefficient
Adjusted R Square Adjusted R2=1−(1−R2) [ n−1n−k−1 ]Standard Error Se=√MSEObservations n
ANOVA df SS MS F Significance FRegression 1 SSR=SST−SSE
SSM1
MSMMSE
P (F>Fstat )where Fstat=MSMMSEResidual n−2 SSE=∑ ( y− y )2 se
2= SSEn−2Total n−1 SST ¿(n−1)S y
2
Coefficients Standard Error t-Stat P-value Lower 95% Upper 95%Intercept b0= y−b1 x SE (b0 )=Se √ 1
n+ x2
∑ (x i−x )2
CoefSE
2xP(tn-2)>t-stat b0−t n−2(SE (b0 )) b0+ tn−2(SE (b0))
X Variable b1=rS y
Sx
SE (b1 )=Se
√∑ (x i−x )2
CoefSE
2xP(tn-2)>t-stat b1−tn−2(SE (b1)) b1+t n−2(SE (b1 ))
ANOVA
a method of the testing the quality of three or more population means by analyzing sample variances we test the hypothesis by determining if the variation b/w groups is larger that the variation within
groups
H0: μ1 = μ2 = … = μk
Ha: at least one μj is different : not all the same
F=MSTMSE
: test statistic involves variation within groups and the variation among groups.
If differences among sample means very large, the variation witihin groups to be large, so the numerator of the test statistic becomes larger that the denominator. So large values of the test statistic suggest unequal means
F ¿>F(k−1 , N−k), Reject H0ANOVA df SS MS F Significance FRegression 1 SSR=SST−SSE
SSM1
MSMMSE
P (F>Fstat )where Fstat=MSMMSE
Residual n−2 SSE=∑ ( y− y )2 se2= SSE
n−2Total n−1 SST ¿(n−1)S y2
ANOVA SS df MS F Significance FRegression SSG=SST−SSE g - 1 MSG= SSG
g−1MSGMSE
P (F>Fstat )where Fstat=MSMMSEResidual SSE=∑ ( y− y )2 N−g MSE= SSE
N−gTotal SST ¿(n−1)S y2 N−1
P-Value = P(Fg-1,n-g) > F0: P-value < : Reject Hα 0
g = how many groups means variablesn = samples observations
SST:SSR: sum of squares between groups (treatments) Represents the variation between the means of the groups
SSE: sum of squares within groups (error) Represents the variation within a group due to random error
Conditions + Assumptions
Regression Lines (Correlation) Quantitative Variables Condition Linearity Condition Outlier Condition Equal Spread Condition: checking spread is about the same throughout
Model for Sampling Distribution of Proportions- 68-95-99.7 Independence Assumptions Sample Size Assumption: sample size, n, must be large enough
o Randomization Conditionso 10% Condition: n < 10% of populationo Success/Failure Condition: np > 10
nq > 10Model for Sampling Distribution of Means: z-score
Independence Assumptiono Randomization Condition
Sample Size Assumptiono 10% Conditiono Large Enough Sample Condition: depends on shape of original data distribution
Confidence Intervals for one proportion z-interval, one proportion z-test Independence Assumption
o Randomization Conditiono 10% Condition
Sample Size Assumptiono Inference – CLT o Need large enough sampling modelo Success/Failure Condition: n p ≥ 10
nq ≥ 10Sampling Distribution for Mean: t-score
Independence Assumptiono Randomization Conditiono 10% Condition
Normal Population Assumption: student’s t-model wont work for that are badly skewed
o Nearly Normal Condition: n < 5: data should follow normal model15 < n < 40: t-method works well as long as data unimodal and symmetricn > 40: t-method safe to use unless data very skewed, also can use of very skewed if n is large enough b/c SD close enough to Normal
ANOVA1. All populations are normally distributed2. The populations variances are equal3. The observations are independent of one another
Multiple Regression: required conditions for the error variable ε1. The probability distribution of ε is normal2. The mean of ε is 03. The standard deviation of ε is σε, which is constant of each value x4. The errors are independent
Simple Linear Regression (inference): required conditions for the error variable1. The probability distribution of ε is normal2. The mean of the distribution is 0; that is E(ε) = 03. The standard deviation of ε is σε, which is constant regardless of the value of x4. The value of ε associated with any particular value of y is independent of ε associated with
any value of y
Chi-Square test1. Expected cell frequency condition: all expected cell counts are at least 5, so X2 is reliable
Comparing two Means1. Independence Assumption: Randomization, 10%2. Normal Population Assumption
a. N < 15: do not use student t if skewedb. N ≈ 40: okay if mildly skewedc. N > 40: CLT works unless data very skewed
3. Independence group assumptiona. Two independent samples
Pooled T-test1. Paired data assumption2. Independence assumption