statistics cheat sheet

Measures of Central Tendency

Mean (Average): x=∑ x

n

Mean vs. Medium Mean x>Median: Skewed to the Right Mean x<Median: Skewed to the Left Right Long Tail: Skewed to the Right Left Long Tail: Skewed to the Left

Boxplot:1. Q1,Q2,Q32. Fences – Outliers 3. Max, Min

Q1 – First Quartile: 25th PercentileQ2 – Median: 50th PercentileQ3 – Third Quartile: 7th PercentileUpper Fence: Q3 + 1.5IQRLower Fence: Q1 – 1.5IQR

Range =xmax−xmin

Interquartile Range (IQR) = Q3 – Q1 Spread of middle 50% of distribution

Sensitivity to Outliers Not Skewed (Roughly Symmetric)– Sensitive (No) to Outliers: Mean and Range, Variance, SD Skewed (Has Outliers)– Not Sensitive to Outliers: Median and IQR Observation number of outliers = n Margin of outliers = n of outliers

Population Standard Deviation: σ=√∑ ¿¿¿¿¿ or σ=√∑ (x)2

n−x2

Sample Standard Deviation: s=√∑ ¿¿¿¿¿ or s=√∑ x2−

(x )2

nn−1

Linear Transformation Adding a constant (a): center (mean, median) will shift by constant, does not change spread Multiplying a constant (b): center and spread (SD, IQR) change, variance is b2

Linear Model: y=b0+b1 x New Linear Model: xnew=a+bx

o Center (c): cnew=a+b×c: plug original c into new formulao Spread (d): dnew=b×d: plug original d into new formulao Variance: snew

2 =b2×s2

Normal Distribution

Empirical Rule

Standard Normal Distribution: N(0,1) or N(μ, σ)

For Data:z=y− ys

For Model:z=x−μσ

For %:x=μ+z σ

P ( x≤a )=P(z≤ a−μσ

)

P ( x≥a )=1−P ( x<a )P (a<x ≤b )=P ( x≤b )−P (x ≤a )

Sum of Random Variables

N(μ1, σ1) + N(μ2, σ2) = N(μ1 + μ 2, √σ12+σ 2

2)

Sampling Distributions for Proportions: Categorical Data p: true proportion, center of histogram p: sample proportion, varies from one sample to the next

SD ( p )=√ pqn

~ N(p, √ pqn

)

For CI: SE ( p )=√ p qn

Sampling Distributions for Means: Quantitative Data : population meanμ y: sample mean

SD ( y )= σ

√n : population SD, σ known

For CI: SE ( y )= s

√n : sample SD, σ unknown

Statistical Inference

Properties of Sampling Distribution of Sample Means1. μx = : μ sample mean = population mean

2. σ x= σ

√n: SD of sample means = population SD

Central Limit Theorem The relationship between the sampling distribution of sample means and the population

that the samples are taken from If samples of n≥25 drawn from any population with mean and SD μ σ , then sampling

distribution of sample means approximates normal distribution. The greater the sample size the better the approximation

If population normally distributed, sampling distribution of sample means is normally distributed for any sample size n

If asks about one (individual), then use:

z= x−μσ

If asks about mean of sample of n individuals, then use:

z= x−μσ

√n

Hypothesis Testing Math for 1 Sample

Z-Test for a Pop Mean

H0: = μ μ0

Ha: <,≠,> μ μ0

Known : process/popσ ~N(0,1)

z=x−μ0

σ√n

P-Value

P(z< x−μσ

√n) = %

P(z>x−μσ

√n) = 1 – P(z>

x−μσ

√n)

P(z>(≠)x−μσ

√n) = 2×P(z>

x−μσ

√n)

Conclusion: based on / CIα

P-value < : reject Hα 0, accept Ha

Sufficient evidence Significant Unlikely

P-value > : fail to reject Hα 0

Not sufficient evidence Not significant Likely

CI: x± z¿ σ

√nn=( z

¿ σME )

2

Z*: new score, ME: old one used

One-Sample t-Test

H0: = μ μ0

Ha: <,≠,> μ μ0

Unknown : data/sampleσ ~tn-1

t=x−μ0

s√n

P-Value

P(tn-1< x−μs

√n) = #<p-value<#

P(tn-1>x−μs

√n) = #<p-value<#

P(tn-1>(≠)x−μs

√n)=2#<p-

value<2#

Conclusion: based on / CIα





CI: x± t n−1¿ s

√n

CI z¿

90% 1.64595% 1.96099% 2.576

Z-Test for Pop Proportion

H0: p = p0

Ha: p <,≠,> p0

~N(0,1)

z=p−p0

√ p0q0

nP-Value

P(z<

p−p0

√ p0q0

n

) = %

P(z>

p−p0

√ p0q0

n

) = 1 – P(z>

p−p0

√ p0q0

n

)

P(z>(≠)

p−p0

√ p0q0

n

)= 2×P(z>

p−p0

√ p0q0

n)Conclusion: based on / CIα





CI: p ± z¿√ p qn

SE ( p )=√ p qn

ME=z¿√ p qn

n=( z¿ )2 p q(ME )2

Hypothesis Testing Qualitative for 1 Sample

Null Hypothesis (H0): value of population model parameter Skeptical claim: nothing different Default is true

Alternative Hypothesis (Ha): value of population parameter that consider valuable if H0 rejected Two-Sided: H0: p = p0

Ha: p ≠ p0

o P-value: probability of deviating in either direction from H0

One-Sided: H0: p = p0

Ha: p < p0 or p > p0

o P-value: probability of deviating only in the direction of Ha away from the H0

Alpha Levels: threshold for P-Value Statistically significant/insignificant depending on alpha level

o Depends on how large sample is

α One-Sided Two-Sided0.05 (95% CI) 1.645 1.960.01 (99% CI) 2.33 2.576

0.001 (99.9% CI) 3.09 3.29

P-Value: value on which we base our decision Determines how likely data hypothesized are, if H0 true Ultimate goal of calculation

Conclusion: reject or fail to reject H0

Confidence Interval

CI: Estimated range of values, calculated from the sample data that is likely to include unknown population parameter level

+ Confidence – have to capture true value more often so make interval wider Smaller interval (less variability) – choose a large sample Estimate ± ME ME = z¿× SE: extent of interval on either side of middle value

o ME < 5% is acceptable

Level of Confidence Probability that the interval estimate contains the population parameter 95% confidence level

o One can be 95% confident that the population parameter is contained in the intervalCritical Value

# of SE’s must stretch put on either side of middle value

Types of Errors and Level of Significance

Decision Decision

Accept H0 Reject H0

H0 True Correct Type I error

H0 False Type II error Correct

Type I Error (false positive): H0 is rejected, when it is true (drew unusual sample) Healthy person diagnosed with disease Jury convicts innocent person Money will be invested in project that turns out not to be profitable

Type II Error (false negative): H0 is not rejected (fail to reject), when false Infected person diagnosed healthy Jury fails to convict guilty person Money wont be invested in project that would have been profitable

Detecting false hypothesis: Power (β)

Descriptive Statistics

Categorical Variable: descriptive responsesQuantitative Variable: measure of quantity (units)

Time-series: variable measures at regular intervals over time Consistent space-time interval (months, weeks…)

Cross Sectional Data: several variables measured at same point in time. Exact time (every Feb at Starbucks)

Stem-plot: distribution of data, which also includes the specific data points Can calculate mean, quartiles, median, shape

Scatterplot: plots 2 quantitative variablesHistogram: distribution of data by breaking the range of values of variable into intervals and displaying count or proportion of observations that fall into each intervalShapes of Distributions

Symmetric, Unimodal, Bimoda, Uniform, SkewedData Collection: From direct observation or produced through experiments

Survey (Response Rate): personal interview, telephone interview, self-administered questionnaire

Sampling Plans SRS : sample selected so that every possible sample with the same number of observations

is equally likely to be selected Stratified RS : separating the population into mutually exclusive sets or strata and drawing

a SRS from eacho Homogeneous and different to one another (Black vs. White)

Cluster S : SRS of groups or clusters of elementso Heterogeneous and similar to one another (Vancouver vs. Toronto)

Systematic S : sample every kth unit in a population Multi-stage S : randomly choose clusters and randomly sample individuals within each

clusterErrors

Sampling Error : difference between sample and population that exists only because of the observations that happened to be selected for the sample

Non-sampling Error : due to mistakes made in acquiring data or sample observations being selected improperly (more serious b/c cannot be corrected by + sample size)

o Errors in acquiring data : recording incorrect responseso Nonresponse error : error (or bias) introduced when responses are not obtained be

some members of a sampleo Response Bias : anything that influences responseso Voluntary Response Bias : large group invited to respond and all countedo Selection bias : occurs when sampling plan is such that some members of the target

population cannot be selected for inclusion in sampleo Convenience Sampling : include individuals that are conveniento Under-coverage : some portion of population not sampled at all or has smaller

representation Answer Phrasing

o Measurement Errors: inaccurate responseso Pilot Test : small trial run of study to check if method of study okay

Probability and Random Variables

Probability: likelihood of random phenomenon or chance behaviors 0 ≤ P(A) ≤ 1

P(S) = P(certain event) P(A1 or A2 or…) = P(A1) + P(A2) + P(A3)… if mutually exclusive

Interpreting Probability

Relative Frequency Approach: P (A )= ¿ of×event A occurred¿of ×the experiment was run

Classical Approach: P (A )= ¿of ways anevent occur¿of ways the experiment can occur

Random Variables: variable that assigns a numerical result to an outcome of an event that is associated with chance

Discrete: if can take only finite or countably infinite number of values Continuous: not discrete

Discrete Conditions 0 ≤ P(x) ≤ 1 ∑ P ( x )=1

Expected Value: mean of probability distribution: μ = E(x) = ∑ (x P( x)¿)¿ Variance: σ 2=∑ ( x−μ )2P ( x ) or ¿

If Linear H(x) = ax + b E(h(x)) = ∑ h(x) f (x) E(h(x)) = E(ax + b) = aE(x) + b Var(h(x)) = Var(ax + b) = a2Var(x)

If Random Variables E(x + y) = E(x) + E(y) Var(x ± y) = Var(x) + Var(y)

Hypothesis Testing For Comparing Two Means

Two Independent Means

Case 1: σ 12=σ2

2: 12≤s1

s2

≤2


Case 1: σ 12≠σ2

2

Two Population Means

Matched Pairs Experiments

Hypothesis Testing For Comparing Two Means: Qualitative

Null Hypothesis (H0): There is no difference between the means or proportions = 0

Alternative Hypothesis (Ha): There is a difference between means or proportions

P-Value: value on which we base our decision Determines how likely data hypothesized are, if H0 true Ultimate goal of calculation

Conclusion: reject or fail to reject H0

% ≤ P-value ≤ % ≥ : Do not Reject Hα 0: We cannot conclude there has been a significant reduction/ increase/ difference in mean study of experiment

-z-score < z-test value < z-score: Do not Reject H0 in inside: There is not sufficient evidence at the 5% level of significance that there is a reduction/ increase/ difference between the two means

Confidence Interval

CI: + Confidence – have to capture true value more often so make interval wider Smaller interval (less variability) – choose a large sample

Level of Confidence 90% CI: (-#,#): Since the 90% CI contains 0, we cannot conclude at the 10% level of

significance that the perception of the study examined of x1 andx2 90% CI: We are 90% confident that the interval (#, #) contains the true mean/proportion

of the study experiment

Hypothesis Testing For Comparing Population Proportion


Case 1: σ 12=σ2

2: 12≤s1

s2

≤2


Case 1: σ 12≠σ2

2

Two Population Means

Matched Pairs Experiments

Analysis of Two Way Tables

Two Separate Samples

H0: Ha:

p1−p2= p1− p2=x1

n1

−x2

n2

P-Value:P(tdf ≤ t0) = % from topP(tdf ≥ t0) = % from topP(tdf>(≠) t0) = 2×P(% from top)

Conclusion::α


Sufficient evidence, Significant, Unlikely


Not sufficient evidence, Not significant, Likely

CI: (#,#): has value: Fail to reject H0

(#,#): doesn’t have value: Reject H0

(-#,#): CI has 0 cannot conclude

( p1− p2) z¿√ p1(1− p1)n1

+p2(1− p2)

n2

p1=x1

n1

p2=x2

n2

Chi-Square Tests

Analysis of Two Way Tables

Chi-Square Tests

Goodness of Fit Test

Used to measure how well observed data fit what would be expected under specified conditions

H0: no association between row and column variables

: p1 = p2 = … = pk

Ha: not all proportions equal

χ2=∑ ( f o−f e )2

f e=¿∑ (Obs−exp )2

exp¿

where f0 = the observed frequency

fe = the expected frequencydf = k – 1: k is number of

categories/cells specified under H0

f e=row total ×column total

n

n: total number of observations in the table

Conclusion: P-value:





χ2> χ (k−1)2 , Reject H0

Test of Independence

Used to determine whether the row and column variables in a two-way contingency table are independent or related

H0: no association between row and column variables

: p1 = p2 = … = pk

Ha: not all proportions equal

χ2=∑ ( f o−f e )2


exp¿


n


df =(r−1)(c−1)

r: number of rowsc: number of columns






χ2> χ (r−1)(c−1)2 , Reject H0

Homogeneity Test

Comparing observed counts from 2 or more populations

Examine sample to see if they have the same proportions of some characteristic

H0: the populations have the same proportion of some characteristic

: 2 categorical variables are independent

Ha: at least one of the populations has a different proportion

: they aren’t independent

χ2=∑ ( f o−f e )2


exp¿


n


df =(r−1)(c−1)

r: number of rowsc: number of columns






χ2> χ (r−1)(c−1)2 , Reject H0

Random Sample A B TotalA (R1)(C1)

n(R1)(C2)

nR1

B (R2)(C1)n

(R2)(C2)n

R2

Total C1 C2 n

Example: Does the gender of a survey interviewer have an effect on survey responses by men?

P ¿¿): P ¿) We reject the H0 and conclude there is significant evidence that the proportion of men is

different for the two genders at the interviewers

Reject H0: We conclude that the variables are not independent. They are associated.Fail to Reject H0: There is no evidence of an association

Linear Regression (Scatterplots)

Correlation:

Relationship between two variables Only measures strength of linear relationship/ association x – independent/ explanatory/ predictor variable y – dependent variable/ response/ predicted variable

r=∑ (x−x )( y− y)

(∑ ( x−x )2)(∑ ( y− y )2) OR r=

n∑ xy−(∑ x )(∑ y )

√n∑ x2−(∑ x )2 ¿√n∑ y2−¿¿

Properties of Correlation Coefficient1. -1≤r ≤12. rxy = ryx3. Positive values = Positive correlation; Negative Values = Negative correlation4. Strong Correlation/linear relationship – closer to -1 and 15. Weak Correlation/linear relationship – closer to 06. Only used on two quantitative variables7. Calculated using means, SDs, z-scores8. Not resistant to outliers

Describing Association1. Form

Is it linear, Bell-shaped, Curved, Cloud2. Direction

Positive or Negative?3. Strength

Spread

Linear Model: y=b0+b1 x y : predicted y value for agiven x−value

Slope: y+/- by slope (y units) per x (x units) Gets sign from correlation Gets units from ratio of two SDs, so units of the slope are a ratio of units of the variables

b1=rS y

SX OR b=n∑ xy−¿¿

Intercept: y is y-intercept when x=0, y = y - b10

Starting value for predictionsb0= y−b1 x

To find slope and intercept need: Correlation (r): tells us strength of linear association Means: tells us where to locate the line SD: tells us the units

Predicting SD +/- the mean when have another SD +/- the mean: SDy = r (SDx)

Correlation to the Line: plot of standardized variables b1= r b0 = 0

Z y=r Z x: For every SD +/- the mean we are in x, we’ll predict that y is r(SD) +/- the mean y

Residual: observed (point) – predicted (line):e= y− y Does the model make sense? How well does the line fit the data?

o How much variation in y does our model explain? – coefficient of determination R2

– e: y is big (overestimate) + e: y is small (underestimate) Residuals vs. predicted values: shows no patterns, no direction, no shape, mean = 0

Point Prediction: the value of y obtained by specifying the value x and solving regression equation for X* we have for regression equation

o we can only make predictions within range of our data not away because of EXRAPOLATION BAD

Coefficient of Determination Measures the proportion of variation in y that is explained be the variation in x R2: fraction of data’s variation accounted for by model About R2 % of the variation in y that is explained by the variation in x, 1 – R2: fraction of variation left in residuals

R2=1−( y− y )2

∑ ( y− y )2=r2

Lurking Variable: Variable that is not among the explanatory or response variables. It can influence the interpretation of relationships among these variables Ex. Lung cancer ≠ stain fingernails: smoking

What is the most realistic value of SD?

Range6

Simple Linear Regression (Inference) Qualitative

First order Model: y=β0+β1 x+ε

where: y = Dependent variable x = Independent variable

β0= y-interceptβ1= Slope of the line (defined as rise/run) ε = Error variable

Measures of Variation

y = point (x,y) y = sample mean y = predicted y-value

Total Deviation: vertical distance y− y

SST=∑ ( y− y )2=∑ y2−(∑ y )2

nExplained Deviation: vertical distance y− y

SSR=∑ ( y− y )2=b0∑ y+b1∑ xy−(∑ y )2

nUnexplained Deviation: vertical distance y− y

SSE=∑ ( y− y )2=¿∑ y2−¿b0∑ y+b1∑ xy ¿¿

Coefficient of Determination/Variation (R2): the amount of the variation in y that is explained by the regression line. It is the ratio of the explained variation to the total variationR2=1− SSE

∑ ( y i− y )2∨R2= SSR

SST=SST−SSE

SST=1−SSE

SST

Standard Error of Estimate (Se): a measure of the differences between the observed sample y-values and the predicted y& that are obtained using the regression equationSum of Squares for X (SSxx): the sum of the squares of the deviations of xASSESING THE MODELStandard Deviation of the Error Variable (σε)

If σε is large: some of the errors will be large, which implies that the model’s fit is poor If σε is small, the errors tend to be close to the mean (which is 0) and so the model fits well

Sum of Squares for Regression (SSR): measures the amount of variation in y that is explained by the variation in the independent variable xVariation in y=SSE+SSR

So SSE is the amount of variation in y that remains unexplained

R2=1−SSE

∑ ( y i− y )2=∑ ( y i− y )2−SSE

∑ ( y i− y )2 =SSR

∑ ( yi− y )2=Explained variation∈ y

Variationi y

The greater the explained variation (greater the SSR or R2), the better the model

Simple Linear Regression (Inference) Testing

Significance of Regression

Predictor Coefficient/ Slope:

H0: β1= 0: y is not linearly related to x, so regression line will be horizontal so slope is 0

Ha: β1≠ 0: determines linear relationship

t=b1−β1

sb1

sb1= s

√SS xx

=sε

√(n−1)sx2

sε=√ SSEn−2






CI: β1± t n−2¿ (sb1

): n–2 distribution

df: v = n – 2

t : outside t n−2¿

from chart: Reject H0

Regression Equation

Prediction Interval Determines how closely y matches

the true value of y Single observation – predicting

individual vale – wider interval

y ±t n−2¿ (sε)√1+ 1

n+(x g−x)2

SSxx

SSxx=(n−1)sx2

Confidence Interval Estimator of E(y) will be narrower

than the prediction interval b/c there is less error in estimating a mean value opposed to predicating an individual value

y ±t n−2¿ (sε)√ 1

n+(x g−x)2

SSxx

SSxx=(n−1)sx2

(xg−x)2

SSxx

: estimated error

Coefficient of Correlation

Data observational Two variables are bivariate normally

distributed Can test for linear association b/w 2

variables using test ρ: estimate of population coefficient

of correlation – is the sample coefficient of correlation r

H0: ρ = 0: there is no relationship between the two variables

Ha: ρ ≠ 0

t=r √ n−21−r2

df: v = n – 2






t : outside t n−2¿

from chart: Reject H0

Multiple Regressions

y=β0+β1 x1+β2 x2+…+βk xk+ε

where: k = independent variables are potentially related to the dependent variable y = dependent variable

x1 , x2 ,…, xk= independent variables β0 , β1 ,…,βk= coefficients ε = error variable

Independent variables: may be function of other variables x2=x1

2, x5=x3 x4, x7=log(x¿¿6)¿

Meaning of Regression Coefficient

x1: with all other variables held constant, if the average x1function (what it means) increases by 1, your expected Y increases/decreases by x1coefficient

Adjusted R2: takes into account the sample size and the number of independent variables k > n: unadjusted R2 may be unrealistically high

Adjusted R2=1−

SSE /(n−k−1)

∑ ( y i− y )2

n−1

CI: Test for Each Variable

If P-value < α: We conclude that βi is greater/smaller/not than 0

Multiple Regressions Tests

Significance of Regression

Testing the validity of the model: f- test F test combines t- tests into a single test Not affected by the problem of multicollinearity,

which is when the independent variables are correlated with one another

H0: β1= β1= … = β1= 0 : if true says none of the independent variables

x1 , x2 ,…, xkis linearly related to y, so the model is invalid

Ha: at least one β i≠ 0 : the model has some validity

F=

[∑ ( y i− y )2−SSE ]k

SSEn−k−1

=

SSRk

SSEn−k−1

=(SST−SSE)/kSSE/(n−k−1)

=MSRMSE

Sε=√ SSEn−k−1



Sufficient evidence, Significant, UnlikelyP-value > : fail to reject Hα 0


df (numerator): v = kdf (denominator): v = n – k - 1

F>F(α, k ,n−k−1 ), Reject H0

because the model is valid, regression is significant

Significance of Each Variable

Testing the coefficients: t- tests T-tests of individual coefficients allow us to

determine whether β i≠ 0 (for I = 1, 2, … , k) if linear relationship exists between xi and y.

using lots of t-tests instead of F-tests to test the validity of the model increases the probability of Type 1 Error

H0: at least one β i= 0 : if true says no linear relationship

Ha: β i≠ 0 for any i

t=bi−β i

sbi

sb1= s

√SS xx

=sε

√(n−1)sx2

df: v = n – k - 1

Confidence Interval Estimator of E(y) will be narrower than the

prediction interval b/c there is less error in estimating a mean value opposed to predicating an individual value

y ±t n−2¿ (sε)√ 1

n+(x g−x)2

SSxx

SSxx=(n−1)sx2

(xg−x)2

SSxx

: estimated error

t>F(α /2 ,n−k−1), Reject H0

β i parameter is significant

t>F(α /2 ,n−k−1), Reject H0

Analysis of Variance (ANOVA)

One-Way ANOVA: one way analysis of variance because we use a single property, or characteristic, for categorizing the populations

H0: μ1 = μ2 = μ3 Regression StatisticsMultiple R √r2=r

R Square r2= SSRSST

where r : correlationcoefficient

Adjusted R Square Adjusted R2=1−(1−R2) [ n−1n−k−1 ]Standard Error Se=√MSEObservations n

ANOVA df SS MS F Significance FRegression 1 SSR=SST−SSE

SSM1

MSMMSE

P (F>Fstat )where Fstat=MSMMSEResidual n−2 SSE=∑ ( y− y )2 se

2= SSEn−2Total n−1 SST ¿(n−1)S y

2

Coefficients Standard Error t-Stat P-value Lower 95% Upper 95%Intercept b0= y−b1 x SE (b0 )=Se √ 1

n+ x2

∑ (x i−x )2

CoefSE

2xP(tn-2)>t-stat b0−t n−2(SE (b0 )) b0+ tn−2(SE (b0))

X Variable b1=rS y

Sx

SE (b1 )=Se

√∑ (x i−x )2

CoefSE

2xP(tn-2)>t-stat b1−tn−2(SE (b1)) b1+t n−2(SE (b1 ))

ANOVA

a method of the testing the quality of three or more population means by analyzing sample variances we test the hypothesis by determining if the variation b/w groups is larger that the variation within

groups

H0: μ1 = μ2 = … = μk

Ha: at least one μj is different : not all the same

F=MSTMSE

: test statistic involves variation within groups and the variation among groups.

If differences among sample means very large, the variation witihin groups to be large, so the numerator of the test statistic becomes larger that the denominator. So large values of the test statistic suggest unequal means

F ¿>F(k−1 , N−k), Reject H0ANOVA df SS MS F Significance FRegression 1 SSR=SST−SSE

SSM1

MSMMSE

P (F>Fstat )where Fstat=MSMMSE

Residual n−2 SSE=∑ ( y− y )2 se2= SSE

n−2Total n−1 SST ¿(n−1)S y2

ANOVA SS df MS F Significance FRegression SSG=SST−SSE g - 1 MSG= SSG

g−1MSGMSE

P (F>Fstat )where Fstat=MSMMSEResidual SSE=∑ ( y− y )2 N−g MSE= SSE

N−gTotal SST ¿(n−1)S y2 N−1

P-Value = P(Fg-1,n-g) > F0: P-value < : Reject Hα 0

g = how many groups means variablesn = samples observations

SST:SSR: sum of squares between groups (treatments) Represents the variation between the means of the groups

SSE: sum of squares within groups (error) Represents the variation within a group due to random error

Conditions + Assumptions

Regression Lines (Correlation) Quantitative Variables Condition Linearity Condition Outlier Condition Equal Spread Condition: checking spread is about the same throughout

Model for Sampling Distribution of Proportions- 68-95-99.7 Independence Assumptions Sample Size Assumption: sample size, n, must be large enough

o Randomization Conditionso 10% Condition: n < 10% of populationo Success/Failure Condition: np > 10

nq > 10Model for Sampling Distribution of Means: z-score

Independence Assumptiono Randomization Condition

Sample Size Assumptiono 10% Conditiono Large Enough Sample Condition: depends on shape of original data distribution

Confidence Intervals for one proportion z-interval, one proportion z-test Independence Assumption

o Randomization Conditiono 10% Condition

Sample Size Assumptiono Inference – CLT o Need large enough sampling modelo Success/Failure Condition: n p ≥ 10

nq ≥ 10Sampling Distribution for Mean: t-score

Independence Assumptiono Randomization Conditiono 10% Condition

Normal Population Assumption: student’s t-model wont work for that are badly skewed

o Nearly Normal Condition: n < 5: data should follow normal model15 < n < 40: t-method works well as long as data unimodal and symmetricn > 40: t-method safe to use unless data very skewed, also can use of very skewed if n is large enough b/c SD close enough to Normal

ANOVA1. All populations are normally distributed2. The populations variances are equal3. The observations are independent of one another

Multiple Regression: required conditions for the error variable ε1. The probability distribution of ε is normal2. The mean of ε is 03. The standard deviation of ε is σε, which is constant of each value x4. The errors are independent

Simple Linear Regression (inference): required conditions for the error variable1. The probability distribution of ε is normal2. The mean of the distribution is 0; that is E(ε) = 03. The standard deviation of ε is σε, which is constant regardless of the value of x4. The value of ε associated with any particular value of y is independent of ε associated with

any value of y

Chi-Square test1. Expected cell frequency condition: all expected cell counts are at least 5, so X2 is reliable

Comparing two Means1. Independence Assumption: Randomization, 10%2. Normal Population Assumption

a. N < 15: do not use student t if skewedb. N ≈ 40: okay if mildly skewedc. N > 40: CLT works unless data very skewed

3. Independence group assumptiona. Two independent samples

Pooled T-test1. Paired data assumption2. Independence assumption

statistics cheat sheet

Documents