Foundation Statistics
Copyright Douglas L. Dean, 2015
2
Types of Variables
1. Input Variables1. Might explain the outcome variable
2. Output variable1. Variable you want to explain or predict
3
Outline
1. One-way ANOVA– Does a category make a significant difference?
2. Bivariate Statistics– Are numeric variables related to each other?
3. Plotting bivariate relationships and lines of best fit
4. Simple linear regression
4
Oneway ANOVA
Outlier Thresholds
1.5 x the interquartile range
Interquartilerange
Annotated Box-whisker Plot
7
Bivariate descriptive stats
• Slope (m)
• Correlation (r)
• Coefficient of Determination (R2)
8
Slope
Every straight line can be represented by an equation: y = mx + b
The slope ‘m’ describes both the direction and the steepness of the line.
9
Slope Tree
Zero Slope
UndefinedSlope
10
How slope is calculated
11
A positive slope example
(0,1)
(3,3)
●
●
12
Another example
(0,1)
(3,4)
●
●
13
Negative Slope example
(3,0)
(1,4)
●
●
Correlation Coefficient• The Pearson Correlation Coefficient (r) is a measure of the
strength of the linear relationship between two numeric variables.
• The value of r ranges from -1 to 1
Correlation Coefficient
StrongerStronger
Which correlation is stronger? r = -.80 or r = .80?
Neither. They are the same strength.
16
Examples of Perfect Correlation
17
Examples of Strong Correlation
18
Examples of Weak Correlation
19
No Correlation
Ways to get r = 0
• Pure randomness
• Perfectly horizontal strait line
20
Correlation
• Often the first thing we check to see if a relationship with variable may exist
• Good place to start, not to finish
• Is a standardized value– Units of measure are factored out
– So you can have one variable on a small scale and the related variable on a large scale. No matter the scales, the value of r will be adjusted to be between negative and positive one.
21
Limitations of Correlation
• Correlation measures linear association not causality
• Correlation is only one important measure of a possible linear relationship
• Correlations lack statistical control for other possible related variables
22
Correlation ≠ Causality The Japanese eat very little fat and drink little red wine and suffer fewer heart attacks than the British or Americans
The French eat a lot of fat and drink a lot of red wine and suffer fewer heart attacks than the British or Americans
The Germans drink a lot of beer and eat a lot of sausages and fat and suffer fewer heart attacks than the British or Americans.
Conclusion:
Eat and drink what you like. Apparently it is speaking English that kills you.
23
24
Ambiguities in causality abound…
When Y and Z are correlated, direction of causality might be
X Z Or X Z
25
Ambiguities in causality abound…
When a correlation exists between Y and Z. Causality might be
XY
Z
X Y Z X
YZ
XY Z
26
Statistical control
• The purpose of statistical control is to find the degree of association between two variables after removing the effects of other variables.
• Correlation lacks statistical control
• Many variables may exert influence on the variable being predicted. You cannot control for multiple influences with r alone
• Some forms of statistics and data mining methods give you statistical control
27
Sign (+ or -) of Basic statistics
• Slope and r can be positive or negative– Slope and r have the same sign
› If one is positive, so is the other
› If one is negative, so is the other
• R2 is always positive
28
R-Squared (R2)
• R2 is the Coefficient of determination
• The proportion of the variance in Y attributable to the variance in X (if only one x) or set of X variables if more than one predictor is included in the model.
29
Calculation of R2
If only one predictor: R2 = r2,
r = .80 R2 = .802 = .64
With multiple input variables - The math to calculate R2 is more complex - Math beyond the scope of this course.
30
Why we need more than just slope
• If we have the slope, why do we need r and R2?– Relationships are rarely perfectly linear in their ability to predict.
– Slopes of “best-fit” lines do not give us a measure of variability in how x and y relate to each other
• Correlation (r) measures the variability in linear association
• R2 measures proportion of variance in y attributable to all of the input variables included in the model