foundation statistics copyright douglas l. dean, 2015

30
Foundation Statistics Copyright Douglas L. Dean, 2015

Upload: dylan-shanon-gray

Post on 17-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Foundation Statistics Copyright Douglas L. Dean, 2015

Foundation Statistics

Copyright Douglas L. Dean, 2015

Page 2: Foundation Statistics Copyright Douglas L. Dean, 2015

2

Types of Variables

1. Input Variables1. Might explain the outcome variable

2. Output variable1. Variable you want to explain or predict

Page 3: Foundation Statistics Copyright Douglas L. Dean, 2015

3

Outline

1. One-way ANOVA– Does a category make a significant difference?

2. Bivariate Statistics– Are numeric variables related to each other?

3. Plotting bivariate relationships and lines of best fit

4. Simple linear regression

Page 4: Foundation Statistics Copyright Douglas L. Dean, 2015

4

Oneway ANOVA

Page 5: Foundation Statistics Copyright Douglas L. Dean, 2015

Outlier Thresholds

1.5 x the interquartile range

Interquartilerange

Page 6: Foundation Statistics Copyright Douglas L. Dean, 2015

Annotated Box-whisker Plot

Page 7: Foundation Statistics Copyright Douglas L. Dean, 2015

7

Bivariate descriptive stats

• Slope (m)

• Correlation (r)

• Coefficient of Determination (R2)

Page 8: Foundation Statistics Copyright Douglas L. Dean, 2015

8

Slope

Every straight line can be represented by an equation: y = mx + b

The slope ‘m’ describes both the direction and the steepness of the line.

Page 9: Foundation Statistics Copyright Douglas L. Dean, 2015

9

Slope Tree

Zero Slope

UndefinedSlope

Page 10: Foundation Statistics Copyright Douglas L. Dean, 2015

10

How slope is calculated

Page 11: Foundation Statistics Copyright Douglas L. Dean, 2015

11

A positive slope example

(0,1)

(3,3)

Page 12: Foundation Statistics Copyright Douglas L. Dean, 2015

12

Another example

(0,1)

(3,4)

Page 13: Foundation Statistics Copyright Douglas L. Dean, 2015

13

Negative Slope example

(3,0)

(1,4)

Page 14: Foundation Statistics Copyright Douglas L. Dean, 2015

Correlation Coefficient• The Pearson Correlation Coefficient (r) is a measure of the

strength of the linear relationship between two numeric variables.

• The value of r ranges from -1 to 1

Page 15: Foundation Statistics Copyright Douglas L. Dean, 2015

Correlation Coefficient

StrongerStronger

Which correlation is stronger? r = -.80 or r = .80?

Neither. They are the same strength.

Page 16: Foundation Statistics Copyright Douglas L. Dean, 2015

16

Examples of Perfect Correlation

Page 17: Foundation Statistics Copyright Douglas L. Dean, 2015

17

Examples of Strong Correlation

Page 18: Foundation Statistics Copyright Douglas L. Dean, 2015

18

Examples of Weak Correlation

Page 19: Foundation Statistics Copyright Douglas L. Dean, 2015

19

No Correlation

Ways to get r = 0

• Pure randomness

• Perfectly horizontal strait line

Page 20: Foundation Statistics Copyright Douglas L. Dean, 2015

20

Correlation

• Often the first thing we check to see if a relationship with variable may exist

• Good place to start, not to finish

• Is a standardized value– Units of measure are factored out

– So you can have one variable on a small scale and the related variable on a large scale. No matter the scales, the value of r will be adjusted to be between negative and positive one.

Page 21: Foundation Statistics Copyright Douglas L. Dean, 2015

21

Limitations of Correlation

• Correlation measures linear association not causality

• Correlation is only one important measure of a possible linear relationship

• Correlations lack statistical control for other possible related variables

Page 22: Foundation Statistics Copyright Douglas L. Dean, 2015

22

Correlation ≠ Causality The Japanese eat very little fat and drink little red wine and suffer fewer heart attacks than the British or Americans

The French eat a lot of fat and drink a lot of red wine and suffer fewer heart attacks than the British or Americans

The Germans drink a lot of beer and eat a lot of sausages and fat and suffer fewer heart attacks than the British or Americans.

Conclusion:

Eat and drink what you like. Apparently it is speaking English that kills you.

Page 23: Foundation Statistics Copyright Douglas L. Dean, 2015

23

Page 24: Foundation Statistics Copyright Douglas L. Dean, 2015

24

Ambiguities in causality abound…

When Y and Z are correlated, direction of causality might be

X Z Or X Z

Page 25: Foundation Statistics Copyright Douglas L. Dean, 2015

25

Ambiguities in causality abound…

When a correlation exists between Y and Z. Causality might be

XY

Z

X Y Z X

YZ

XY Z

Page 26: Foundation Statistics Copyright Douglas L. Dean, 2015

26

Statistical control

• The purpose of statistical control is to find the degree of association between two variables after removing the effects of other variables.

• Correlation lacks statistical control

• Many variables may exert influence on the variable being predicted. You cannot control for multiple influences with r alone

• Some forms of statistics and data mining methods give you statistical control

Page 27: Foundation Statistics Copyright Douglas L. Dean, 2015

27

Sign (+ or -) of Basic statistics

• Slope and r can be positive or negative– Slope and r have the same sign

› If one is positive, so is the other

› If one is negative, so is the other

• R2 is always positive

Page 28: Foundation Statistics Copyright Douglas L. Dean, 2015

28

R-Squared (R2)

• R2 is the Coefficient of determination

• The proportion of the variance in Y attributable to the variance in X (if only one x) or set of X variables if more than one predictor is included in the model.

Page 29: Foundation Statistics Copyright Douglas L. Dean, 2015

29

Calculation of R2

If only one predictor: R2 = r2,

r = .80 R2 = .802 = .64

With multiple input variables - The math to calculate R2 is more complex - Math beyond the scope of this course.

Page 30: Foundation Statistics Copyright Douglas L. Dean, 2015

30

Why we need more than just slope

• If we have the slope, why do we need r and R2?– Relationships are rarely perfectly linear in their ability to predict.

– Slopes of “best-fit” lines do not give us a measure of variability in how x and y relate to each other

• Correlation (r) measures the variability in linear association

• R2 measures proportion of variance in y attributable to all of the input variables included in the model