foundation statistics copyright douglas l. dean, 2015

Post on 17-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Foundation Statistics

Copyright Douglas L. Dean, 2015

2

Types of Variables

1. Input Variables1. Might explain the outcome variable

2. Output variable1. Variable you want to explain or predict

3

Outline

1. One-way ANOVA– Does a category make a significant difference?

2. Bivariate Statistics– Are numeric variables related to each other?

3. Plotting bivariate relationships and lines of best fit

4. Simple linear regression

4

Oneway ANOVA

Outlier Thresholds

1.5 x the interquartile range

Interquartilerange

Annotated Box-whisker Plot

7

Bivariate descriptive stats

• Slope (m)

• Correlation (r)

• Coefficient of Determination (R2)

8

Slope

Every straight line can be represented by an equation: y = mx + b

The slope ‘m’ describes both the direction and the steepness of the line.

9

Slope Tree

Zero Slope

UndefinedSlope

10

How slope is calculated

11

A positive slope example

(0,1)

(3,3)

12

Another example

(0,1)

(3,4)

13

Negative Slope example

(3,0)

(1,4)

Correlation Coefficient• The Pearson Correlation Coefficient (r) is a measure of the

strength of the linear relationship between two numeric variables.

• The value of r ranges from -1 to 1

Correlation Coefficient

StrongerStronger

Which correlation is stronger? r = -.80 or r = .80?

Neither. They are the same strength.

16

Examples of Perfect Correlation

17

Examples of Strong Correlation

18

Examples of Weak Correlation

19

No Correlation

Ways to get r = 0

• Pure randomness

• Perfectly horizontal strait line

20

Correlation

• Often the first thing we check to see if a relationship with variable may exist

• Good place to start, not to finish

• Is a standardized value– Units of measure are factored out

– So you can have one variable on a small scale and the related variable on a large scale. No matter the scales, the value of r will be adjusted to be between negative and positive one.

21

Limitations of Correlation

• Correlation measures linear association not causality

• Correlation is only one important measure of a possible linear relationship

• Correlations lack statistical control for other possible related variables

22

Correlation ≠ Causality The Japanese eat very little fat and drink little red wine and suffer fewer heart attacks than the British or Americans

The French eat a lot of fat and drink a lot of red wine and suffer fewer heart attacks than the British or Americans

The Germans drink a lot of beer and eat a lot of sausages and fat and suffer fewer heart attacks than the British or Americans.

Conclusion:

Eat and drink what you like. Apparently it is speaking English that kills you.

23

24

Ambiguities in causality abound…

When Y and Z are correlated, direction of causality might be

X Z Or X Z

25

Ambiguities in causality abound…

When a correlation exists between Y and Z. Causality might be

XY

Z

X Y Z X

YZ

XY Z

26

Statistical control

• The purpose of statistical control is to find the degree of association between two variables after removing the effects of other variables.

• Correlation lacks statistical control

• Many variables may exert influence on the variable being predicted. You cannot control for multiple influences with r alone

• Some forms of statistics and data mining methods give you statistical control

27

Sign (+ or -) of Basic statistics

• Slope and r can be positive or negative– Slope and r have the same sign

› If one is positive, so is the other

› If one is negative, so is the other

• R2 is always positive

28

R-Squared (R2)

• R2 is the Coefficient of determination

• The proportion of the variance in Y attributable to the variance in X (if only one x) or set of X variables if more than one predictor is included in the model.

29

Calculation of R2

If only one predictor: R2 = r2,

r = .80 R2 = .802 = .64

With multiple input variables - The math to calculate R2 is more complex - Math beyond the scope of this course.

30

Why we need more than just slope

• If we have the slope, why do we need r and R2?– Relationships are rarely perfectly linear in their ability to predict.

– Slopes of “best-fit” lines do not give us a measure of variability in how x and y relate to each other

• Correlation (r) measures the variability in linear association

• R2 measures proportion of variance in y attributable to all of the input variables included in the model

top related