bivariate data analysis. bivariate data in this powerpoint we look at sets of data which contain two...

75
Bivariate Data analysis

Upload: dwayne-shepherd

Post on 05-Jan-2016

233 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Bivariate Data analysis

Page 2: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Bivariate Data

In this PowerPoint we look at sets of data which contain two variables.

Scatter plots Correlation

Outliers Causation

Page 3: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Variables

DiscreteContinuous 

Quantitative(Numerical)

(measurements and counts)

Qualitative(categorical)

(define groups)

Ordinal(fall in natural order)

Categorical(no idea of order)

We are only going to consider quantitative variables in this AS

Page 4: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Quantitative

Discrete• Many repeated values• Age groups• Marks

Continuous• Few repeated values• Height• Length• Weight

Page 5: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Qualitative

Categorical• Gender• Religious

denomination• Blood types• Sport’s numbers (e.g.

He wears the number ‘8’ jersey)

Ordinal• Grades• Places in a race (e.g.

1st, 2nd, 3rd)

Page 6: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

We often want to know if there is a relationship between two

numerical variables.

A scatter plot, which gives a visual display of the relationship between two variables, provides a good starting point.

Page 7: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

In a relationship involving two variables, if the values of one variable ‘depend’ on the values of another variable, then the former variable is referred to as the dependent (or response) variable and the latter variable is

referred to as the independent (or explanatory) variable.

y - axis dependent (response) variable

x - axis independent (explanatory) variable

Page 8: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Consider data on ‘hours of study’ vs ‘ test score’

Hours Score Hours Score Hours Score

18 59 14 54 17 59

16 67 17 72 16 76

22 74 14 63 14 59

27 90 19 72 29 89

15 62 20 58 30 93

28 89 10 47 30 96

18 71 28 85 23 82

19 60 25 75 26 35

22 84 18 63 22 78

30 98 19 61

Page 9: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

We may want to see if we could predict

the test score (response variable) based on the

hours of study (explanatory variable).

y - axis: Test score

x - axis: Hours of study

Page 10: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

We look for a pattern in the way

the points lie

Certain patterns tell us about

the relationship

This is called

correlation

This pointis an outlier

Page 11: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

We could describe the rest of the data as having a linear form.

Page 12: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Scatter plots• Use hollow circles for points• Label axes correctly with units• What you want to predict goes on the y-axis

(response variable)• Title of graph• No background; No gridlines• Unless you need to show categories- no legend• Show different categories on a single graph in

different colours rather than on separate graphs.• Adjust scale and size of font (14pt for pasting)

Page 13: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

What to look for in your plot?

• Direction of the relationship - positive or negative• Form of the graph - linear or curved• The strength - whether it is strong, moderate or

weak• Scatter - constant scatter, a fan effect…• Outliers• Groupings

Page 14: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Page 22

Page 15: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation
Page 16: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation
Page 17: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation
Page 18: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation
Page 19: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

What do you see in this scatter plot?

• There appears to be a linear trend.

• There appears to be moderate constant scatter.

• Negative Association.

• No outliers or groupings visible.

454035

20

19

18

17

16

15

14

Latitude (°S)   

Mean January Air Temperatures for 30 New Zealand Locations

Tem

pera

ture

(°C

)

Page 20: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

What do you see in this scatter plot?

• There appears to be a non-linear trend.

• There appears to be non-constant scatter about the trend line.

• Positive Association.

• One possible outlier (Large GDP, low % Internet Users).

0 10 20 30 40

GDP per capita (thousands of dollars)

0

10

20

30

40

50

60

70

80

Inte

rnet

Use

rs (

%)

% of population who are Internet Users vs GDP per capita for 202 Countries

Page 21: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

What do you see in this scatter plot?

• Two non-linear trends (Male and Female).

• Very little scatter about the trend lines

• Negative association until about 1970, then a positive association.

• Gap in the data collection (Second World War).

Year

1990198019701960195019401930

30

28

26

24

22

20

Ag

e

Average Age New Zealanders are First Married

Page 22: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Rank these relationships from weakest (1) to strongest (4):

1

2

3

4

Page 23: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Describe these relationshipsPerfect, negative, linear

relationship

Perfect, positive, linear

relationship

Norelationship

Moderate,negative

linearrelationship

Weak,positivelinear

relationship

Page 24: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Describe this relationship.

Page 25: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

As the hours of study increase, the test score . . . .? . . .

Page 26: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Pearson’s product-moment correlation coefficient, r

Correlation measures the strength of the linear association between two quantitative variables.

r = -1 r = -0.7 r = -0.4 r = 0 r = 0.3 r = 0.8 r = 1

Points fall exactly on a straight line

No linear relationship

(uncorrelated)

Points fall exactly on a straight line

The correlation coefficient may take any value between -1.0 and +1.0

Page 27: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

How close the points in the scatter plot come to lying on the line.

r - what does it tell you?

r = 0.99

x

y

**** ** ** ** * **** **** *

r = 0.57

x

y

*

**

*

** *

*

**

*

****

*

*

*

* *r = 0.99 r = 0.57

Page 28: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Interpreting r

• 0.75-1 Strong positive linear association• 0.5-0.75 Moderate positive linear association• 0.25-0.5 Weak positive linear association• -0.25-0.25 No association or weak linear

association• -0.5--0.25 Weak negative linear association• -0.75--0.5 Moderate negative linear association• -1 - -0.75 Strong negative linear association

Page 29: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Useful websites

• http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html Regression by eye

• http://istics.net/stat/Correlations/ Guessing

• http://illuminations.nctm.org/LessonDetail.aspx?ID=L455#whatif effect of outliers

Page 30: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Assumptions

• linear relationship between x and y

• continuous random variables• The residuals must be normally distributed• x and y must be independent of each other• all individuals must be selected at random

from the population

• all individuals must have equal chance of being selected

Page 31: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

What is correlation?

A measure of the strength of a LINEAR association between two

quantitative variables.

Page 32: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Sure you can calculate a correlation coefficient for any

pair of variables but correlation measures the strength only of the linear association and will be

misleading if the relationship is not linear.

Page 33: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Do you know that:

• Correlation applies only to quantitative variables. Check you know the units and what they measure.

• Outliers can distort the correlation dramatically.

Page 34: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Some facts about the correlation coefficient

• The sign gives the direction of the association.• Correlation is always between -1 and 1.• Correlation treats x and y symmetrically. The correlation

of x and y is the same as the correlation of y with x.• Correlation has no units and is generally given as a

decimal.• r is a multiple of the slope• Note: variables can have a strong association but still have

a small correlation if the association isn’t linear.• Correlation is sensitive to outliers. A single outlying value

can make a small correlation large or make a large one small.

Page 35: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

The sign gives the direction of the association.

Positive Negative

Page 36: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Correlation treats x and y symmetrically. The correlation of x and y is the same as

the correlation of y with x.

Page 37: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

r is a multiple of the slope

Page 38: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Variables can have a strong association but still have a small correlation if the association isn’t

linear.

Always plot the data before looking at the correlation!

Page 39: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Would it be OK to use a correlation coefficient to

describe the strength of the relationship?

9876543210

4000

3000

2000

1000

0

Position Number

Dis

tan

ce (

mill

ion

mile

s)

Distances of Planets from the Sun

Reaction Times (seconds) for 30 Year 10 Students

0

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

Non-dominant Hand

Dom

inan

t H

an

d

454035

20

19

18

17

16

15

14

Latitude (°S)   

Mean January Air Temperatures for 30 New Zealand Locations

Tem

pera

ture

(°C

)

√Female ($)

Average Weekly Income for Employed New Zealanders in 2001

Male

($

)

0

200

400

600

800

1000

1200

0 200 400 600 800

XX

Page 40: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a

large one small.

Page 41: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

You should be cautious in interpreting the correlation - these

graphs all have the same correlation coefficient (0.817)

Page 42: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Data set 1

Page 43: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Data set 2

Page 44: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Data set 3

Page 45: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Data set 4

Page 46: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Outliers can distort the correlation dramatically. An

outlier can make an otherwise small correlation look big or hide

a large correlation. It can even give an otherwise positive

association a negative correlation coefficient (and vice versa).

Page 47: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

What do you see in this scatterplot?

22 23 24 25 26 27 28 29

150

160

170

180

190

200

Foot size (cm)

Heig

ht

(cm

)

Height and Foot Size for 30 Year 10 Students •Appears to be

a linear trend, with a possible outlier (tall person with a small foot size.)

•Appears to be constant scatter.

•Positive association.

Page 48: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

What will happen to the correlation coefficient if the tallest Year 10

student is removed?

22 23 24 25 26 27 28 29

150

160

170

180

190

200

Foot size (cm)

Heig

ht

(cm

)

Height and Foot Size for 30 Year 10 Students

•It will get smaller

•It will get bigger

•It will stay the same

Page 49: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

What do you see in this scatter plot?

•Appears to be a strong linear trend.

•Outlier in X (the elephant).

•Appears to be constant scatter.

•Positive association.

6005004003002001000

40

30

20

10

Gestation (Days)

Life

Exp

ect

an

cy (

Years

)

Life Expectancies and Gestation Period for a sample of non-human Mammals

Elephant

Page 50: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

6005004003002001000

40

30

20

10

Gestation (Days)

Life

Exp

ect

an

cy (

Years

)

Life Expectancies and Gestation Period for a sample of non-human Mammals

Elephant

What will happen to the correlation coefficient if the

elephant is removed?

•It will get smaller

•It will get bigger

•It will stay the same

Page 51: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

How does the outlier affect the r - value?

Page 52: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

How does the outlier affect the r - value?

Page 53: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

How does the outlier affect the r - value?

Page 54: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

How does the outlier affect the r - value?

Page 55: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

How does the outlier affect the r - value?

Page 56: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

How does the outlier affect the r - value?

Page 57: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

When you see an outlier, it’s often a good idea to report the

correlations with and without the point.

Page 58: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Don’t confuse Correlation with causation. Scatterplots and

correlation never prove causation.

Page 59: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Using the information in the plot, can you

suggest what needs to be done in a country to increase the life expectancy?

Explain.

400003000020000100000

80

70

60

50

People per Doctor

Life

Exp

ect

an

cy

Life Expectancy and Availability of Doctors for a Sample of 40 Countries

Perhaps if you have less people per Doctor (i.e. more Doctors per person), then the life expectancy will increase.

Page 60: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Using the information in this plot, can you make another suggestion as

to what needs to be done in a country to increase life expectancy?

6005004003002001000

80

70

60

50

People per Television

Life

Exp

ect

an

cy

Life Expectancy and Availability of Televisions for a Sample of 40 Countries It looks like if

you decrease the number of people per television (i.e. have more TVs per person), then the life expectancy will increase!

Page 61: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Can you suggest another variable that is linked to life expectancy and

the availability of doctors (and televisions) which explains the

association between the life expectancy and the availability of

doctors

(and televisions)?

Some measure of wealth of a country.

Eg Average income per person or GDP.

Page 62: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Damaged for life by too much TV

Page 63: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

• Watching too much television as a child causes serious health problems years later, and raises the risk of heart disease, a New Zealand study of 1000 children has found….

• It links the amount of time spent in front of the box as a child with obesity, high cholesterol, poor fitness and smoking….

Damaged for life by too much TV

Page 64: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Damaged for life by too much TV

Hea

lth

Sco

re

TV watching

r = - 0.93

Page 65: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Causal relationships

• Two general types of studies: experiments and observational studies

• In an experiment, the experimenter determines which experimental units receive which treatments.

• In an observational study, we simply compare units that happen to have received each of the treatments.

Page 66: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

• Only properly designed and carefully executed experiments can reliably demonstrate causation.

• An observational study is often useful for identifying possible causes of effects, but it cannot reliably establish causation

Causal relationships

Page 67: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

• In observational studies, strong relationships are not necessarily causal relationships.

• Correlation does not imply causation.

• Be aware of the possibility of lurking variables.

Causal relationships

Page 68: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Watch out for lurking variables. Damage ($) vs number of firemen would show a strong correlation, but damage doesn’t cause firemen

and firemen do seem to cause damage (spraying water and

chopping holes). The underlying variable is the size of the blaze.

Page 69: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Although there was plenty of evidence that increased smoking was associated with increased levels of lung cancer, it took

years to provide evidence that smoking actually causes lung

cancer.

Page 70: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

It would be a good idea to read the two pages of notes you have

that discusses correlation and causation!

Page 71: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

So now you want to know how to calculate the correlation

coefficient, r.Here is one version of the

formula!

Page 72: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation
Page 73: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

Luckily the computer will calculate R2 and you can square

root this to get r. Remember only when the

association is linear.

Page 74: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

r measures the strength of the relationship NOT R2!!!!

r measures the strength of the relationship NOT R2!!!!

r measures the strength of the relationship NOT R2!!!!

Page 75: Bivariate Data analysis. Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation

The words you use

• There is a strong, positive, linear relationship between ‘x’ and ‘y’ and when the x- values increase, the y-values increase also. This is indicated by the value of the correlation coefficient i.e. r = 0.85 which is close to 1.

• (Note: Do not use ‘x’ and ‘y’ use what they represent.)