bivariate data - university of new havenmath.newhaven.edu/mhm/courses/estat/slides/bivardat.pdfthe...

Marc Mehlman

Marc Mehlman

Bivariate Data

Marc H. [email protected]

University of New Haven

Marc Mehlman (University of New Haven) Bivariate Data 1 / 36

[email protected]

Marc Mehlman

Marc Mehlman

Table of Contents

1 Bivariate Data

2 Scatterplots

3 Correlation

4 Two–Way Tables

5 Chapter #2 R Assignment


Marc Mehlman

Marc Mehlman

Bivariate Data

Bivariate Data

Bivariate Data


Marc Mehlman

Marc Mehlman

Bivariate Data

Bivariate data comes from measuring two aspects of the sameitem/individual. For instance,

(70, 178), (72, 192), (74, 184), (68, 181)

is a random sample of size four obtained from four male college students.The bivariate data gives the height in inches and the weight in pounds ofeach of the for students. The third student sampled is 74 inches high andweighs 184 pounds.

Can one variable be used to predict the other? Do tall people tend toweigh more?

Definition

A response (or dependent) variable measures the outcome of a study.The explanatory (or independent) variable is the one that predicts theresponse variable.


Marc Mehlman

Marc Mehlman

Scatterplots

Scatterplots

Scatterplots


Marc Mehlman

Marc Mehlman

Scatterplots

Student ID

Number of Beers

Blood Alcohol Content

1 5 0.1

2 2 0.03

3 9 0.19

6 7 0.095

7 3 0.07

9 3 0.02

11 4 0.07

13 5 0.085

4 8 0.12

5 3 0.04

8 5 0.06

10 5 0.05

12 6 0.1

14 7 0.09

15 1 0.01

16 4 0.05

Here we have two quantitative variables

recorded for each of 16 students:

1. how many beers they drank

2. their resulting blood alcohol content (BAC)

Bivariate data

For each individual studied, we record

data on two variables.

We then examine whether there is a

relationship between these two

variables: Do changes in one variable

tend to be associated with specific

changes in the other variables?


Marc Mehlman

Marc Mehlman

Scatterplots

Student Beers BAC

1 5 0.1

2 2 0.03

3 9 0.19

6 7 0.095

7 3 0.07

9 3 0.02

11 4 0.07

13 5 0.085

4 8 0.12

5 3 0.04

8 5 0.06

10 5 0.05

12 6 0.1

14 7 0.09

15 1 0.01

16 4 0.05

Scatterplots

A scatterplot is used to display quantitative bivariate data.

Each variable makes up one axis. Each individual is a point on the graph.


Marc Mehlman

Marc Mehlman

Scatterplots

> plot(trees$Girth~trees$Height,main="girth vs height")

●●

●

●● ●

● ● ●● ●●●●

●

● ●

●

●●●

●●

●●

●●

●●●

●

65 70 75 80 85

810

1214

1618

20

girth vs height

trees$Height

tree

s$G

irth


Marc Mehlman

Marc Mehlman

Scatterplots

How to scale a scatterplotSame data in all four plots

Both variables should be given a similar amount of space: Plot is roughly square Points should occupy all

the plot space (no blank space)


Marc Mehlman

Marc Mehlman

Scatterplots

Interpreting scatterplots

After plotting two variables on a scatterplot, we describe the overall

pattern of the relationship. Specifically, we look for …

Form: linear, curved, clusters, no pattern

Direction: positive, negative, no direction

Strength: how closely the points fit the “form”

… and clear deviations from that pattern

Outliers of the relationship


Marc Mehlman

Marc Mehlman

Scatterplots

Form

Linear

Nonlinear

No relationship


Marc Mehlman

Marc Mehlman

Scatterplots

Positive association: High values of one variable tend to occur together

with high values of the other variable.

Negative association: High values of one variable tend to occur together

with low values of the other variable.

Direction


Marc Mehlman

Marc Mehlman

Scatterplots

Strength

The strength of the relationship between the two variables can be seen

by how much variation, or scatter, there is around the main form.


Marc Mehlman

Marc Mehlman

Scatterplots

Outliers

An outlier is a data value that has a very low probability of occurrence

(i.e., it is unusual or unexpected).

In a scatterplot, outliers are points that fall outside of the overall pattern

of the relationship.


Marc Mehlman

Marc Mehlman

Scatterplots

Adding categorical variables to scatterplots

Two or more relationships can be compared on a single scatterplot

when we use different symbols for groups of points on the graph.

The graph compares the association

between thorax length and longevity

of male fruit flies that are allowed to

reproduce (green) or not (purple).

The pattern is similar in both groups

(linear, positive association), but male

fruit flies not allowed to reproduce

tend to live longer than reproducing

male fruit flies of the same size.


Marc Mehlman

Marc Mehlman

Correlation

Correlation

Correlation


Marc Mehlman

Marc Mehlman

Correlation

Definition

Given the bivariate data, (x1, y1), · · · , (xn, yn), the sample correlationcoefficent (sample Pearson product-moment correlation coefficient) is

rdef=

1

n − 1

n∑j=1

(xj − x

sx

)(yj − y

sy

).

The population correlation coefficient is denoted as

ρdef=

1

N

N∑j=1

(xj − µXσX

)(yj − µYσY

)where the above sum is summed over the entire population of size N.

One thinks of r as an estimator of ρ.


Marc Mehlman

Marc Mehlman

Correlation

One can also use the formula

r =n(∑n

j=1 xjyj)− (∑n

j=1 xj)(∑n

j=1 yj)√[n∑n

j=1 x2j −

(∑nj=1 xj

)2] [n∑n

j=1 y2j −

(∑nj=1 yj

)2]R command:

> cor(trees$Girth,trees$Height)

[1] 0.5192801


Marc Mehlman

Marc Mehlman

Correlation

The correlation coefficient measures the strength of any linear relationshipbetween X and Y .

Properties of Correlation:

cor(X ,Y ) = cor(Y ,X ).

−1 ≤ r ≤ 1, and scale invariant.

if r is positive there is a positive linear relationship between the twovariables.

if r is negative there is a negative linear relationship between the twovariables.

the closer |r | is to one, the stronger the linear relationship betweenthe two variables.

if |r | = 1 (ie, r = 1 or −1), all the data points lie on a straight line.


Marc Mehlman

Marc Mehlman

Correlation

r has no unitr = -0.75

r = -0.75standardized value of x (unitless)

standardized value of y (unitless)


Marc Mehlman

Marc Mehlman

Correlation

Correlations are calculated using

means and standard deviations,

and thus are NOT resistant to

outliers.

r is not resistant to outliers

Just moving one point away from the linear

pattern here weakens the correlation from

−0.91 to −0.75 (closer to zero).


Marc Mehlman

Marc Mehlman

Correlation

14

Correlation


Marc Mehlman

Marc Mehlman

Correlation

Caution: Correlation is not Causation

Definition

When calculating correlation, a lurking variable is a third factor thatexplains the relationship between the two correlated variables

Example (Lurking Variables)

There is a strong correlation between shoe size and reading skillsamong elementary school children. The lurking variable is · · ·There is a strong correlation between the number of firefighters at afire site and the amount of damage. The lurking variable is · · ·

Caution: Beware correlations based on averaged data. While there isa strong correlation average age and average height among children, thecorrelation between age and height for individual children is much, muchlower.


Marc Mehlman

Marc Mehlman

Correlation

Definition

Two variables are confounded when their effects on the response variable can not bedistinguished from each other. The confounded variables can be either explanatory orlurking variables (or only work in the presence of each other).

The only way to distinguish between two confounded variables is to redesign theexperiment.

Example

When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots ofcoffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, orsome combination of the above?

Example

A classic example of confounding: A study suggests that people who carry matches aremore likely to develop lung cancer. Is it the matches or is there confounding here with alurking variable?


Marc Mehlman

Marc Mehlman

Correlation

Establishing causation from an observed association can be done if:

1) The association is strong.

2) The association is consistent.

3) Higher doses are associated with stronger responses.

4) The alleged cause precedes the effect.

5) The alleged cause is plausible.

Establishing causation

Lung cancer is clearly associated with smoking.

What if a genetic mutation (lurking variable) caused

people to both get lung cancer and become addicted to smoking?

It took years of research and accumulated indirect evidence to reach the

conclusion that smoking causes lung cancer.


Marc Mehlman

Marc Mehlman

Two–Way Tables

Two–Way Tables

Two–Way Tables


Marc Mehlman

Marc Mehlman

Two–Way Tables

Given two random variables (the row variable and the column variable)that are categorical, data can be organized into a r × c two–way table.The number of categories for the row variable is r and the number ofcategories for the column variable is c . The grand total is the totalnumber of bivariate data points considered. For instance noting anindividual’s gender (the column variable) and deciphering their perceivedchances of getting rich (the row variable) one gets the following 5× 2two–way table:

Female MaleAlmost no chance 96 98Some chance, but probably not 426 286A 50–50 chance 696 720A good chance 663 758Almost certain 486 597

The ij th cell corresponds to a tally of all the individuals who gave the i th

answer to the row variable question and the j th answer to the columnvariable question.


Marc Mehlman

Marc Mehlman

Two–Way Tables

Given a two–way table, marginal distributions are just the distributionsof the row and column random variables. The adjective “marginal” comesfrom adding row and column totals to the two–way table to make it easyto calculate the row and column distributions. For instance:

Female Male TotalAlmost no chance 96 98 194Some chance, but probably not 426 286 712A 50–50 chance 696 720 1,416A good chance 663 758 1,421Almost certain 486 597 1,083

Total 2,367 2,459 4,826

allows us to see the distribution of the row variable is

194

4, 826,

712

4, 826,

1, 416

4, 826,

1, 421

4, 826,

1, 083

4, 826.

The distribution of the column variable is 2,3674,826 ,

2,4594,826 .


Marc Mehlman

Marc Mehlman

Two–Way Tables

The joint distribution of two two categorical random variables is againthe proportions of total that correspond to each joint result. For instancethe joint distribution from the previous example is:

Female MaleAlmost no chance 96

4,82698

4,826

Some chance, but probably not 4264,826

2864,826

A 50–50 chance 6964,826

7204,826

A good chance 6634,826

7584,826

Almost certain 4864,826

5974,826


Marc Mehlman

Marc Mehlman

Two–Way Tables

The conditional distribution of one of the random variables given theother random variable takes on a particular value is the proportion ofoutcomes the random variable takes on given the other random variable isthat particular value. For instance, given:

Female Male TotalAlmost no chance 96 98 194Some chance, but probably not 426 286 712A 50–50 chance 696 720 1,416A good chance 663 758 1,421Almost certain 486 597 1,083

Total 2,367 2,459 4,826

conditional distribution of row variable given that column variable equalsmale is

98

2, 459,

286

2, 459,

720

2, 459,

758

2, 459,

597

2, 459.

Similarly, the conditional distribution of the column variable given the rowvariable equals “A good chance” is 663

1,421 ,7581,421 .


Marc Mehlman

Marc Mehlman

Two–Way Tables

39

When studying the relationship between two variables, there may exist a lurking variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered.

The lurking variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables.

An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.

An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.

Simpson’s Paradox


Marc Mehlman

Marc Mehlman

Two–Way Tables

40

Consider the acceptance rates for the following groups of men and women who applied to college.

A higher percentage of men were accepted: Is there evidence of discrimination?

Simpson’s Paradox


Marc Mehlman

Marc Mehlman

Two–Way Tables

41

Consider the acceptance rates when broken down by type of school.

BUSINESS SCHOOL

ART SCHOOL

Simpson’s Paradox


Marc Mehlman

Marc Mehlman

Two–Way Tables

42

Lurking variable: Applications were split between the Business School (240) and the Art School (320).

Within each school a higher percentage of women were accepted than men.

There is not any discrimination against women!!!

This is an example of Simpsons Paradox.

When the lurking variable (Type of School: Business or Art) is ignored the data seem to suggest discrimination against women.

However, when the type of school is considered, the association is reversed and suggests discrimination against men.

Simpson’s Paradox


Marc Mehlman

Marc Mehlman

Chapter #2 R Assignment




Marc Mehlman

Marc Mehlman


1 Create a scatterplot of weight versus quarter mile times for thedataset, “mtcars”. Assume the the independent variable is thequarter mile times and the dependent variable is the weight.

2 Find the correlation of of weight versus quarter mile times for thedataset, “mtcars”.


bivariate data - university of new havenmath.newhaven.edu/mhm/courses/estat/slides/bivardat.pdfthe...

Documents