bivariate data - university of new havenmath.newhaven.edu/mhm/courses/estat/slides/bivardat.pdfthe...
TRANSCRIPT
Marc Mehlman
Marc Mehlman
Bivariate Data
Marc H. [email protected]
University of New Haven
Marc Mehlman (University of New Haven) Bivariate Data 1 / 36
Marc Mehlman
Marc Mehlman
Table of Contents
1 Bivariate Data
2 Scatterplots
3 Correlation
4 Two–Way Tables
5 Chapter #2 R Assignment
Marc Mehlman (University of New Haven) Bivariate Data 2 / 36
Marc Mehlman
Marc Mehlman
Bivariate Data
Bivariate Data
Bivariate Data
Marc Mehlman (University of New Haven) Bivariate Data 3 / 36
Marc Mehlman
Marc Mehlman
Bivariate Data
Bivariate data comes from measuring two aspects of the sameitem/individual. For instance,
(70, 178), (72, 192), (74, 184), (68, 181)
is a random sample of size four obtained from four male college students.The bivariate data gives the height in inches and the weight in pounds ofeach of the for students. The third student sampled is 74 inches high andweighs 184 pounds.
Can one variable be used to predict the other? Do tall people tend toweigh more?
Definition
A response (or dependent) variable measures the outcome of a study.The explanatory (or independent) variable is the one that predicts theresponse variable.
Marc Mehlman (University of New Haven) Bivariate Data 4 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Scatterplots
Scatterplots
Marc Mehlman (University of New Haven) Bivariate Data 5 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Student ID
Number of Beers
Blood Alcohol Content
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Here we have two quantitative variables
recorded for each of 16 students:
1. how many beers they drank
2. their resulting blood alcohol content (BAC)
Bivariate data
For each individual studied, we record
data on two variables.
We then examine whether there is a
relationship between these two
variables: Do changes in one variable
tend to be associated with specific
changes in the other variables?
Marc Mehlman (University of New Haven) Bivariate Data 6 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Student Beers BAC
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Scatterplots
A scatterplot is used to display quantitative bivariate data.
Each variable makes up one axis. Each individual is a point on the graph.
Marc Mehlman (University of New Haven) Bivariate Data 7 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
> plot(trees$Girth~trees$Height,main="girth vs height")
●●
●
●● ●
● ● ●● ●●●●
●
● ●
●
●●●
●●
●●
●●
●●●
●
65 70 75 80 85
810
1214
1618
20
girth vs height
trees$Height
tree
s$G
irth
Marc Mehlman (University of New Haven) Bivariate Data 8 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
How to scale a scatterplotSame data in all four plots
Both variables should be given a similar amount of space: Plot is roughly square Points should occupy all
the plot space (no blank space)
Marc Mehlman (University of New Haven) Bivariate Data 9 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Interpreting scatterplots
After plotting two variables on a scatterplot, we describe the overall
pattern of the relationship. Specifically, we look for …
Form: linear, curved, clusters, no pattern
Direction: positive, negative, no direction
Strength: how closely the points fit the “form”
… and clear deviations from that pattern
Outliers of the relationship
Marc Mehlman (University of New Haven) Bivariate Data 10 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Form
Linear
Nonlinear
No relationship
Marc Mehlman (University of New Haven) Bivariate Data 11 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Positive association: High values of one variable tend to occur together
with high values of the other variable.
Negative association: High values of one variable tend to occur together
with low values of the other variable.
Direction
Marc Mehlman (University of New Haven) Bivariate Data 12 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Strength
The strength of the relationship between the two variables can be seen
by how much variation, or scatter, there is around the main form.
Marc Mehlman (University of New Haven) Bivariate Data 13 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Outliers
An outlier is a data value that has a very low probability of occurrence
(i.e., it is unusual or unexpected).
In a scatterplot, outliers are points that fall outside of the overall pattern
of the relationship.
Marc Mehlman (University of New Haven) Bivariate Data 14 / 36
Marc Mehlman
Marc Mehlman
Scatterplots
Adding categorical variables to scatterplots
Two or more relationships can be compared on a single scatterplot
when we use different symbols for groups of points on the graph.
The graph compares the association
between thorax length and longevity
of male fruit flies that are allowed to
reproduce (green) or not (purple).
The pattern is similar in both groups
(linear, positive association), but male
fruit flies not allowed to reproduce
tend to live longer than reproducing
male fruit flies of the same size.
Marc Mehlman (University of New Haven) Bivariate Data 15 / 36
Marc Mehlman
Marc Mehlman
Correlation
Correlation
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 16 / 36
Marc Mehlman
Marc Mehlman
Correlation
Definition
Given the bivariate data, (x1, y1), · · · , (xn, yn), the sample correlationcoefficent (sample Pearson product-moment correlation coefficient) is
rdef=
1
n − 1
n∑j=1
(xj − x
sx
)(yj − y
sy
).
The population correlation coefficient is denoted as
ρdef=
1
N
N∑j=1
(xj − µXσX
)(yj − µYσY
)where the above sum is summed over the entire population of size N.
One thinks of r as an estimator of ρ.
Marc Mehlman (University of New Haven) Bivariate Data 17 / 36
Marc Mehlman
Marc Mehlman
Correlation
One can also use the formula
r =n(∑n
j=1 xjyj)− (∑n
j=1 xj)(∑n
j=1 yj)√[n∑n
j=1 x2j −
(∑nj=1 xj
)2] [n∑n
j=1 y2j −
(∑nj=1 yj
)2]R command:
> cor(trees$Girth,trees$Height)
[1] 0.5192801
Marc Mehlman (University of New Haven) Bivariate Data 18 / 36
Marc Mehlman
Marc Mehlman
Correlation
One can also use the formula
r =n(∑n
j=1 xjyj)− (∑n
j=1 xj)(∑n
j=1 yj)√[n∑n
j=1 x2j −
(∑nj=1 xj
)2] [n∑n
j=1 y2j −
(∑nj=1 yj
)2]R command:
> cor(trees$Girth,trees$Height)
[1] 0.5192801
Marc Mehlman (University of New Haven) Bivariate Data 18 / 36
Marc Mehlman
Marc Mehlman
Correlation
The correlation coefficient measures the strength of any linear relationshipbetween X and Y .
Properties of Correlation:
cor(X ,Y ) = cor(Y ,X ).
−1 ≤ r ≤ 1, and scale invariant.
if r is positive there is a positive linear relationship between the twovariables.
if r is negative there is a negative linear relationship between the twovariables.
the closer |r | is to one, the stronger the linear relationship betweenthe two variables.
if |r | = 1 (ie, r = 1 or −1), all the data points lie on a straight line.
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman
Marc Mehlman
Correlation
The correlation coefficient measures the strength of any linear relationshipbetween X and Y .
Properties of Correlation:
cor(X ,Y ) = cor(Y ,X ).
−1 ≤ r ≤ 1, and scale invariant.
if r is positive there is a positive linear relationship between the twovariables.
if r is negative there is a negative linear relationship between the twovariables.
the closer |r | is to one, the stronger the linear relationship betweenthe two variables.
if |r | = 1 (ie, r = 1 or −1), all the data points lie on a straight line.
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman
Marc Mehlman
Correlation
The correlation coefficient measures the strength of any linear relationshipbetween X and Y .
Properties of Correlation:
cor(X ,Y ) = cor(Y ,X ).
−1 ≤ r ≤ 1, and scale invariant.
if r is positive there is a positive linear relationship between the twovariables.
if r is negative there is a negative linear relationship between the twovariables.
the closer |r | is to one, the stronger the linear relationship betweenthe two variables.
if |r | = 1 (ie, r = 1 or −1), all the data points lie on a straight line.
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman
Marc Mehlman
Correlation
The correlation coefficient measures the strength of any linear relationshipbetween X and Y .
Properties of Correlation:
cor(X ,Y ) = cor(Y ,X ).
−1 ≤ r ≤ 1, and scale invariant.
if r is positive there is a positive linear relationship between the twovariables.
if r is negative there is a negative linear relationship between the twovariables.
the closer |r | is to one, the stronger the linear relationship betweenthe two variables.
if |r | = 1 (ie, r = 1 or −1), all the data points lie on a straight line.
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman
Marc Mehlman
Correlation
The correlation coefficient measures the strength of any linear relationshipbetween X and Y .
Properties of Correlation:
cor(X ,Y ) = cor(Y ,X ).
−1 ≤ r ≤ 1, and scale invariant.
if r is positive there is a positive linear relationship between the twovariables.
if r is negative there is a negative linear relationship between the twovariables.
the closer |r | is to one, the stronger the linear relationship betweenthe two variables.
if |r | = 1 (ie, r = 1 or −1), all the data points lie on a straight line.
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman
Marc Mehlman
Correlation
The correlation coefficient measures the strength of any linear relationshipbetween X and Y .
Properties of Correlation:
cor(X ,Y ) = cor(Y ,X ).
−1 ≤ r ≤ 1, and scale invariant.
if r is positive there is a positive linear relationship between the twovariables.
if r is negative there is a negative linear relationship between the twovariables.
the closer |r | is to one, the stronger the linear relationship betweenthe two variables.
if |r | = 1 (ie, r = 1 or −1), all the data points lie on a straight line.
Marc Mehlman (University of New Haven) Bivariate Data 19 / 36
Marc Mehlman
Marc Mehlman
Correlation
r has no unitr = -0.75
r = -0.75standardized value of x (unitless)
standardized value of y (unitless)
Marc Mehlman (University of New Haven) Bivariate Data 20 / 36
Marc Mehlman
Marc Mehlman
Correlation
Correlations are calculated using
means and standard deviations,
and thus are NOT resistant to
outliers.
r is not resistant to outliers
Just moving one point away from the linear
pattern here weakens the correlation from
−0.91 to −0.75 (closer to zero).
Marc Mehlman (University of New Haven) Bivariate Data 21 / 36
Marc Mehlman
Marc Mehlman
Correlation
14
Correlation
Marc Mehlman (University of New Haven) Bivariate Data 22 / 36
Marc Mehlman
Marc Mehlman
Correlation
Caution: Correlation is not Causation
Definition
When calculating correlation, a lurking variable is a third factor thatexplains the relationship between the two correlated variables
Example (Lurking Variables)
There is a strong correlation between shoe size and reading skillsamong elementary school children. The lurking variable is · · ·There is a strong correlation between the number of firefighters at afire site and the amount of damage. The lurking variable is · · ·
Caution: Beware correlations based on averaged data. While there isa strong correlation average age and average height among children, thecorrelation between age and height for individual children is much, muchlower.
Marc Mehlman (University of New Haven) Bivariate Data 23 / 36
Marc Mehlman
Marc Mehlman
Correlation
Definition
Two variables are confounded when their effects on the response variable can not bedistinguished from each other. The confounded variables can be either explanatory orlurking variables (or only work in the presence of each other).
The only way to distinguish between two confounded variables is to redesign theexperiment.
Example
When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots ofcoffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, orsome combination of the above?
Example
A classic example of confounding: A study suggests that people who carry matches aremore likely to develop lung cancer. Is it the matches or is there confounding here with alurking variable?
Marc Mehlman (University of New Haven) Bivariate Data 24 / 36
Marc Mehlman
Marc Mehlman
Correlation
Definition
Two variables are confounded when their effects on the response variable can not bedistinguished from each other. The confounded variables can be either explanatory orlurking variables (or only work in the presence of each other).
The only way to distinguish between two confounded variables is to redesign theexperiment.
Example
When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots ofcoffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, orsome combination of the above?
Example
A classic example of confounding: A study suggests that people who carry matches aremore likely to develop lung cancer. Is it the matches or is there confounding here with alurking variable?
Marc Mehlman (University of New Haven) Bivariate Data 24 / 36
Marc Mehlman
Marc Mehlman
Correlation
Definition
Two variables are confounded when their effects on the response variable can not bedistinguished from each other. The confounded variables can be either explanatory orlurking variables (or only work in the presence of each other).
The only way to distinguish between two confounded variables is to redesign theexperiment.
Example
When I’m stressed, I get muscle cramps. However, when I’m stressed, I also drink lots ofcoffee and lose sleep. Are the cramps caused by stress, or coffee, or lack of sleep, orsome combination of the above?
Example
A classic example of confounding: A study suggests that people who carry matches aremore likely to develop lung cancer. Is it the matches or is there confounding here with alurking variable?
Marc Mehlman (University of New Haven) Bivariate Data 24 / 36
Marc Mehlman
Marc Mehlman
Correlation
Establishing causation from an observed association can be done if:
1) The association is strong.
2) The association is consistent.
3) Higher doses are associated with stronger responses.
4) The alleged cause precedes the effect.
5) The alleged cause is plausible.
Establishing causation
Lung cancer is clearly associated with smoking.
What if a genetic mutation (lurking variable) caused
people to both get lung cancer and become addicted to smoking?
It took years of research and accumulated indirect evidence to reach the
conclusion that smoking causes lung cancer.
Marc Mehlman (University of New Haven) Bivariate Data 25 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
Two–Way Tables
Two–Way Tables
Marc Mehlman (University of New Haven) Bivariate Data 26 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
Given two random variables (the row variable and the column variable)that are categorical, data can be organized into a r × c two–way table.The number of categories for the row variable is r and the number ofcategories for the column variable is c . The grand total is the totalnumber of bivariate data points considered. For instance noting anindividual’s gender (the column variable) and deciphering their perceivedchances of getting rich (the row variable) one gets the following 5× 2two–way table:
Female MaleAlmost no chance 96 98Some chance, but probably not 426 286A 50–50 chance 696 720A good chance 663 758Almost certain 486 597
The ij th cell corresponds to a tally of all the individuals who gave the i th
answer to the row variable question and the j th answer to the columnvariable question.
Marc Mehlman (University of New Haven) Bivariate Data 27 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
Given a two–way table, marginal distributions are just the distributionsof the row and column random variables. The adjective “marginal” comesfrom adding row and column totals to the two–way table to make it easyto calculate the row and column distributions. For instance:
Female Male TotalAlmost no chance 96 98 194Some chance, but probably not 426 286 712A 50–50 chance 696 720 1,416A good chance 663 758 1,421Almost certain 486 597 1,083
Total 2,367 2,459 4,826
allows us to see the distribution of the row variable is
194
4, 826,
712
4, 826,
1, 416
4, 826,
1, 421
4, 826,
1, 083
4, 826.
The distribution of the column variable is 2,3674,826 ,
2,4594,826 .
Marc Mehlman (University of New Haven) Bivariate Data 28 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
The joint distribution of two two categorical random variables is againthe proportions of total that correspond to each joint result. For instancethe joint distribution from the previous example is:
Female MaleAlmost no chance 96
4,82698
4,826
Some chance, but probably not 4264,826
2864,826
A 50–50 chance 6964,826
7204,826
A good chance 6634,826
7584,826
Almost certain 4864,826
5974,826
Marc Mehlman (University of New Haven) Bivariate Data 29 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
The conditional distribution of one of the random variables given theother random variable takes on a particular value is the proportion ofoutcomes the random variable takes on given the other random variable isthat particular value. For instance, given:
Female Male TotalAlmost no chance 96 98 194Some chance, but probably not 426 286 712A 50–50 chance 696 720 1,416A good chance 663 758 1,421Almost certain 486 597 1,083
Total 2,367 2,459 4,826
conditional distribution of row variable given that column variable equalsmale is
98
2, 459,
286
2, 459,
720
2, 459,
758
2, 459,
597
2, 459.
Similarly, the conditional distribution of the column variable given the rowvariable equals “A good chance” is 663
1,421 ,7581,421 .
Marc Mehlman (University of New Haven) Bivariate Data 30 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
39
When studying the relationship between two variables, there may exist a lurking variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered.
The lurking variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables.
An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.
An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.
Simpson’s Paradox
Marc Mehlman (University of New Haven) Bivariate Data 31 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
40
Consider the acceptance rates for the following groups of men and women who applied to college.
A higher percentage of men were accepted: Is there evidence of discrimination?
Simpson’s Paradox
Marc Mehlman (University of New Haven) Bivariate Data 32 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
41
Consider the acceptance rates when broken down by type of school.
BUSINESS SCHOOL
ART SCHOOL
Simpson’s Paradox
Marc Mehlman (University of New Haven) Bivariate Data 33 / 36
Marc Mehlman
Marc Mehlman
Two–Way Tables
42
Lurking variable: Applications were split between the Business School (240) and the Art School (320).
Within each school a higher percentage of women were accepted than men.
There is not any discrimination against women!!!
This is an example of Simpsons Paradox.
When the lurking variable (Type of School: Business or Art) is ignored the data seem to suggest discrimination against women.
However, when the type of school is considered, the association is reversed and suggests discrimination against men.
Simpson’s Paradox
Marc Mehlman (University of New Haven) Bivariate Data 34 / 36
Marc Mehlman
Marc Mehlman
Chapter #2 R Assignment
Chapter #2 R Assignment
Chapter #2 R Assignment
Marc Mehlman (University of New Haven) Bivariate Data 35 / 36
Marc Mehlman
Marc Mehlman
Chapter #2 R Assignment
1 Create a scatterplot of weight versus quarter mile times for thedataset, “mtcars”. Assume the the independent variable is thequarter mile times and the dependent variable is the weight.
2 Find the correlation of of weight versus quarter mile times for thedataset, “mtcars”.
Marc Mehlman (University of New Haven) Bivariate Data 36 / 36