chapter 3. bidimensional statistics

3. Data analysis in two variables

Data analysis of two variables

The previous chapter finished with the comparison of two variables (or databases). In this chapter we take this analysis one step further: given two variables

!  Is there any relationship between the variables?

!  If so, of which type?

!  Still if so, how can we measure such relationship?

!  If one variable changes, which changes can we predict in the other variable?


Here you can see some examples of variables that can be related:

1.  Color of the eyes of a mother and color of the eyes of her children.

2.  Month income and money expenses on leisure activities.

3.  Gender and salaries.

4.  Choice of a degree and professional success.


Here you can see some examples of variables that can be related:

1.  Color of the eyes of a mother and color of the eyes of her children.

2.  Month income and money expenses on leisure activities.

3.  Gender and salaries.

4.  Choice of a degree and professional success.

This is not about causality!!!

The goal is to study if two variables are related, not to deduce if such relationship goes in one direction or the other.

An example

We ask three hundred people about

1.  How many hours per week did they study per week during their degrees (average).

2.  The color of their hair (options: blond, brown, black, red).

3.  Their salaries.

Essentially, we want to know if either one of the two variables (hours of study per week, color of hair) has had an effect on the third one (salary). What do you think?

An example

Hours

Salary [0,12) [12,24) [24,36) TOTAL

[0,800) 118 24 6 148 [800,1600) 9 70 10 89 [1600, 2400) 3 11 49 63

TOTAL 130 105 65 300

Hair

Salary Blond Brown Black Red TOTAL

[0,800) 36 65 39 8 148 [800,1600) 12 28 42 7 89 [1600, 2400) 8 35 16 4 63

TOTAL 56 128 97 19 300

An example

These tables are called two-way tables.

!  Each cell contains the number (absolute frequency) of people corresponding both to the column and the row.

!  You can think of these tables as “double frequency tables”. They are called (absolute) joint frequency distribution tables. The second table contains a categorical variable (hair color), and it’s sometimes called a contingency table.

!  Dividing each value by the size of the sample, we obtain the relative joint frequency distribution (next slide).

Exercise: extract (and complete) the frequency tables for each variable (hours per week, color of hair, salary). Analyze each sample. Can you deduce the joint frequency table for the variables “hours per week” and “hair color”?

An example

Hours

Salary [0,12) [12,24) [24,36) TOTAL

[0,800) 0.39 0.08 0.02 0.49 [800,1600) 0.03 0.23 0.03 0.3

[1600, 2400) 0.01 0.04 0.16 0.21 TOTAL 0.43 0.35 0.22 1

Hair


[0,800) 0.12 0.22 0.13 0.03 0.49 [800,1600) 0.04 0.09 0.14 0.02 0.3 [1600, 2400) 0.03 0.12 0.05 0.01 0.21

TOTAL 0.19 0.43 0.32 0.06 1

An example

The columns and rows labeled “TOTAL” form the marginal distributions (because they appear in the margins of the tables).

Salary Marginal abs. freq.

Marginal rel. freq.

[0,800) 148 0.49 [800,1600) 89 0.3 [1600,2400) 63 0.21

Hours Marginal abs. freq.

Marginal rel. freq.

[0, 12) 130 0.43 [12, 24) 105 0.35 [24, 36) 65 0.22

Hair Color Marginal abs. freq.

Marginal rel. freq.

Blond 56 0.19 Brown 128 0.43 Black 97 0.32 Red 19 0.06

An example

Each of the other columns or rows correspond to a variable conditioned to a specific value of the other variable: these are the conditional frequency distributions.

Salary (hours = [12, 24))

Cond. freq.

[0, 800) 0.23 [800, 1600) 0.66 [1600, 2400) 0.11

Hours (salary = [1600,2400))

Cond. freq.

[0, 12) 0.05 [12, 24) 0.19 [24, 36) 0.76

Hair Color (salary = [0, 800)) Cond. freq.

Blond 0.24 Brown 0.45 Black 0.27 Red 0.06


The following is a first criterion to see if there is a potential relationship between variables. For each cell, compare nij/n and ni/n x nj/n:

!  If nij/n and ni/n x nj/n are equal (so the marginal and conditional distributions are similar), then probably there is no relationship.

!  If nij/n and ni/n x nj/n are different (the marginal and conditional distributions are different,) then probably there is some relationship between the variables.

Exercise: try the above criterion on the previous example.

Contingency coefficient

When the number of rows equals the number of columns, there is a better criterion to determine whether there is a relationship between two variables: the contingency coefficient. To compute the contingency coefficient, follow these steps (in the joint frequency table):

1.  Form a new table, where each cell has the value

2.  Compute the associated Chi-square index

€

χ2 =(EV −OV )2

EV∑€

EV =Totrow × Totcolumn( )

n

Contingency coefficient

3.  Compute the contingency coefficient

The coefficient works as follows:

!  If CC is close to zero then there is no relationship.

!  If CC is away from zero then there is some relationship.

€

CC =χ2

χ2 + n

An example

Hours

Salary [0,12) [12,24) [24,36) TOTAL

[0,800) 118 24 6 148 [800,1600) 9 70 10 89 [1600, 2400) 3 11 49 63

TOTAL 130 105 65 300

Hours

Salary [0,12) [12,24) [24,36) TOTAL

[0,800) 64.13 51.08 32.07 [800,1600) 38.57 31.15 19.28 [1600, 2400) 27.3 22.05 13.65

TOTAL

An example

The second table in the previous slide contains the expected values, and now can compute the associated Chi-square index:

And finally, we compute the contingency coefficient:

In view of the value of the contingency coefficient, it’s not very clear if there is a relationship...

€

χ2 =(EV −OV )2

EV∑ = 275.66

€

CC =χ2

χ2 + n= 0.69

Statistical dependence

Let’s formalize some of the concepts mentioned in previous slides. Along the way we shall see better techniques to detect relationships between variables.

Let X and Y be variables on the same sample. What does it mean that Y depends on X? Essentially, it means that there is some function f such that Y = f(X) (up to some small error).

In practice, there are so many different types of functions that, from that point of view, any two variables should have some sort of relationship (it’s just a matter of finding the right function!).

We’ll just restrict to a certain type of functions, which are very easy to handle...

Statistical dependence

The simplest type of functions that we may find are called linear functions: they describe straight lines.

The formula of a linear function is f(x) = ax + b, where !  “a” is the slope (inclination of the line); !  “b” is the independent term. In this chapter we’ll focus of detecting when the variable Y has a linear relationship with the variable X.

Scatter plot

Where do we start? As we did in the previous chapter, we can start by plotting the data and getting some visual insight on the distribution. There is a type of graph that best suits the situation we study in this chapter: the scatter plot.

We have two variables, X and Y, on the same sample. Let’s assume that Y depends on X (we want to describe the variable Y in terms of the variable X). Thus, for each individual of the sample we have two observations, x and y. Arrange the two observations as a pair (x,y) (the order is important!).

We can take the pair (x,y) as the coordinates of a point on a plane. The scatter plot basically plots each pair (x,y), for all the individuals in the sample.

Scatter plot

Some examples of scatter plots.

No relationship (at all) No linear relationship

Positive linear relationship Negative linear relationship

Scatter plot

The scatter plot gives us an idea about the following features: !  Direction of the relationship: positive (if high values of X coincide with high values of Y) or negative (if high values of X coincide with low values of Y).

!  Form of the relationship: in many cases we can tell if there is linear relationship (if the dots are arranged more or less on a line). In other cases we may be able to see other forms (like a parabola...).

!  Strength of the relationship: once we detect the form, we can check how close the dots are from the imaginary line (or form).

!  Atypical observations: points lying very far away from the cloud of dots. These may correspond to errors in the observations.

Covariance and correlation coefficient

At the beginning of the chapter we saw some criteria to decide when two variables are related or not (under any type of relationship, not necessarily linear).

By paying attention only to linear relationships, we can improve the criteria to decide the existence of relationships. The covariance is a good measure for linear relationship:

€

SXY =(xi − x) × (yi − y)

ni=1

n

∑

=xi × yin

− (x ×i=1

n

∑ y)


This way: !  If SXY > 0, then there is positive linear relationship. !  If SXY < 0, then there is negative linear relationship !  If SXY = 0, then there is no linear relationship.

However, the covariance by itself is not as good a measure as it could be: covariance is not bounded, and it makes it difficult to decide when SXY has a value close to zero or not...


This problem is solved with the (linear) correlation coefficient:

where SX and SY are the standard deviations of X and Y respectively.

The correlation coefficient works like magic! !  It is always a number between -1 and 1. !  If rXY is close to 1, then there is positive linear relationship. !  If rXY is close to -1, then there is negative linear relationship !  If rXY is close to 0, then there is no linear relationship.

If rXY = 1 or rXY = -1 then there is a perfect linear relationship (the dots in the scatter plot will be precisely arranged along a line!!).

€

rXY =SXY

SX × SY

An example

The following table contains the height and weight of 10 people.

Height (cm) 161 166 168 171 171 176 177 180 181 183 Weight (kg) 57 59 58 62 67 63 67 73 78 82

Let’s see if the weight has a linear relationship with the height. This means that the height will be the independent variable (or predictor) and the weight will be the dependent variable.

An example

It’s hard to see from the scatter plot if there is a linear relationship between the variables (note also the different scaling of the axes!).

We have to compute the standard deviations for the height and the weight, as well as the covariance. To simplify, set X = Height and Y = Weight. Then:

SX = 6.8 SY = 8.16 SXY = 50.06

Finally, we can compute the correlation coefficient: rXY = 0.9

This can be interpreted as follows: there is a 90% of positive linear correspondence!

An example

The next table shows the number of child death (per year) and number of doctors in 10 cities of some country. Study the possible correlation of the variables.

Number of deaths 10 20 30 35 45 55 65 75 90 100

Number of doctors 100 90 80 70 60 50 40 30 20 10

As an extra, form a two-way table with the above data (you are free to group the data in whichever intervals you find appropriate).

Linear regression

We know now how to tell if there is a linear relationship between two variables, X and Y. Remember that this means that, up to some error, we should have

Y = aX + b for some coefficients a and b.

The next step is to compute these coefficients, so we can completely describe the relationship between the variables.

The idea is to find the line that better approximates all the data that we already have: clearly it is almost impossible that a line will pass through all the points in the scatter plot (unless they are already arranged in a perfect line, which seldom happens).

Linear regression

Let a and b be any numbers (for now). For each observation xi of the variable X, we have

Observed value on Y Approximation for Y yi yi* = a xi + b

In general, yi and yi* are different numbers, which means that the approximation given by the line Y = aX + b has an error

ei = yi + yi*

The regression line of Y over X is the line that minimizes the squares of all the errors ei. Equivalently, it’s the line that minimizes the variance of these errors. The regression line is y = ax + b, where:

€

b = y −SXYSX2 × x

€

a =SXYSX2

An example

Back to the example on the weight and the height of 10 people:

Height (cm) 161 166 168 171 171 176 177 180 181 183 Weight (kg) 57 59 58 62 67 63 67 73 78 82

Previously we saw that the correlation coefficient is rXY = 0.90, so we can safely assume that the variables follow a linear relationship.

We also computed the covariance and the standard deviations, so it is very easy now to find the equation of the regression line:

y = 1.083 x – 121.125

An example

Here’s the scatter plot, together with the regression line:

We can use the regression line to estimate the weight of new individuals in the population, not included in the original sample. For example, for someone with height 182 cm, the expected weight is 75.98 kg (just replace x by 182 in the equation of the regression line).

Coefficient of determination

The regression line is the linear function that best describes the linear relationship between the variables X and Y. Still, there is a question to consider: how good is the regression line in describing this relationship?

The idea is the following: if Y = aX + b, then the variance of X (the dispersion of the data in X) explains part of the variance of Y, but not all of it, and we want to measure which part of the variance of Y is explained by X (the more the better).

We use the coefficient of determination, which is (rXY)2. The closer to 1, the better!

An example

Study the linear relationship between variables in this database.

A quick computation shows that the correlation coefficient is rXY = 0.754

so it seems fair to assume that the variables are linearly related. The regression line is

y = 4 x + 6

X 1.1 1.5 1.7 1.4 1.8 1.9 2 1.3 1.2 1.6 Y 11.4 13 11.8 10.6 14.2 12.6 13 10.2 11.8 13.4

X 2 1.8 1.3 1.6 1.2 1.1 1.5 1.7 1.9 1.4 Y 15 12.2 12.2 11.4 9.8 9.4 11 13.8 14.6 12.6

An example

However, the coefficient of determination is 0.569, which means that the regression line does not do such a great job... It is actually not too bad, but it isn’t great either.

Let’s have a look at the scatter plot:

Residual plot

If you really really really want to do a perfect linear correlation analysis, then there is one last step to take care of: the residual plot.

The idea is to study the distribution of the errors given by the regression line: !  If the errors are randomly distributed, then linear regression is fine. !  If the errors are distributed following some obvious pattern, then most likely the variables X and Y do not follow a linear relationship.

An example

In the previous example we saw how the regression line does not quite do a good job in approximating the variable Y. Let’s see what happens with the errors:

X 1.1 1.5 1.7 1.4 1.8 1.9 2 1.3 1.2 1.6 Y – (4X + 6) 1 1 -1 -1 1 -1 -1 -1 1 1

X 2 1.8 1.3 1.6 1.2 1.1 1.5 1.7 1.9 1.4 Y – (4X + 6) 1 -1 1 -1 -1 -1 -1 1 1 1

Note that the error is always either 1 or -1!

An example

The scatter plot for the errors shows that there is indeed something going on, as we do not obtain a random cloud of point!

The fact that this plot shows a clear pattern indicates that, in assuming a linear relationship between the variables, there is some important aspect that we are not considering.

Bidimensional analysis & categorical variables

Whenever categorical variables are involved our options are more limited. For example, the regression line makes little sense if one of the variables (either X or Y) is categorical.

In this case we have to go back to using two-way tables, as we saw at the beginning of this chapter. Let’s see one more tool that we may use in order to analyze the relationship between variables in this case.

To this end, let’s rescue the example that we saw at the beginning on the chapter, where we have information about people’s salaries and their hair color...

An example

Hair


[0,800) 36 65 39 8 148 [800,1600) 12 28 42 7 89 [1600, 2400) 8 35 16 4 63

TOTAL 56 128 97 19 300

Below there’s the two-way table from the previous example:

A scatter plot makes no sense here, but we can try a different strategy...

Remember that our assumption was that the color of hair has some effect on the salary (we are actually trying to decide whether there is such relationship or not).

An example

Hair


[0,800) 0.643 0.508 0.402 0.421 148 [800,1600) 0.214 0.219 0.433 0.368 89 [1600, 2400) 0.143 0.273 0.165 0.211 63

TOTAL 1 1 1 1 300

By dividing each column by the corresponding total frequency (of the column!), we get a table of relative frequencies (proportions). We can ignore the rightmost column, as we will not need it.

An example

Finally, we can form a bar chart for each category: blond, brown, black and red. If there was any relationship between the categories, we should see some pattern in the graphs:

As there is no pattern, we can conclude that the color of hair and the salary are unrelated (independent) variables.

An example

We are running a little business producing music bands. The table below shows the data for the bands we are producing: number of albums produced, the career year of the band, and the number of registered followers of the band.

Band Albums Year Followers Band Albums Year Followers 1 2 3 7790134 8 1 2 26458 2 4 6 879650 9 9 14 8248993 3 3 7 457070 10 4 4 388013 4 1 2 1590205 11 16 11 20922 5 3 5 79937 12 1 1 9597 6 1 2 110627 13 25 23 230761 7 2 2 69704 14 10 16 73008

Analyze the linear relationship between the number of albums and of followers, and between the year of career and the number of followers.

chapter 3. bidimensional statistics

Documents