marshall university school of medicine department of biochemistry and microbiology bms 617 lecture 9...

Marshall University Genomics Core Facility

Marshall University School of MedicineDepartment of Biochemistry and Microbiology

BMS 617

Lecture 9 – Correlation and linear regression

Marshall University School of Medicine

Correlation• Correlation describes the propensity for one variable to vary in the same (or

opposite) way to another variable• Example (from Motulsky): Borkmann et al. measured the insulin sensitivity and

fraction of polyunsaturated fatty acids with between 20 and 22% carbon atoms in 13 healthy men

• Both variables show a degree of variation:


Scatterplot

• Scatterplot of insulin sensitivity against %C20-22:

• The plot seems to show a relationship, or correlation, between the variables• The higher the %C20-22, the higher

the insulin sensitivity


Correlation Coefficient• The correlation coefficient between two sets of values x1…xn and y1…yn is

computed as follows:– Calculate the standardized values of x and y:– zx,i=(xi-mean(x))/sd(x); zy,i=(yi-mean(y))/sd(y);– Compute the products of all the standardized values, add them up, and divide by

n-1:– r=(zx,1zy,1+zx,2zy,2+…+zx,nzy,n)/(n-1)


Why the correlation coefficient works

• If a value is bigger than the mean, its standardized score is positive, otherwise its standardized score is negative

• The product of two standardized scores will be positive if both scores are positive, or both scores are negative i.e. if both scores are bigger than the mean, or both are less than the mean

• So if one variable tends to increase when the other tends to increase, the bulk of the products of standardized scores will be positive, and the correlation coefficient will be high

• On the other hand, if one variable tends to decrease when the other increases, the bulk of the products of standardized scores will be negative, and the correlation coefficient will be low

• If there is no relationship, the standardized scores will be randomly distributed, and their products will tend to cancel out


Correlation coefficient for the insulin sensitivity data

• The correlation coefficient for the insulin sensitivity data is r=0.77

• The square of this value is r2=0.59• r2 is always between 0 and 1• r2 is easier to interpret than r:• 59% of the variation in insulin sensitivity can

be "explained" by the variation in %C20-22.• We will make this more precise later


Confidence Intervals for Correlation Coefficients

• Most statistical software will compute a confidence interval for a correlation coefficient:

• 95% confidence interval for these data is [0.38, 0.93]

• We are 95% confident the interval from 0.38 to 0.93 includes the true correlation coefficient for insulin sensitivity and %C20-22 fatty acid content


GRHL2 and Epithelial-Mesenchymal Transition

• Epithelial-Mesenchymal Transition (EMT) is a process cancer cells must undergo before metastasis can occur

• Mani et al (Cell 2008; 133; 704-15) published a gene signature for cells which have undergone EMT

• Relative expression of a set of 251 genes indicative of EMT• Cieply et al. (Cancer Research, 2012) attempted to induce

EMT in GRHL2-overexpressed cells, profiled the resulting gene expression by microarray

• Hypothesized that GRHL2 would suppress EMT• Compared expression of Core EMT genes in their assay to

that of Mani


Expression of Core EMT genes in GRHL2 overexpressed cells

• Expression patterns show a strong negative correlation• Suggests that GRHL2 has suppressed EMT


p-values for correlation coefficients• It is possible to compute a p-value for correlation coefficients• The null hypothesis is that there is no correlation

– i.e. that the true correlation coefficient is zero• So the p-value is the probability of getting a correlation coefficient at

least as big as the one observed, from a random sample of the same size as the one used, assuming there is no correlation in the population

• Note that with large samples, p-values for correlation coefficients tend to be very small

• For the insulin sensitivity example (n=13), p=0.0021• For the GRHL2-EMT example, (n=216), p<10-16

• It is important to look at the r or r2 value to determine if the result is of biological importance


Correlation and Causality• A very common error is to assume that correlation implies causality• In the insulin sensitivity example, it would be wrong to conclude from

the correlation alone that high lipid content caused high insulin sensitivity

• The possible reasons for the correlation in this example are:– Lipid content determines insulin sensitivity– Insulin sensitivity determines lipid content– Both lipid content and insulin sensitivity are determined by a common factor– There is a complex network of interacting factors of which lipid content and

insulin sensitivity are two components– It is a coincidence

• The p-value tells you how rare a coincidence would be, under the null hypothesis

• To determine among the other possibilities, further experimentation is needed


Correlation and Causality in the Examples

• In the first example (insulin sensitivity), the investigators performed further experiments in which they manipulated the variables

• They concluded that lipid content determined insulin sensitivity (to some extent)

• In the second example, the data come from the same genes under different sets of conditions– There is no mechanism for the expression under one condition to affect

the expression under another condition– They are both determined by a common factor (the extent to which the

cell has undergone EMT)• In the first example, it makes sense to investigate the nature of

the influence of lipid content on insulin sensitivity further


Simple Linear Regression• Correlation asks the question "To what extent is there a linear

relationship between two variables”• Linear regression asks the question "What is the linear

relationship between two variables”• Correlation is symmetric

– the correlation coefficient between x and y is the same as the correlation coefficient between y and x

• Linear regression is not symmetric:– One variable must be designated as independent and one must be

designated as dependent– It assumes a model of causality– Switching the roles of independent and dependent variables will

produce different results


What does linear regression do?• Linear regression calculates the straight line that gives the best

prediction of the y values from the x values• It finds the values of a and b in the equation y=a+bx to do this• This is done by minimizing the sum of the squares of the vertical

distance from each point to the line• Note that:

– The roles of x and y are predetermined, and affect the result– We can only estimate a and b based on our data sample– We cannot know the true population values for a and b– Usually helpful to calculate a confidence interval for these

• Does it make sense to perform linear regression on– The insulin sensitivity data?– The GRHL2-EMT expression data?


Linear Regression for Insulin Sensitivity


Linear Regression results for Insulin Sensitivity


Interpreting linear regression results

• The best fit values show the slope and intercepts of the line, along with their standard errors

• So estimate of slope is 37.21 with standard error 9.3– For each 1% increase in the percentage of polyunsaturated fatty acids with 20-22

Carbon atoms, the insulin sensitivity increases on average by 37.21 mg/m2/min• The 95% confidence interval for the slope ranges from 16.75 to 57.67

– This is easier to interpret than the standard error– We are 95% confident the range 16.75 to 57.67 includes the true value of the slope

• The intercepts give the values of the insulin sensitivity when the %C20-22 is 0, and the value of %C20-22 that would yield an insulin sensitivity of 0– Are these meaningful?

• The R2 value is 0.5929. This means that 59% of the variance in insulin sensitivity can be accounted for by the variation in C20-22 polyunsaturated fatty acids, and the remaining 41% is the result of other factors.– We will discuss R2 in more detail later


p-value for linear regression• The linear regression results give a p-value of 0.0021• To interpret this, we need to know the null hypothesis• The null hypothesis is that there really is no linear relationship

between insulin sensitivity and %C20-22. – If this were true, the best fit line would have a slope of zero

• If the null hypothesis were true (there is no linear relationship between the two), the chances of seeing a best fit line with a slope at least this steep would be 0.21%

• Note that the null hypothesis for correlation is essentially equivalent to the null hypothesis for linear regression– Hence the p-values are equal– However, the interpretations are different


Assumptions for linear regression• Linear regression is based on the following assumptions:

– There is a linear relationship between the two quantities– The residuals are normally distributed

• The residuals are the vertical distances of each point from the line; the random scatter

– The variability is the same all the way along the line– Data points are independent– The x and y values are measured independently– The x values are known precisely

• Be careful of the following:– Do not try to interpret the linear regression for values far from the data– In our example, the %C20-22 values were all between 17 and 25. The

linear regression is unlikely to be meaningful for values far from this. In particular, the intercept value (%C20-22=0) is likely to be meaningless.


Common mistakes with linear regression

• Be careful of the following traps when using Linear Regression:– Not all relationships are linear! If the R2 value for linear regression is

low, consider the possibility there may be another relationship between the variables

– Don't use linear regression on smoothed data• This violates the assumption that data points are independent

– Don't use linear regression if y is (partly) calculated from x• For example, if y is the change in a measurement before and after

treatment, and x is the value before treatment• This violates the assumptions that x and y are measured independently

• Always carefully consider which variable is x and which is y– If you can't decide, you probably shouldn't be using regression

• Always plot the data


Summary: Correlation

• Correlation determines the extent to which two variables share a linear relationship

• Makes no assumptions and draws no conclusions about causality

• The correlation coefficient is between -1 and 1, with ±1 being a perfect linear relationship

• The square of the correlation coefficient is the percentage of variability in one variable which is "explained by" the variability in the other variable


Summary: Linear Regression

• Linear Regression provides the best prediction of one variable from another variable, assuming they have a linear relationship

• Causal direction is built into the model• Results give estimates for two parameters:– Intercept and slope

• and confidence intervals for each

marshall university school of medicine department of biochemistry and microbiology bms 617 lecture 9...

Documents