objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ......

37
Objectives 2.3 Least-squares regression Regression lines Prediction and Extrapolation Correlation and r 2 Transforming relationships Adapted from authors’ slides © 2012 W.H. Freeman and Company

Upload: others

Post on 16-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Objectives  

2.3 Least-squares regression

p  Regression lines

p  Prediction and Extrapolation

p  Correlation and r2

p  Transforming relationships

Adapted  from  authors’  slides  ©  2012  W.H.  Freeman  and  Company  

Page 2: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Linear  regression  p  Correlation measures the linear relationship between two quantative

variables. p  It gives information about the direction (negative or positive). p  And the strength of the relationship (how close is the correlation to zero,

-1 or 1). p  Placing a line through the plot, allows us to visualize the correlation. p  Correlation helps us to understand the relationship between the

variables. p  If we want to know how one variable may influence the value of the

other variable, then we need to use linear regression line. p  A regression line allows us to understand the relationship between

two variables when one (explanatory) variable helps explain or predict the other (response) variable. p  The response variable is the outcome of the study p  The explanatory variable helps explain changes in the response variable

(see page 82, IPS).

Page 3: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Explanatory  and  response  variables  A response variable measures or records an outcome of a study. An explanatory variable explains changes in the response variable.

Typically, the explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis.

Explanatory variable: number of beers

Blood Alcohol as a function of Number of Beers

0.000.020.040.060.080.100.120.140.160.180.20

0 1 2 3 4 5 6 7 8 9 10Number of Beers

Bloo

d Al

coho

l Lev

el (m

g/m

l)

Response variable: blood alcohol content

We are interested in the relationship between the two variables: How is one affected by changes in the other one?

Two numerical variables for each of 16 students.

Page 4: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Examples  of  explanatory  and  response  variables  p  In a study of the influence of calcium on growth of young people

researchers gave young people a controlled diet, where everything that the young people ate was identical except for the amount of calcium. At the end of the study the growth of the bones was measures. p  It is clear that calcium here is the explanatory variable (since this is

controlled by the researchers) and the response variable is the growth of subjects bones.

p  The rental price of apartments and the number of bedrooms. p  The number of bedrooms is the explanatory variable the rental price is the

response variable. p  The amount of sugar added to tea and how sweet it tastes.

p  The amount of sugar is the explanatory variable (since we control this) and how sweet it tastes is the response variable.

p  High school English and high school mathematics grades. p  Here is an example, where it is impossible to say that one is the

explanatory and the other is the response.

Page 5: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

The  equation  of  a  line  p  Once it is has been established which is the explanatory and which

is the response variable, then we need to decide how to put a line through the data.

p  The equation of a line is given as

y = b0 + b1x

q  b0 is the intercept it gives the value where the line crosses the y-axis. q  b1 is the slope it measures the rate of change in the response as the

explanatory variable changes by `one unit’. q  A pictorial description is given on the next page. A statistical problem is that given a scatterplot what line (determined by b0 and b1) does one choose to fit through the data?

Page 6: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

q  A regression is a formula that describes how a response variable y changes (on average) as an explanatory variable x changes.

q  We often use a regression line to predict the value of y for a given value of x. The predicted value for y is often denoted

q  In regression, the distinction between explanatory and response variables is important.

q  A straight line regression has the form

Straight  Line  Regression  

0 1y b b x= +

ˆ.y

y is the observed value

is the predicted y value (y hat) b1 is the slope b0 is the y-intercept

y

Page 7: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Distances between the points and line are squared so that they all are positive values.

The  least  squares  regression  line  The least-squares regression line is the unique line such that the sum of the squared vertical (y) differences between the data points and the line is as small as possible. (These differences are called residuals.)

This is the line that best predicts y from x (not the other way).

Page 8: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

First we calculate the slope of the line, b1, from statistics we have already computed. Interpretation: b1 is the (average) change in the value of y when x is changed by 1 unit.

1y

x

sb r

s= ×

r is the correlation. sy is the standard deviation of the response variable y. sx is the standard deviation of the explanatory variable x.

Once we know b1, we can calculate b0, the y-intercept. Interpretation: b0 is the predicted value when x = 0 (although this value of x is not always meaningful).

b0 = y − b1x and are the sample means of the x and y variables.

How  to  compute  the  slope  and  intercept:  

x

y

Page 9: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Calculation:  Weight  gain  and  NEA  

The correlation coefficient is -0.778 so using the formula we see that the slope will be And the intercept will be

b1 = �0.0034 = �0.7884⇥ 1.13257

b0 = 3.5 = 2.38� (�0.0034)⇥ 2.39

The equation of the line on the left is fat = 3.5� 0.0034⇥NEA

Page 10: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Statcrunch:  Weight  gain  and  NEA  

Page 11: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Example:  Predicting  weight  gain  from  NEA  

p  If someone’s increase in NEA is 200 calories, we predict their weight gain will be 2.8 kilos = 3.5 – 0.0034×200.

p  If someone’s decrease in NEA of 20 calories, we predict their weight gain will be 3.56 kilos = 3.5 + 0.0034×20.

p  What happens if someone’s increase in NEA is 800 calories what do we predict their weight will be? p  In this case we have to be very cautious about using the linear equation

because we have no information about weight gain beyond 690 calories. The whole relationship between NEA and weight gain could change beyond this point, we just don’t have any idea.

p  This is a case of extrapolation and we need to be extremely cautious about using our prediction equation in this case.

p  We illustrate why in the next slide….

Page 12: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Problems  with  extrapolation  brief communications

NATURE | VOL431 | 30 SEPTEMBER 2004 | www.nature.com/nature 525

Lung cancer

Intragenic ERBB2 kinasemutations in tumours

The protein-kinase family is the mostfrequently mutated gene family foundin human cancer and faulty kinase

enzymes are being investigated as promisingtargets for the design of antitumour thera-pies. We have sequenced the gene encodingthe transmembrane protein tyrosine kinaseERBB2 (also known as HER2 or Neu) from120 primary lung tumours and identified4% that have mutations within the kinasedomain; in the adenocarcinoma subtype oflung cancer, 10% of cases had mutations.ERBB2 inhibitors, which have so far provedto be ineffective in treating lung cancer,should now be clinically re-evaluated in thespecific subset of patients with lung cancerwhose tumours carry ERBB2 mutations.

The successful treatment of chronicmyelogenous leukaemia with a drug (knownas imatinib, marketed as Gleevec) thatinhibits a mutant protein kinase has fosteredinterest in the development of other kinaseinhibitors1. Gefitinib, an inhibitor of the epidermal growth-factor receptor (EGFR),induces a marked response in a small subsetof lung cancers; activating mutations havebeen found in the EGFR gene in tumoursthat respond to gefitinib but are rare in thosethat do not respond2,3. The response to gefi-tinib as a treatment for lung cancer thereforeseems to be predicated upon the presence ofan EGFR mutation in the tumour.

Momentous sprint at the 2156 Olympics?Women sprinters are closing the gap on men and may one day overtake them.

r2 = 0.789

r2 = 0.882

Year

Win

ning

tim

e (s

)

1900

1916

1932

1948

1964

1980

1996

2012

2028

2044

2060

2076

2092

2108

2124

2140

2156

2172

2188

2204

2220

2236

2252

6

7

8

9

10

11

12

13

Figure 1 The winning Olympic 100-metre sprint times for men (blue points) and women (red points), with superimposed best-fit linear regres-

sion lines (solid black lines) and coefficients of determination. The regression lines are extrapolated (broken blue and red lines for men and

women, respectively) and 95% confidence intervals (dotted black lines) based on the available points are superimposed. The projections inter-

sect just before the 2156 Olympics, when the winning women’s 100-metre sprint time of 8.079 s will be faster than the men’s at 8.098 s.

The 2004 Olympic women’s 100-metresprint champion, Yuliya Nesterenko, isassured of fame and fortune. But we

show here that — if current trends continue— it is the winner of the event in the 2156Olympics whose name will be etched insporting history forever, because this maybe the first occasion on which the race iswon in a faster time than the men’s event.

The Athens Olympic Games could beviewed as another giant experiment inhuman athletic achievement. Are womennarrowing the gap with men, or falling further behind? Some argue that the gainsmade by women in running events betweenthe 1930s and the 1980s are decreasing as thewomen’s achievements plateau1.Others con-tend that there is no evidence that athletes,male or female, are reaching the limits oftheir potential1,2.

In a limited test,we plot the winning timesof the men’s and women’s Olympic finals overthe past 100 years (ref. 3; for data set, see sup-plementary information) against the compe-tition date (Fig. 1). A range of curve-fittingprocedures were tested (for methods,see sup-plementary information), but there was noevidence that the addition of extra para-meters improved the model fit significantlyfrom the simple linear relationships shownhere. The remarkably strong linear trendsthat were first highlighted over ten years ago2

persist for the Olympic 100-metre sprints.There is no indication that a plateau has beenreached by either male or female athletes inthe Olympic 100-metre sprint record.

Extrapolation of these trends to the 2008Olympiad indicates that the women’s 100-metre race could be won in a time of10.57!0.232 seconds and the men’s event in9.73!0.144 seconds. Should these trendscontinue, the projections will intersect at the2156 Olympics, when — for the first timeever — the winning women’s 100-metresprint time of 8.079 seconds will be lowerthan that of the men’s winning time of 8.098seconds (Fig. 1). The 95% confidence inter-vals, estimated through Markov chain MonteCarlo simulation4 (see supplementary infor-mation), indicate that this could occur asearly as the 2064 or as late as the 2788 Games.

This simple analysis overlooks numerousconfounding influences, such as timing accuracy,environmental variations,nationalboycotts and the use of legal and illegal stim-ulants. But it is also defended by the limitedamount of variance that remains unex-plained by these linear relationships.

So will these trends continue and canwomen really close the gap on men? Thosewho contend that the gender gap is widening

say that drug use explains why women’stimes were improving faster than men’s,particularly as that improvement slowedafter the introduction of drug testing1.How-ever, no evidence for this is found here. Bycontrast, those who maintain that therecould be a continuing decrease in gendergap point out that only a minority of theworld’s female population has been giventhe opportunity to compete (O. Anderson,www.pponline.co.uk/encyc/0151.htm).

Whether these trends will continue at theBeijing Olympics in 2008 remains to be seen.Sports, biological and medical sciencesshould enable athletes to continue toimprove on Olympic and world records, byfair means or foul5. But only time will tellwhether in the 66th Olympiad the fastesthuman on the planet will be female.Andrew J. Tatem*, Carlos A. Guerra*, PeterM. Atkinson†, Simon I. Hay*‡*TALA Research Group, Department of Zoology,University of Oxford, Oxford OX1 3PS, UKe-mail: [email protected]†School of Geography, University of Southampton,Highfield, Southampton SO17 1BJ, UK‡Public Health Group, KEMRI/Wellcome TrustResearch Laboratories, PO Box 43640, 00100 GPO,Nairobi, Kenya1. Holden, C. Science 305, 639–640 (2004).2. Whipp, B. J. & Ward, S. A. Nature 355, 25 (1992).3. Rendell, M. (ed.) The Olympics: Athens to Athens 1896–2004

338–340 (Weidenfeld and Nicolson, London, 2003).4. Gilks, W. R., Thomas, A. & Spiegelhalter, D. J. Statistician 43,

169–178 (1994).5. Vogel, G. Science 305, 632–635 (2004).Supplementary information accompanies this communication onNature’s website.Competing financial interests: declared none.

30.9 brief comms 525 MH 23/9/04 4:33 pm Page 525

© 2004 Nature Publishing Group

Have a look at the following article predicting that in 2156 that women may be faster than men…. Using the same graph if we extrapolate a few thousand years, then women (and men) may run a hundred meters in negative time… The extrapolations done here need to be taken with a large pinch of salt….

Page 13: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Warnings  about  scale  changes  p  Unlike the correlation coefficient the slope and the intercept can change according to the scale that you use. p  It is important to note that if we change the scale of either the x or y

axis then the slope and intercept will also change. For example, we were change the Fat gain from kilograms to pounds the equation of the slope and intercept will change.

p  We calculate these changes using the formulas for the changes to the mean and standard deviation given linear change (see the formulas at the start of class).

q  Do not be fooled by the size of the slope. If it is `small’, it does not mean it is insignificant, or the correlation is small. The size of a slope depends on the scaling that we use. Whether it is `statistically significant’ depends on the standard error (more of this later on) not on the scale.

Page 14: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Example:  Midterm  1  and  Midterm  2  

Page 15: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Example:  Midterm  1  and  Midterm  3  

Page 16: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Example:  Midterm  2  and  Midterm  3  

Page 17: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Ef6iciency  of  a  bio6ilter,  by  temperature  In StatCrunch: Stat-Regression-Simple Linear

ˆ 97.5 0.0757 .y x= +For every degree that temperature goes up, the efficiency can be expected to increase by b1 = 0.0757 units. The predicted efficiency when temperature equals 10 is

ˆ 97.5 0.0757 10 98.26.y = + × =

Page 18: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Relationship  between  ozone  and  carbon  pollutants  In StatCrunch: Stat-Regression-Simple Linear

ˆ 0.0515 0.005708 .y x= +For each unit that carbon goes up, ozone can be expected to increase by b1 = 0.005708 units. The predicted efficiency when carbon equals 15 is However, the relationship is not strong so the prediction may not be all that accurate.

ˆ .0515 0.005708 15 .1371.y = + × =

Page 19: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

In regression we examine the variation in the response variable (y) given the explanatory variable (x).

The correlation is a measure of spread (scatter) in both the x and y directions in the linear relationship.

Differences:  Correlation  versus  regression  

Page 20: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

What  is  gained  by  Jitting  a  linear  model?  p  A large component of statistics is measuring how much is `gained’

by fitting a more complex model to the data over a simpler model. p  How can we justify the extra complexity?

p  In the case of linear regression, we need to ask, how much is gained by placing a line through the data over just a flat line?

p  This gain is measure using the so called R2.

R

2 =

Pi(yi � y)2 �

Pi(yi � b0 � b1xi)2P

i(yi � y)2

q  The formula is hard to follow (and in practice in you never have to calculate it). But it measures the residuals from each of the observations and both the flat line and linear line.

q  It is best understood with a picture (given in class). q  R2 = 1 means the line fits the data perfectly. q  R2 = 0 means the line does not help to explain any of the

variation. You may as well use a flat line. In practice you will never get an R2 = 0, even if the line has no meaning.

Page 21: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

CoefJicient  of  determination,  R2  

Another way of understanding the formula is that it is R2 is the percentage of the variation of y that can be explained by the prediction from x. (That is, it is the amount of vertical scatter from the regression line relative to the overall vertical scatter.)

R2 is meaningful for any fit of the response variable to one or more explanatory variables. In the case of straight line fit only, however, R2 = r2, where r is the correlation coefficient (positive or negative).

Page 22: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Ef6iciency  of  a  bio6ilter,  by  temperature  

ˆ 97.5 0.0757 .y x= +

R2 = 79.4% is the proportion of the variation in Efficiency that is explained by the straight line regression on Temperature.

Page 23: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Relationship  between  ozone  and  carbon  pollutants  In StatCrunch: Stat-Regression-Simple Linear

ˆ 0.0515 0.005708 .y x= +

R2 = 44.7% is the proportion of the variation in Ozone Level that is explained by the straight line regression on Carbon Pollutant Level.

Page 24: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

The distances from each point to the least-squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter.

Observed y

Predicted ŷ ˆ residual y y− =

DeJinition:  Residuals  

Points above the line have a positive residual.

Points below the line have a negative residual.

These distances are called residuals, because they are what is “left over” after fitting the line. The sum of the residuals is always 0.

Page 25: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Residuals are the differences between y-observed and y-predicted. We plot them in a residual plot, which plots residuals vs. x. If the data are best predicted simply by using a straight line then the residuals will be scattered randomly above and below 0.

Residual  plots  

Page 26: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

The x-axis in a residual plot is the same as on the scatterplot.

Only the y-axis is different.

Page 27: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Constant mean and spread. Residuals are randomly scattered – good! Non-constant mean. Curved pattern—means the relationship you are looking at (e.g., a straight line) has not been fit properly.

Non-constant spread. A change in variability across a plot indicates that the response variable is less predictable for some values of x than for others. This can affect the accuracy of statistical inference.

Page 28: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Outlier: an observation that lies outside the overall pattern of observations.

Influential observation: an observation that markedly changes the regression if removed. This is often an outlier on the x-axis.

Outliers  and  inJluential  points  

Child 19 = outlier in y direction

Child 18 = outlier in x direction

Child 19 is an outlier of the relationship.

Child 18 is only an outlier in the x direction and thus might be an influential point.

Page 29: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

All data Without child 18 Without child 19

outlier in y-direction

influential

Are these points

influential?

Page 30: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

A correlation coefficient and a regression line can be calculated for any relationship between two quantitative variables. However, outliers greatly influence the results, and running a linear regression on a nonlinear association is not only meaningless but misleading.

Always plot your data

So make sure to always plot your data before you run a correlation or regression analysis.

Page 31: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Examples of when the line of best fit may not be appropriate

The four data sets below were constructed so that they each have correlation r = 0.816, and the regression lines are all approximately ŷ = 3 + 0.5x. For all four sets, we would predict ŷ = 8 when x = 10.

Page 32: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

The four scatterplots show that the correlation or regression analysis is not appropriate for just any data set with two numerical variables.

A moderate linear association. A straight line is regression OK. Statistical inference for SLR is OK.

An obviously nonlinear relationship. A straight line regression is not OK. Fit a different curve.

One point deviates from the highly linear pattern. This influential outlier must be examined closely before proceeding.

Just one very influential point; all other points have the same x value. What experiment was conducted here?

Continued

Page 33: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Vocabulary:  lurking  vs.  confounding  

q  We recall that a lurking variable is a variable that is not among the explanatory or response variables in the analysis and yet, if

observed and considered, may influence the interpretation of relationships among those variables (an example is given in the next

slide).

q  Two variables are confounded when their effects on a response variable cannot be distinguished (statistically) from each other. The

confounded variables can be explanatory variables or lurking variables.

q  Association is not causation. Even if a statistical association is very strong, this is not by itself good evidence that a change in x will

cause a change in y. The association would be just as strong if we reversed the roles of x and y.

Page 34: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Lurking  variable  and  regression  Often, things are not simple and one-dimensional. We need to group the data into categories to reveal the correct.

What may look like a positive linear relationship is in fact a series of negative linear associations.

Here, the habitat for each observation is a lurking variable.

Plotting data points from different habitats in different colors allows us to make that important distinction.

Page 35: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Comparison of racing records over time for men and women. Each group shows a very strong negative linear relationship that would not be apparent without the gender categorization.

Relationship between lean body mass and metabolic rate in men and women. Both men and women follow the same positive linear trend, but women show a stronger association. As a group, males typically have larger values for both variables.

Page 36: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

Cautions  before  rushing  into  a  correlation  or  a  regression  analysis  q  Do not use a regression on inappropriate data.

q  Clear pattern in the residuals

q  Presence of large outliers q  Clustered data falsely appearing linear

q  Beware of lurking variables (see previous slide).

q  Avoid extrapolating (predicting beyond values in the data set).

q  Recognize when the correlation/regression is being performed on values that are averages of another variable.

q  An observed relationship in the data, however strong it is, does not imply causation just on its own.

Use residual plots for help in seeing these.

Page 37: Objectives*suhasini/teaching301/stat301_linear_regression.pdf · by 1 unit. 1 y x s br s = ... assured of fame and fortune. But we show here that — if current trends continue

The distinction between explanatory and response variables is crucial in regression. If you exchange y for x in calculating the regression line, you will get a different line which is a predictor of x for a given value of y.

This is because the least squares regression of y on x is concerned with the distance of all points from the line in the y direction only.

Here is a plot of Hubble telescope data about galaxies moving away from earth. The solid line is the best prediction of y = velocity from x = distance. The dotted line is the best prediction of x = velocity from y = distance.

Warnings: