statistics for social and behavioral sciences session #6: the regression line c’ted (agresti and...

24
Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Upload: gervase-jennings

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Statistics for Socialand Behavioral Sciences

Session #6: The Regression Line C’ted(Agresti and Finlay, Chapter 9)

Prof. Amine Ouazad

Page 2: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Statistics Course Outline

PART I. INTRODUCTION AND RESEARCH DESIGN

PART II. DESCRIBING DATA

PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL

STATISTICS

PART IV. : CORRELATION AND CAUSATION: REGRESSION

ANALYSIS

Week 1

Weeks 2-4

Weeks 5-9

Weeks 10-14

This is where we talk about Zmapp and Ebola!

Firenze or Lebanese Express?

Where we are right now!Describing associations between two variables

Page 3: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Last Session

• From a scatter plot to a linear relationship– A linear relationship is a model, imperfect.– A linear relationship implies constant gradients.– A linear relationship helps predict/extrapolate,

interpolate to fill missing statistics.

• Finding the regression line– The regression line minimizes the sum of squared

errors.– The formula for a and b are essential to learn.

Page 4: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Outline1. The Regression Line (C’ted)– Last time’s recap– Why we call it regression

2. Warning: Correlation is not causation– Spurious relationships– Being agnostic about causality: correlation

3. How well does the linear model perform?

Next session: Bivariate analysis Chapter 9 of A&F, continued

Page 5: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Finding the regression line

• Which line is the right one? • A line is entirely determined by the choice of a and b.

An essential formula.

Notice the difference between b and b, between a and a.

x is the explanatory variable y is the response variable

If y increases when x increases, then b>0If y decreases when x increases, then b<0

Page 6: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Why do we call this regression?

• “Regression towards mediocrity in Hereditary Stature”, Sir Francis Galton, 1886. What are y,x,b here?

Sir F. Galton

Page 7: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Understanding Galton: Questions

• A little exercise to understand Sir Francis:1. What is the data? How many observations? What is y? What

is x? 2. Write the assumed linear relationship

between y and x.3. Can you express the mean of y?

(as a function of the mean of x)4. Take the difference between child i’s height and children’s

mean height. 5. How does it relate to the difference between child i’s

parents’ midheight and the the mean of parents’ midheight?

I use mean and average interchangeably in this course. Same formulaSir F. Galton

Page 8: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Outline1. The Regression Line (C’ted)– Last time’s recap– Why we call it regression

2. Warning: Correlation is not causation– Spurious relationships– Being agnostic about causality: correlation

3. How well does the linear model perform?

Next session: Bivariate analysis Chapter 9 of A&F, continued

Page 9: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

“More than a fifth of people on unemployment benefits have a criminal record, government figures have revealed.The new data showed an estimated 22 per cent of all people claiming out of work claimants - such as Jobseeker’s Allowance - were made by people who had been to prison or convicted of an offence in the previous 12 years.”

Chris Grayling, the Justice Secretary, is pushing through reforms which aim to provide more support to offenders who are released from jail back into the community.Jeremy Wright, the justice minister, said: “We are committed to delivering long-needed changes that will see all offenders released from prison receive targeted support to finally turn themselves around and start contributing to society.”

Page 10: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Unemployment and Crime

“The figures also showed 44 per cent of offenders were claiming benefits a month after being convicted, cautioned or released from jail.”“More than half of offenders - 54 per cent - released from prison were claiming out-of-work benefits one month later, gradually decreasing to 42 per cent two years after.”“In all, 214,000 people claiming out-of-work benefits had been to prison at least once in the previous 12 years, or 4 per cent of the total.”“Previous data published in 2011 estimated the proportion of criminal claimants was slightly higher, at 26 per cent, but a Ministry of Justice spokesman said the sets of figures were not directly comparable.”

Chris Grayling, Justice Secretary (UK)

Page 11: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Association is not causation

What Drives Obesity?

Is higher obesity due to the rise in driving? Perhaps. It’s an intriguing hypothesis. But our friends at The Economist should know better than to report nonsensical correlations. Here’s the evidence they cite (drawn from this entirely unconvincing research paper published in Transport Policy):

Looks impressive, right? (Well, apart from putting the explanatory variable on the vertical axis.) But before concluding that there’s anything here, let’s try a different variable, instead—my age:

Page 12: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Reading is an important skill, and elementary school teachers have observed that the reading ability of their students tends to increase with their shoe size. To help boost reading skills, should policymakers offer prizes to scientists to devise methods to increase the shoe size of elementary school children? Obviously, the tendency for shoe size and reading ability to increase together does not mean that big feet cause improvements in reading skills. Older children have bigger feet, but they also have more developed brains. This natural development of children explains the simple observation that shoe size and reading ability have a tendency to increase together—that is, they are positively correlated. But clearly there is no relationship: bigger shoe size does not cause better reading ability.

In economics, correlations are common. But identifying whether the correlation between two or more variables represents a causal relationship is rarely so easy. Countries that trade more with the rest of the world also have higher income levels—but does this mean that trade raises income levels? People with more education tend to have higher earnings, but does this imply that education results in higher earnings? Knowing precise answers to these questions is important. If additional years of schooling caused higher earnings, then policymakers could reduce poverty by providing more funding for education. If an extra year of education resulted in a $20,000 a year increase in earnings, then the benefits of spending on education would be a lot larger than if an extra year of education caused only a $2 a year increase.

Economists need statisticians

Page 13: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Association is not causation• The response variable may be the explanatory variable and vice

verse (reverse causation).• There may be other factors that affect the response variable, other

than the explanatory variable. ☞ Multivariate statistics coming up in week 12.

Univariate statistics

Inspecting the distribution of one variable.

Am I taller than the average? Than the median?What percentile of the distribution do I belong to?

Bivariate statistics

Discovering associations between 2 variables.

What is the relationship between parents’ height and children’s height?What is the relationship between unemployment and crime?

Multivariate statistics

Uncovering causality: looking at the impact of multiple explanatory variables on one response variable

What factors cause crime? Poverty, unemployment, guns, police headcounts?

Weeks 1 and 2 Now and next week Week 12

Page 14: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

The correlation of two variables• The correlation of two variables is:

• The correlation does not make an assumption about the direction of causality (The slope does)

• It is, however, related to the slope:

SlopeCorrelation

Standard dev. of x

Standard dev. of y

A sum of N observations: fortunately a computer will usually do it (Stata)

Page 15: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

An Example: Unemployment and Murders – The Sequel

• Standard deviation of Unemployed Persons: 5,901.259• Standard deviation of Murders:

20.44• Regression line: we find b = 0.00285 and a = -1.96• The correlation r(Unemployed, Murders) is: 0.83.• Self-check?

Page 16: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Properties of the correlation

• The correlation is a number between -1 and 1, sometimes (but rarely) expressed as a percentage.

• If two variables have a correlation of 1, we say that they are perfectly correlated…– Example: student expenses in USD are perfectly

correlated with student expenses in AED.– y is exactly a+b x, with b>0.

• If two variables have a correlation of -1, the two variables are exactly such that y = a + b x, with b<0.– Example: Number of days to New Year’s eve, Number of

days from New Year’s eve.

Page 17: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Outline1. The Regression Line (C’ted)– Last time’s recap– Why we call it regression

2. Warning: Correlation is not causation– Spurious relationships– Being agnostic about causality: correlation

3. How well does the linear model perform?

Next session: Bivariate analysis Chapter 9 of A&F, continued

Page 18: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

How “good” are our predictions?

The regression line.

y

x

Aouch: we make errors.The actual yi

And the predicted yi, noted:

The regression line minimizes the sum of the squared errors:

Remember the formula for b and a.

When does a model predict y perfectly?When does the model have no predictive power?

Page 19: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Playing with the R Squared• The R Squared is :

• Answers the question(s): – “What fraction of the variance of the response

variable is explained by the explanatory variable?”– “What percentage of the variance of the response

variable is explained by the explanatory variable?” • Measures the fit of the linear model.• The R squared is also the square of the

correlation between x and y !R2=r2

Page 20: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

An Example: Unemployment and Murders – The Sequel

Page 21: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

An Example: Unemployment and Murders – The Sequel

Page 22: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

An Example: Unemployment and Murders – The Sequel

• The variance of the predicted number of murders is:

284.3

• The variance of the actual number of murders :

417.8

• The R Squared is:

Not bad !!

• Side question: what is the variance of the errors (residuals)?

Follow my lead, it’s easierRemember: variance(y) = variance(prediction) + variance(error)

Page 23: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Wrap up

• Finding the regression line (Sir Galton)– The regression line minimizes the sum of squared errors.– The formulas for a and b are essential.

• Association is not causation– Does x cause y or does y cause x?– Is there any other factor that may cause y?– Being agnostic about the direction of causality: the correlation r.

• How good are my predictions? How good is my model?– Use the R Squared, know its formula.– The variance is the square of the standard deviation.

Page 24: Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad

Next session: Minority Report continuesDon’t forget: • Midterm 1 coming up in week 5 (exact date coming soon from the Registrar Mary Downes).• Online Quiz #3 starting tonight at 9pm, due Tuesday at 9am.• Sunday recitation on: “The Regression Line: ‘Education and Economic Growth.’”

• In chapter 9, read everything except Section 9.5 (Inferences for the Slope)

For help:

• Amine OuazadOffice 1135, Social Science [email protected] hour: Wednesday from 4 to 5pm.

• GAF: Irene [email protected] recitations. At the Academic Resource Center, Monday from 2 to 4pm.