statistics for social and behavioral sciences session #9: linear regression and conditional...

Statistics for Socialand Behavioral Sciences

Session #9: Linear Regression and Conditional distributionProbabilities

(Agresti and Finlay, Chapter 9)

Prof. Amine Ouazad

Statistics Course Outline

PART I. INTRODUCTION AND RESEARCH DESIGN

PART II. DESCRIBING DATA

PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL

STATISTICS

PART IV. : CORRELATION AND CAUSATION: REGRESSION

ANALYSIS

Week 1

Weeks 2-4

Weeks 5-9

Weeks 10-14

This is where we talk about Zmapp and Ebola!

Firenze or Lebanese Express?

Where we are right now!Describing associations between two variables

Last session

• How good are my predictions? How good is my model?– Use the R Squared = ESS/TSS.– TSS = ESS + SSE.– The notations TSS, ESS, SSE are widespread.– The variance is the square of the standard deviation.– The R squared is also the square of the correlation of

the predicted value and the actual value.

Outline

1. Conditional distribution– What wage will I earn after graduation?

2. Probabilities (Chapter 4)

After the Break: Probability Distributions Chapter 4 of A&F

WHEN LaTisha Styles graduated from Kennesaw State University in Georgia in 2006 she had $35,000 of student debt. This obligation would have been easy to discharge if her Spanish degree had helped her land a well-paid job. But there is no shortage of Spanish-speakers in a nation that borders Latin America. So Ms Styles found herself working in a clothes shop and a fast-food restaurant for no more than $11 an hour.

Frustrated, she took the gutsy decision to go back to the same college and study something more pragmatic. She majored in finance, and now has a good job at an investment consulting firm. Her debt has swollen to $65,000, but she will have little trouble paying it off.

A Contingency Table(From Previous Session)

• But can I do a regression analysis here?

We will learn how to produce this later in the course. For now, let’s interpret/understand this.

Shows the average weekly earnings for each year of education.

What wage will I earn after graduation?

• Data: Census of Population 2010.• The United States Census is a decennial census mandated by

Article I, Section 2 of the United States Constitution, which states: "Representatives and direct Taxes shall be apportioned among the several States ... according to their respective Numbers ... . The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years.”

• Variables: – Number of years of education completed.– Wage income.

• We can only perform regression analysis on quantitative variables.

Linear Relationship anybody?• We can postulate that there is a linear relationship between

wage income(y) and years of schooling (x).

• Using greek letters here. True relationship.• Notice the importance of residuals (aka errors)• Units of measurement matter. Make sure you read the fine

print.– y is annual income in dollars. x is in years.

• Also, with a linear relationship, an additional year of education leads the same increase in income at any stage of your education process.– Makes sense? Check the contingency table.

• We keep the linear relationship as a convenient model.

Estimation of a and b• We estimate a and b by computing the values of a and b. We

only have a sample, not the entire population.

• So, what earnings can we expect?

Linear? A contigency table

• Here, years of schooling (x) is quantitative discrete, so we can do both regression analysis and a contingency table!

(… Continued …)

1

b

Years of schooling

y Wage Income

a

The unconditional distribution of income and education

• We find that the mean and the standard deviation of the variables are as follows:– Annual income y mean: $41,550 SD: $48,659– Years of schooling x mean: 12.25 years SD: 1.6 years

• Assuming a bell shaped distribution– Most earnings will fall between: mean +- 3 sd– 95% of earnings will fall between: mean +- 2 sd– 68% of earnings will fall between: mean +- 1 sd

• Interesting: could do a risk analysis with that data:– What is the probability that you earn more than mean + 2 sd?

• But the unconditional distribution of annual income mixes both individuals with high and low levels of education…

• So instead of using the unconditional distribution of income (aka marginal distribution), we use the conditional distribution of income.

• “What is the distribution of income given that an individual studied for x years?”

The conditional distribution of income and education

x

Earnings y

Education

• After x years of education, the predicted (mean) annual income will be:

a + b x

-123.610 + 2,689.936 x

• With x = 16 …. We find $42,915 !• Good or bad?

Understanding themean of income given x

• Use the fact that TSS = ESS + SSE.– The ESS measures how education explains the variance of earnings.

• From this we find that Var(y) = Var(predictions) + Var(error).– How do we go from TSS=ESS+SSE to this?

• But that is the variance of the unconditional distribution of y.• How can we find the variance of earnings given a level of education?• In such a case Var(y) given a level of education is Var(y given x)=Var(error).

• And thus the standard deviation of earnings given a level of education is:– SD(residuals) = square root of (SSE/N) = sqrt (513012622113699.1/1460042)

= $18,744

• Applying the empirical rule… we find that most annual incomes will lie between:

$ 42,915 - 3 x $18,744 and $ 42,915 + 3 x $18,744$0 and $99,417

Understanding the risks:Approach #1

• Use our beautiful formula:

• Hence the correlation between earnings and education is: 0.2240• It is lower than 1 because the linear relationship doesn’t hold

exactly.• The r2 is thus: 0.050176• Notice the variance of the error: Var(error) = (1-R2) x Var(y)• And thus ! sd(residuals) = sqrt(1-R2) * SD(y)• We find: $18,744 !!! Same as before !

Understanding the risks:Approach #2

Slope: $2,689.936Correlation

Standard dev. of xHere 1.6

Standard dev. of y: 19,233.75

Where will your earnings lie with 95% probability?

The Empirical RuleFrequency

Earnings

The conditional distribution has a lower standard

deviation… a higher mean than the unconditional

distribution.

Unconditional distribution

Conditional distribution

Wrap up

• With a linear relationship y = a + b x + e..– The unconditional distribution of y has a larger variance than

the conditional (i.e. marginal) distribution of y given x.

• The mean of the conditional distribution of y given x isa + b x

• And the standard deviation is the standard deviation of the errors ei.

• Such standard deviation is equal to:

Again, N in the denominator.Proper discussion of this to follow.

Outline

1. Conditional distribution– What wage will I earn after graduation?

2. Probabilities (Chapter 4)

After the Break: Probability Distributions Chapter 4 of A&F

Probability and Luck

• We play a game together… – Heads you win 1 dirham.– Tails I win 1 dirham.

• We play the game a very largenumber of times.

• Should you play this game?• P(heads) = 0.5, P(tails) = 0.5

• P(heads) = 1 – P(not heads)• P(heads) is read as “probability of heads”.• Game sequence:

– In the long run, with a balanced coin, 0.5 of the trials will lead to heads, 0.5 of the trials will lead to tails.

– The probability of heads is the ratio of the number of heads to the number of trials, with an infinite number of draws…


Perform the game for a very long number of draws.

… the longer the game the closer the ratio will be to 0.5

• What is the probability that you win twice in a row?– P(heads in the first round)

* P(heads in the second round) = – Because the draws in the first and the second round

are independent events.• What is the probability that you win k times in a

row?– P(heads in the first round)

* P(heads in the second round)* …. * P(heads in the kth round) =


Sometimes we can’t repeat our choices

Life is full of random events… but• We only draw one job at the end of university.– Hard to know what other incomes/jobs we would

have gotten.• We only draw one marriage.– Subsequent marriages are not identical to the first

one.– What is the probability of divorce?

• We only die once at a particular age.– What is the probability of death at age 50?

• In such a case we define the probability of an event as the ratio of the number of such events over the number of individuals in identical circumstances.– … for a very large number of such individuals.

• Example: number of individuals with the same degree, same age as me:

• What is the probability of earning more than $45,000 in my first job?

Sometimes we can’t repeat our choices

Wrap Up

• What is the conditional distribution of y given x?– Use the relationship y = a + b x + e to find the mean of y

given x. • We compute a and b using our formulas.

– Use the relationship TSS = ESS + SSE:• the variance of the error is the variance of the y minus the variance

of the prediction.

– The standard deviation of y given x is the standard deviation of the errors (residuals).

– Apply the empirical rule.• 95% of the y given x will lie between a + b x +- 2 sd(y given x)

• Beginning probability distributions (chapter 4)

Coming up: Don’t forget: • Break of Statistics for 2 weeks.• Only one week break for recitations.

For help:

• Amine OuazadOffice 1135, Social Science [email protected] hour: Wednesday from 4 to 6pm.

• GAF: Irene [email protected] recitations. At the Academic Resource Center, Monday from 2 to 4pm.

mailto:[email protected]



statistics for social and behavioral sciences session #9: linear regression and conditional...

Documents

af slide

linear relationship

linear regression

united states census

amine ouazad slide

i ntroduction

d escribing data p art

r esearch d esign p