chapter 1: what is a regression? 1: what is a regression? - 1.1 - section 1.0: what we need to know...

43
© Jeffrey S. Zax 2008 - 1.1 - Chapter 1: What is a Regression? 8/24/08 Chapter 1: What is a Regression? - 1.1 - Section 1.0: What we need to know when we finish this chapter - 1.1 - Section 1.1: Why are we doing this? - 1.4 - Section 1.2: Education and earnings - 1.6 - Section 1.3: What does a regression look like? - 1.8 - Section 1.4: Where do we begin? - 1.9 - Section 1.5: Where’s the explanation? - 1.10 - Section 1.6: What do we look for in this explanation? - 1.12 - Section 1.7: How do we interpret the explanation? - 1.17 - Section 1.8: How do we evaluate the explanation? - 1.23 - Section 1.9: R 2 and the F-statistic - 1.26 - Section 1.10: Have we put this together in a responsible way? - 1.29 - Section 1.11: Do regressions always look like this? - 1.35 - Section 1.12: How to read this book - 1.38 - Section 1.13: Conclusion - 1.39 - Exercises - 1.40 - Section 1.0: What we need to know when we finish this chapter This chapter explains what a regression is and how to interpret it. Here are the essentials: 1. Section 1.4: The dependent or endogenous variable measures the behavior that we want

Upload: duongthien

Post on 15-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

© Jeffrey S. Zax 2008 - 1.1 -

Chapter 1: What is a Regression?

8/24/08

Chapter 1: What is a Regression? - 1.1 -

Section 1.0: What we need to know when we finish this chapter - 1.1 -

Section 1.1: Why are we doing this? - 1.4 -

Section 1.2: Education and earnings - 1.6 -

Section 1.3: What does a regression look like? - 1.8 -

Section 1.4: Where do we begin? - 1.9 -

Section 1.5: Where’s the explanation? - 1.10 -

Section 1.6: What do we look for in this explanation? - 1.12 -

Section 1.7: How do we interpret the explanation? - 1.17 -

Section 1.8: How do we evaluate the explanation? - 1.23 -

Section 1.9: R2 and the F-statistic - 1.26 -

Section 1.10: Have we put this together in a responsible way? - 1.29 -

Section 1.11: Do regressions always look like this? - 1.35 -

Section 1.12: How to read this book - 1.38 -

Section 1.13: Conclusion - 1.39 -

Exercises - 1.40 -

Section 1.0: What we need to know when we finish this chapter

This chapter explains what a regression is and how to interpret it. Here are the essentials:

1. Section 1.4: The dependent or endogenous variable measures the behavior that we want

© Jeffrey S. Zax 2008 - 1.2 -

to explain with regression analysis.

2. Section 1.5: The explanatory, independent or exogenous variables measure things that

we think might determine the behavior that we want to explain. We usually think of them as

pre-determined.

3. Section 1.5: The slope estimates the effect of a change in the explanatory variable on the

value of the dependent variable.

4. Section 1.5: The t-statistic indicates whether the associated slope is reliable. The slope is

reliable if the prob-value associated with the t-statistic is .05 or less. In this case, we say

that the associated slope is statistically significant. This generally corresponds to an

absolute value of approximately two or greater for the t-statistic, itself. If the t-statistic has a

prob-value that is greater than .05, the associated slope coefficient is insignificant. This

means that it isn’t reliable.

5. Section 1.6: The intercept is usually uninteresting. It represents what everyone has in

common, rather than characteristics that might cause individuals to be different.

6. Section 1.6: We usually interpret only the slopes that are statistically significant. They

indicate the effect of their associated explanatory variables on the dependent variable

ceteris paribus, or holding all other characteristics constant that are included in the

regression.

7. Section 1.6: Continuous variables take on a wide range of values. Their slopes indicate

the change that would occur in the dependent variable if the value of the associated

© Jeffrey S. Zax 2008 - 1.3 -

explanatory variable increased by one unit.

8. Section 1.6: Discrete variables, sometimes called categorical variables, indicate the

presence or absence of a particular characteristic. Their slopes indicate the change that

would occur in the dependent variable if an individual who did not have that characteristic

were given it.

9. Section 1.7: Regression interpretation requires three steps. The first is to identify the

reliable slopes. The second is to understand their magnitudes. The third is to use this

understanding to verify or modify the behavioral intuition that motivated the regression in

the first place.

10. Section 1.7: Statistical significance is necessary in order to have interesting results, but not

sufficient. Important effects are those that are both statistically significant and substantively

large. Slopes that are statistically significant but substantively small indicate that the effects

of the associated explanatory variable can be reliably interpreted as unimportant.

11. Section 1.7: A proxy is a variable that is related to, but not exactly the variable we really

want. We use proxies when the variables we really want aren’t available. Sometimes this

makes interpretation difficult.

12, Section 1.8: If the prob-value associated with the F-statistic is .05 or less, the collective

effect of the ensemble of explanatory variables on the dependent variable is statistically

significant.

13. Section 1.8: Observations are the individual examples of the behavior under examination,

© Jeffrey S. Zax 2008 - 1.4 -

upon which the regression is based. All of the observations together constitute the sample.

14. Section 1.8: The R2, or coefficient of determination, represents the proportion of the

variation in the dependent variable that is explained by the explanatory variables. The

adjusted R2 modifies the R2 in order to take account of the numbers of explanatory variables

and observations. However, neither measures directly the reliability of the regression

results.

15. Section 1.9: F-statistics can be used to evaluate the contribution of a subset of explanatory

variables, as well as the collective statistical significance of all explanatory variables. In

both cases, the F-statistic is a transformation of R2 values.

16. Section 1.10: Regression results are useful only to the extent that the choice of variables in

the regression, variable construction and sample design are appropriate.

17. Section 1.11: Regression results may be presented in one of several different formats.

However, they all have to contain the same substantive information.

Section 1.1: Why are we doing this?

This book presents an introduction to econometrics. In other words, it’s our second look at

statistical analysis. For most of us, the first look took place in a prior course, which focused on

probability and statistics. The principal theme of that course was how we should think about a

single random variable.

In the probability part, we studied the probability distributions that describe some common

random variables. They included the binomial distribution, the normal distribution, the t-

distribution, perhaps the P2-distributions or the F-distributions. We learned about their expected

© Jeffrey S. Zax 2008 - 1.5 -

values and standard deviations. For practical purposes, the most important thing we learned may

have been how to read probabilities for these distributions from the tables in the back of the book.

In the statistics section, we learned how to estimate the expected value and standard

deviation for a random variable, based on available data. With these estimates, we constructed

confidence intervals for expected values and tested hypotheses regarding their magnitudes. We

probably repeated these analyses for random variables constructed from linear combinations of

other random variables.

All of this may have been interesting in its own right. However, none of it addressed the

fundamental question that underlies most of science: How does one thing affect another? It was all

about understanding the nature of a single thing. The discussion didn’t include a second thing.

There was nothing to affect or to be affected by.

This doesn’t mean that the material from our previous course regarding single random

variables can be ignored or discarded. To the contrary, that material is a foundation for everything

that follows. We’ll review it as necessary, especially in chapters 5 through 7.

However, the central theme of this book is econometrics. This theme begins where our

conversations regarding probability and statistics left off. It is devoted to answering the question of

how one thing affects another. This is the sort of question that we ask ourselves all the time.

Whenever we wonder whether our grade will go up if we study more, whether we’re more likely

to get into graduate school if our grades are better, or whether we’ll get a better job if we go to

graduate school, we are asking questions that econometrics can answer with elegance and

precision.

Of course, we probably think we have answers to these questions already. We almost

surely do. However, they’re casual and even sloppy. Moreover, our confidence in them is almost

certainly exaggerated. Econometrics not only teaches how to answer questions like this more

accurately. It also helps us understand what is necessary in order to obtain an answer that we can

legitimately treat as accurate.

We begin in this chapter with a primer on how to interpret regression results. This will

© Jeffrey S. Zax 2008 - 1.6 -

allow us to read work based on regression and even to begin to perform our own analyses. We

might think that this would be enough.

However, this chapter will not explain why the interpretations it presents are valid. That

requires a much more thorough investigation. We prepare for this investigation in chapter 2. There

we review the summation sign, the most important mathematical tool for the purposes of this book.

We actually embark on this investigation in chapter 3. We consider the precursors to

regression, the covariance and correlation. These are basic statistics that measure the association

between two variables, without regard for causation. We may well have been introduced to them

towards the end of our course in probability and statistics. We return to them in detail because they

are the mathematical building blocks from which regressions are constructed.

However, our primary focus will be on the basics of regression analysis. Regression is the

principal tool that economists use to assess the responsiveness of some outcome to changes in its

determinants. Our previous course may have contained a short introduction to this subject. Here we

devote chapters 4 through 13 to a thorough discussion. The remainder of the book introduces a

selection of more advanced topics that have particularly important applications.

Section 1.2: Education and earnings

Few of us will be interested in econometrics purely for its abstract, theoretical beauty. In fact, this

book is designed on the premise that what will interest us most is how econometrics can help us

organize the quantitative information that we observe all around us. Obviously, we’ll need

examples.

There are two ways to approach the selection of examples. Econometric analysis has

probably been applied to virtually all aspects of human behavior. This means that there is

something for everyone. Why not provide it?

Well, this strategy would involve a lot of examples. Most readers wouldn’t need that many

to get the hang of things, and wouldn’t be interested in a lot of them. In addition, they could make

1 By the end of this book, we will be reasonably sophisticated. However, we won’t reach thefrontier. For a look at that, try Card (1999).

© Jeffrey S. Zax 2008 - 1.7 -

the book a lot bigger, which is counterproductive if students are inclined to be intimidated, anyway.

The alternative is to focus principally on one example which may have relatively broad

appeal, and develop it throughout the book. That’s the choice here. We will still sample a variety of

applications over the course of the entire text. However, our running example returns, in a larger

sense, to the question of section 1.1: What are we doing here? Except now, let’s talk about college,

not this course.

Presumably, at least some of the answer to that question is that we believe college prepares

us in an important way for adulthood. Part of that preparation is for jobs and careers. In other

words, we probably believe that education has some important effect on our ability to support

ourselves.

This is the example that we’ll pursue throughout this book. In the rest of this chapter, we’ll

interpret a fairly complicated regression which takes earnings as affected by several determinants,

with education among them. In chapter 3, we’ll return to basics, and simply ask whether there’s an

association between education and earnings. Starting in chapter 4, we’ll assume that education

affects earnings, and ask: By how much? In chapter 11 we’ll examine whether this causal

assumption is satisfactory, and what can be done if it’s not.

As we can see, we’ll ask this question with increasing sophistication as we proceed through

this book.1 The answers will demonstrate the power of econometric tools to address important

quantitative questions. They will also serve as illustrations for applications to other questions.

Lastly, we can hope that they will confirm our commitment to higher education.

Section 1.3: What does a regression look like?

Figure 1.1 is a typical presentation of a regression:

© Jeffrey S. Zax 2008 - 1.8 -

Figure 1.1

Our first regression

Earnings = !19427. × Intercept + 3624.3 × Years of school(!2.65) (9.45)

+ 378.60 × Age !17847. × Female(3.51) (!7.17)

!10130. × Black ! 2309.9 × Hispanic (!2.02) (!.707)

! 8063.9 × American Indian or Alaskan Native (!.644)

! 4035.6 × Asian ! 3919.1 × Native Hawaiian, other PacificIslander

(!.968) (!.199)

R2 =.1652

Adjusted R2 =.1584

F-statistic=24.5, prob-value <.0001

Observations=1000

Note: Parentheses contain t-statistics.

Does that answer the question?

Superficially, yes. But what does it all mean?

This question can be answered on two different levels. In this chapter, we’ll talk about how

to interpret the information in figure 1.1. This should put us in a position to read and understand

other work based on regression analysis. It should also allow us to interpret regressions of our own.

In the rest of the book, we’ll talk about why the interpretations we offer here are valid.

We’ll also talk about the circumstances under which these interpretations may have to be modified,

© Jeffrey S. Zax 2008 - 1.9 -

or may even be untrustworthy. There will be a lot to say about these matters. But for the moment, it

will be enough to work through the mystery of what figure 1.1 could possibly be trying to reveal.

Section 1.4: Where do we begin?

The first thing to understand about regression is the very first thing in figure 1.1. The word

“earnings” identifies the dependent variable in the regression. It is also occasionally referred to as

the endogenous variable.

The dependent variable is the primary characteristic of the entities whose behavior we are

trying to understand. These entities might be people, companies, governments, countries, or any

other choice-making unit whose behavior might be interesting. In the case of figure 1.1, we might

imagine that “earnings” implies that we’re trying to understand the behavior of companies.

However, “earnings” here refers to the payments that individuals get in return for their labor. So the

entities of interest here are workers or individuals who might potentially be workers.

“Dependent” and “endogenous” indicate the role that “earnings” plays in the regression of

figure 1.1. We want to explain how it gets determined. “Dependent” suggests that “earnings”

depends on other things. “Endogenous” implies the same thing, though it may be less familiar. It

means that the value of “earnings” is determined by other, related pieces of information.

The question of what it means to “explain” something statistically can actually be quite

subtle. We will have some things to say about this in chapters 4 and 11. Initially, we can proceed as

if we believe that the things that we use to “explain” earnings actually “cause” earnings.

Section 1.5: Where’s the explanation?

Most of the rest of figure 1.1, from the equality sign to “(!.199)” presents our explanation of

earnings. The equality sign indicates that we’re going to represent this explanation in the form of an

equation. On the right-hand side of the equation we’re going to combine a number of things

algebraically. Because of the equality, it looks as though the result of these mathematical operations

2 There are some subtleties associated with this point that we will explore in chapter 11.

© Jeffrey S. Zax 2008 - 1.10 -

will be “earnings”. Actually, as we’ll learn in chapter 4, it will be more accurate to call this result

“predicted earnings”.

The material to the right of the equality sign in figure 1.1 is organized into terms. The terms

are separated by signs for either addition, “+”, or subtraction, “!”. Each term consists of a number

followed by the sign for multiplication, “×”, a word or group of words, and a second number in

parentheses below the first.

In each term, the word or group of words identifies an explanatory variable. An

explanatory variable is a characteristic of the entities in question that we think may cause, or help to

create, the value that we observe for the dependent variable.

Explanatory variables are also referred to as independent variables. This indicates that

they are not “dependent”. For present purposes, this means that they do not depend on the value of

the dependent variable.

Explanatory variables are also referred to as exogenous variables. This indicates that they

are not “endogenous”. The variables listed in the terms to the right of the equality can be thought of

as causing the dependent variable, but not the other way around. We often summarize this

assumption as “causality runs in only one direction”.

This same idea is sometimes conveyed by the assertion that the explanatory variables are

pre-determined. This means that their values are already known at the moment when the value of

the dependent variable is determined. They have been established at an earlier point in time. The

point is that, as a first approximation, behavior that occurs later, historically, can’t influence

behavior that preceded it.2

This proposition is easy to accept in the case of the regression of figure 1.1. Earnings accrue

during work. Work, or at least career, typically starts a couple of decades into life. Racial or ethnic

identity and sex are usually established long before then. Age accrues automatically, starting at

birth. Schooling is usually over before earnings begin as well. Therefore, it would be hard to make

3 Again, we’ll consider an argument to this effect in chapter 11.

4 This number may be referred to elsewhere as the coefficient. We won’t do this. Instead, we’llreserve “coefficient” for use in chapter 5, where we first distinguish between the sample and thepopulation.

© Jeffrey S. Zax 2008 - 1.11 -

an argument at this stage that the dependent variable, earnings, causes any of the explanatory

variables.3

In each term of the regression, the number that multiplies the explanatory variable is its

slope. The reason for this name will become apparent in chapter 4.4 The slope estimates the

magnitude of the effect that the explanatory variable has on the dependent variable.

Finally, the number in parentheses measures the reliability of the slope. In figure 1.1, these

numbers are t-statistics. What usually matters most with regard to interpretations of the t-statistic is

its prob-value. However, the prob-value doesn’t appear in figure 1.1. This is because they don’t

appear in the most common presentations of regression results, which is what our discussion of

figure 1.1 is preparing us for.

We’ll offer an initial explanation of prob-values and their interpretations in section 1.8.

We’ll present the prob-values for figure 1.1 in table 1.3 of section 1.11. Lastly, we’ll explore the

calculation and interpretation of t-statistics at much greater length in chapter 6.

In the presentation of figure 1.1, what matters to us most is the absolute value of the t-

statistic. If it is approximately two or greater, we can be pretty sure that the associated explanatory

variable has a reliable effect on the dependent variable. In this case, we usually refer to the

associated slope as being statistically significant, or just significant. It is our best guess of how

big this effect is.

If the absolute value of the t-statistic is less than approximately two, regression has not been

able to identify a reliable effect of the explanatory variable on the dependent variable, according to

conventional standards. There just isn’t enough evidence to support the claim that the explanatory

variable actually affects the dependent variable. In this case, we often refer to the associated slope

as statistically insignificant, or just insignificant.

5 The algebraic sign of the t-statistic is unimportant, because the absolute value is all thatmatters. Why, then, does figure 1.1 report the sign? Mainly to demonstrate that just becausesomething appears in print doesn’t mean that it’s worth paying attention to. In this case, and in themany examples in other publications where these signs appear, they do not convey any usefulinformation. Efficient presentations omit them at no substantive cost. We’ll omit them in allsubsequent chapters.

6 There are occasional exceptions. Chapter 9 will present an example where the intercept actuallyestimates the effect of a particular explanatory variable. However, this only occurs in problems thatare much more complicated than those which we’re considering here. The intercept is often referredto as the constant. As with “coefficient”, we’ll reserve that word to refer to the population inchapter 5.

© Jeffrey S. Zax 2008 - 1.12 -

As we’ll see in chapter 6, this is not an absolute judgement. t-statistics that are less than two

in absolute value, but not by much, indicate that regression has identified an effect that is almost

reliable according to conventional standards. In contrast, t-statistics that have absolute values of less

than, say, one, indicate that there’s hardly even a hint of any reliable relationship.5 Nevertheless, for

reasons that will become clearer in chapter 6, we usually take two to be the approximate threshold

distinguishing explanatory variables which have effects worth discussing from those that don’t.

As we can see in figure 1.1, regression calculates a value for the slope regardless of the

value of the associated t-statistic. However, this discussion demonstrates that not all of these slopes

have the same claim on our attention. If a t-statistic is less than two in absolute value, and especially

if it’s a lot less than two, it’s best to assume, for practical purposes, that the associated explanatory

variable has no important effect on the dependent variable.

Section 1.6: What do we look for in this explanation?

The regression in figure 1.1 contains nine terms. Eight contain true explanatory variables, and are

potentially interesting.

The first term, which figure 1.1 calls the intercept, does not contain a true explanatory

variable, and is ordinarily uninteresting.6 It measures a part of the dependent variable that is

common to all entities under examination. In other words, it measures a part that doesn’t depend on

the other characteristics of these entities that are included as explanatory variables in the regression.

© Jeffrey S. Zax 2008 - 1.13 -

This isn’t usually interesting because we’re typically concerned with explaining why

different people or organizations are different from each other. The intercept usually tells us only

about what they share. Consequently, it isn’t informative about the relevant question.

Of course, if the intercept has a t-statistic that is less than two in absolute value, then we can

ignore it on the grounds that it is unreliable in any case. However, if its t-statistic is greater than two

in absolute value, as in figure 1.1, then we have to acknowledge that it’s reliably different from

zero.

That still doesn’t make it interesting. In the case of figure 1.1, the interesting question is

why some people have higher earnings than others. The intercept in the regression there tells us that

everyone starts out with negative $19,427 in annual earnings, regardless of who they are.

This can’t be literally true. It’s probably best to take the intercept as simply a mechanical

device. As we’ll see in chapter 4, its purpose is just to provide an appropriate starting point from

which to gauge the effects of the genuine explanatory variables.

The eight true explanatory variables are potentially interesting because they measure

specific characteristics of each person. Regression can attempt to estimate their contributions to

earnings because they appear explicitly in the regression. The first question that must be asked with

regard to any of them is whether the regression contains any evidence that they actually affect the

dependent variable.

As we learned in the last section, the t-statistic answers this question. Therefore, the first

number to look at with regard to any of the explanatory variables in figure 1.1 is the number in

parentheses. If it has an absolute value that is greater than two, then the regression estimates a

reliable effect of the associated explanatory variable on the dependent variable. These variables

deserve further attention.

In figure 1.1, t-statistics indicate that four explanatory variables are statistically significant,

or have reliable effects: years of school, age, female and Black. With regard to these four, the next

question is how big are these effects? As we said in the last section, the answer is in the slopes.

7 In strict mathematical terms, these variables are not exactly continuous. They cover limitedranges. In the example of figure 1.1, they take on only integer values within those ranges.Nevertheless, they can be treated as continuous for the purposes here without introducing anyimportant distortions.

© Jeffrey S. Zax 2008 - 1.14 -

Two of these explanatory variables, years of school and age, are continuous. This means

that they can take on a wide range of values. This regression is based on individuals whose years of

school range from zero to 21. Age varies from 18 to 65.7

For these variables, the simplest interpretation of their slopes is that they estimate how

earnings would change if years of school or age increased by a year. For example, the slope for

years of schooling is 3624.3. This indicates that earnings could be expected to increase by

$3,624.30 for each additional year devoted to study. Similarly, the slope for age is 378.69. Earnings

would increase by $378.60 annually simply as an individual grows older.

This interpretation is based on the image of following individuals as their lives evolve. This

sort of image will usually be helpful and not grossly inappropriate. However, it’s not exactly what’s

going on in figure 1.1. That regression is not comparing different moments in the life of the same

individual. Instead, it’s comparing many different individuals, of different ages, to each other.

This suggests a more correct, though more subtle interpretation of, for example, the slope

associated with years of schooling. It actually compares the earnings of two individuals who have

the same values for all of the other explanatory variables, but who differ by one year in their

schooling. In other words, the regression of figure 1.1 tells us that if we had two individuals who

had the same racial and ethnic identification, were of the same sex and age, but differed by one year

in schooling attainment, we would expect the individual with greater schooling to have annual

earnings that exceeded those of the other individual by $3,624.30.

Chapter 12 will demonstrate formally why this interpretation is appropriate. Until then, it’s

enough to summarize the interpretation of the preceding paragraph as follows: Any slope estimates

the effect of the associated explanatory variable on the dependent variable, holding constant all

other variables.

8 According to the Oxford English Dictionary, Second Edition on CD-ROM, the first word inthis phrase is pronounced approximately as “set-er-us”, “chet-er-us” or “ket-er-us”. The secondword is pronounced approximately as “pear-ih-bus”. Accents are on the first syllable of both words.The phrase means “(o)ther things being equal, other conditions corresponding”.

9 Chapter 12 will discuss at length how our interpretations are affected by, and can account for,variables that should be explanatory but are omitted from the regression.

10 It might be argued that both sex and racial and ethnic identities are “divisible”. Gender canencompass biology, social behaviors and sexual preferences. Race and ethnic heritage can includeancestors from many different geographic locations and physical archetypes. In some contexts,these distinctions may be central to the relevant questions. We acknowledge them here and discusssome possible implications in section 1.10. However, we ignore them for the purposes of ourexamples.

© Jeffrey S. Zax 2008 - 1.15 -

This interpretation is often conveyed by the Latin phrase ceteris paribus.8 It’s important to

remember that, regardless of the language in which we state this interpretation, it means that we are

holding constant only the other variables that actually appear as explanatory in the

regression.9 We will often summarize this condition by stating that we are comparing individuals

or entities that are otherwise similar, except for the explanatory variable whose slope is under

discussion.

The ceteris paribus interpretation of the slope for the age variable is analogous to that of

years of schooling. Again, if we compared two individuals who had identical racial and ethnic

identities, the same sex and level of education, but who differed by one year in age, the older’s

earnings would exceed those of the younger by $378.60.

The two remaining statistically significant explanatory variables are female and Black. Both

are discrete variables, or categorical variables, meaning that they identify the presence of a

characteristic that we ordinarily think of as indivisible. In this case, the variable Female

distinguishes women from men, and the variable Black distinguishes individuals who reported

themselves as at least partially Black or African American from those who did not.10

For this reason, the interpretation of the slopes associated with discrete explanatory

variables differs somewhat from that of slopes associated with continuous explanatory variables. In

the latter case, the slope indicates the effect of a marginal change in the explanatory variable. In

© Jeffrey S. Zax 2008 - 1.16 -

the former case, the slope indicates the effect of changing from one category to another.

At the same time, the interpretation of slopes associated with discrete variables is ceteris

paribus, as is that of slopes associated with continuous variables. In other words, if we compare a

man and a woman who have the same age, the same amount of schooling and the same racial and

ethnic identities, we expect their incomes to differ by the amount indicated by the slope associated

with the Female variable.

In figure 1.1, this slope indicates that the woman would have annual earnings that are

$17,847 less than those of the man. Similarly, the slope for Blacks indicates that, if we compare two

individuals of the same age, schooling and sex, annual earnings for the Black person will be

$10,130 less than those of the otherwise similar white person.

These must seem like very large differences. We’ll talk about this in the next section. We

conclude this section by noting that the slopes of the variables identifying individuals who are

Hispanic, American Indian or Alaskan Native and Asian are all statistically insignificant.

Nevertheless, they also seem to be large. Even the smallest of them, that for Hispanics, indicates

that their annual earnings might be $2,309.90 less than those of otherwise similar whites.

Although the magnitudes of the slopes for these variables might seem large and even

alarming, it’s inappropriate to take them seriously. Not only are their t-statistics less than two,

they’re a lot less than two. This means that, even though regression has calculated slopes for these

variables, it really can’t identify their effects, if any, with any confidence. In later chapters, we’ll

discuss what we might do if we wanted to identify them more reliably.

Section 1.7: How do we interpret the explanation?

Regression interpretation proceeds through three steps. We’ve already taken the first. It was to

identify the explanatory variables that have statistically significant slopes. For the most part,

regression is only informative about these explanatory variables. They are the only variables for

which regression can estimate reliable effects.

11 Exercise 1.4 further explores the magnitude of this effect.

© Jeffrey S. Zax 2008 - 1.17 -

We’re also halfway through the second step, which is to interpret the magnitude of these

effects. Effects that are both statistically significant and substantively large are the ones that really

catch our attention.

The coefficient on the categorical variable for females is an example. Not only is it

estimated very reliably, it indicates that women have annual earnings that are almost $18,000 less

than those of otherwise similar men. In economic terms, this difference seems huge.11

The slope associated with Blacks is similar. Its t-statistic is smaller, as is its magnitude.

However, the t-statistic is still large enough to indicate statistical significance. The magnitude is

still big enough to be shocking.

It takes a little more work to evaluate the effect of years of school. Its t-statistic indicates

that it is very reliable. However, its magnitude is markedly smaller than that of the slopes for

females and Blacks.

Nevertheless, this slope indicates that a worker with one more year of schooling than an

otherwise similar worker will have earnings that are greater by $3,624.30 in every year that they are

of working age. If a typical working career lasts for, perhaps, 40 years, this advantage accumulates

to something quite substantial.

Another way to think of this is to calculate the earnings advantage conferred by completing

an additional level of schooling. People with college degrees have approximately four more years of

school than those who end their formal education with high school graduation. Relative to these

people, an individual with a college degree will get the $3,624.30 annual earnings premium for each

of their four additional years of schooling.

This amounts to a total annual earnings premium of $14,497.20. This premium is again

quite large. It explains why so many people continue on to college after high school and why there

is so much concern regarding the career prospects for those who don’t.

It also presents an interesting comparison to the slopes for women and Blacks. The slope for

12 This comparison is based only on the slopes in the regression of figure 1.1. Exercise 1.1invites us to figure out how it is calculated.

© Jeffrey S. Zax 2008 - 1.18 -

women is larger than the earnings premium for four years of school. This suggests that, in order for

women to have the same annual earnings as otherwise similar men, they would have to have more

than four additional years of education. Similarly, Blacks would have to have nearly three

additional years of education in order to attain the same earnings level as otherwise similar whites.

The remaining explanatory variable with a statistically significant effect is age. Its slope is

about one-tenth of that associated with years of school, so its effect is substantively much smaller.

Two otherwise similar people would have to differ in age by about 27 years in order to have an

earnings difference similar to that between an otherwise similar Black and white individual of the

same age.12

This raises a very interesting point. It is possible for an explanatory variable to have an

effect which is statistically significant, but economically, or substantively, unimportant. In this

case, regression identifies an effect which is reliably small. While this may be of moderate interest,

it can’t be nearly as intriguing as a reliably large effect.

In other words, statistical significance is necessary in order to have interesting results, but

not sufficient. Any analysis that aspires to be interesting therefore has to go beyond identifying

statistical significance to consider the behavioral implications of the significant effects. If these

implications all turn out to be substantively trivial, their significance will be of limited value.

This second interpretive step reveals another useful insight. The slope for females in figure

1.1 is given as !$17,847. The computer program that calculated this slope added several digits to

the right of the decimal point. However, the most important question that we’ve asked of this slope

is whether it is substantively large or small. Our answer, just above, was that it is “huge”. Which of

the digits in this slope provide this answer?

It can’t be the digits to the right of the decimal point. They aren’t even presented in figure

1.1. It also can’t be the digits in the ones’ or tens’ place in the slope as it is presented there. If this

13 We’ll revisit this issue in chapters 3, 6 and 7.

© Jeffrey S. Zax 2008 - 1.19 -

slope had been !$17,807 instead of !$17,847, would we have concluded that it wasn’t huge, after

all? Hardly. In fact, this effect would have arguably looked huge regardless of what number was in

the hundreds’ place.

In other words, the substantive interpretation that we applied to this variable really

depended almost entirely on the first two digits. The rest of the digits did not convey any really

useful information, except for holding their places.

At the same time, they didn’t do much harm. They wouldn’t, unless they’re presented in

such profusion that they distract us from what’s important. Unfortunately, that happens a lot. We

should be careful not to invest too much effort into either interpreting these digits in other people’s

work, or into presenting them in our own.13

The third interpretive step is to formulate a plausible explanation of the slope magnitudes,

based on our understanding of economic and social behavior. In fact, this is something we should

have done already. Why did we construct the regression in figure 1.1 in the first place? Presumably,

because we had reasons to believe that the explanatory variables were important influences on

earnings.

It’s now time to revisit those reasons. We compare them to the actual regression results.

Where they are consistent, our original beliefs are confirmed and strengthened. Where they are

inconsistent, we have to consider revising our original beliefs. This is the step in which we

consolidate what we have learned from our regression. Without it, the first two steps aren’t of much

value.

We begin this step by simply asking “Why?” For example, why does education have such a

large positive effect on income? It seems reasonable to believe that people with additional

education might be more adept at more sophisticated tasks. It seems reasonable to believe that more

sophisticated tasks might be more profitable for employers. Therefore, it may be reasonable to

expect that employers will offer higher levels of pay to workers with higher levels of education.

14 Exercise 1.2 explores this question more thoroughly.

© Jeffrey S. Zax 2008 - 1.20 -

The regression in figure 1.1 confirms these expectations. This in itself is valuable.

Moreover, our expectations had very little to say about exactly how much employers would be

willing to pay for an additional year of schooling. The slope estimates this quantity for us, in this

case with a relatively high level of reliability.

Of course, there was no guarantee that the regression would be so congenial. How would we

have responded to a slope for years of education that was inconsistent with our expectations, either

too low or too high?14 What would we have done if the data, representing actual experience, was

inconsistent with our vision about what that experience should be like?

A contradiction of this sort raises two possibilities. Either our expectations were wrong, or

something was wrong with the regression that we constructed in order to represent them. It would

be our obligation to review both in order to reconcile expectations and experience.

Ordinarily, we would begin with the issue where we were least confident in our initial

choices. If we were deeply committed to our expectations, we would suspect the regression. If we

believed that the regression was appropriate, we would wonder first what was faulty in our

expectations.

In the case of years of schooling, its estimated effect probably leaves most of us in the

following position: We were already fairly certain that more education would increase earnings.

Our certainty is now confirmed: We have an estimate of this effect which is generally consistent

with our expectations. In addition, we have something much more concrete than our own intuition

to point to for support when someone else takes a contrary position.

The slope for age presents a different explanatory problem. Why might we expect that

earnings would change with age? We can certainly hope that workers become more productive as

the learn more about their work. This would suggest that earnings should be greater for older

workers.

At the same time, at some point in the aging process workers become less vigorous, both

© Jeffrey S. Zax 2008 - 1.21 -

physically and mentally. This should reduce their productivity, and therefore their wages.

This might be an explanation for why the slope for age is relatively small in magnitude.

Perhaps it combines an increase in earnings that comes from greater work experience, and a

reduction in earnings that comes from lower levels of activity? The first effect might be a little

stronger than the second, so that the net effect is positive but modest.

This is superficially plausible. However, a little more thought suggests that it is problematic.

For example, is it plausible that these two effects should cancel each other to the same degree,

regardless of worker age?

It may seem more reasonable to expect that the effects of experience should be particularly

strong when the worker has very little, at the beginning of the career. At the ages when most people

begin to work, it’s hard to believe that vigor noticeably deteriorates from one year to the next. If so,

then productivity and, therefore, earnings, should increase rapidly with age for young workers.

Conversely, old workers may have little more to learn about the work that they do. At the

same time, the effects of aging on physical and mental vigor may be increasingly noticeable. This

implies that, on net, productivity and earnings might decline with age among old workers.

© Jeffrey S. Zax 2008 - 1.22 -

Figure 1.2

Potential effects of age on earnings

Figure 1.2 illustrates these ideas. As shown there, a more thorough understanding of the

underlying behavior suggests that the effects of increasing age on earnings should depend on what

age we’re at. We’ll learn how to incorporate this understanding into a regression analysis in chapter

13. For the moment, it suggests that we should be cautious about relying on the slope for age in the

regression of figure 1.1. It’s estimated reliably but it’s not clear what it represents.

This difficulty is actually a symptom of a deeper issue. The confusion arises because,

according to our explanation, the single explanatory variable for age is being forced to do two

different jobs. The first is to serve as a rough approximation, or proxy, for work experience. The

second is to serve, again as only a proxy, for effort.

In other words, the two variables that matter, according to our explanation, aren’t in the

regression. Why? Because they aren’t readily available. The one variable that is included in the

regression, age, has the advantage that it is available. Unfortunately, it doesn’t reproduce either of

© Jeffrey S. Zax 2008 - 1.23 -

the variables that we care about exactly. It struggles to do a good job of representing both

simultaneously. We’ll return to this set of issues in chapters 12 and 13.

What about the large negative effects of being female or Black? We might be tempted to

explain them by surmising that women and Blacks have lower earnings than do white males

because they have less education, but that would be wrong. Education is entered as an explanatory

variable in its own right. This means that, as we said in section 1.6, it’s already held constant. The

slopes for female and Black compare women and Blacks to males and whites that are otherwise

similar, including having the same years of schooling.

The explanation must lie elsewhere. One possibility is that women and Blacks differ from

white males in some other way that is important for productivity, but not included in the regression

of figure 1.1. We’ll talk about this possibility in the next section, and at length in chapter 12.

A second possibility is that the quality of the education or work experience that women and

blacks get is different from that of white males. We’ll talk about how we might address this in

chapter 13. Yet another possibility is that women and Blacks suffer from discrimination in the labor

market. This is the most disturbing explanation, and the hardest to pin down. We will return to it in

chapter 12 and subsequently.

Section 1.8: How do we evaluate the explanation?

At this point, we know what explanatory variables seem to be important in the determination of

earnings, and we have explanations for why they might be so. Can we say anything about how good

these explanations are, as a whole?

The answer, perhaps not surprisingly, is yes. The remaining information in figure 1.1 allows

us to address this question from a statistical perspective. The most important piece of additional

information in figure 1.1 is the prob-value (sometimes referred to as the “p-value”) associated

with the F-statistic. The F-statistic tests whether the whole ensemble of explanatory variables has a

reliable collective effect on the dependent variable. In other words, the F-statistic essentially

15 This threshold is not always absolute. In some contexts, again addressed in chapter 6, prob-values of up to .1 are taken as evidence of a statistically significant joint effect.

© Jeffrey S. Zax 2008 - 1.24 -

answers the question of whether the regression has anything at all useful to say about the dependent

variable.

Chapter 12 will explain how it does this in detail. For the moment, the answer is most

clearly articulated by the prob-value, rather than by the F-statistic with which it is associated. If the

prob-value is .05 or less, then the joint effect of all explanatory variables on the dependent variable

is statistically significant.15

If the prob-value is larger than .05, the ensemble of explanatory variables does not have a

jointly reliable effect on the dependent variable. We’ll explore the implications of this in chapter 12.

It could be that a subgroup of explanatory variables really does have a reliable effect that is being

obscured by the subgroup of all other explanatory variables. But it could also be that the regression

just doesn’t tell us anything useful.

In the case of figure 1.1, the prob-value associated with the F-statistic is so small that the

computer doesn’t calculate a precise value. It simply tells us that the prob-value is less than .0001.

Further precision isn’t necessary, because this information alone indicates that the true p-value is

not even one five-hundredth of the threshold value of .05. This means that there can be almost no

doubt that the joint effect of the collection of explanatory variables is statistically significant.

What’s left in figure 1.1 are two R2 measures. The first, “R2”, is sometimes written as the

“R-square” or “R-squared” value. It is sometimes referred to as the coefficient of determination.

The R2 represents the proportion of the variation in the dependent variable that is explained by the

explanatory variables. The natural interpretation is that if this proportion is larger, the explanatory

variables are a more dominant influence on the dependent variable. So bigger is generally better.

The question of how big the R2 should be is difficult. First, the value of the R2 depends

heavily on the context. For example, the R2 value in figure 1.1 is approximately .17. This implies

that the eight explanatory variables explain a little less than 17% of the variation in annual earnings.

16 Chapter 4 defines the R2 statistic and analyzes these issues. Chapter 12 revisits them.

© Jeffrey S. Zax 2008 - 1.25 -

This may not seem like much. However, experience shows that this is more or less typical

for regressions that are comparing incomes of different individuals. Other kinds of comparisons can

yield much higher R2 values, or even lower values.

The second reason why the magnitude of the R2 statistic is difficult to evaluate is that it

depends on how many explanatory variables the regression contains and how many individuals it is

comparing. If the first is big and the second is small, the R2 can seem large even if the regression

doesn’t provide a very good explanation of the dependent variable.16

The adjusted R2 is an attempt to correct for the possibility that R2 is distorted in this way.

Chapter 4 gives the formula for this correction, which naturally depends on the number of

explanatory variables and the number of individuals compared, and explains it in detail. The

adjusted R2 is always less than R2. If it’s a lot less, it suggests that the R2 is misleading because it is,

in a sense, trying to identify a relatively large number of effects off of a relatively small number of

examples.

The number of individuals involved in the regression is given in figure 1.1 as the number of

observations. “Observation” is a generic term for a single example or instance of the entities or

behavior under study in a regression analysis. In the case of figure 1.1, each individual represents

an observation. All of the observations together constitute the sample upon which the regression is

based.

According to figure 1.1, the regression there is based on 1,000 observations. That is, it

compares the value of earnings to the values of the eight explanatory variables for 1,000 different

individuals. This is big enough, and the number of explanatory variables is small enough, that the

R2 should not be appreciably distorted. As figure 1.1, reports, the adjusted R2 correction doesn’t

reduce R2 by much.

We’ve examined the R2 statistic in some detail because it gets a lot of attention. The reason

for its popularity is that it seems to be easy to interpret. However, nothing in its interpretation

© Jeffrey S. Zax 2008 - 1.26 -

addresses the question of whether any of the regression results are reliable. For this reason, the

attention that the R2 gets is misplaced. It has real value, not because it answers this question

directly, but because it is the essential ingredient in the answer.

Section 1.9: R2 and the F-statistic

The F-statistic, which we discussed in the last section, addresses the question of reliability directly.

That’s why we’ll emphasize it in preference to the R2. Ironically, as chapter 12 will prove, the F-

statistic is just a transformation of the R2. It’s the R2 dressed up, so to speak, so as to be more

presentable. For this reason, the R2 is necessary even though it’s not sufficient to serve our interests.

Moreover, there are other transformations of the R2 that are also very useful.

For example, suppose that we wondered whether the whole category of racial and ethnic

identity has any relevance to earnings. The evidence in figure 1.1 is mixed. The slope for blacks is

negative and statistically significant. However, the slopes for the four other racial and ethnic

categories are not statistically significant. Is it possible that we could responsibly discard all of the

information on racial and ethnic identity?

17 As we’ll see in chapter 13, the R2 value for the regression in figure 1.3 has to be lower, or atleast no higher, than the value for the regression in figure 1.1. Exercise 1.3 compares the slopesfrom the two regressions.

© Jeffrey S. Zax 2008 - 1.27 -

Figure 1.3

Our first regression, restricted

Earnings = !22756. × Intercept + 3682.8 × Years of school(!3.83) (10.8)

+ 396.94 × Age !18044. × Female(3.79) (!7.27)

R2 =.1611

Adjusted R2 =.1586

F-statistic=63.8, prob-value <.0001

Observations=1000

Note: Parentheses contain t-statistics.

Figure 1.3 attempts exactly this. It presents a modification of the regression in figure 1.1.

All five of the variables indicating racial and ethnic identity have been omitted. Is there any way to

tell whether this regression is better or worse than that of figure 1.1?

There are two considerations. The regression of figure 1.3 has fewer explanatory variables.

Therefore, it’s simpler. If nothing else were to change, this would be a good thing. However, its R2

is lower. It has less explanatory power. If nothing else were to change, this would be a bad thing.17

The question, then, of whether the regression of figure 1.3 is better than that of figure 1.1 turns on

whether the advantages of a simpler explanation outweigh the disadvantages of reduced explanatory

power.

This is the first question that we’ve asked of the data that can’t be answered with a single

number already present in figures 1.1 or 1.3. Most statistical software packages will provide the

© Jeffrey S. Zax 2008 - 1.28 -

necessary number if instructed to do so. Here, let’s take the opportunity to introduce a formula for

the first time. We’ll talk about this formula at some length in chapter 12. For the moment, it’s

sufficient to note that it depends on the comparison between the R2 values for the regressions with

and without the variables measuring racial and ethnic identity.

Let’s distinguish them by referring to the R2 value for the regression of figure 1.3 as the

restricted R2, or Rrestricted2. This is because that regression is restricted to have fewer explanatory

variables. In addition, let’s designate the size of the sample in both regressions as n, the number of

explanatory variables in the regression of figure 1.1 as k, and the number of explanatory variables

that are omitted from the regression of figure 1.3 as j. The general form for the comparison that we

need is then

( )

( )( )

R Rj

Rn k

restricted2 2

21

−⎡

⎣⎢

⎦⎥

−−

⎣⎢

⎦⎥

. (1.1)

With the relevant values from figures 1.1 and 1.3, this becomes

( )

( )( )

. .

.. .

1652 16115

1 16521000 9

9734

−⎡⎣⎢

⎤⎦⎥

−−

⎡⎣⎢

⎤⎦⎥

=

The value for this comparison, .9734, is another F-statistic. It has its own prob-value, in this case,

18 This prob-value comes from Appendix table 3 for the F-statistic. We will explain how to findit in section 13.6.

© Jeffrey S. Zax 2008 - 1.29 -

.567.18

This is a lot bigger than the threshold of .05 that we discussed at the beginning of this

section. Using the language there, this indicates that the ensemble of explanatory variables

representing racial or ethnic identity does not have a collectively reliable effect on the dependent

variable. In other words, the regression of figure 1.3 provides a more effective summary of the

evidence in this sample than does that of figure 1.1.

Section 1.10: Have we put this together in a responsible way?

Section 1.8 describes how information contained in figure 1.1 can help evaluate the statistical

performance of the regression there. In section 1.9, we ask if the regression in that figure should be

modified by omitting some explanatory variables. The answer takes us beyond figure 1.1 because it

requires the additional regression of figure 1.3.

Here we ask more fundamental questions about the construction of the regression in figure

1.1. These questions address whether the regression analysis is constructed appropriately, in the

first place. The answers to these questions again require information that isn’t in figure 1.1. This

information would typically appear in associated text rather than any formal presentation. In this

case, the answers themselves would not take the form of precise statistical or mathematical

statements. Nevertheless, they are essential to establish the credibility of the statistical analysis.

There are three main areas of potential concern. The first is the choice of variables, or

model specification. We have already raised this issue in connection with our interpretation of the

variable for age. Our conclusion there was that the variables we would ideally like to have were

probably work experience and work effort.

Naturally, this is an example of a more general issue. It is frequently the case that relevant

explanatory variables aren’t available. We may be able to replace them, as in the case of work

© Jeffrey S. Zax 2008 - 1.30 -

experience and work effort, with a somewhat plausible proxy. In these cases, as we saw in the

example of the age variable in figure 1.1, interpretation can get complicated.

Sometimes, even a plausible proxy won’t be available. This makes interpretation even more

complicated. We’ll discuss the issue of interpretation when relevant variables are missing at length

in chapter 12.

The second area of concern is in the construction of the included variables. We’ve only had

a hint of this so far, in the repeated references to the dependent variable as representing “annual”

earnings. We might wonder whether that is the best measure of the behavior that we want to

investigate.

The issue is that annual earnings depends on both what an individual’s labor is worth, and

how much labor that individual provides. The first might be measured by something like an hourly

wage, representing the value that an employer places on an hour, a single “unit”, of labor from that

individual. The second might be measured by the number of annual hours that this individual

works.

In the regression of figure 1.1, individuals who work a lot of annual hours at relatively low

wages, and individuals who work a few hours at relatively high wages, can end up with similar

values for annual earnings, the dependent variable. Consequently, this regression treats them as if

they are the same. For some purposes this may be fine.

However, both of these variables might be interesting in their own rights. For other

purposes, the differences between hours worked and wages paid may be important or even primary.

This implies that the purpose of the regression may have to be clarified in order to ensure that the

dependent variable is appropriate.

The construction of the explanatory variables merits the same kind of examination. We’ve

already talked about the variable for age. What about the others?

Superficially, there isn’t anything offensive about using years of schooling as a measure of

skill. At the same time, a little reflection might suggest that all years of schooling are probably not

19 In the empirical literature that examines returns to schooling, the additional value associatedwith gaining a degree is referred to as a “sheepskin effect”.

20 If education does not have diminishing returns, than the costs of education would have toincrease even faster than the returns in order to make people finish their educations.

21 The appendix to chapter 3 discusses the construction of some of our other variables in detail.

© Jeffrey S. Zax 2008 - 1.31 -

equal. For example, employers probably take graduation as an especially important skill signal.

Therefore, school years which culminate in graduation, such as the twelfth grade, may be more

valuable than school years that don’t, such as the eleventh grade.19

It’s also possible that different levels of schooling have different value. For example, the

difference between the productivity of an illiterate worker and a literate worker is probably very

large. This suggests that the effect of literacy on earnings may also be large. Consequently, the

return to primary school, where literacy is acquired, may be especially big.

More generally, economic theory suggests to us that the economic value of additional years

of schooling has to go down eventually as the number of years of schooling already completed goes

up. If, to the contrary, each additional year of schooling was worth more than the previous year, it

would be very hard to stop going. The evidence that almost everyone eventually does implies that,

at some point, the return to additional education has to decline. In other words, as we may have

learned in our courses on principles of economics, education is probably subject to diminishing

marginal returns.20

All of this suggests that the regression in figure 1.1 might be improved if it identified

different levels of schooling attainment separately. We’ll investigate this refinement in chapter 13.

The remaining explanatory variables in the regression of figure 1.1 represent racial and

ethnic identity. How was that done?

The data upon which the regression is based comes from the U.S. Census conducted in

2000. The Census survey allowed individuals to identify themselves as members of many different

racial groups simultaneously. In addition, they could separately designate themselves as Hispanic,

from one of many geographic heritages.21

© Jeffrey S. Zax 2008 - 1.32 -

The regression of figure 1.1 identifies only five racial and ethnic identities. It assigns

individuals to each of the identities that they reported for themselves. Consequently, the regression

would assign the effects of all chosen identities to an individual who reported more than one.

This strategy makes the judgement that the more detailed variation in identity available in

the Census data isn’t useful for the purpose of the regression in figure 1.1. This, in itself, may be

incorrect. In contrast, it treats the five major non-white identity categories symmetrically. It doesn’t

make any judgements as to which of them be most important.

However, this treatment implies that the five major categories are additive. That is, if an

individual who chooses to identify as white and another individual who chooses to identify as

Hispanic also both choose to identify as Black, regression assigns them both the !$10,130 earnings

discount associated with the latter identity. This might not be reasonable.

For example, imagine that some of this discount is attributable to discrimination in the labor

market. The first individual probably doesn’t suffer discrimination as a consequence of identifying,

at least partially, as white. Therefore, to the extent that this individual also identifies as Black, that

individual might experience the full force of discrimination against this group.

Imagine that, in contrast, Hispanics also experience labor market discrimination. It’s

plausible that the second individual doesn’t suffer much additional discrimination as a consequence

of also identifying as Black. In this case, the assumption in the regression of figure 1.1, that all

individuals who share a particular identity are treated similarly, regardless of whether they differ in

other identifications, would be incorrect.

This discussion demonstrates that variables that seem straightforward may often embody a

great deal of concealed complexity. Alternative definitions of the same concept may yield results

that differ. This doesn’t mean that one definition would be wrong and another correct. It means that

results must be interpreted with the constant reference to the specific definitions upon which they

are based.

The third area of general concern in the evaluation of regression results is the sample

22 Exercise 1.5 presents the regression of figure 1.1 as calculated using only the subsample ofindividuals with non-zero earnings.

© Jeffrey S. Zax 2008 - 1.33 -

design. All we know from figure 1.1 is that the sample contains observations on 1,000 individuals.

Who are they?

In the case of figure 1.1, they are all between the ages of 18 and 65, and not enrolled in

school. In other words, they are what we conventionally call working-age adults. There are two

potential problems with this. First, individuals who are older than 65 are increasingly likely to

continue in work. Second, just because individuals are working-age doesn’t mean that they’re

actually working. In fact, 210 of the ones in this sample don’t seem to be, because they report no

earnings.22

Does it make sense to exclude older workers? Probably, because Social Security and many

pension programs treat people differently once they reach the age of 65. This is not to say that these

people might not be working. The point, instead, is that the circumstances under which their

earnings are determined are sufficiently different from those of younger workers that they should

probably be analyzed separately.

Does it make sense to include non-working adults of working-age? The question of why

working-age adults have no earnings and aren’t at work is certainly interesting. However, figure 1.1

might not be the right place in which to ask it. If a purpose of the regression in figure 1.1 is to

understand why some workers are worth more to employers than are others, it doesn’t make sense

to include those who don’t work. There’s no information about what employers would be willing to

pay them if they did.

As with variable definitions, the point here is not that one sample design is correct and

another is not. The point is rather that different sample designs support answers to different

questions. The issue when reviewing results such as those in figure 1.1 is whether the chosen

sample design answers the question at issue.

© Jeffrey S. Zax 2008 - 1.34 -

Table 1.1

Our first regression, revisited

Dependent variable: Annual earnings

Explanatory variable Slope t-statisticIntercept -19427 -2.65Years of school 3624.3 9.45Age 378.60 3.51Female -17847 -7.17Black -10130 -2.02Hispanic -2309.9 -.707American Indian or Alaskan Native -8063.9 -.644Asian -4035.6 -.968Native Hawaiian, other Pacific Islander -3919.1 -.199

R2 .165Adjusted R2 .158F-statistic 24.5 prob-value <.0001Observations 1000

Section 1.11: Do regressions always look like this?

Not exactly. Regression presentations always have to convey the same information, but it’s not

always, or even most frequently, organized in the form of figure 1.1. Table 1.1 presents the same

information in a format that is more common:

As we can see, table 1.1 makes no attempt to represent everything as an explicit equation.

However, all of the information that was in figure 1.1 is reproduced here. It’s just arranged

differently.

The explanatory variables are listed in the leftmost column. Each row contains all of the

information for that explanatory variable. The second column reports the associated slope, and the

last column the t-statistic for that slope.

© Jeffrey S. Zax 2008 - 1.35 -

Table 1.2

Our first regression, revisited for the second time

Dependent variable: Annual earnings

Explanatory variable Slope Standard errorIntercept -19427. * 7331.8Years of school 3624.3 * 383.66Age 378.60 * 107.84Female -17847 * 2488.3Black -10130 * 5021.2Hispanic -2309.9 3266.7American Indian or Alaskan Native -8063.9 12526Asian -4035.6 4169.7Native Hawaiian, other Pacific Islander -3919.1 19671.

R2 .165Adjusted R2 .158F-statistic 24.5 prob-value <.0001Observations 1000

Note: * indicates that the slope is statistically significant.

Table 1.2 is an example of a second common format. It’s identical to table 1.1, except that

the last column presents standard errors instead of t-statistics of prob-values. This often is less

useful because the standard errors are only part of the t-statistic of table 1.1. Chapter 3 will explain

what standard errors are, and chapter 6 will explain how they are related to statistical significance.

Chapter 6 will also discuss the one advantage of this presentation.

Nevertheless, the format in table 1.2 doesn’t immediately indicate which explanatory

variables have slopes that are statistically significant. Instead, presentations of this type usually add

asterisk superscripts next to slopes that achieve this level of reliability, as indicated in the note to

table 1.2.

© Jeffrey S. Zax 2008 - 1.36 -

Table 1.3

Our first regression, revisited for the third time

Dependent variable: Annual earnings

Explanatory variable Slope prob-valueIntercept -19427 .0082Years of school 3624.3 <.0001Age 378.60 .0005Female -17847 <.0001Black -10130 .0439Hispanic -2309.9 .480American Indian or Alaskan Native -8063.9 .520Asian -4035.6 .333Native Hawaiian, other Pacific Islander -3919.1 .842

R2 .165Adjusted R2 .158F-statistic 24.5 prob-value <.0001Observations 1000

Table 1.3 presents an alternative format that is becoming more popular. It’s again similar to

table 1.1, except that the last column in table 1.3 reports the prob-value associated with the t-

statistic for each explanatory variable rather than the t-statistic, itself. The standard for interpreting

them when applied to t-statistics is the same as that which we discussed in section 1.8, with regard

to F-statistics. Values of .05 or less indicate unambiguous statistical significance.

Comparing tables 1.1 and 1.3, we notice that all of the slopes that have t-statistics in excess

of approximately two in table 1.1 have prob-values that are less than .05 in table 1.3. Conversely,

none of the slopes that have t-statistics that are less than approximately two in table 1.1 have prob-

values that are less than .05 in table 1.3.

This consistency is no accident. As chapter 6 will demonstrate, the t-statistic and its prob-

© Jeffrey S. Zax 2008 - 1.37 -

value are just different ways to say the same thing. The prob-value is actually the more informative

way, because the threshold of .05 is absolute. In contrast, the exact value for the t-statistic that

corresponds to this prob-value varies a bit from regression to regression, depending on the numbers

of observations and explanatory variables. For this reason, presentations like table 1.1 may contain

the additional notations specifically identifying those explanatory variables whose slopes are

statistically significant that we saw in table 1.2.

Section 1.12: How to read this book

Was this fun? Let’s hope so. This book aspires to keep us interested in the material. If it succeeds,

we won’t have to worry about study strategies. We’ll read because we want to.

This is, of course, somewhat ambitious for an econometrics text. If it doesn’t rise to the

level of entertainment, we’ll have to treat it more like work. From that perspective, it’s inefficient to

read many pages in a single sitting. The material is too dense to absorb in large doses. Anyone who

claims to have read a chapter straight through probably can’t identify a single thing that occurred in

the last third of it. Moreover, they’re probably not fit for much else unless they’ve had a nap.

Instead, we should plan to read one section at a time. But we should really read them. That

doesn’t mean just noticing the words. That means examining the equations closely, understanding

how the accompanying text describes the equations, verifying the derivations by returning to the

precursor equations, and, most importantly, working through the associated exercises.

Then we take a break. We’ll need to read slightly more than one chapter per week in order

to complete the text in a semester. Depending on the chapter, this could mean from one to three

sections a day. This shouldn’t be very burdensome. Done carefully, as described here, this will

trains us to be relatively high-functioning applied econometricians in commercial and research

applications by the time the semester is over. Or, if we’re so inclined, we’ll be ready to take a more

advanced course in econometrics.

In other words, the experience of reading this book is not intended to replicate that of

© Jeffrey S. Zax 2008 - 1.38 -

reading a work of fiction for leisure purposes. In that situation we usually read the book straight

through. In this situation, if we’re only turning the right-hand page, we’re probably not learning it

well enough. If, instead, we find ourselves going backwards nearly as often as we go forwards, we

know we’re really studying.

This may sound like an arduous process. It is. But the brain is a muscle. It only gets smarter

through exercise. We don’t go to the gym, sniff the air, stroke the weights, and announce that we’ve

had a workout. Similarly, we don’t casually leaf through an econometrics text and announce that

we’ve been educated.

Section 1.13: Conclusion

This chapter has covered a lot of territory. We should now be able to interpret regression results

with some degree of sophistication. Moreover, it should be apparent that econometrics allows us

to assess how much one quantity responds to changes in another, how reliable that response is,

and whether that response, as revealed in observed experience, conforms to or contradicts our

expectations. Section 1.0, “What you need to know when you finish this chapter”, provides a

convenient summary of the most important points.

At the same time, we know nothing of how the numbers in figure 1.1 are calculated, or why

the interpretations that we have offered are valid. That should make us cautious about practicing our

newly-learned skills too aggressively. The purpose of the rest of this text is to put us in a position to

know not only why the interpretations in this chapter may be appropriate, but to recognize the

circumstances under which they might not be, and perhaps to remedy them.

© Jeffrey S. Zax 2008 - 1.39 -

Exercises

1.1 Section 1.5 asserts that, based on the regression in figure 1.1, “Two otherwise similar

individuals would have to differ in age by about 27 years in order to achieve an earnings

differential similar to that between an otherwise similar Black and white individual of the

same age.”

a. Reproduce this calculation, based on the slopes in figure 1.1.

b. How many years older would a woman have to be in order to have the same income

as an otherwise similar man?

1.2 Suppose that the value of the slope on years of schooling in figure 1.1 was different.

a. Imagine that this slope was negative and statistically significant. What would that

indicate about the effects of education on earnings? Would this be surprising?

Disturbing? Why or why not?

b. Imagine that this slope was statistically significant and had a value of $10,000. What

would this imply about the earnings of a typical high school graduate? A typical

college graduate? Would this be surprising? Disturbing? Why or why not?

1.3 Compare the slopes and t-statistics in the regression of figure 1.3 to the slopes for the same

explanatory variables in the regression of figure 1.1.

b. Does the statistical significance of any of the explanatory variables in the regression

of figure 1.3 differ importantly from that of the same explanatory variable in the

regression of figure 1.1? If yes, which ones and why? If no, why not?

a. Do any of the statistically significant slopes in figure 1.3 differ substantively from

the corresponding slope in figure 1.1? If yes, which ones and why? If no, why not?

© Jeffrey S. Zax 2008 - 1.40 -

Table 1.4

Our first regression, revised

Dependent variable: Annual earnings

Explanatory variable Slope t-statisticIntercept -26057. 3.19Years of school 3596.1 9.38Age for males 558.03 3.82Age for females 177.34 1.15Female -2537.4 .29Black -10194. 2.03Hispanic -2636.1 .81American Indian or Alaskan Native -8001.8 .64Asian -4298.6 1.03Native Hawaiian, other Pacific Islander -5374.5 .27

R2 .168Adjusted R2 .160F-statistic 22.2prob-value <.0001Observations 1000

1.4 Table 1.4 presents a revised version of the regression first presented in figure 1.1. In this

revision, the single explanatory variable for age is replaced by two variables, one measuring

age for men and the other measuring age for women.

a. Are the effects of age for men and women on earnings each statistically significant?

Why or why not?

b. Interpret and explain the magnitude of the effect of age for men on earnings.

c. Interpret and explain the magnitude of the effect of age for women on earnings.

d. Does the effect on earnings of age differ for men and for women? In what way?

Why might this be so?

© Jeffrey S. Zax 2008 - 1.41 -

Table 1.5

Our first regression, revised for the second time

Dependent variable: Annual earnings

Explanatory variable Slope t-statisticIntercept -31973. 3.72Years of school 4036.7 8.88Age 720.42 5.48Female -17021. 5.83Black -11396. 1.88Hispanic -3192.1 .85American Indian or Alaskan Native -7825.3 .54Asian -3063.6 .63Native Hawaiian, other Pacific Islander -8757.1 .37

R2 .189Adjusted R2 .181F-statistic 22.8prob-value <.0001Observations 790

e. What has happened to the slope associated with the female variable in table 1.4

compared to the same slope in table 1.1? Might this be related to the difference

between the effect of age in table 1.1 and the effect of age for females in table 1.4?

If yes, how? If no, why not?

1.5 Table 1.5 reproduces the regression of figure 1.1, with the exception that the sample for

table 1.5 omits all individuals with zero earnings.

a. Does the change in sample substantially change the statistical significance of any of

the explanatory variables?

b. Does the change in sample substantially change the magnitude of any of the

statistically significant explanatory variables?

© Jeffrey S. Zax 2008 - 1.42 -

Figure 1.4

Our second regression

Rent = 155.61 × Intercept(1.76)

+ 80.223 × Number of rooms(9.69)

+ 292.99 × Unit is a single-family structure(3.50)

+ 292.92 × Unit is in an apartment building(3.51)

+ 257.19 × Unit was built in 1999 or 2000(2.27)

+ 200.72 × Unit was built between 1995 and 1998(2.50)

+ 130.83 × Unit was built between 1990 and 1994(2.35)

+ 54.495 × Occupant moved in after 1998(2.09)

! 118.48 × Lot is one acre or larger(!1.60)

R2 =.1362

Adjusted R2 =.1292

F-statistic=19.5, prob-value <.0001

Observations=1000

Note: Parentheses contain t-statistics.

c. What can we conclude about the effect of including individuals with zero earnings

on the regression of figure 1.1?

© Jeffrey S. Zax 2008 - 1.43 -

1.6 Figure 1.4 presents the results of a new regression. The observations are dwelling units in

California that are occupied by renters or available for rent. The dependent variable is the

monthly rent for the unit.

a. Which explanatory variables are statistically significant?

b. What do the magnitudes of the slopes indicate regarding the effects of each

explanatory variable?

c. Are there any explanatory variables that have statistically significant slopes, but

estimated effects that are substantively unimportant? If yes, which ones and why?

d. Examine each explanatory variable in turn. Is there a good reason for why that

variable might determine housing rents? If yes, what is it? If no, why not? Are there

any questions regarding the way each variable might be measured?

e. Are there other variables, not included in the regression of figure 1.4, that may have

important effects on rent? If yes, what are they?

f. Based on the description above, are the observations in the sample appropriate for

the purpose of identifying the determinants of rents? If no, why not? If no, how

might we redesign the sample in order to address this question more successfully?