chapter 5 - introduction to probability and statistics · chapter 5 - introduction to probability...

Chapter 5 - Introduction to Probability and Statistics∗

Simona Helmsmueller†

These lecture notes are meant to be used by students entering the University of Mannheim

Master program in Economics. They constitute the base for a pre-course in mathematics;

that is, they summarize elementary concepts with which all of our econ grad students must

be familiar. More advanced concepts will be introduced later on in the regular coursework.

A thorough knowledge of these basic notions will be assumed in later

coursework.

Although the wording is my own, the definitions of concepts and the ways to approach them

is strongly inspired by various sources, which are mentioned explicitly in the text or at the

end of the chapter.

Simona Helmsmueller

∗This version: 2017†Center for Doctoral Studies in Economic and Social Sciences.

1

Contents

1 Introduction 4

2 Probability theory 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Random variables, distributions and their features . . . . . . . . . . . . . . . 7

2.3 Joint variables, distributions and their features . . . . . . . . . . . . . . . . . 11

3 Statistics 13

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Finite sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Large sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2

Preview

What you should take away from this chapter:

1. Introduction

• You should grasp the importance of stochastics and statistics in all areas of eco-

nomic science.

2. Probability theory

• You should know what random variable, its distribution function and correspond-

ing density is, and know the calculation rules that apply

• You should be able to draw some pictures illustrating cdfs and their densities,

and be able to show how the picture changes with changing mean and variance.

• You should know the definition and properties of mean and variance.

• You should have a good intuition for the terms joint distribution, marginal dis-

tribution and conditional distribution.

• You should be able to check for independence of two variables, and be able to

calculate the covariance and correlation between the two.

3. Statistics

• You should understand that statistics is about samples and estimation of some

quantities of interest (in our case: parameters of the distribution function).

• You should know the difference between the terms population variance, variance

estimator, and variance of the estimator.

• You should know criteria for assessing the quality of an estimator.

• You should have a graphical and analytical understanding of the terms bias, vari-

ance and consistency of an estimator.

• You should have an intuitio and a graphical understanding of the law of large

numbers and the central limit theorem, and be able to write both of them down

in your sleep!

3

1 Introduction

Modern economics is a study of uncertain behavior to uncertain events under incomplete

information. Understanding this uncertainty, or randomness, and the mathematical concepts

associated with it is therefore an integral prerequisite for any economist. I could easily have

filled another week of preparatory classes on stochastics and statistics. However, we do not

have the time for this, and I am therefore forced to reduce this to a very short introduction of

the key concepts. The following is heavily leaning on the annexes B and C from Wooldridge’s

Introductory Econometrics, a modern approach (2013). All graphs included in these lecture

notes are taken from the book. I strongly urge you to read the roughly 80 pages in the book.

They are a relatively easy read and a thorough understanding of the concepts laid out there

will save you a lot of future trouble. This is not only valid for the following econometric

classes but also for any micro and macro lecture. I doubt you will have any lecture where

no expected value pops up at least once. Making sure you understand the language of these

classes will allow you to focus on the plot!

Here are a couple of examples where probability theory or statistics come into play:

1. In a duopoly, two companies A and B set their prices at the same time. Whether or not

company A gets most (or all) of the customers depends not only on the price of that

company but also on the price the competitor offers. The price of company B on the

other hand depends on that company’s production cost and pricing strategy. While

company A might have some information about the production technology of company

B (e.g. by considering the historic prices or some insider knowledge), some uncertainty

remains. For its own pricing strategy, company A therefore considers different scenarios

and, as a rational agent, will decide for the one which promises the highest expected

revenues.

2. A state considering to raise its unemployment benefits will also consider the impact

this has on the labor market. Will wages raise? Will the required higher tax rate to

finance the benefits lead to less consumption? To answer these questions, the state has

to form expectations about the behavior of firms and job-seekers.

3. A consumer considering to buy an insurance needs to weigh the premium of the policy

against the expected loss incured by the adverse event.

4. An economist wants to know whether some policy intervention, e.g. the introduction

of a mandatory health insurance scheme, really has an impact on the beneficiaries’

lives, e.g. on their health status. Because we cannot compare the health status of

the insured people with the health status of the same people but in a parallel universe

without the insurance, we must rely on estimation techniques by first finding a suitable

control group, e.g. the same population before the introduction of a health insurance,

or the population of a different state. Also, the impact is likely to be different for

4

different people, e.g. depending on their pre-insurance health status, their life style or

available health infrastructure. Instead of estimating an effect for each individual, we

consider means over a sample treatment and control population and use statistics to

causally infer about the impact of the insurance on the health status.

2 Probability theory

2.1 Introduction

Probability theory lies the foundation for statistical analysis by providing the formal, math-

ematical framework. If we were to introduce this framework solidly, I would need to start

with an introduction to measure theory. This is a beautiful subject, with unexpected twists

and turns and if you ever feel that you want to dig deeper into higher mathematics, I highly

recommend you get a book on measure theory. Now, however, this preparatory course is

restricted in time and therefore, I will only be able to give you a very brief introduction

into probability theory, mainly consisting of definitions and examples, and leaving out the

general concepts and proofs.

Let me, however, try to give you some intuition on measures and their importance for

probability theory. In mathematics, a measure is nothing else than a function describing

some sort of magnitude of mathematical objects, here: sets. Equivalently to our derivation

of the definition of vector spaces, there are some intuitive criteria which you would require

such a measure function to fulfill. For example, it would be reasonable to assume that the

measure function maps into the positive real numbers, and that if there is nothing inside the

considered object (whatever that may mean), then its measure is 0. You might also want to

ensure that an object which is fully contained within another object has a smaller measure.

There are other things to consider: what would you expect from the measure of the union

of two (disjoint) objects? Can sets that contain elements also have a zero measure? Can the

measure take the value infinity and under what circumstances could this be the case?1

To illustrate this further, let us consider some examples. If the mathematical objects

considered are closed intervals [a, b], a measure could be the length b−a and you might want

to require that this is the same for an open interval (a, b). As another example, if the sets

you consider are countable, then it is straightforward to let the measure function assign each

set the number of elements in it. These examples might sound trivial, but measure theory

extends this basic notions to more general cases. It can even be shown that there even

exists subsets of the real numbers to which one cannot assign a measure with the desired

above-described properties.

Why is this important to probability theory? Because it sets the frame for the formal

definition of what one calls probability in everyday language. This is in fact a highly critically,

1The questions asked here are similar to those used for introducing metrics and norms on vector spaces.The difference here is that we consider general sets without algebraic structure, i.e. addition and scalarmultiplication might not be defined on the set’s elements.

5

even philosophical question. We use terms like probability and likelihood many times each

day and we might feel that it is somehow associated with relative frequencies. So why bother

about a formal definition?

Consider the following question:

In the US, an individual has been described by a neighbor as follows: Steve

is very shy and withdrawn, invariably helpful but with very little interest

in people or in the world of reality. A meek and tidy soul, he has a need

for order and structure, and a passion for detail. Is Steve more likely to

be a librarian or a farmer?

As Kahneman and Tversky observe, the majority of respondents assign a higher likelihood

to Steve being a librarian. We do not take into consideration the relative frequencies: in

the US there are five times more farmers than librarians - and this ratio is even higher for

male farmers compared to male librarians. In the absence of a formal concept, we might be

intrigued to be misguided by unrelated information.

Now look at the following story known as the Linda problem by Kahneman and Tversky.

Answer quickly what comes to your mind!

Linda is thirty-one years old, single, outspoken, and very bright. She

majored in philosophy. As a student, she was deeply concerned with is-

sues of discrimination and social justice, and also participated in anti-

nuclear demonstrations...Please rank in order of likelihood various sce-

narios: Linda is

1. an elementary school teacher,

2. active in the feminist movement,

3. a bank teller,

4. an insurance salesperson, or

5. a bank teller also active in the feminist movement.

Independent of the other rankings, most respondents assign (5) a higher rank than (3).

Now, there are certainly more bank tellers than bank tellers also active in the feminist

movement, as the latter is a proper subset of the former!

These examples show that it is easy to go astray in a world with a lot of information if we

are asked to navigate through this solely with our intuition. Mathematical definitions can

illuminate patterns in this information and this might help us decide on which information

to take seriously and which to discard. This is why we need some probability theory to

successfully master statistics!2

2The examples also quite clearly highlight why the rational agent models used in classical economics are

6

Nevertheless, a typical procedure in econometric classes is to skip the definition of a prob-

ability measure and start with the definition of probability distributions, where probability

is again taken to be a term that needs no further definition. Let me at this stage at least

provide you the formal definition of a general probability measure. It is more general as it

incorporates the discrete and the continuous as well as any mixed case.

Definition 2.1. (Probability measure)

Let S be a set and P(S) the power set3. A function P from P(S) to [0, 1] is called a

probability measure if

1. Non-negativitiy: For all S ∈ P(S) : P (S) ≥ 0.

2. Certain event: P (S) = 1.

3. Countable additivity: For all countable collections {Si}∞i=1 of pairwise disjoint sets

in P(S): P (∪∞k=1Sk) =∞∑k=1

P (Sk).

You should confirm that the intuitive meaning of probability as limit of relative frequen-

cies fulfills this definition. A probability measure defined as such has a useful additional

property: If A ⊂ B, then P (A) ≤ P (B). As an exercise, you should prove this using the

above axioms! This property is what we would have needed to avoid the pitfall in the Linda

problem.

2.2 Random variables, distributions and their features

Having in mind that there is a formal definition of a probability measure which includes but

generalizes our intuitive understanding, we now proceed as in most econometric classes. A

crucial concept is that of a random variable and this can be quite confusing. I copy from

wikipedia to show you what this term encompasses:

In probability and statistics, a random variable [...] is a variable whose value

is subject to variations due to chance [...]. A random variable can take on a set

of possible different values (similarly to other mathematical variables), each with

an associated probability, in contrast to other mathematical variables.

A random variable’s possible values might represent the possible outcomes

of a yet-to-be-performed experiment, or the possible outcomes of a past exper-

iment whose already-existing value is uncertain (for example, due to imprecise

not always a good depiction of reality. Behavioral economics provide an alternative and if you are interestedin this I recommend you read up on prospect theory by Kahnemann and Tversky. Also, Kahnemann’s bookThinking fast and slow is an entertaining and informative read on this strand of economics. Nevertheless, itis unworthy to criticize something you do not firmly understand - so make sure you pay attention to expectedutility theory in your micro lecture. It does have its merits and applications in the real world!

3Remember, the power set is the set that contains all possible subsets of S.

7

measurements [...]). They may also conceptually represent either the results of

an ”objectively” random process (such as rolling a die) or the ”subjective” ran-

domness that results from incomplete knowledge of a quantity.

A more formal definition would again require some measure theory and then we would

introduce a random variable as a measurable function. Although you do not know the

meaning of the term measurable, it might be worthwhile to remember that a random variable

is a function (usually) mapping into the real numbers! It is not a set and it is also not the

probability measure. Instead, it is a function X that takes a subset of possible outcomes

and assigns a value on the real line describing some numerical property that these outcomes

may have. The important property is that we can assign a probability to any argument of

the random variable, i.e., it make sense to write something like P (X < 2) by which we mean

P ({ω : X(ω) < 2}). As an example, consider as outcome the gender of a newborn baby.

The set of possible outcomes is { boy, girl }. A random variable could be defined as follows:

X(ω) =

{1, if ω = boy

0, if ω = girl

Why would we need this? Because on the real line we can add, subtract and multiply

at will, whereas this might not be defined (and more lengthy in notation) on the set of

possible outcomes. As such, we can define another random variable Y which looks at the

gender of two newborn babies by defining Y ({ω1, ω2}) = X(ω1) +X(ω2) and this gives you

the number of boys in the pair. And then we can concisely write P (Y ≥ 1) instead of

P ({boy, girl}, {girl, boy}, {boy, boy}) or even less concise ”the probability that at least one

child is a boy”.

This last sort of quantity arises so often as object of study that we give it its own name

and definition:

Definition 2.2. (Cumulative distribution function)

The CDF of a real-valued random variable X is a function FX which assigns to x the

probability that X will have value less or equal to x:

FX : R→ [0, 1], x 7→ P (X ≤ x).

Theorem 2.1. (Properties of the CDF and definition density)

For any CDF we have the following properties:

1. P (a < X ≤ b) = FX(b)− FX(a).

2. FX is non-decreasing.

3. FX is continuous from the right.

8

4. If X is a discrete random variable attaining the values xi with probability pi then

FX(x) =∑i:xi≤x

pi.

5. If FX is differentiable, then the derivative fX is called the probability density function

and it holds that

• f(x) ≥ 0∀x

•∫ ∞−∞

f(x)dx = 1

• P (a < X ≤ b) = FX(b)− FX(a) =

∫ b

a

f(x)dx.

The following figure illustrates the last property:

Figure 1: The probability that X lies between a and b (Source: Wooldridge)

Distribution functions hold a lot of information. Some of this information can be aggre-

gated in single numbers which allow us to compare two different random variables. These

numbers are measures of location (e.g. mean, median) and dispersion (e.g. variance).

Definition 2.3. (Mean)

If the random variable X is discrete (i.e. has a discrete CDF) and takes on the values xiwith probability pi, then the mean or expected value is defined as the weighted average over

the outcome set with weights equal to the probabilities pi :

µ = E(X) =∑

xipi.

9

If X has a continuous distribution with density function f , then we define the mean or

expected value by

µ = E(X) =

∫ ∞−∞

xf(x)dx.

Remark 1. It follows from the definition of the discrete case that the expected value of rolling

a dice is1

6· 1 +

1

6· 2 +

1

6· 3 +

1

6· 4 +

1

6· 5 +

1

6· 6 = 3.5. Yet, the value 3.5 will never be

the result of one rolling of the dice. This example illustrated that the expected value need

not be the value that you expect with the greatest probability (in terms of econometrics: it

need not be the maximum likelihood estimator)!

Remark 2. In the notation, E is the expectation operator. As explained in a previous chapter,

an operator is a function that takes as argument another function and maps it (in this case)

into R. This highlights again that a random variable is function!

Theorem 2.2. (Properties of the mean)

Let X and Y be two random variables defined on the same outcome space and let a, b, c ∈ R.

Then

1. E(c) = c

2. If X ≤ Y , then E(X) ≤ E(Y ).

3. E(aX + bY ) = aE(X) + bE(Y ).

Proof. All results follow directly from the calculation rules for sums and integrals, see chapter

3 (Multivariate calculus).

A further central measure is the variance, which tells us how far a random variable is on

average from the mean.

Definition 2.4. (Variance and standard deviation)

The variance σ2 of a CDF is defined as follows:

σ2 = E(x− µ)2,

where µ is the expected value of the CDF. It follows that for a discrete distribution function,

we have that

σ2 =∑

(xi − µ)2pi

and in the continuous case

σ2 =

∫(x− µ)2f(x)dx.

The standard deviation σ is the square root of the variance.

The graph below illustrates the definition.

The following is a very useful characterization of the variance:

10

Figure 2: Random variables with the same mean but different variance (Source: Wooldridge,figure B.4)

Theorem 2.3. (Characterization of variance)

For any random variable X with E(X2) <∞, it holds that

σ2(X) = E(X2)− (E(X))2.

Theorem 2.4. (Properties of the variance)

1. σ2(X) = 0⇔ ∃c ∈ R : P (X = c) = 1.

2. For any a, b ∈ R : σ2(aX + b) = a2σ2(X) and σ(aX + b) = |a|σ(X).

2.3 Joint variables, distributions and their features

In econometrics as in other subjects applying statistical methods, one usually is confronted

with more than one random variable. For example, in health economics you might be

interested in the probability that a person visits a hospital conditional on that person being

insured.

Definition 2.5. (Joint distribution of two random variables)

Let X and Y be discrete random variables which take on the values xi and yi respectively.

Then the joint distribution is given by pij = P (X = xi, Y = yj) and the corresponding cdf

is described by

FX,Y (x, y) = P (X ≤ x, Y ≤ y) =∑

{(i,j):xi≤x,yi≤y}

pij.

If X and Y are continuous variables, then their joint cdf is also defined as

FX,Y (x, y) = P (X ≤ x, Y ≤ y),

11

and if the second derivative of this function exists, then the corresponding joint density is

defined by fX,Y (x, y) =∂2F (x, y)

∂x∂y.

Remark 1. As in the one-variable case, the joint density has the properties that f(x, y) > 0

and

∫ ∫f(x, y)dxdy = 14.

Remark 2. Two important related concepts are that of the marginal distribution and the

conditional distribution. I here aim to only explain the concept: Imagine you have two

random variables, e.g. X measuring the sex of a person and Y the height of a person. The

marginal distribution of Y then describes the distribution of height in the whole population,

and the marginal distribution of X the distribution of sex in the whole population. In

contrast, the conditional distribution gives the distribution of one variable contingent on the

value of the other variable. For example, the distribution of Y conditional on X = 1 gives the

distribution of height amongst the male population. Marginal and conditional distributions

can coincide, and then the random variables are said to be independent.

Definition 2.6. (Independence)

Two discrete random variables X and Y are said to be independent if for all (x, y) it holds

that

P (X = x, Y = y) = P (X = x)P (Y = y).

For continuous distributions, this translates into

fX,Y (x, y) = fX(x)fY (y).

Let us conclude this section with one more important definition, which defines a measure

of how, on average, two variables vary with each other.

Definition 2.7. (Covariance and correlation)

The covariance of two random variables X and Y is defined as

Cov(X, Y ) = E[(X − µX)(Y − µY )],

where µX , µY are the means of the random variables. In the continuous case, this is equal

to

∫ ∫(x− µX)(y − µY )fX,Y (x, y)dxdy.

The correlation coefficient between X and Y is then given by

ρX,Y =Cov(X, Y )

σXσY,

where σX , σY are the standard deviations of the variables.

4Remember Fubini?

12

Remark 3. To get an intuition for the concept of covariance, it is useful to consider the

following: Let CovX,Y > 0. Then, if X is above its mean, then, on average, Y is also above

its mean, and vice versa (think: height and weight of people). This relationship is also often

formulated as the covariance measuring the amount of linear relationship between the two

variables. A positive covariance shows that the two variables move in the same direction,

and a negative covariance shows that the two move in opposite directions.

Theorem 2.5. (Properties of covariance and correlation)

Let X, Y be two random variables with mean µX , µY .

1. Cov(X, Y ) = E(X, Y )− µXµY

2. X, Y independent ⇒ Cov(X, Y ) = 0 (the converse is not true!)

3. Cov(a1X + b1, a2Y + b2) = a1a2Cov(X, Y ) and ρa1X+b1,a2Y+b2 = ρX,Y

4. −1 ≤ ρX,Y ≤ 1 and if ρX,Y = 1, then there exist a, b such that Y = a+ bX.

Theorem 2.6. (Variance and covariance)

For two variables X, Y it is

V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) + 2abCov(X, Y ).

Remark 4. Further, very elementary concepts are that of conditional mean and conditional

variance. Make sure you read up on this if you are unsure about the concept.

3 Statistics

3.1 Introduction

Imagine you want to know what the average income of people in Germany is. It would be

quite cumbersome to ask every person individually. You are better off conducting a survey,

in which you only ask a number of people (your sample) about their income. You hope that

the information obtained from your sample is somehow representative for all of Germany.

For example, you might want to infer from the average income of your sample to the mean

income of all people in Germany.

What you are in fact doing is a parameter estimation, which can be modeled as follows:

There is a true distribution of income in Germany, called fθ. You must assume some func-

tional form of your distribution, e.g. that it is a normal distribution, but you do not know

the parameter θ, which could for example be θ = (µ, σ), the mean and variance of the normal

distribution. Instead, your aim is to estimate θ based on the observations y1, .., yn in your

sample, which are realizations of the random variable ”income”.

A straightforward approach is the so-called method of moments. If you can express θ as a

function of population moments, then you would simply replace them by the sample moments

13

and obtain your estimate θ. In the above example, you could estimate the mean income in

Germany by the average income in your sample. There are other methods, such as least

squares and maximum likelihood estimation, which you will encounter in your econometrics

class. You could also define the income of person number 10 in your sample as your estimate.

Would this be a good choice? It is the aim of this section to find criteria which we expect a

good estimator to fulfill.

Now, imagine you conduct the above described survey on income multiple times, each

time drawing a new sample but asking the same question. Do you think you would also

get the same average income? Probably not. Your estimate is by itself a random variable

because its value depends on the sample you draw. Therefore, we can also look at the sample

distribution of sample means and the sampling variance of the mean and other properties.

The graph below nicely illustrates the concept.

Figure 3: Distribution of variable y from sample realizations and the sampling distributionof the mean (Source: Groves et al.: Survey methodology (2nd edition), figure 4.2)

Warning: It is easy to get confused with all the different variances. There are at least

three types: 1. the unknown variance of the population which we might like to estimate, σ2,

2. the estimate σ2 which we obtain from the one sample we drew, 3. the unknown variance

of the distribution of our estimate, which describes how an estimate would change if we were

to draw the sample again and again. Clearly distinguishing between all of these will make it

a lot easier to understand the difference between the concepts in the next two subsections.

In the terms of econometrics, we consider a random sample as a series of random variables

Y1, ..., Yn which are independent and identically distributed (i.i.d.). As an example, think of

rolling a dice n times. Typically, the actually measured outcome is denoted in small letters:

y1, ..., yn. These are the realizations of the random variables. We are interested in some

parameter θ0 of the unknown distribution of the Yi and we estimate this by applying an

14

estimator:

θ = h(Y1, ..., Yn).

Once we plug the observed values y1, .., yn into the function h, we derive the estimate θ0.

As discussed in the above chapter, applying a function on random variables delivers another

random variables with a new distribution. As the distribution of Yi is unknown, so is the

distribution of h.

3.2 Finite sample properties

In the following, let us for simplicity assume that θ contains only one unknown parameter

of interest.

The most fundamental properties of an estimator are that of bias and variance. These

are nicely illustrated in the following dart throwing graphs.

Figure 4: Illustration of bias and variance

Variance and bias together make up the mean squared error, which is a measure for how

far off our estimate is of the true parameter on average:

MSE(θ) = E[(θ − θ0)2] = var(θ) + (E[θ]− θ0)2.

The first term on the right-hand side is the variance and the second is the squared bias. (To

derive this equality, note that θ0 is not a random variable, but a constant.)

It gives us two criteria at hand to evaluate the quality of an estimator.

Definition 3.1. (Unbiased and efficient estimators)

An estimator θ of θ0 is said to be unbiased if for all possible values of θ0

E(θ) = θ0.

15

Furthermore, the estimator is said to be efficient within a class of estimators, if it minimizes

the variances within this class.

Remark 1. One might be tempted to discard directly any biased estimators. However, when

comparing an estimator with a nonzero but small bias and a small variance to an unbiased

estimator with a large variance, the former might still be the better choice. This is because

we only have one sample at hands to derive at an estimate and hence the outcome of the

biased estimator is in general closer to θ0 (i.e. has a smaller mean squared error).

Example 3.1. We show that the sample average Y is indeed an unbiased estimator of the

mean µ of a random variable, even if we do not know the distribution of these.

E(Y ) = E(1

n

n∑i=1

Yi) =1

n

n∑i=1

E(Yi) =1

n

n∑i=1

µ =1

nnµ = µ.

Example 3.2. Let us calculate the variance of the estimator Y for µ:

V ar(Y ) = V ar(1

n

n∑i=1

Yi) =1

n2

n∑i=1

V ar(Yi) =1

n2nσ2 =

σ2

n.

3.3 Large sample properties

The last example showed that the variance of an estimator can depend on the sample size n.

This already hints towards the importance of considering properties of an estimator if the

sample size tends towards infinity. To this aim, we define another important quality criteria,

which broadly speaking ensures that if we were to have infinitely many observations at hand,

we would indeed obtain the true value.

Definition 3.2. (Consistency)

Let θn be an estimator of θ0, which makes use of the random variables Y1, ..., Yn. Then, θnis said to be a consistent estimator for θ0 if for every ε < 0

P (|θn − θ0|> ε)→ 0 if n→∞.

To introduce a concise notation: The estimator is consistent iff

plimn→∞(θn) = θ.

Remark 1. A consistent estimator can be though of as an unbiased estimator whose variance

shrinks to zero with increasing sample size.

Our above considered estimator of the sample average for the mean is a consistent esti-

mator. This result is so important that it is in fact a theorem with a well-known name:

16

Figure 5: The sampling distributions of a consistent estimator for three sample sizes (Source:Wooldridge, Figure C.3)

Theorem 3.1. (Law of large numbers)

Let Y1, Y2, ..., Yn be i.i.d. random variables with mean µ. Then,

plimn→∞(Yn) = µ.

Proof. (Idea only)

We have shown that V ar(Yn) =σ

n, hence V ar(Yn)→ 0 for n→∞.

A more detailed proof would require Chebychef’s inequality, you can look it up on wikipedia.

Note that the theorem does not require Yi to have finite variance, so another proof is necessary

in case of infinite variance. This more evolved form can also be found on wikipedia.

Remark 2. The above is the so-called weak law of large number. There is also the strong

law, which ascertains almost sure convergence. The difference is the following: the weak law

allows for P (|Yn − µ|> ε) infinitely many times, but at irregular intervals. The strong law

shows that for every ε there is an n0 so that P (|Yn − µ|< ε) for all n > n0.

The following property of the plim allows the application of the law of large numbers to

a wide variety of problems:

Theorem 3.2. (Continuous functions and the plim)

Let g be a continuous function and θ0 be a parameter which we estimate with the consistent

estimator θn, i.e. plim(θn) = θ0. Then we can consistently estimate the parameter g(θ0)

with the estimator Gn := g(θn), i.e. plim(Gn) = plim(g(θn)) = g(plim(θn)) = g(θ0).

Consistency is an important property which tells us that the distribution of the estimator

gets more and more concentrated around the parameter as we increase sample size. Indeed,

it is possible to know even more about the distribution of the estimator and this is important

for interval estimation (such as confidence intervals) and hypothesis testing.

17

Definition 3.3. (Asymptotic normal distribution)

If {Y1, Y2, ....} is an infinite sequence of random variables such that for all z ∈ R we have

that

P (Yn ≤ z)→ Φ(z) for n→∞,

where Φ(z) is the standard normal distribution function, then, Yn is said to be asymptotically

normally distributed, Yn ∼a N (0, 1).

Let us look one final time at our estimator of the sample average for the mean. The

last result of our lecture states that a linear transformation of this estimator is indeed

asymptotically normally distributed. This is a fundamental result and deserves its own

name:

Theorem 3.3. (Central limit theorem)

Let {Y1, ..., Yn} be a random sample with mean µ and variance σ2. Then,

Zn :=Yn − µσ/√n∼a N (0, 1).

There is a pretty good app which simulates random draws and the sampling distributions

of different estimators. I recommend you play around with this for a few minutes to get a feel-

ing for the central limit theorem: http://onlinestatbook.com/stat sim/sampling dist/index.html

18

chapter 5 - introduction to probability and statistics · chapter 5 - introduction to probability...

Documents