stat methods

Some definitions

◦ Individual: each object described by a set of data

◦ Variable: any characteristic of an individual

⋄ Categorical variable: places an individual into one of several

groups or categories.

⋄ Quantitative variable: takes numerical values on which we can

do arithmetic.

◦ Distribution of a variable: tells what values it takes and how often

it takes these values.

Example:

The following data set consists of five variables about 20 individuals.

ID Age Education Sex Total income Job class

1 43 4 1 18526 52 35 3 2 5400 73 43 2 1 3900 74 33 3 1 28003 55 38 3 2 43900 76 53 4 1 53000 57 64 6 1 51100 68 27 4 2 44000 59 34 4 1 31200 510 27 3 2 26030 511 47 6 1 6000 612 48 3 1 8145 513 39 2 1 37032 514 30 3 2 30000 515 35 3 2 17874 516 47 4 2 400 517 51 4 2 22216 518 56 5 1 26000 619 57 6 1 100267 720 34 1 1 15000 5

Age: age in yearsEducation: 1=no high school, 2=some high school, 3=high school diplom,

4=some college, 5=bachelor’s degree, 6=postgraduate degreeSex: 1=male, 2=femaleTotal income: income from all sourcesJob class: 5=private sector, 6=government, 7=self employed

Variables Age and Total income are quantitative, variables Eduction, Sex,

and Job class are categorical.

Graphical Description of Data, Jan 5, 2004 - 1 -

Categorical variable analysis

Questions to ask about a categorical variable:

◦ How many categories are there?

◦ In each category, how many observations are there?

Bar graphs and pie charts

Categorical data can be displayed by bar graphs or pie charts.

◦ In a bar graph, the horizontal axis lists the categories, in any order.

The height of the bars can be either counts or percentages.

◦ For better comparison of the frequencies, the variables can be ordered

from most frequent to lest frequent.

◦ In a pie chart, the area of each slide is proportional to the percentage

of individuals who fall into that category.

Example: Education of people aged 25 to 34

010

2030

Per

cent

of p

eopl

e ag

ed 2

5 to

34

no HS some HS HS diploma Bachelor’s some college postgradEducation level

010

2030

Per

cent

of p

eopl

e ag

ed 2

5 to

34

HS diploma Bachelor’s some college some HS postgrad no HSEducation level

3.6%7.5%

30.4%

29.1%

22.7%

6.7%

no HS some HSHS diploma Bachelor’ssome college postgrad


Categorical variable analysis

Example: Education of people aged 25 to 34

STATA commands:

. infile ID AGE EDUC SEX EARN JOB using individuals.txt, clear

. drop if AGE<25 | AGE>34

. label values EDUC Education

. label define Education 1 "no HS" 2 "some HS" 3 "HS diploma" 4 "Bachelor’s"

> 5 "some college" 6 "postgrad"

. set scheme s1mono

. gen COUNT=100/_N

. graph bar (sum) COUNT, over(EDUC) ytitle("Percent of people aged 25 to 34")

> b1title("Education level")

. translate @Graph bar1.eps, translator(Graph2eps) replace

. graph bar (sum) COUNT, over(EDUC, sort(1) descending)

> ytitle("Percent of people aged 25 to 34") b1title("Education level")

. translate @Graph bar2.eps, translator(Graph2eps) replace

. set scheme s1color

. graph pie COUNT, over(EDUC) plabel(_all perc, format(%4.1f) gap(-5))

. translate @Graph pie.eps, translator(Graph2eps) replace


Quantitative variables: stemplots

Example: Sammy Sosa home runs

Producing stemplots in STATA:

. infile YEAR HR using sosa.dat

. stem HR

Stem-and-leaf plot for HR

0* | 48

1* | 05

2* | 5

3* | 366

4* | 009

5* | 0

6* | 346

Year Home runs

1989 41990 151991 101992 81993 331994 251995 361996 401997 361998 661999 632000 502001 642002 492003 40

How to make a stemplot

1. Separate each observation into a stem and a leaf.

e.g. 15 → 1︸︷︷︸stem

5︸︷︷︸leaf

and 4 → 0︸︷︷︸stem

4︸︷︷︸leaf

2. Write the stems in a vertical column in increasing order.

3. Write each leaf next to stem, in increasing order out from the stem.

How to choose the stem

◦ Rounding: each leaf should have exactly one digit, so rounding long

numbers before producing the stemplot can help produce a more com-

pact and informative plot.

◦ Splitting: if each stem (or many stems) have a large number of leaves,

all stems can be split, with leaves of 0-4 going to the first stem and 5-9

going to the second.


Quantitative variables: histograms

How to make a histogram

1. Group the observations into “bins” according to their value. Choose

the bins carefully: too few hide detail, too many decimate the pattern.

2. Count the individuals in each bin.

3. Draw the histogram

◦ Leave no space between bars.

◦ Label the axes with units of measurement.

◦ The y-axis is can be counts or percentages (per unit).


Year Home runs

1989 41990 151991 101992 81993 331994 251995 361996 401997 361998 661999 632000 502001 642002 492003 40

0.0

1.0

2.0

3.0

4D

ensi

ty

0 10 20 30 40 50 60 70Home runs

The area of each bar is proportional to the percentage of data in that range.

We care about the area, not the height, but when the bar has equal width,

area is determined by the height.

For simplicity, use equally spaced bins.


Quantitative variables: histograms


Histograms with different bin widths:

Histogram of Sosa Home Runs

Home Runs

Per

cent

age

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07


Home Runs

Per

cent

age

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07


Home Runs

Per

cent

age

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07


Home Runs

Per

cent

age

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Producing histograms in STATA:

. infile YEAR HR using sosa.dat

. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs)

. translate @Graph hist1.eps, translator(Graph2eps) replace

. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs) freq

. translate @Graph hist2.eps, translator(Graph2eps) replace

0.0

1.0

2.0

3.0

4D

ensi

ty

0 10 20 30 40 50 60 70Home runs

01

23

45

Fre

quen

cy

0 10 20 30 40 50 60 70Home runs

Why is a histogram not a bar graph?

◦ Frequencies are represented by area, not height.

◦ There is no space between the bars.

◦ The horizontal axis represents a numerical quantity, with an inherent

order.


Interpreting histograms

◦ Describe the overall pattern and any significant deviations from that

pattern.

◦ Shape: Is the distribution (approximately) symmetric or skewed?

Histogram of x

x

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0

050

010

0015

0020

00

This distribution is skewed right

because it has a long right-hand

tail.

◦ Center: Where is the “middle” of the distribution?

◦ Spread: What are the smallest and largest values?

◦ Outliers: Are there any observations that lie outside the overall pat-

tern? They could be unusual observations, or they could be mistakes.

Check them!

Example: Newcomb’s measurements of the passage time of light (IPS Tab

1.1)

Time

Fre

quen

cy

−60 −40 −20 0 20 40 600

5

10

15

20

25


Time plots

Example: Average retail price of gasoline from Jan 1988 to Apr 2001

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Ret

ail g

asol

ine

pric

e

1988 1990 1992 1994 1996 1998 2000Year

Note: Whenever data are collected over time, it is a good idea to have

a time plot. Stemplots and histograms ignore time order, which can be

misleading when systematic change over time exists.

Producing a time plot in STATA:

. infile PRICE using gasoline.txt, clear

. graph twoway line PRICE T, ylabel(0.9(0.1)1.8, format(%3.1f)) xtick(0(12)159)

> xlabel(0 "1988" 24 "1990" 48 "1992" 72 "1994" 96 "1996" 120 "1998" 144 "2000")

> xtitle(Year) ytitle(Retail gasoline price)


Measures of center

The mean

The mean of a distribution is the arithmetic average of the obser-

vations:

x =x1 + · · · + xn

n= 1

n

n∑i=1

xi

The median

The median is the midpoint of a distribution: the number M

such that

◦ half the observations are smaller and

◦ half are larger.

How to find the median

Suppose the observations are x1, x2, . . . , xn.

1. Arrange the data in increasing order and let x(i) denote the ith

smallest observation.

2. If the number of observations n is odd, the median is the center

observation in the ordered list:

M = x((n+1)/2)

3. If the number of observation n is even, the median is the av-

erage of the two center observations in the ordered list:

M =x(n/2) + x(n/2+1)

2Numerical Description of Data, Jan 7, 2004 - 1 -

Measures of center

Examples:

Data set 1:

x1 x2 x3 x4 x5 x6 x7 x8 x9

2 4 3 4 6 5 4 -6 5

Arrange in increasing order:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)

-6 2 3 4 4 4 5 5 6

There is an odd number of observations, so the median is

M = x((n+1)/2) = x(5) = 4.

The mean is given by

x =2 + 4 + 3 + 4 + 6 + 5 + 4 + (−6) + 5

9=

27

9= 3.

Data set 2:

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1


x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10)

1.3 2.3 2.9 3.9 4.1 4.2 5.1 5.9 6.4 8.8

There is an even number of observations, so the median is

M =x(n/2) + x(n/2+1)

2=

x(5) + x(6)

2=

4.1 + 4.2

2= 4.15.

The mean is given by

x =2.3 + 8.8 + 3.9 + 4.1 + 6.4 + 5.9 + 4.2 + 2.9 + 1.3 + 5.1

10=

44.9

10= 4.49.

Numerical Description of Data, Jan 7, 2004 - 2 -

Mean versus median

◦ The mean is easy to work with algebraically, while the median

is not.

◦ The mean is sensitive to extreme observations, while the median

is more robust.

Example:

0 1 2 3 4 5 6 7 8 9 10

The original mean and median are

x =0 + 1 + 2

3= 1 and M = x((n+1)/2) = 1

The modified mean and median are

x =0 + 1 + 10

3= 3

2

3and M = x((n+1)/2) = 1

◦ If the distribution is exactly symmetric, then mean=median.

◦ In a skewed distribution, the mean is further out in the longer

tail than the median.

◦ The median is preferable for strongly skewed distributions, or

when outliers are present.


Measures of spread

Example: Monthly returns on two stocks

Stock A

Daily returns (in %)

Fre

quen

cy

−10 −5 0 5 10 15 200

10

20

30

40Stock B

Daily returns (in %)

Fre

quen

cy

−10 −5 0 5 10 15 200

10

20

30

40

Stock A Stock B

Mean 4.95 4.82

Median 4.99 4.68

The distributions of the two stocks have approximately the same

mean and median, but stock B is more volatile and thus more risky.

◦ Measures of center alone are an insufficient description of a

distribution and can be misleading

◦ The simplest useful numerical description of a distribution con-

sists of both a measure of center and a measure of spread.

Common measures of spread are

◦ the quartiles and the interquartile range

◦ the standard deviation


Quartiles

Quartiles divide data into 4 even parts

◦ Lower (or first) quartile QL:

median of all observations less than the median M

◦ Middle (or second) quartile M = QM :

median of all observations

◦ Upper (or third) quartile QU :

median of all observations lgreater than the median M

◦ Interquartile range: IQR = QU − QL

distance between upper and lower quartile

How to find the quartiles

1. Arrange the data in increasing order and find the median M

2. Find the median of the observations to the left of M, that is the lower

quartiles, QL

3. Find the median of the observations to the right of M, that is the

upper quartiles, QU

Examples:

Data set:

x1 x2 x3 x4 x5 x6 x7 x8 x9

2 4 3 4 6 5 4 -6 5


x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)

-6 2 3 4 4 4 5 5 6

◦ QL is the median of {−6, 2, 3, 4}: QL = 2.5

◦ QU is the median of {4, 5, 5, 6}: QU = 5

◦ IQR = 5 − 2.5 = 2.5


Percentiles

More generally we might be interested in the value which is ex-

ceeded only by a certain percentage of observations:

The pth percentile of a set of observations is the value such that

◦ p% of the observation are less than or equal to it and

◦ (100 − p)% of the observation are greater than or equal to it.

How to find the percentiles

1. Arrange the data into increasing order.

2. If np/100 is not an integer, then x(k+1) is the pth percentile,

where k is the largest integer less than np/100.

3. If np/100 is an integer, the pth percentile is the average of the

x(np/100) and x(np/100+1).

Five-number summary

A numerical summary of a distribution {x1, . . . , xn} is given by

x(1) QL M QU x(n)

A simple boxplot is a graph of the five-number summary.


Boxplots

A common “rule” for discovering outliers is the 1.5 × IQR rule:

An observations is a suspected outlier if it lies more than falls more

than 1.5 × IQR below QL or above QU .

How to draw a boxplot Box-and-whisker

plot)

1. A box (the box) is drawn from the lower to

the upper quartile (QL and QU).

2. The median of the data is shown by a line in

the box.

3. Lines (the whiskers) are drawn from the ends

of the box to the most extreme observations

within a distance of 1.5 IQR (Interquartile

range).

4. Measurements falling outside 1.5 IQR from

the ends of the box are potential outliers and

marked by ◦ or ∗.

−10

010

20

Stock A Stock B

Plotting a boxplot with STATA:

. infile A B using stocks.txt, clear

. label var A "Stock A"

. label var B "Stock B"

. graph box A B, xsize(2) ysize(5)


Boxplots

Interpretation of Box Plots

◦ The IQR is a measure for the sample’s variability.

◦ If the whiskers differ in length the distribution of the data is

probably skewed in the direction of the longer whisker.

◦ Very extreme observations (more than 3 IQR away from the

lower resp. upper quartile) are outliers, with one of the following

explanations:

a) The measurement is incorrect (error in measurement process

or data processing).

b) The measurement belongs to a different population.

c) The measurement is correct, but represents a rare (chance)

event.

We accept the last explanation only after carefully ruling out

all others.


Variance and standard deviation

Suppose there are n observations x1, x2, . . . , xn,

The variance of the n observations is:

s2 =(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2

n − 1

= 1

n − 1

n∑i=1

(xi − x)2

This is (approximately) the average of the squared distances of the

observations from the mean.

The standard deviation is:

s =√

s2 =

√1

n − 1

n∑i=1

(xi − x)2

Why n − 1?

Division by n − 1 instead of n in the variance calculation is a

common cause of confusion. Why n − 1? Note that

n∑

i=1

(xi − x) = 0

Thus, if you know any n − 1 of the differences, the last difference

can be determined from the others. The number of “freely varying”

observations, n− 1 in this case, is called the “degrees of freedom”.


Properties of s

◦ Measures spread around the mean =⇒ use only if the mean

is used as a measure of center.

◦ s = 0 ⇔ all observations are the same

◦ s is in the same units as the measurements, while s2 is in the

square of these units.

◦ s, like x is not resistant to outliers.

Five-number summary versus standard deviation

◦ The 5-number summary is better for describing skewed distri-

butions, since each side has a different spread.

◦ x and s are preferred for symmetric distributions with no out-

liers.


Histograms and density curves

What’s in our toolkit so far?

◦ Plot the data: histogram (or stemplot)

◦ Look for the overall pattern and identify deviations and outliers

◦ Numerical summary to briefly describe center and spread

A new idea:

If the pattern is sufficiently regular, approximate it with a

smooth curve.

Any curve that is always on or above the horizontal axis and has

total are underneath equal to one is a density curve.

◦ Area under the curve in a range of values indicates the propor-

tion of values in that range.

◦ Come in a variety of shapes, but the “normal” family of familiar

bell-shaped densities is commonly used.

◦ Remember the density is only an approximation, but it sim-

plifies analysis and is generally accurate enough for practical

use.

The Normal Distrbution, Jan 9, 2004 - 1 -

Examples

Sulfur oxide (in tons)

Den

sity

0 10 20 30 400.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07


Den

sity

0 10 20 30 400.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07


Den

sity

0 10 20 30 400.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Shaded area of histogram: 0.29

Shaded area under the curve: 0.30

Waiting time between eruptions (min)

Den

sity

40 46 52 58 64 70 76 82 88 94 1000.00

0.01

0.02

0.03

0.04


Median and mean of a density curve

Median:

The equal-areas point with 50% of the “mass” on either side.

Mean:

The balancing point of the curve, if it were a solid mass.

Note:

◦ The mean and median of a symmetric density curve are equal.

◦ The mean of a skewed curve is pulled away from the median in

the direction of the long tail.

The mean and standard deviation of a density are denoted µ and

σ, rather than x and s, to indicate that they refer to an idealized

model, and not actual data.


Normal distributions: N (µ, σ)

The normal distribution is

◦ symmetric,

◦ single-peaked,

◦ bell-shaped.

The density curve is given by

f(x) = 1√2πσ2

exp(− 1

2σ2(X − µ)2

).

It is determined by two parameters µ and σ:

◦ µ is the mean (also the median)

◦ σ is the standard deviation

Note: The point where the curve changes from concave to convex

is σ units from µ in either direction.


The 68-95-99.7 rule

◦ About 68% of the data fall inside (µ − σ, µ + σ).

◦ About 95% of the data fall inside (µ − 2σ, µ + 2σ).

◦ About 99.7% of the data fall inside (µ − 3σ, µ + 3σ).


Example

Scores on the Wechsler Adult Intelligence Scale (WAIS) for the 20

to 34 age group are approximately N(110, 25).

◦ About what percent of people in this age group have scores

above 110?

◦ About what percent have scores above 160?

◦ In what range do the middle 95% of all scores lie?


Standardization and z-scores

Linear transformation of normal distributions:

X ∼ N (µ, σ) ⇒ a X + b ∼ N (a µ + b, a σ)

In particular it follows that

X − µ

σ∼ N (0, 1).

N (0, 1) is called standard normal distribution.

For a real number x the standardized value or z-score

z =x − µ

σ

tells how many standard deviations x is from µ, and in what di-

rection.

Standardization enables us to use a standard normal table to find

probabilities for any normal variable.

For example:

◦ What is the proportion of N(0, 1) observations less than 1.2?

◦ What is the proportion of N(3, 1.5) observations greater than 5?

◦ What is the proportion of N(10, 5) observations between 3 and 9?


Normal calculations

Standard normal calculations

1. State the problem in terms of x.

2. Standardize: z = x−µσ .

3. Look up the required value(s) on the standard normal table.

4. Reality check: Does the answer make sense?

Backward normal calculations

We can also calculate the values, given the probabilities:

If MPG ∼ N (25.7, 5.88), what is the minimum MPG required to be in the

top 10%?

“Backward” normal calculations

1. State the problem in terms of the probability of being less

than some number.

2. Look up the required value(s) on the standard normal table.

3. “Unstandardize,” i.e. solve z = x−µσ for x.


Example

Suppose X ∼ N (0, 1).

◦ P(X ≤ 2) = ?

◦ P(X > 2) = ?

◦ P(−1 ≤ X ≤ 2) = ?

◦ Find the value z such that

⋄ P(X ≤ z) = 0.95

⋄ P(X > z) = 0.99

⋄ P(−z ≤ X < z) = 0.68

⋄ P(−z ≤ X < z) = 0.95

⋄ P(−z ≤ X < z) = 0.997

Suppose X ∼ N (10, 5).

◦ P(X < 5) = ?

◦ P(−3 < X < 5) = ?

◦ P(−x < X < x) = 0.95


Assessing Normality

How to make a normal quantile plot

1. Arrange the data in increasing order.

2. Record the percentiles ( 1n,

2n, . . . ,

nn).

3. Find the z-scores for these percentiles.

4. Plot x on the vertical axis against z on the horizontal axis.

Use of normal quantile plots

◦ If the data are (approximately) normal, the plot will be close

to a straight line.

◦ Systematic deviations from a straight line indicate a nonnormal

distribution.

◦ Outliers appear as points that are far away from the overall

patter of the plot.

−3 −2 −1 0 1 2 3

−2

−1

01

2

Theoretical Quantiles

Sam

ple

Qua

ntile

s

N(0, 1)−3 −2 −1 0 1 2 3

01

23

45

6


Sam

ple

Qua

ntile

s

Exp(1)−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0


Sam

ple

Qua

ntile

s

U(0, 1)


Density Estimation

The normal density is just one possible density curve. There are

many others, some with compact mathematical formulas and many

without.

Density estimation software fits an arbitrary density to data to give

a smooth summary of the overall pattern.

Velocity of galaxy (1000km/s)

Den

sity

0 10 20 30 400.0

0.1

0.2


Histogram

How to scale a histogram?

◦ Easiest way to draw a histogram:

⋄ qqually spaced bins

⋄ counts on the vertical axis

Sosa home runs

Fre

quen

cy

0 10 20 30 40 50 60 700

1

2

3

4

5

Disadvantage: Scaling depends on number of observations and

bin width.

◦ Scale histogram such that area of each bar corresponds to pro-

portion of data:

height =counts

width · total number

Sosa home runs

Den

sity

0 10 20 30 40 50 60 700.00

0.01

0.02

0.03

0.04

Proportion of data in interval (0, 10]:

height · width = 0.02 · 10 = 0.2 = 20%

Since n = 15 this corresponds to 3 observations.


Density curves

x

Den

sity

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5n=250

x

Den

sity

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5n=2500

x

Den

sity

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5n=250000

x

Den

sity

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5n → ∞

Proportion of data in (1,2]:

#{xi : 1 < xi ≤ 2}n

↓ n → ∞

∫ 2

1

f(x) dx

Probability that a new observation X fall into [a, b]P(a ≤ X ≤ b) =

∫ b

a

f(x) dx = limn→∞

#{xi : 1 < xi ≤ 2}n


Relationships between data

Example: Smoking and mortality

◦ Data from 25 occupational groups

(condensed from data on thousands of individual men)

◦ Smoking (100 = average number of cigarettes per day)

◦ Mortality ratio for deaths from lung cancer

(100 = average ratio for all English men)

Scatter plot of the data:

70 80 90 100 110 120 130

60

80

100

120

140

Smoking (index)

Mor

talit

y (in

dex)

In STATA:

. insheet using smoking.txt

. graph twoway scatter mortality smoking

Scatterplots and correlation, Jan 12, 2004 - 1 -

Relationship between data

Assessing a scatter plot:

◦ What is the overall pattern?

⋄ form of the relationship?

⋄ direction of the relationship?

⋄ strength of the relationship?

◦ Are there any deviations (e.g. outliers) from these patterns?

Direction of relationship/association:

◦ positive association: above-average values of both variables

tend to occur together, and the same for below-average values

◦ negative association: above-average values of one variable

tend to occur with below-average values of the other, and vice

versa.

Strength of relationship/association:

◦ determined by how closely the points follow the overall pattern

◦ difficult to assess numerical measure


Correlation

Correlation is a numerical measure of the direction and strength

of the linear relationship between two quantitative variables.

The sample correlation r is defined as

rxy =sxy√sx sy

.

where

sx = 1

n − 1

n∑i=1

(xi − x)2,

sy = 1

n − 1

n∑i=1

(yi − y)2,

sxy = 1

n − 1

n∑i=1

(xi − x)(yi − y).

Properties:

◦ dimensionless quantity

◦ not affected by linear transformations:

for x′i = a xi + b and y′i = c yi + d

rx′y′ = rxy

◦ −1 ≤ rxy ≤ 1

◦ rxy = 1 if and only if yi = a xi + b for some a and b

◦ measures linear association between xi and yi


Correlation

−2 −1 0 1 2

−2

01

2

x

yρ = −0.9

−2 −1 0 1 2

−2

01

2

x

y

ρ = −0.6

−2 −1 0 1 2

−2

01

2

x

y

ρ = −0.3

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0.3

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0.6

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0.9

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0.99


Introduction to regression

Regression describes how one variable (response) depends on

another variable (explanatory variable).

◦ Response variable: variable of interest, measures the out-

come of a study

◦ Explanatory variable: explains (or even causes) changes in

response variable

Examples:

◦ Hearing difficulties:

response - sound level (decibels), explanatory - age (years)

◦ Real estate market:

response - listing prize ($), explanatory - house size (sq. ft.)

◦ Salaries:

response - salary ($), explanatory - experience (years), educa-

tion, sex

Least squares regression, Jan 14, 2004 - 1 -

Introduction to regression

Example: Food expenditures and income

Data: Sample of 20 households

0 20 40 60 80 100 1200

4

8

12

16

20

income

food

exp

endi

ture

Questions:

◦ How does food expenditure (Y ) depend on income (X)?

◦ Suppose we know that X = x0, what can we tell about Y ?

Linear regression:

If the response Y depends linearly on the explanatory variable

X , we can use a straight line (regression line) to predict Y

from X .


Least squares regression

How to find the regression line

0 20 40 60 80 100 1200

4

8

12

16

20

income

foo

d e

xp

en

ditu

re

50 60 70 80 90

8

10

12

14

16

18

income

foo

d e

xpe

nd

iture

Predicted y

Difference y − y

Observed y

Since we intend to predict Y from X , the errors of interest are

mispredictions of Y for fixed X .

The least squares regression line of Y on X is the line that

minimizes the sum of squared errors.

For observations (x1, y1), . . . , (xn, yn), the regression line is given

by

Y = a + b X

where

b = r sy

sxand a = y − b x

(r correlation coefficient, sx, sx standard deviations, x, y means)


Least squares regression

Example: Food expenditure and incomeX 28 26 32 24 54 59 44 30 40 82

Y 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18.0

X 42 58 28 20 42 47 112 85 31 26

Y 4.9 11.8 5.2 4.8 7.9 6.4 20.0 13.7 5.1 2.9

The summary statistics are:

◦ x = 45.50

◦ y = 7.97

◦ sx = 23.96

◦ sy = 4.66

◦ r = 0.946

The regression coefficients are:

b = r sy

sx= 0.946 · 4.66

23.96= 0.184

a = y − b x = 7.97 − 0.184 · 45.5 = −0.402

0 20 40 60 80 100 120

0

5

10

15

20

income

food

exp

endi

ture


Interpreting the regression model

◦ The response in the model is denoted Y to indicate that these

are predicted Y values, not the true Y values. The “hat” de-

notes prediction.

◦ The slope of the line indicates how much Y changes for a unit

change in X .

◦ The intercept is the value of Y for X = 0. It may or not have

a physical interpretation, depending on whether or not X can

take values near 0.

◦ To make a prediction for an unobserved X , just plug it in and

calculate Y .

◦ Note that the line need not pass through the observed data

points. In fact, it often will not pass through any of them.


Regression and correlation

Correlation analysis:

We are interested in the joint distribution of two (or more)

quantitive variables.

Example: Heights of 1,078 fathers and sons

Father’s height (inches)

Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

Points are scattered around the SD line:

◦ (y − y) =sy

sx(x − x)

◦ goes through center (x, y)

◦ has slope sy/sx

The correlation r measures how much the points spread around

the SD line.


Regression and correlation

Regression analysis:

We are interested how the distribution of one response variable

depends on one (or more) explanatory variables.

Example: Heights of 1,078 fathers and sons


Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80 Father’s height = 64 inches

Son’s height (inches)

Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.05

0.10

0.15

0.20x

Father’s height = 68 inches


Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.04

0.08

0.12

0.16x

Father’s height = 72 inches


Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.03

0.06

0.09

0.12

0.15

0.18x


Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

In each vertical strip, the

points are distributed

around the regression

line.


Properties of least squares regression

◦ The distinction between explanatory and response variables is

essential. Looking at vertical deviations means that changing

the axes would change the regression line.


Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

y = a + bx

x = a’ + b’y

◦ A change of 1 sd in X corresponds to a change of r sds in Y .

◦ The least squares regression line always passes through the

point (x, y).

◦ r2 (the square of the correlation) is the fraction of the variation

in the values of y that is explained by the least squares regres-

sion on x.

When reporting the results of a linear regression,

you should report r2.

These properties depend on the least-squares fitting criterion and

are one reason why that criterion is used.


The regression effect

Regression effect

In virtually all test-retest situations, the bottom group on the

first test will on average show some improvement on the sec-

ond test - and the top group will on average fall back. This is

the regression effect. The statistician and geneticist Sir Fran-

cis Galton (1822-1911) called this effect “regression to medi-

ocrity”.


Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

Regression fallacy

Thinking that the regression effect must be due to something

important, not just the spread around the line, is the regression

fallacy.


Regression in STATA

. infile food income size using food.txt

. graph twoway scatter food income || lfit food income, legend(off)> ytitle(food). regress food income

Source | SS df MS Number of obs = 20------------+------------------------------ F( 1, 18) = 151.97

Model | 369.572965 1 369.572965 Prob > F = 0.0000Residual | 43.7725361 18 2.43180756 R-squared = 0.8941

------------+------------------------------ Adj R-squared = 0.8882Total | 413.345502 19 21.7550264 Root MSE = 1.5594

---------------------------------------------------------------------------food | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------------+--------------------------------------------------------------income | .1841099 .0149345 12.33 0.000 .1527336 .2154862_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615

---------------------------------------------------------------------------

0

5

10

15

20

Foo

d ex

pend

iture

0 20 40 60 80 100 120Income

This graph has been generated using the graphical user interface of STATA.

The complete command is:

. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black))> (lfit food income, range(0 120) clcolor(black) clpat(solid) clwidth(medium)),> ytitle(Food expenditure, size(large)) ylabel(, valuelabel angle(horizontal)> labsize(medlarge)) xtitle(Income, size(large)) xscale(range(0 120))> xlabel(0(20)120, labsize(medlarge)) legend(off) ysize(2) xsize(3)


Residual plots

Residuals: difference of observed and predicted values

ei = observed y − predicted y

= yi − yi

= yi − (a + b xi)

For a least squares regression, the residuals always have mean zero.

Residual plot

A residual plot is a scatterplot of the residuals against the

explanatory variable. It is a diagnostic tool to assess the fit of

the regression line.

Patterns to look for:

◦ Curvature indicates that the relationship is not linear.

◦ Increasing or decreasing spread indicates that the prediction

will be less accurate in the range of explanatory variables where

the spread is larger.

◦ Points with large residuals are outliers in the vertical direc-

tion.

◦ Points that are extreme in the x direction are potential high

influence points.

Influential observations are individuals with extreme x values

that exert a strong influence on the position of the regression line.

Removing them would significantly change the regression line.


Regression Diagnostics

Example: First data set

Y

X5 10 15

0

5

10

Res

idua

ls

Fitted values4 6 8 10

−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

residuals are regularly distributed



Example: Second data set

Y

X5 10 15

0

5

10

Res

idua

ls


−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

functional relationship other than linear



Example: Third data set

Y

X5 10 15

0

5

10

15

Res

idua

ls


−1

0

1

2

3

Res

idua

ls

X5 10 15

−1

0

1

2

3

outlier, regression line misfits majority of data



Example: Fourth data set

Y

X5 10 15

0

5

10

15

Res

idua

ls


−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

heteroscedasticity



Example: Fifth data set

Y

X5 10 15 20

0

5

10

15

Res

idua

ls

Fitted values6 8 10 12 14

−2

−1

0

1

2

Res

idua

ls

X5 10 15 20

−2

−1

0

1

2

one separate point in direction of x, highly influential


The Question of Causation

Example: Are babies brought by the stork?

◦ Data from 54 countries

◦ Variables:

⋄ Birth rate (newborns per 1000 women)

⋄ Number of storks (per 1000 women)

0 1 2 3 4 50

3

6

9

12

15

18

21

Number of storks (per 1000 women)

Birt

h ra

te

Model: Birth rate (Y) is proportional to the number of storks (X)

Y = b X + ε

Least squares regression yields for the slope of the regression line

b = 4.3 ± 0.2.

Can we conclude that babies are brought by the stork?

Causation, Jan 16, 2004 - 1 -

The Question of Causation

A more serious example:

Variables:

◦ Income Y - response

◦ level of education X - explanatory variable

There is a positive association between income and the education.

Question: Does better education increase income?

X Y X Y X Y

ZZ

causal effect confounding

?

(a) (b) (c)

?

Possible alternative explanation: Confounding

◦ People from prosperous homes are likely to receive many years of edu-

cation and are more likely to have high earnings.

◦ Education and income might both be affected by personal attributes

such as self assurance. On the other hand the level of education could

have an impact on e.g. self assurance. The effects of education and self

assurance can not be separated.

Confounding:

Response and explanatory variable both depend on a third

(hidden) variable.


Establishing Causal Relationships

Controlled experiments:

A cause-effect relationship between two variables X and Y can be

established by conducting an experiment where

◦ the values of X are manipulated and

◦ the effect on Y is observed.

Problem: Often such experiments are not possible.

If we cannot establish a causal relationship by a controlled experi-

ment, we can still collect evidence from observational studies:

◦ The association is strong.

◦ The association is consistent across multiple studies.

◦ Higher doses are associated with stronger responses.

◦ The alleged cause precedes the effect in time.

◦ The alleged cause is plausible.

Example: Smoking and lung cancer


Caution about Causation

Association is not causation

Two variables may be correlated because both are affected

by some other (measured or unmeasured) variable.

Unmeasured confounding variables can influence the in-

terpretation of relationships among the measured vari-

ables. They

◦ may suggest a relationship where there is none or

◦ may mask a real relationship.

No causation in - no causation out

Causation is - unlike association - no statistical concept.

For inference on cause-effect relationships, we need some

knowledge about the causal relationships between the vari-

ables in the study.

Randomized experiments guarantee the absence of any

confounding variables. Any relationship between the ma-

nipulated variable and the response must be due to a

cause-effect relationship.


Experiments and Observational Studies

Two major types of statistical studies

◦ Observational study - observes individuals/objects and mea-

sures variables of interest but does not attempt to interfere with

the natural process.

◦ Designed experiment - deliberately imposes some treatment

on individuals to observe their responses.

Remarks:

◦ Sample survey are an example of an observational study.

◦ In economics, most studies are observational.

◦ Clinical studies are often designed experiments.

◦ Designed experiments allow statements about causal relation-

ship between treatment and response.

◦ Observational studies have no control over variables. Thus the

effect of the explanatory variable on the response variable might

be confounded (mixed up) with the effect of some other vari-

ables. Such variables are called confounder and a major source

of bias.

Experiments and Observational Studies, Jan 16, 2004 - 5 -

Designed Experiments

• In controlled experiments, the subjects are assigned to one of

two groups,

◦ treatment group and

◦ control group (which does not receive treatment).

• A controlled experiment is randomized if the subjects are ran-

domly assigned to one of the two groups.

• One precaution in designed experiments if the use of a placebo,

which are made of a completely neutral substance. The sub-

jects do not know whether they receive the treatment or a

placebo, any difference in the response thus cannot be attir-

buted to psychological and psychosomatical effects.

• In a double blind experiment, neither the subjects nor the

treatment administrators know who is assigned to the two

groups.

Example: The Salk polio vaccine field trial

◦ Randomized controlled double-blind experiment in 11 states

◦ 200,000 children in treatment group

◦ 200,000 children in control group treated with placebo

The difference between the responses of the two groups show that

the vaccine reduces the risk of polio infection.


Confounding

Confounding means a difference between the treatment and con-

trol groups—other than the treatment—which affects the responses

being studied. A confounder is a third variable. associated with

exposure and with disease.

Example: Lanarkshire Milk Experiment

The purpose of the experiment was to study the effect of pasteur-

ized milk on the health of children.

◦ The subjects of the experiment were school children.

◦ The children in the treatment group got a daily portion of pas-

teurized milk.

◦ The children in the control did not receive any extra milk.

◦ The teachers assigned poorer children to treatment group so

that they got extra milk

The effect of pasteurized milk on the health of children is con-

founded with the effect of wealth: Poorer children are more exposed

to diseases.


Observational Studies

Confounding is a major problem in observational studies.

Association is NOT Causation

Example: Does smoking cause cancer.

• Designed experiment not possible (cannot make people

smoke).

• Observation: Smokers have higher cancer rates

• Tobacco industry: There might be a gene which

◦ makes people smoke and

◦ causes cancer

In that case stopping smoking would not prevent cancer since

it is caused by the gene. The observed high association could

be attributed to the confounding effect of such a gene.

• However: Studies with identical twins—one smoker and one

nonsmoker—puts some serious doubt on the gene theory.


Example

Do screening programs speed up detection of breast cancer?

◦ Large-scale trial run by the Health Insurance Plan of Greater

New York, starting in 1963

◦ 62,000 women age 40 to 64 (all members of the plan)

◦ Randomly assigned to two equal groups

◦ Treatment group:

⋄ women were encouraged to come in for annual screeening

⋄ 20,200 women did come in for screening

⋄ 10,800 refused.

◦ Control group:

⋄ was offered usual health care

◦ All the women were followed for many years.

Epidemiologists who worked on the study found that

◦ screening had little impact on diseases other than breast cancer;

◦ poorer women were less likely to accept screening than richer

ones; and

◦ most diseases fall more heavily on the poor than the rich.


Example

Deaths in the first five years of the screening trial, by cause. Rates per

1,000 women.

Cause of Death

Breast cancer All other

Number of persons Number Rates Number Rates

Treatment group 31,000 39 1.3 837 27

Examined 20,200 23 1.1 428 21

Refused 10,800 16 1.5 409 38

Control group 31,000 63 2.0 879 28

Questions:

◦ Does screening save lives?

◦ Why is the death rate from all other causes in the whole treatment

group (“examined” and “refused” combined) about the same as the

rate in the control group?

◦ Why is the death rate from all other causes higher for the “refused”

group than the “examined” group?

◦ Breast cancer (like polio, but unlike most other diseases) affects the

rich more than the poor. Which numbers in the table confirm this

association between breast cancer and income?

◦ The death rate (from all causes) among women who accepted screening

is about half the death rate among women who refused. Did screening

cut the death rate in half? In not, what explains the difference in death

rates?

◦ To show that screening reduces the risk from breast cancer, someone

wants to compare 1.1 and 1.5. Is this a good comparison? Is it biased

against screening? For screening?


Survey Sampling

Situation:

Population of N individuals (or items)

e.g. ◦ students at this university

◦ light bulbs produced by a company on one day

Seek information about population

e.g. ◦ amount of money students spent on books this quarter

◦ percentage of students who bought more than 10 books

in this quarter

◦ lifetime of light bulbs

Full data collection is often not possible because it is e.g.

◦ too expensive

◦ too time consuming

◦ not sensible (e.g. testing every produced light bulb for its lifetime)

Statistical approach:

◦ collect information from part of the population (sample)

◦ use information on sample to draw conclusions on whole pop-

ulation

Questions:

◦ How to choose a sample?

◦ What conclusions can be drawn?

Survey Sampling, Jan 19, 2004 - 1 -

Survey Sampling

Objective of a sample survey:

Gather information on some variable for population of N individ-

uals:

xi value of interest for ith individual

x1, . . . , xN values for population

Sample of length n:

x1, . . . , xn values obtained from sampling

Parameter - number that describes the population, e.g.

µpop =1

N

N∑j=1

xj population mean

σ2pop =

1

N

N∑j=1

(xj − µpop)2 population variance

Estimate population parameter from sampled values:

µpop = x =1

n

n∑i=1

xi sample mean

σ2pop = s2 =

1

n − 1

N∑j=1

(xj − x)2 sample variance

A function of the sample x1, . . . , xn is called a statistic.


Sampling Distribution

Suppose we are interested in the amount of money students at this

university have spent on books this quarter.

Idea: Ask 20 students about the amount they have spent and take

the average.

The value we obtain will vary from sample to sample, that is, if we

asked another 20 students we would get a different answer.

Sampling distribution

The sampling distribution of a statistic is the distribution of

all values taken by the statistic if evaluated for all possible

samples of size n taken from the same population.

In our example, the sampling distribution of the average amount

obtained from the sample depends on the way we choose the sample

from the population:

◦ Ask 20 students in this class.

◦ Ask 20 students in your department.

◦ Ask 20 students in the University bookshop.

◦ Select randomly 20 students from the register of the university.

The design of a sample refers to the method used to choose the

sample from the population.


Sampling Distribution

Example:

Consider a population of 20 students who spent the following

amounts on books:x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15

100 120 150 180 200 220 220 240 260 280 290 300 310 350 400

x

Fre

quen

cy (

%)

0 100 200 300 4000

3

6

9

12σ = 55.4247

(a)

x

Fre

quen

cy (

%)

0 100 200 300 4000

3

6

9

12σ = 43.38302

(b)

x

Fre

quen

cy (

%)

0 100 200 300 4000

3

6

9

12σ = 35.96526

(c)

Sampling distribution of

x = 1

n

n∑i=1

xi

for sample sizes

(a) n = 2

(b) n = 3

(c) n = 4


Bias

Example:

Suppose we are interested in the amount of money students at this

university have spent on books last quarter.

Sample: 20 students in the University bookshop

Do we get a good estimate for the average amount spent on books

last quarter by UofC students?

◦ Students who buy more books and spend more money on books

are more likely to be found in bookshops than students who buy

less books.

◦ The sample mean might overestimate the true amount spent

on books.

◦ The sample is not representative for the population of all stu-

dents.

Careful: A poor sample design can produce misleading conclu-

sions.

The design of a study is biased if it systematically favors some

parts of the population over others.

A statistic is unbiased if the mean of its sampling distribution

is equal to the parameter being estimated. Otherwise we say the

statistic is biased.


Bias

Examples: Biased Sampling

◦ Midway Airlines Ads in the New York Times and the Wall Street Jour-

nal stated that “84 percent of frequent business travelers to Chicago

prefer Midway Metrolink to American, United, and TWA.”

The survey was “conducted among Midway Metrolink passengers be-

tween New York and Chicago.

◦ A 1992 Roper poll asked “Does it seem possible or does it seem im-

possible to you that the Nazi extermination of Jews never happened?”

22% of the American respondents said “seems possible.”

A reworded poll 1994 asked “Does it seem possible to you that the Nazi

extermination of Jews never happened, or do you feel certain that it

happened?” This time only 1% of the respondents said it was “possible

it never happened.”

◦ ABC network program Nightline once asked whether the United Na-

tions should continue to have its headquarters in the United States.

More than 186,000 callers responded, and 67% said “No.”

A properly designed sample survey showed that 72% of adults want the

UN to stay.

◦ A call-in poll conducted by USA Today concluded that Americans love

Donald Trump.

USA Today later reported that 5,640 of the 7,800 calls for the poll came

from the offices owned by one man, Cincinnati financier Carl Lindner.


Caution about Sample Surveys

• Undercoverage

◦ occurs when same groups in the population are left out of

the process of choosing the sample

◦ no accurate list of the population

◦ results in bias if this group differs from the rest of the

population

• Nonresponse

◦ occurs when a chosen individual cannot be contacted or

does not cooperate

◦ results in bias if this group differs from the rest of the

population

• Response bias

◦ subjects may not want to admit illegal or unpopular be-

haviour

◦ subjects may be affected by the interviewers appearance or

tone

◦ subjects may not remember correctly

• Question wording

◦ confusing or leading questions can introduce strong bias

◦ do not trust sample survey results unless you have read the

exact questions posed


Simple Random Sampling

A simple random sample (SRS) of size n consists of n indi-

viduals chosen from the population in such a way that every set of

n individuals is equally likely to be selected.

◦ Every possible sample has an equal chance of being selected.

◦ Every individual has an equal chance of being selected.

◦ Random selection eliminates bias in sampling.

SRS or Not?

Is each of the following samples an SRS or not?

◦ A deck of cards if shuffled, and the top five dealt.

◦ A sample of Illinois residents is drawn by choosing all the resi-

dents in each of 100 census blocks (in such a way that each set

of 100 blocks is equally likely to be chosen)

◦ A telephone survey is conducted by dialing telephone numbers

at random (i.e. each valid phone number is equally likely).

◦ A sample of 10%of all student at the University of Chicago is

chosen by numbering the students 1, . . . , N , drawing a random

integer i from 1 to 10, and drawing every tenth student begin-

ning with i.

(E.g. if i = 5, students 5, 15, 25, . . . are chosen.)


Stratified Sampling

Example:

◦ Population: Students at this university

◦ Objective: Amount of money spent on books this quarter

◦ Knowledge: Students in e.g. humanities spend more money on

books

Use knowledge to build sample:

◦ divide sample into groups of similar individuals, called strata

◦ choose simply random sample within each group

◦ size of samples in each groups e.g. proportional to size of groups

Can reduce variability of estimate significantly.


Summary

◦ A number which describes a population is a parameter.

◦ A number computed from the data is a statistic.

◦ Use statistics to make inferences about unknown population

parameters.

◦ A Simple random sample (SRS) of size n consists of n in-

dividuals from the population sampled without replacement,

that is, every set of n individuals has an equal chance to be the

sample actually selected.

◦ A statistic from a random sample has a sampling distribution

that describes how the statistic varies in repeated data produc-

tion.

◦ A statistic as an estimator of a parameter may suffer from bias

or from high variability. Bias means that the mean of the

sampling distribution is not equal to the true value of the pa-

rameter. The variability of the statistic is described by the

spread of its sampling distribution.


First Step Towards Probability

Experiment:

Toss a die and observe the number on the face up.

What is the chance

◦ of getting a six?Event of interest: 6All possible events: 1 2 3 4 5 6⇒ 1

6(one out of six)

◦ of getting an even number?Event of interest: 2 4 6All possible events: 1 2 3 4 5 6⇒ 1

2(three out of six)

The classical probability concept:

If there are N equally likely possibilities, of which one must occur

and s are regarded favorable, or as a “success”, then the probability

of a “success” is

s

N.

Counting, Jan 21, 2003 - 1 -

First Step Towards Probability

Example:

Suppose that of 100 applicants for a job 50 were women and 50

were men, all equally qualified. Further suppose that the company

hired 2 women and 8 men.

How likely is this outcome under the assumption that

the company does not discriminate?

How many ways are there to choose

◦ 10 out of 100 applicants? (⇒ N)

◦ 2 out of 50 female applicants and 8 out of 50 male applicants?

(⇒ s)

To compute such probabilities we need a way to count the num-

ber of possibilities (favorable and total).

Counting, Jan 21, 2003 - 2 -

The Multiplicative Rule

Suppose you have k choices with N1, . . . , Nk possibilities, re-

spectively, to make. Then the total number of possibilities is

the product

N1 · · ·Nk.

Sampling in order with replacement

If you sample n times in order with replacement from a set of N

elements, then the total number of possible sequences (x1, . . . , xn)

is Nn.

Example:

If you toss a die 5 times, the number of possible results is 65 = 7776.

Sampling in order without replacement

If you sample n times in order without replacement from a set of N

elements, then the total number of possible sequences (x1, . . . , xn)

is

N(N − 1) · · · (N − n + 1) =N !

(N − n)!.

Example:

If you select 5 cards in order from a card deck of 64, the number

of possible results is 64 · 63 · 62 · 61 · 60 = 914, 941, 440.

Counting, Jan 21, 2003 - 3 -

Permutations and Combinations

Example:

If you select 5 cards from a card deck of 64, you are typically only

interested in the cards you have, not in the order in which you

received them. How many different combinations of 5 cards out

of 64 are there?

To answer this question we first address the question of how many

different sequences of the same 5 cards exist.

Permutation:

Let (x1, . . . , xn) be a sequence. A permutation of this sequence is

any rearrangement of the elements without loosing or adding any

elements, that is, any new sequence

(xi1, . . . , xin)

with permuted indices {i1, . . . , in} = {1, . . . , n}. The trivial per-

mutation does not change the order, i.e. ij = j.

How many permutations of n distinct elements are there? The

multiplicative rule yields

n · · · (n − 1) · · · 1 = n!.

Example (contd):

The number of different sequences of 5 fixed cards is 5! = 5 · 4 · 3 ·2 · 1 = 120.

Counting, Jan 21, 2003 - 4 -

Permutations and Combinations

How many different combinations of n elements chosen from

N distinct elements are there?

Recall that

◦ The number of different sequences of length n that can be cho-

sen from N distinct elements are

N !

(N − n)!.

◦ The number of permutions of any sequence of length n is n!.

Thus the number of combinations of n elements chosen from N

distinct elements is

N !

n! (N − n)!=

(N

n

)=

(N

N − n

).

(Nn

)are referred to as binomial coefficient.

Since two permuted (ordered) sequences (x1, . . . , xn) lead to the same (un-

ordered) combination {x1, . . . , xn} we divide the number of ordered se-

quences by the number of permutations.

Counting, Jan 21, 2003 - 5 -

Examples

Example:

If you select 5 cards from a card deck of 64, you are typically only

interested in the cards you have, not in the order in which you

received them. How many different combinations of 5 cards out

of 64 are there?

The answer is(

64

5

)=

64 · 63 · 62 · 61 · 60

5 · 4 · 3 · 2 · 1=

914941444

120= 7, 624, 512.

Example:

Recall the example with the 100 applicants for a job. The number

of ways to choose

◦ 2 women out of 50 is(

502

).

◦ 8 men out of 50 is(

508

).

◦ 10 applicants out of 100 is(

10010

).

Thus the chance of this event is(

502

)(508

)(

10010

) = 0.037

Moreover, the chance of this or a more extreme event (only one or

no woman is hired) is 0.046.

Counting, Jan 21, 2003 - 6 -

Summary

The number of possibilities to sample with or without replacement

in order or unordered n elements from a set of N distinct elements

are summarized in the following table:

Sampling in order without order

without replacementN !

(N − n)!

(N

n

)

with replacement Nn

(N + n − 1

N

)

Counting, Jan 21, 2003 - 7 -

Introduction to Probability

Classical Concept:

◦ requires finitely many and equally likely outcomes

◦ probability of event defined as number of favorable outcomes

(s) divided by number of total outcomes (N):

Probability of event =s

N

◦ can be determined by counting outcomes

In many practical situations the different outcomes are not equally

likely:

◦ Success of treatment

◦ Chance to die of a heart attack

◦ Chance of snowfall tomorrow

It is not immediately clear how to measure chance in each of these

cases.

Three Concepts of Probability

◦ Frequency interpretation

◦ Subjective probabilities

◦ Mathematical probability concept

Elements of Probability, Jan 23, 2003 - 1 -

The Frequentist Approach

In the long run, we are all dead.

John Maynard Keynes (1883-1943)

The Frequency Interpretation of Probability

The probability of an event is the proportion of time that events

of the same kind (repeated independently and under the same

conditions) will occur in the long run.

Example:

Suppose we collect data on the weather in Chicago on Jan 21 and

we note that in the past 124 years it snowed in 34 years on Jan 21,

that is 34124

100% = 27.4% of the time.

Thus we would estimate the probability of snowfall on Jan 21 in

Chicago as 0.274.

The frequency interpretation of probability is based on the follow-

ing theorem:

The Law of Large Numbers

If a situation, trial, or experiment is repeated again and again, the

proportion of successes will converge to the probability of any one

outcame being a success.


The Frequentist Approach

Number of Tosses

Rel

ativ

e F

requ

ency

of H

eads

0.0

0.2

0.4

0.6

0.8

1.0

0 100 200 300 400 500 600 700 800 900 1000

Tosses 1 − 1000

Number of Tosses (in 1000s)

Rel

ativ

e F

requ

ency

of H

eads

0.48

0.49

0.50

0.51

0.52

1 10 20 30 40 50 60 70 80 90 100

Tosses 1000 − 100000

Number of Tosses (in 100000s)

Rel

ativ

e F

requ

ency

of H

eads

0.49

50.

500

0.50

5

1 2 3 4 5 6 7 8 9 10

Tosses 100000 − 1000000


The Subjectivist (Bayesian) Approach

Not all events are repeatable:

◦ Will it snow tomorrow?

◦ Will Mr Jones, 42, live to 65?

◦ Will the Dow Jones rise tomorrow?

◦ Does the Iraq have weapons of mass destruction?

To all these questions the answer is either “yes” or “no”, but we

are uncertain about the right answer.

Need to quantify our uncertainty about an event A:

Game with two players:

◦ 1st player determines p such that he will “win” $c · (1 − p) if

event A occurs and otherwise he will “loose” $c · p.

◦ 2nd player chooses c which can be positive or negative.

The Bayesian interpretation of probability is that probability

measures the personal (subjective) uncertainty of an event.

Example: Weather forecast

Meteorologist says that the probability of snowfall tomorrow is

90%.

He should be willing to bet $90 against $10 that it snows tomorrow

and $10 against $90 that it does not snow.


The Elements of Probability

A (statistical) experiment is a process of observation or mea-

surement. For a mathematical treatment we need:

Sample Space S - set of possible outcomes

Example: An urn contains five balls, numbered from 1 through

5. We choose two at random and at the same time. What is the

sample space?

S ={{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5},{3, 4}, {3, 5}, {4, 5}

}.

Events A ⊆ S - an event is a subset of the sample space S

Example: In the example above the event A that two balls with

uneven numbers are choses is

A ={{1, 3}, {1, 5}, {3, 5}

}.

Probability Function P - assigns each A a value in [0, 1]

Example: Assuming that all events are equally likely we obtainP(A) =3

10.


The Elements of Probability

Why not assign probabilities to outcomes?

Example: Spinner labeled from 0 to 1.

◦ Suppose that all outcomes s ∈ S = [0, 1) are equally likely.

◦ Assign probabilities uniformly on S.

◦ P({s}) = c > 0 ⇒ P(S) = ∞◦ P({s}) = 0 ⇒ P(S) = 0

Solution: Assign to each subset of S a probability equal to the

“length” of that subset:

◦ Probability that the spinner lands in [0, 14) is 1

4.

◦ Probability that the spinner lands in [12, 3

4) is 1

4.

◦ Probability that the spinner lands on 12 is 0.

In integral notation we haveP(spinner lands in [a, b]) =

∫ b

a

dx = b − a.

Remark:

Strictly speaking, we can define above probability only on a set A of subsets A ⊆ S which

however covers all important and for this class relevant subsets.

In the case of finite or countably infinite sample spaces S there are no such exceptions

and A covers all subsets of S.


A Set Theory Primer

A set is “a collection of definite, well distinguished objects of our perception

or of our thought”. (Georg Cantor, 1845-1918)

Some important sets:

◦ N = {1, 2, 3, . . .}, the set of natural numbers

◦ Z = {. . . ,−2,−1, 0, 1, 2, . . .}, the set of integers

◦ R = (−∞,∞), the set of real numbers

Intervals are denoted as follows:

[0, 1] the interval from 0 to 1 including 0 and 1

[0, 1) the interval from 0 to 1 including 0 but not 1

(0, 1) the interval from 0 to 1 not including 0 and 1

If a is an element of the set A then we write a ∈ A.

If a is not an element of the set A then we write a /∈ A.

Suppose that A and B are subsets of S (denoted as A, B ⊆ S).

The empty set is denoted by ∅ (Note: ∅ ⊆ A for all subsets A of S).

Difference of A and B (A\B): Set of all elements in A which are not in B.

Intersection of A and B (A ∩ B): Set of all elements in S which are both

in A and in B.

Union of A and B (A∪B): Set of all elements in S that are in A or in B.

Complement of A (A∁ or A′): Set of all elements in S that are not in A.

Note that A ∩ A∁ = ∅ and A ∪ A∁ = S

A and B are disjoint if A and B have no common elements, that is A∩B =

∅. Two events A and B with this property are said to be mutually

exclusive.


The Postulates of Probability

A probability on a sample space S (and a set A of events) is a

function which assigns each subset A a value in [0, 1] and satisfies

the following rules:

Axiom 1: All probabilities are nonnegative:P(A) ≥ 0 for all events A.

Axiom 2: The probability of the whole sample space is 1:P(S) = 1.

Axiom 3 (Addition Rule): If two events A and B are mutu-

ally exclusive thenP(A ∪ B) = P(A) + P(B),

that is the probability that one or the other occurs is the sum

of their probabilities.

More generally, if countably many events Ai, i ∈ N are mutu-

ally exclusive (i.e. Ai ∩ Aj = ∅ whenever i 6= j) thenP( ∞⋃i=1

Ai

)=

∞∑i=1

P(Ai).



Classical Concept of Probability

The probability of an event A is defined asP(A) =#A

#S,

where #A denotes the number of elements (outcomes) in A.

It satisfies

◦ P(A) ≥ 0

◦ P(S) = #S/#S = 1

◦ If A and B mutually exclusive thenP(A ∪ B) =#(A ∪ B)

#S

=#A

#S+

#B

#S= P(A) + P(B).



Frequency Interpretation of Probability

The probability of an event A is defined asP(A) = limn→∞

n(A)

n,

where n(A) is the number of times event A occurred in n repeti-

tions.

It satisfies

◦ P(A) ≥ 0

◦ P(S) = limn→∞nn

= 1

◦ If A and B mutually exclusive then n(A∪B) = n(A) + n(B).

HenceP(A ∪ B) = limn→∞

n(A ∪ B)

n

= limn→∞

(n(A)

n+

n(B)

n

)

= limn→∞

n(A)

n+ lim

n→∞n(B)

n= P(A) + P(B).



Example: Toss of one die

The events A = {1} and B = {4 5} are mutually exclusive.

Since all outcomes are equiprobable we obtainP(A) =1

6

and P(B) =1

3.

The addition rule yieldsP(A ∪ B) =1

6+

1

3=

3

6=

1

2.

On the other hand we get for C = A ∪ B = {1 4 5}P(C) =3

6=

1

2.

The first two axioms can be summarized by the

Cardinal Rule: For any subset A of S

0 ≤ P(A) ≤ 1.

In particular

◦ P(∅) = 0

◦ P(S) = 1


The Calculus of Probability

Let A and B be events in a sample space S.

Partition rule:P(A) = P(A ∩ B) + P(A ∩ B∁)

Example: Roll a pair of fair diceP(Total of 10)

= P(Total of 10 and double) +P(Total of 10 and no double)

=1

36+

2

36=

3

36=

1

12

Complementation rule:P(A∁) = 1 − P(A)

Example: Often useful for events of the type “at least one”:P(At least one even number)

= 1 −P(No even number) = 1 − 9

36=

3

4

Containment ruleP(A) ≤ P(B) for all A ⊆ B

Example: Compare two aces with doubles,

1

36= P(Two aces) ≤ P(Doubles) =

6

36=

1

6

Calculus of Probability, Jan 26, 2003 - 1 -

The Calculus of Probability

Inclusion and exclusion formulaP(A ∪ B) = P(A) + P(B) − P(A ∩ B)

Example: Roll a pair of fair diceP(Total of 10 or double)

= P(Total of 10) +P(Double) −P(Total of 10 and double)

=3

36+

6

36− 1

36=

8

36=

2

9

The two events are

Total of 10 = {46,55,64}and

Double = {11,22,33,44,55,66}

The intersection is

Total of 10 and double = {55}.

Adding the probabilities for the two events, the probability for the

event 55 is added twice.


Conditional Probability

Probability gives chances for events in sample space S.

Often: Have partial information about event of interest.

Example: Number of Deaths in the U.S. in 1996

Cause All ages 1-4 5-14 15-24 25-44 45-64 ≥ 65

Heart 733,125 207 341 920 16,261 102,510 612,886

Cancer 544,161 440 1,035 1,642 22,147 132,805 386,092

HIV 32,003 149 174 420 22,795 8,443 22

Accidents1 92,998 2,155 3,521 13,872 26,554 16,332 30,564

Homicide2 24,486 395 513 6,548 9,261 7,717 52

All causes 2,171,935 5,947 8,465 32,699 148,904 380,396 1,717,218

1 Accidents and adverse effects, 2 Homicide and legal intervention

measure probability with respect to a subset of S

Conditional probability of A given BP(A|B) =P(A ∩ B)P(B)

, if P(B) > 0

If P(B) = 0 then P(A|B) is undefined.

Conditional probabilities for causes of death:

◦ P(accident) = 0.04282

◦ P(age=10) = 0.00390

◦ P(accident|age=10) = 0.42423

◦ P(accident|age=40) = 0.17832


Conditional Probability

Example: Select two cards from 32 cards

◦ What is the probability that the second card is an ace?P(2nd card is an ace) =1

8

◦ What is the probability that the second card is an ace if the

first was an ace?P(2nd card is an ace|1st card was an ace) =3

31


Multiplication rules

Example: Death Rates (per 100,000 people)

All Ages 1-4 5-14 15-24 25-44 45-64 ≥ 65

872.5 38.3 22.0 90.3 177.8 708.0 5071.4

Can we combine these rates with the table on causes of death?◦ What is the probability to die from an accident (HIV)?

◦ What is the probability to die from an accident at age 10 (40)?

Know P(accident|die) = P(die from accident)/P(die)

⇒ P(die from accident) = P(accident|die)P(die)

Calculate probabilities:

◦ P(die from accident) = 0.04281 · 0.00873 = 0.00037

◦ P(die from accident|age = 10) = 0.42423 · 0.00090 = 0.00038

◦ P(die from accident|age = 40) = 0.17832 · 0.00178 = 0.00031

◦ P(die from HIV) = 0.01473 · 0.00873 = 0.00013

◦ P(die from HIV|age = 10) = 0.02055 · 0.00090 = 0.00002

◦ P(die from HIV|age = 40) = 0.15308 · 0.00178 = 0.00027

General multiplication ruleP(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)


Independence

Example: Roll two dice

◦ What ist the probability that the second die shows 1?P(2nd die = 1) =1

6

◦ What ist the probability that the second die shows 1 if the first

die already shows 1?P(2nd die = 1|1st die = 1) =1

6

◦ What ist the probability that the second die shows 1 if the first

does not show 1?P(2nd die = 1|1st die 6= 1) =1

6

The chances of getting 1 with the second die are the same, no

matter what the first die shows. Such events are called indepen-

dent:

The event A is independent of the event B if its chances are

not affected by the occurrence of B,P(A|B) = P(A).

Equivalently, A and B are independent ifP(A ∩ B) = P(A)P(B)

Otherwise we say A and B are dependent.


Let’s Make a Deal

The Rules:

◦ Three doors - one price, two blanks

◦ Candidate selects one door

◦ Showmaster reveals one loosing door

◦ Candidate may switch doors

1 2 3

Would YOU change?

Can probability theory help you?

◦ What is the probability of winning if candidate switches doors?

◦ What is the probability of winning if candidate does not switch

doors?


The Rule of Total Probability

Events of interest:

◦ A - choose winning door at the beginning

◦ W - win the price

Strategy: Switch doors (S)

Know: ◦ PS(W |A) = 0

◦ PS(W |A∁) = 1

◦ PS(A) = 13

◦ PS(A∁) = 23

Probability of interest: PS(W ):PS(W ) = PS(W ∩ A) + PS(W ∩ A∁)

= PS(W |A)PS(A) + PS(W |A∁)PS(A∁)

= 0 · 1

3+ 1 · 2

3=

2

3

Strategy: Do not switch doors (N)

Know: ◦ PN(W |A) = 1

◦ PN(W |A∁) = 0

◦ PN(A) = 13

◦ PN(A∁) = 23

Probability of interest: PN(W ):PN(W ) = PN(W ∩ A) + PN(W ∩ A∁)

= PN(W |A)PN(A) + PN(W |A∁)PN(A∁)

= 1 · 1

3+ 0 · 2

3=

1

3



Rule of Total Probability

If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, thenP(A) = P(A|B1)P(B1) + . . . + P(A|Bk)P(Bk)

Example:

Suppose an applicant for a job has been invited for an interview.

The chance that

◦ he is nervous is P(N) = 0.7,

◦ the interview is succussful if he is nervous is P(S|N) = 0.2,

◦ the interview is succussful if he is not nervous isP(S|N∁) = 0.9.

What is the probability that the interview is successful?P(S) = P(S|N)P(N) + P(S|N∁)P(N∁)

= 0.2 · 0.7 + 0.9 · 0.3

= 0.441



Example:

Suppose we have two unfair coins:

◦ Coin 1 comes up heads with probability 0.8

◦ Coin 2 comes up heads with probability 0.35

Choose a coin at random and flip it. What is the probability of its

being a head?

Events: H=“heads comes up”, C1=“1st coin”, C2=“2nd coin”P(H) = P(H|C1)P(C1) + P(H|C2)P(C2)

=1

2(0.8 + 0.35) = 0.575


Bayes’ Theorem

Example: O.J. Simpson

“Only about 110 of one percent of wife-batterers actually murder their wives”

Lawyer of O.J. Simpson on TV

Fact: Simpson pleaded no contest to beating his wife in 1988.

So he murdered his wife with probability 0.001?

◦ Sample space S - married couples in U.S. in which the husband

beat his wife in 1988

◦ Event H - all couples in S in which the husband has since

murdered his wife

◦ Event M - all couples in S in which the wife has been murdered

since 1988

We have ◦ P(H) = 0.001

◦ P(M |H) = 1 since H ⊆ M

◦ P(M |H∁) = 0.0001 at most in the U.S.

ThenP(H|M) =P(M |H)P(H)P(M)

=P(M |H)P(H)P(M |H)P(H) + P(M |H∁)P(H∁)

=0.001

0.001 + 0.0001 · 0.999= 0.91


Bayes’ Theorem

Example: Testing for AIDS

Enzyme immunoassay test for HIV:

◦ P(T+|I+) = 0.98 (sensitivity - positive for infected)

◦ P(T-|I-) = 0.995 (specificity - negative for noninfected)

◦ P(I+) = 0.0003 (prevalence)

What is the probability that the tested person is infected if the

test was positive?P(I+|T+) =P(T+|I+)P(I+)P(T+|I+)P(I+) + P(T+|I-)P(I-)

=0.98 · 0.0003

0.98 · 0.0003 + 0.005 · 0.9997

= 0.05556

Consider different population with P(I+) = 0.1 (greater risk)P(I+|T+) =0.98 · 0.1

0.98 · 0.1 + 0.005 · 0.9= 0.956

testing on large scale not sensible (too many false positives)

Repeat test (Bayesian updating):

◦ P(I+|T++) = 0.92 in 1st population

◦ P(I+|T++) = 0.9998 in 2nd population


Random Variables

Aim: ◦ Learn about population

◦ Available information: observed data x1, . . . , xn

Problem: ◦ Data affected by chance variation

◦ New set of data would look different

Suppose we observe/measure some characteristic (variable) of n

individuals. The actual observed values x1, . . . , xn are the outcome

of a random phenomenon.

Random variable: a variable whose value is a numerical out-

come of a random phenomenon

Remark: Mathematically, a random variable is a real-valued func-

tion on the sample space S:

SX−−−−→ R

ω 7−→ x = X(ω)

◦ SX = X(S) is the sample space of the random variable.

◦ The outcome x = X(ω) is called realisation of X .

◦ X induces a probability P (B) = P(X ∈ B) on SX , the prob-

ability distribution of X

Example: Roll one die

Outcome ω 1 2 3 4 5 6Realization X(ω) 1 2 3 4 5 6

Random Variables, Jan 28, 2003 - 1 -

Random Variables


◦ X1 - number on the first die

◦ X2 - number on the second die

◦ Y = X1 + X2 - total number of points

(a function of random variables is again a random variable)

Table of outcomes:

Outcome (X1, X2) Y1 1 (1,1) 21 2 (1,2) 31 3 (1,3) 41 4 (1,4) 51 5 (1,5) 61 6 (1,6) 72 1 (2,1) 32 2 (2,2) 42 3 (2,3) 52 4 (2,4) 62 5 (2,5) 72 6 (2,6) 83 1 (3,1) 43 2 (3,2) 53 3 (3,3) 63 4 (3,4) 73 5 (3,5) 83 6 (3,6) 9

Outcome (X1, X2) Y4 1 (4,1) 54 2 (4,2) 64 3 (4,3) 74 4 (4,4) 84 5 (4,5) 94 6 (4,6) 105 1 (5,1) 65 2 (5,2) 75 3 (5,3) 85 4 (5,4) 95 5 (5,5) 105 6 (5,6) 116 1 (6,1) 76 2 (6,2) 86 3 (6,3) 96 4 (6,4) 106 5 (6,5) 116 6 (6,6) 12


Random Variables

Two important types of random variables:

• Discrete random variable

◦ takes values in a finite or countable set

• Continuous random variable

◦ takes values in a continuum, or uncountable set

◦ probability of any particular outcome x is zeroP(X = x) = 0 for all x ∈ SX

Example: Ten tosses of a coin

Suppose we toss a coin ten times. Let

◦ X be the number of heads in ten tosses of a coin

◦ Y be the time it takes to toss ten times


Discrete Random Variables

Suppose X is a discrete random variables with values x1, x2, . . ..


Y = X1 + X2 total number of points

y 2 3 4 5 6 7 8 9 10 11 12P(Y = y) 136

236

336

436

536

636

536

436

336

236

136

Frequency function: The function

p(x) = P(X = x) = P({ω ∈ S|X(ω) = x})

is called the frequency function or probability mass function.

Note: p defines a probability on SX = {x1, x2, . . .}:

P (B) =∑x∈B

p(x) = P(X ∈ B).

We call P the (probability) distribution of X .

Properties of a discrete probability distribution

◦ p(x) ≥ 0 for all values of X

◦ ∑i p(xi) = 1


Discrete Random Variables


Let X denote the number of points on the face turned up. Since

all numbers are equally likely we obtain

p(x) = P(X = x) =

{16 if x ∈ {1, . . . , 6}0 otherwise

.


The probability mass function of the total number of points

Y = X1 + X2

can be written as:

p(y) = P(Y = y) =

{136

(6 − |y − 7|

)if y ∈ {2, . . . , 12}

0 otherwise

Example: Three tosses of a coin

Let X be the number of heads in three tosses of a coin. There are(3x

)outcomes with x heads and 3 − x tails, thus

p(x) =

(3

x

)1

8.


Continuous Random Variables

For a continuous random variable X , the probability that X falls

in the interval (a, b ] is given byP(a < X ≤ B) =∫ b

a

f(x)dx,

where f is the density function of X .

Note: The density defines a probability on R:

P([a, b]

)=

∫ b

a

f(x) dx = P(X ∈ [a, b]

)

We call P the (probability) distribution of X .

Remark: The definition of P can be extended to (almost) all B ⊆ R.

Example: Spinner

Consider a spinner that turns freely on its axis and slowly comes to a stop.

◦ X is the stopping point on the circle marked from 0 to 1.

◦ X can take any value in SX = [0, 1).

◦ The outcomes of X are uniformly distributed over the interval [0, 1).

Then the density function of X is

f(x) =

{1 if 0 ≤ x < 1

0 otherwise.

ConsequentlyP(X ∈ [a, b]

)= b − a.

Note that for all possible outcomes x ∈ [0, 1) we haveP(X ∈ [x, x]

)= x − x = 0.


Independence of Random Variables

Recall: Two events A and B are independent ifP(A ∩ B) = P(A)P(B)

Independence of Random Variables

Two discrete random variables X and Y are independent ifP(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)

for all A ⊆ SX and B ⊆ SY .

Remark: It is sufficient to show thatP(X = x, Y = y) = pX(x) pY (y) = P(X = x)P(Y = y)

for all x ∈ SX and y ∈ SY .

More generally, X1, X2, . . . are independent if for all n ∈ NP(X1 ∈ A1, . . . , Xn ∈ An) = P(X1 ∈ A1) · · ·P(Xn ∈ An).

for all Ai ⊆ Xi.

Example: Toss coin three times

Consider

Xi =

{1 if head in ith toss of coin

0 otherwise

X1, X2, and X3 are independent:P(X1 = x1, . . . , X3 = x3) =1

8= P(X1 = x1)P(X2 = x2)P(X3 = x3)


Multivariate Distributions: Discrete Case

Discrete Case

Let X and Y be discrete random variables.

Joint frequency function of X and Y

pXY (x, y) = P(X = x, Y = y) = P({X = x} ∩ {Y = y})

Marginal frequency function of X

pX(x) =∑i

pXY (x, yi)

Marginal frequency function of Y

pY (y) =∑i

pXY (xi, y)

The random variables X and Y are independent if and only if

pXY (x, y) = pX(x) pY (y)

for all possible values x ∈ SX and y ∈ SY .

Conditional probability of X = x given Y = yP(X = x|Y = y) = pX|Y (x|y) =pXY (x, y)

pY (y)=P(X = x, Y = y)P(Y = y)

where pX|Y (x|y) is the conditional frequency function.


Multivariate Distributions

Discrete Case

Example: Three Tosses of a Coin

◦ X - number of heads on the first toss (values in {0, 1})

◦ Y - total number of heads (values in {0, 1, 2, 3})

The joint frequency function pXY (x, y) is given by the following

table

x\y 0 1 2 3

0 18

28

18

0 12

1 0 18

28

18

12

18

38

38

18

1

Marginal frequency function of Y

pY (0) = P(Y = 0)

= P(Y = 0, X = 0) + P(Y = 0, X = 1)

= 18 + 0 = 1

8

pY (1) = P(Y = 1)

= P(Y = 1, X = 0) + P(Y = 1, X = 1)

= 28

+ 18

= 38

...


Multivariate Distributions

Continuous Case

Let X and Y be continuous random variables.

Joint density function of X and Y : fXY such that∫

A

∫

B

fXY (x, y) dy dx = P(X ∈ A, Y ∈ B)

Marginal density function of X :

fX(x) =∫

fXY (x, y) dy

Marginal density function of Y

fY (y) =∫

fXY (x, y) dx

The random variables X and Y are independent if and only if

fXY (x, y) = fX(x) fY (y)

for all possible values x ∈ SX and y ∈ SY .

Conditional density function of X given Y = y

fX|Y (x|y) =fXY (x, y)

fY (y)

Conditional probability of X ∈ A given Y = yP(X ∈ A|Y = y) =∫

A

fX|Y (x|y) dx


Bernoulli Distribution

Example: Toss of coin

Define X = 1 if head comes up and

X = 0 if tail comes up.

Both realizations are equally likely: P(X = 1) = P(X = 0) = 12

Examples:

Often: Two outcomes which are not equally likely:

◦ Success of medical treatment

◦ Interviewed person is female

◦ Student passes exam

◦ Transmittance of a disease

Bernoulli distribution (with parameter θ)

◦ X takes two values, 0 and 1, with probabilities p and 1 − p

◦ Frequency function of X

p(x) =

{θx(1 − θ)1−x for x ∈ {0, 1}0 otherwise

◦ Often:

X =

{1 if event A has occured

0 otherwise

Example: A = blood pressure above 140/90 mm HG.

Distributions, Jan 30, 2003 - 1 -

Bernoulli Distribution

Let X1, . . . , Xn be independent Bernoulli random variables with

same parameter θ.

Frequency function of X1, . . . , Xn

p(x1, . . . , xn) = p(x1) · · · p(xn) = θx1+...+xn(1 − θ)n−x1−...−xn

for xi ∈ {0, 1} and i = 1, . . . , n

Example: Paired-Sample Sign Test

◦ Study success of new elaborate safety program

◦ Record average weekly losses in hours of labor due to accidents before

and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10

Before 45 73 46 124 33 57 83 34 26 17

After 36 60 44 119 35 51 77 29 24 11

Define for the ith plant

Xi =

{1 if first value is greater than the second

0 otherwise

Result: 1 1 1 1 0 1 1 1 1 1

The Xi’s are independently Bernoulli distributed with unknown

parameter θ.


Binomial Distribution

Let X1, . . . , Xn be independent Bernoulli random variables

◦ Often only interested in number of successes

Y = X1 + . . . + Xn

Example: Paired Sample Sign Test (contd)

Define for the ith plant

Xi =

{1 if first value is greater than the second

0 otherwise

Y =n∑

i=1

Xi

Y is the number of plants for which the number of lost hours has

decreased after the installation of the safety program

We know:

◦ Xi is Bernoulli distributed with parameter θ

◦ Xi’s are independent

What is the distribution of Y ?

◦ Probability of realization x1, . . . , xn with y successes:

p(x1, . . . , xn) = θy(1 − θ)n−y

◦ Number of different realizations with y successes:(ny

)



Binomial distribution (with parameters n and θ)

Let X1, . . . , Xn be independent and Bernoulli distributed with pa-

rameter θ and

Y =n∑

i=1

Xi.

Y has frequency function

p(y) =(

n

y

)θy (1 − θ)n−y for y ∈ {0, . . . , n}

Y is binomially distributed with parameters n and θ. We write

Y ∼ Bin(n, θ).

Note that

◦ the number of trials is fixed,

◦ the probability of success is the same for each trial, and

◦ the trials are independent.

Example: Paired Sample Sign Test (contd)

Let Y be the number of plants for which the number of lost hours

has decreased after the installation of the safety program. Then

Y ∼ Bin(n, θ)



Binomial distribution for n = 10

p(x)

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

x

θ = 0.1

p(x)

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

x

θ = 0.3

p(x)

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

x

θ = 0.5

p(x)

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

x

θ = 0.8


Geometric Distribution

Consider a sequence of independent Bernoulli trials.

◦ On each trial, a success occurs with probability θ.

◦ Let X be the number of trials up to the first success.

What is the distribution of X?

◦ Probability of no success in x − 1 trials: (1 − θ)x−1

◦ Probability of one success in the xth trial: θ

The frequency function of X is

p(x) = θ(1 − θ)x−1, x = 1, 2, 3, . . .

X is geometrically distributed with parameter θ.

Example:

Suppose a batter has probability 13 to hit the ball. What is the chance that

he misses the ball less than 3 times?

The number X of balls up to the first success is geometrically distributed

with parameter 13. ThusP(X ≤ 3) =

1

3+

1

3· 2

3+

1

3

(2

3

)2

= 0.7037.


Hypergemetric Distribution

Example: Quality Control

Quality control - sample and examine fraction of produced units

◦ N produced units

◦ M defective units

◦ n sampled units

What is the probability that the sample contains x defective units?

The frequency function of X is

p(x) =

(Mx

)(N−Mn−x

)(Nn

) , x = 0, 1, . . . , n.

X is a hypergeometric random variable with parameters N , M ,

and n.

Example:

Suppose that of 100 applicants for a job 50 were women and 50 were men,

all equally qualified. If we select 10 applicants at random what is the

probability that x of them are female?

The number of chosen female applicants is hypergeometrically distributed

with parameters 100, 50, and 10. The frequency function is

p(x) =

(50x

)(50

10−x

)(10010

) for x ∈ {0, . . . , n}

for x = 0, 1, . . . , 10.


Poisson Distribution

Often we are interested in the number of events which occur in aspecific period of time or in a specific area of volume:◦ Number of alpha particles emitted from a radioactive source during a

given period of time

◦ Number of telephone calls coming into an exchange during one unit of

time

◦ Number of diseased trees per acre of a certain woodland

◦ Number of death claims received per day by an insurance company

Characteristics

Let X be the number of times a certain event occurs during a given

unit of time (or in a given area, etc).

◦ The probability that the event occurs in a given unit of time is

the same for all the units.

◦ The number of events that occur in one unit of time is inde-

pendent of the number of events in other units.

◦ The mean (or expected) rate is λ.

Then X is a Poisson random variable with parameter λ and

frequency function

p(x) =λx

x!e−λ, x = 0, 1, 2, . . .


Poisson Approximation

The Poisson distribution is often used as an approximation for

binomial probabilities when n is large and θ is small:

p(x) =(

n

x

)θx (1 − θ)n−x ≈ λx

x!e−λ

with λ = n θ.

Example: Fatalities in Prussian cavalry

Classical example from von Bortkiewicz (1898).

◦ Number of fatalities resulting from being kicked by a horse

◦ 200 observations (10 corps over a period of 20 years)

Statistical model:

◦ Each soldier is kicked to death by a horse with probability θ.

◦ Let Y be the number of such fatalities in one corps. Then

Y ∼ Bin(n, θ)

where n is the number of soldiers in one corps.

Observation: The data are well approximated by a Poisson distribution

with λ = 0.61

Deaths per Year Observed Rel. Frequency Poisson Prob.

0 109 0.545 0.543

1 65 0.325 0.331

2 22 0.110 0.101

3 3 0.015 0.021

4 1 0.005 0.003


Poisson Approximation

Poisson approximation of Bin(40, θ)

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

0.3

0.4

0.5

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

0.3

0.4

0.5

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

x

θ = 14

θ = 18

θ = 140

θ = 1400

λ = 10

λ = 5

λ = 1

λ = 110


Continuous Distributions

Uniform distribution U(0, θ)

Range (0, 1)

f(x) =1

θ1(0,θ)(x)E(X) =θ

2

var(X) =θ2

12

Exponential distribution Exp(λ)

Range [0,∞)

f(x) = λ exp(−λx)1[0,∞)(x)E(X) =1

λ

var(X) =1

λ2

Normal distribution N (µ, σ2)

Range Rf(x) =

1√2πσ2

exp(− 1

2σ2(x−µ)2

)E(X) = µ

var(X) = σ2

X

Fre

quen

cy

−2 −1 0 1 2 3 4

010

2030

40

U(0, θ)

X

Fre

quen

cy

−2 −1 0 1 2 3 4

010

2030

40

Exp(λ)

X

Fre

quen

cy

−2 −1 0 1 2 3 4

010

2030

40

N(µ, σ2)

−2

02

46

U(0, θ) Exp(λ) N(µ, σ2)


Expected Value

Let X be a discrete random variable which takes values in SX =

{x1, x2, . . . , xn}

Expected Value or Mean of X :E(X) =n∑

i=1

xi p(xi)


Let X be outcome of rolling one die. The frequency function is

p(x) =1

6, x = 1, . . . , 6,

and henceE(X) =6∑

x=1

x

6=

7

2= 3.5

Example: Bernoulli random variable

Let X ∼ Bin(1, θ).

p(x) = θx(1 − θ)1−x

Thus the mean of X isE(X) = 0 · (1 − θ) + 1 · θ = θ.

Expected Value and Variance, Feb 2, 2003 - 1 -

Expected Value

Linearity of the expected value

Let X and Y be two discrete random variables. ThenE(a X + b Y ) = aE(X) + bE(Y )

for any constants a, b ∈ RNote: No independence is required.

Proof:E(a X + b Y ) =∑x,y

(a x + b y)p(x, y)

= a∑x,y

x p(x, y) + b∑x,y

y p(x, y)∑x

p(x, y) = p(y)

x

= a∑x

x p(x) + b∑y

y p(y)

= aE(X) + bE(Y )

Example: Binomial distribution

Let X ∼ Bin(n, θ). Then X = X1+. . .+Xn with Xi ∼ Bin(1, θ):E(X) =n∑

i=1

E(Xi) =n∑

i=1

θ = nθ


Expected Value

Example: Poisson distribution

Let X be a Poisson random variable with parameter λ.

E(X) =∞∑

x=0

xλx

x!e−λ

= λ e−λ∞∑

x=0

λx−1

(x − 1)!

= λ e−λeλ

= λ

Remarks:

◦ For most distributions some “advanced” knowledge of calculus

is required to find the mean.

◦ Use tables for means of commonly used distribution.


Expected Value

Example: European Call Options

Agreement that gives an investor the right (but not the obliga-

tion) to buy a stock, bond, commodity, or other instruments at

a specific time at a specific price.

What is a fair price P for European call options?

If ST is the price of the stock at time T , the profit will be

Profit = (ST − K)+ − P.

Profit is a random variable.

0 10 20 30 40 50

−10

010

2030

0

0

Fair price P for this option is expected value

P = E(ST − K)+.


Expected Value

Example: European Call Options (contd)

Consider the following simple model:

◦ St = St−1 + εt, t = 1, . . . , T

◦ P(εt = 1) = p and P(εt = −1) = 1 − p.

St is also called a random walk.

The distribution of ST is given by (s0 known at time 0)

ST = s0 + 2 Y − T, with Y ∼ Bin(T, p)

Therefore the price P is (assuming s0 = 0 without loss of generality)

P = E(ST − K)+ =T∑

y=1(2 y − T − K) pθ(y) 1{y>(K+T )/2}

Let n = 20, K = 10, θ = 0.6

P = 2.75

Profit

p(x)

−2.

75

−0.

75

1.2

5

3.2

5

5.2

5

7.2

5

9.2

5

11.2

5

13.2

5

15.2

5

17.2

5

19.2

5

0.0

0.1

0.2

0.3

0.4

0.5

Frequency function of profit


Expected Value

Example: Group testing

Suppose that a large number of blood samples are to be screened for a rare

disease with prevalence 1 − p.

• If each sample is assayed individually, n tests will be required.

• Alternative scheme:

◦ n samples, m groups with k samples

◦ Split each sample in half and pool all samples in one group

◦ Test pooled sample for each group

◦ If test positive test all samples in group separately

What is the expected number of tests under this alternative scheme?

Let Xi be the number of tests in group i. The frequency function of Xi is

p(x) =

{pk if x = 1

1 − pk if x = k + 1

The expected number of tests in each group isE(Xi) = pk + (k + 1)(1− pk) = k + 1 − kpk

HenceE(N) =m∑

i=1

E(Xi) = n(1 +

1

k− pk

)

Plot of E(N):

The mean is minimized forgroups of size 11.

2 4 6 8 10 12 14 16

0.20

0.25

0.30

0.35

0.40

0.45

0.50

k

Pro

port

ion


Variance

Let X be a random variable.

Variance of X :

var(X) = E(X − E(X)

)2.

The variance of X is the expected squared distance of X from its

mean.

Suppose X is discrete random variable with SX = {x1, . . . , xn}.

Then the variance of X can be written as

var(X) =n∑

i=1

(xi −

n∑j=1

xj p(xj))2

p(xi)


X takes values in {1, 2, 3, 4, 5, 6}with frequency function p(x) = 16.E(X) =

6∑x=1

x1

6=

7

2

var(X) =6∑

x=1

(x − 7

2

)2 1

6=

1

6

(25

4+

9

4+

1

4+

1

4+

9

4+

25

4

)=

35

12

We often denote the variance of a random variable X by σ2X ,

σ2X = var(X)

and its standard deviation by σX .


Properties of the Variance

The variance can also be written as

var(X) = E(X2) −(E(X)

)2

To see this (using linearity of the mean):

var(X) = E(X −E(X))2

= E[X2 − 2XE(X) +

(E(X))2]

= E(X2

)− 2E(X)E(X) +

(E(X))2

= E(X2) −(E(X)

)2

Example: Let X ∼ Bin(1, θ). Then

var(X) = E(X2) −(E(X)

)2

= E(X) −(E(X)

)2= θ − θ2 = θ (1 − θ)

Rules for the variance:

◦ For constants a and b

var(aX + b) = a2var(X).

◦ For independent random variables X and Y

var(X + Y ) = var(X) + var(Y ).

Example: Let X ∼ Bin(n, θ). Then

var(X) = n θ (1 − θ)


Covariance

For independent random variables X and Y we have

var(X + Y ) = var(X) + var(Y ).

Question: What about dependent random variables?

It can be shown that

var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y )

where

cov(X, Y ) = E[(X − E(X))(Y − E(Y )

]

is the covariance of X and Y .

Properties of the covariance

◦ cov(X, Y ) = E(XY ) − E(X)E(Y )

◦ cov(X, X) = var(X)

◦ cov(X, 1) = 0

◦ cov(X, Y ) = cov(Y, X)

◦ cov(a X1 + b X2, Y ) = a cov(X1, Y ) + b cov(X2, Y )


Covariance

Important:

cov(X, Y ) = 0 does NOT imply that X and Y are independent.

Example:

Suppose X ∈ {−1, 0, 1} with probabilities P(X = x) = 13

for

x = −1, 0, 1. Then E(X) = 0 and

cov(X, X2) = E(X3) = E(X) = 0

On the other handP(X = 1, X2 = 0) = 0 6= 19 = P(X = 1)P(X2 = 0),

that is, X and Y are not independent!

Note: The covariance of X and Y measures only linear depen-

dence.


Correlation

The correlation coefficient ρ is defined as

ρXY = corr(X, Y ) =cov(X, Y )√var(X)var(Y )

.

Properties:

◦ dimensionless quantity

◦ not affected by linear transformations, i.e.

corr(a X + b, c Y + d) = corr(X, Y )

◦ −1 ≤ ρXY ≤ 1

◦ ρXY = 1 if and only if P(Y = a + b X) = 1 for some a and b

◦ measures linear association between X and Y

Example: Three boxes: pp, pd, and dd (Ex 3.6)

Let Xi = 1{penny on ith draw}. Then Xi ∼ Bin(1, p) with p = 12

and

joint frequency function

p(x1, x2):

x1\x2 0 1

0 13

16

1 16

13

Thus:

cov(X1, X2) = E[(X1 − p)(X2 − p)]

= 14· 1

3+ 1

4· 1

3+ 2 1

4· 1

6= 1

12

corr(X1, X2) = 41 · 1

12 = 13


Prediction

An instructor standardizes his midterm and final so the class aver-

age is µ = 75 and the SD is σ = 10 on both tests. The correlation

between the tests is always around ρ = 0.50.

◦ X - score of student on the first examination

◦ Y - score of student on the second examination

Since X and Y are dependent we should be able to predict the

score in the final from the midterm score.

Approach:

◦ Predict Y from linear function a + b X

◦ Minimize mean squared error

MSE = E(Y − a − b X

)2

= var(Y − b X) +[E(Y − a − b X)

]2

Solution:

a = µ − b µ and b =σXY

σ2X

= ρ

Thus the best linear predictor is

Y = µ + ρ (X − µ)

Note:

We expect the student’s score on the final to differ from the mean

only by half the difference observed in the midterm (regression to

the mean).


Summary

Bernoulli distribution - Bin(1, θ)

p(x) = θx(1 − θ)1−x E(X) = θ

var(X) = θ(1 − θ)

Binomial distribution - Bin(n, θ)

p(x) =

(n

x

)θx(1 − θ)n−x E(X) = nθ

var(X) = nθ(1 − θ)

Poisson distribution - Poiss(λ)

p(x) =λx

x!e−λ E(X) = λ

var(X) = λ

Geometric distribution

p(x) = θ(1 − θ)x−1 E(X) =1

θ

var(X) =1 − θ

θ2

Hypergeometric distribution - H(N, M, n)

p(x) =

(Mx

)(N−Mn−x

)(Nn

) E(X) =n M

N


Properties of the Sample Mean

Consider X1, . . . , Xn independent and identically distributed (iid)

with mean µ and variance σ2.

X =1

n

n∑i=1

Xi (sample mean)

ThenE(X) =1

n

n∑i=1

µ = µ

var(X) =1

n2

n∑i=1

σ2 =σ2

n

Remarks:

◦ The sample mean is an unbiased estimate of the true mean.

◦ The variance of the sample mean decreases as the sample size

increases.

◦ Law of Large Numbers: It can be shown that for n → ∞

X =1

n

n∑i=1

Xi → µ.

Question:

◦ How close to µ is the sample mean for finite n?

◦ Can we answer this without knowing the distribution of X?

Central Limit Theorem, Feb 4, 2004 - 1 -

Properties of the Sample Mean

Chebyshev’s inequality

Let X be a random variable with mean µ and variance σ2.

Then for any ε > 0P(|X − µ| > ε

)≤ σ2

ε2.

Proof: Let

1{|xi − µ| > ε} =

{1 if |xi − µ| > ε

0 otherwise

Then

n∑i=1

1{|xi − µ| > ε} p(xi) =n∑

i=1

1{(xi − µ)2

ε2> 1

}p(xi)

≤n∑

i=1

(xi − µ)2

ε2p(xi) =

σ2

ε2

Application to the sample mean:P(µ − 3σ√

n≤ X ≤ µ +

3σ√n

)≥ 1 − 1

9≈ 0.889

However: Known to be not very precise

Example: Xiiid∼ N (0, 1)

X =1

n

n∑i=1

Xi ∼ N (0, 1n)

ThereforeP(− 3√

n≤ X ≤ 3√

n

)= 0.997


Central Limit Theorem

Let X1, X2, . . . be a sequence of random variables

◦ independent and identically distributed

◦ with mean µ and variance σ2.

For n ∈ N define

Zn =√

nX − µ

σ=

1√n

n∑i=1

Xi − µ

σ.

Zn has mean 0 and variance 1.


For large n, the distribution of Zn can be approximated by the

standard normal distribution N (0, 1). More precisely,

limn→∞

P(a ≤ √

nX − µ

σ≤ b

)= Φ(b) − Φ(a),

where Φ(x) is the standard normal probability

Φ(z) =∫ z

−∞f(x) dx,

that is, the area under the standard normal curve to left of z.

Example:

◦ U1, . . . , U12 uniformly distributed on [ 0, 12).

◦ What is the probability that the sample mean exceeds 9?P(U > 9) = P(√12

U − 6√12

> 3)≈ 1 − Φ(3) = 0.0013


Central Limit Theoremde

nsity

f(x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=1

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=2

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=6

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=12

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=100

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0 Exp(1),n=1

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Exp(1),n=2

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5 Exp(1),n=6

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4Exp(1),n=12

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Exp(1),n=100



Example: Shipping packages

Suppose a company ships packages that vary in weight:

◦ Packages have mean 15 lb and standard deviation 10 lb.

◦ They come from a arge number of customurs, i.e. packages are

independent.

Question: What is the probability that 100 packages will have a

total weight exceeding 1700 lb?

Let Xi be the weight of the ith package and

T =100∑i=1

Xi.

ThenP(T > 1700 lb) = P(T − 1500 lb√

100 · 10 lb>

1700 lb − 1500 lb√100 · 10 lb

)

= P(T − 1500 lb√

100 · 10 lb> 2

)

≈ 1 − Φ(2) = 0.023



Remarks

• How fast approximation becomes good depends on distribution

of Xi’s:

◦ If it is symmetric and has tails that die off rapidly, n can

be relatively small.

Example: If Xiiid∼ U [0, 1], the approximation is good for

n = 12.

◦ If it is very skewed or if its tails die down very slowly, a

larger value of n is needed.

Example: Exponential distribution.

• Central limit theorems are very important in statistics.

• There are many central limit theorems covering many situa-

tions, e.g.

◦ for not identically distributed random variables or

◦ for dependent, but not “too” dependent random variables.


The Normal Approximation to the Binomial

Let X be binomially distributed with parameters n and p.

Recall that X is the sum of n iid Bernoulli random variables,

X =n∑

i=1

Xi, Xiiid∼ Bin(1, p).

Therefore we can apply the Central Limit Theorem:

Normal Approximation to the Binomial Distribution

For n large enough, X is approximately N(np, np(1 − p)

)

distributed:P(a ≤ X ≤ b) ≈ P(

a − 12 ≤ Z ≤ b + 1

2

)

where

Z ∼ N(np, np(1 − p)

).

Rule of thumb for n: np > 5 and n(1 − p) > 5.

In terms of the standard normal distribution we getP(a ≤ X ≤ b) = P(

a − 12 − np√

np(1 − p)≤ Z ′ ≤ b + 1

2 − np√np(1 − p)

)

= Φ

(b + 1

2 − np√np(1 − p)

)− Φ

(a − 1

2 − np√np(1 − p)

)

where Z ′ ∼ N (0, 1).


The Normal Approximation to the Binomialp(

x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

Bin(1,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

Bin(2,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

0.4

0.5

x

Bin(5,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

x

Bin(10,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

x

Bin(20,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

Bin(1,0.1)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

Bin(5,0.1)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

0.4

0.5

x

Bin(10,0.1)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

x

Bin(20,0.1)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

x

Bin(50,0.1)


The Normal Approximation to the Binomial

Example: The random walk of a drunkard

Suppose a drunkard executes a “random” walk in the following

way:

◦ Each minute he takes a step north or south, with probability 12

each.

◦ His successive step directions are independent.

◦ His step length is 50 cm.

How likely is he to have advanced 10 m north after one hour?

◦ Position after one hour: X · 1 m − 30 m

◦ X binomially distributed with parameters n = 60 and p = 12

◦ X is approximately normal with mean 30 and variance 15:P(X · 1 m − 30 m > 10 m)

= P(X > 40)

≈ P(Z > 39.5) Z ∼ N (30, 15)

= P(Z − 30√

15>

9.5√15

)

= 1 − Φ(2.452) = 0.007

How does the probability change if he has same idea of where he

wants to go and steps north with probability p = 23

and south with

probability 13?


Estimation

Example: Cholesterol levels of heart-attack patients

Data: Observational study at a Pennsylvania medical center

◦ blood cholesterol levels patients treated for heart attacks

◦ measurements 2, 4, and 14 days after the attack

Id Y1 Y2 Y3 Id Y1 Y2 Y3

1 270 218 156 15 294 240 264

2 236 234 193 16 282 294 220

3 210 214 242 17 234 220 264

4 142 116 120 18 224 200 213

5 280 200 181 19 276 220 188

6 272 276 256 20 282 186 182

7 160 146 142 21 360 352 294

8 220 182 216 22 310 202 214

9 226 238 248 23 280 218 170

10 242 288 298 24 278 248 198

11 186 190 168 25 288 278 236

12 266 236 236 26 288 248 256

13 206 244 238 27 244 270 280

14 318 258 200 28 236 242 204

Aim: Make inference on distribution of

◦ cholesterol level 14 days after the attack: Y3

◦ decrease in cholesterol level: D = Y1 − Y3

◦ relative decrease in cholesterol level: R = Y1 − Y3

Y3

Confidence intervals I, Feb 11, 2004 - 1 -

Estimation

Data:

d1, . . . , d28 observed decrease in cholesterol level

In this example, parameters of interest might be

µD = E(D) the mean decrease in cholesterol level,

σ2D = var(D) the variation of the cholesterol level,

pD = P(D ≤ 0) probability of no decrease in cholesterol level

These parameters are naturally estimated by the following sample

statistics:

µD =1

n

n∑i=1

di (sample mean)

σ2D =

1

n

n∑i=1

(di − d)2, (sample mean)

pD =#{di|di ≤ 0}

n(sample proportion)

Such statistics are point estimators since they estimate the corre-

sponding parameter by a single numerical value.

◦ Point estimates provide no information about their chance vari-

ation.

◦ Estimates without an indication of their variability are of lim-

ited value.


Confidence Intervals for the Mean

Recall:

◦ CLT for the sample mean: For large n we have

X ≈ N(µ,

σ2

n

)

◦ 68-95-99 rule: With 95% probability the sample differs from

its mean µ by less that two standard deviations.

More precisely, we haveP(µ − 1.96

σ√n≤ X ≤ µ + 1.96

σ√n

)= 0.95,

or equivalently, after rearranging the terms,P(X − 1.96

σ√n≤ µ ≤ X + 1.96

σ√n

)= 0.95.

Interpretation: There is 95% probability that the random in-

terval[X − 1.96

σ√n

, X + 1.96σ√n

]

will cover the mean µ.

Example: Cholesterol levels

d = 36.89, σ = 51.00, n = 28.

Therefore, the 95% confidence interval for µ is

[18.00, 55.78].



Assumption: The population standard deviation σ is known.

◦ In the next lecture, we will drop this unrealistic assumption.

◦ Assumption is approximately satisfied for large sample sizes,

since then σ ≈ σ by the law of large numbers.

Definition: Confidence interval for µ (σ known)

The interval[X − zα/2

σ√n, X + zα/2

σ√n

]

is called a 1− α confidence interval for the population mean

µ. (1 − α) is the confidence level.

For large sample sizes n, an approximate (1 − α) confidence

interval for µ is given by[X − zα/2

σ√n, X + zα/2

σ√n

].

Here, zα is the α-critical value of the standard normal distribution:

z

f(x)

−3 −2 −1 0 1 2 30.0

0.1

0.2

0.3

0.4

zα

α

◦ zα has area α to its right

◦ Φ(zα) = 1 − α


Confidence Interval for the Mean

Example: Community banks

◦ Community banks are banks with less than a billion dollars of assets.

◦ Approximately 7500 such banks in the United States.

Annual survey of the Community Bankers Council of the American Bankers

Association (ABA)

◦ Population: Community banks in the United States.

◦ Variable of interest: Total assets of community banks.

◦ Sample size: n = 110

◦ Sample mean: X = 220 millions of dollars

◦ Sample standard deviation: SD = 161 millions of dollars

◦ Histogram of sampled values:Assets of Community Banks in the U.S.

Assets (in millions of dollars)

Fre

quen

cy

0

5

10

15

20

0 100 200 300 400 500 600 700 800 900 1000

(sample of 110 community banks)

Suppose we want to give a 95% confidence interval for the mean total assets

of all community banks in the United States.

◦ α = 0.05, zα/2 = 1.96

A 95% confidence interval for the mean assets (in millions of dollars) is[220 − 1.96 · 161√

110, 220 + 1.96 · 161√

110

]≈

[190, 250].


Sample Size


Suppose we want a 99% confidence interval for the decrease in

cholesterol level:

◦ α = 0.01, z0.005 = 2.58

◦ The 99% confidence interval for µD is[36.89− 2.58 · 50.93√

28, 36.89 + 2.58 · 50.93√

28

]≈

[12.06, 61.72].

Note: If we raise the confidence level, the confidence interval

becomes wider.

Suppose we want to obtain increase the confidence level without

increasing the error of estimation (indicated by the half-width of

the confidence interval). For this we have to increase the sample

size n.

Question: What sample size n is needed to estimate the mean

decrease in cholesterol with error e = 20 and confidence level 99%?

The error (half-width of the confidence interval) is

e = zα/2σ√n

Therefore the sample size ne needed is given by

ne ≥(

zα/2σ

e

)2

=(

2.58 · 50.93

20

)2

= 43.16,

that is, a sample of 44 patients is needed to estimate µD with error

e = 20 and 99% confidence.


Estimation of the Mean

Example: Banks’ loan-to-deposit ratio

The ABA survey of community banks also asked about the loan-to-deposit

ratio (LTDR), a bank’s total loans as a percent of its total deposits.

Sample statistics:

◦ n = 110

◦ µLTDR = 76.7

◦ σLTDR = 12.3

Loan−To−Deposit Ratio of Community Banks

LTDR (in %)

Fre

quen

cy

0

3

6

9

12

15

18

50 60 70 80 90 100 110 120

(sample of 110 community banks)

Construction of 95% confidence interval:

◦ α = 0.05, zα/2 = 1.96

◦ Standard error σX =σLTDR√

n= 1.17

◦ 95% confidence interval for µLTDR:[X − zα/2

σLTDR√n

, X + zα/2σLTDR√

n

]=

[74.4, 79.0

]

◦ To get an estimation with error e = 3.0 (half-width of confidence inter-

val) it suffices to sample ne banks,

ne ≥(

zα/2σLTDR

e

)2

=

(1.96 · 12.3

3.0

)2

= 64.6.

Thus a sample of ne = 65 banks it sufficient.


Confidence intervals

Definition: Confidence interval

A (1 − α) confidence interval for a parameter is an interval that

◦ depends only on sample statistics and

◦ covers the parameter with probability (1 − α)

Note:

◦ Confidence intervals are random while the estimated parameter

is fixed.

◦ For repeated samples, only 95% of the confidence intervals will

cover the true parameter is a random:

µ

Confidence intervals II, Feb 13, 2004 - 1 -


Suppose that X1, . . . , Xniid∼ N (µ, σ2). Then

X − µ

σ/√

n∼ N (0, 1) (*)

Assuming that σ is known, we obtain[X − zα/2 · σ√

n, X + zα/2 · σ√

n

]

as (1 − α) confidence interval for µ.

More realistic situation: σ is unknown.

Approach: Replace by estimate σ = s

This approach leads to the t statistic

T =X − µ

s/√

n∼ tn−1.

It is t distributed with n − 1 degrees of freedom. x

f(x)

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4 t1t3t10

N(0, 1)

Confidence interval for the mean µ (σ unknown)

The interval[X − tn−1,α/2 · s√

n, X + tn−1,α/2 · s√

n

]

is a (1 − α) confidence interval for the mean µ.

Notation: Critical values of distributions

zα standard normal distribution

tn,α t distribution with n degrees of freedom




In the study on cholesterol levels, the standard deviation of the decrease

of cholesterol level was unknown.

◦ µD = 36.89, σD = 50.94

◦ t27,0.025 = 2.05

◦ Then[36.89− 2.05 · 50.94√

27, 36.89 + 2.05 · 50.94√

27

]= [16.78, 57.01]

is a 95% confidence interval for µD

◦ The large sample confidence interval based on (*) was [18.00,55.78].

Example: Level of vitamin C

The following data are the amounts of vitamin C, measured in milligrams

per 100 grams (mg/100 g) of corn soy blend, for a random sample of size 8

from a production run:

26 31 23 22 11 22 14 31

What is the 95% confidence interval for µ, the mean vitamin C content of

the CSB produced during this run?

◦ µ = 22.5, σ = 7.2, t7,0.025 = 2.36

◦ The 95% confidence interval for µ is[22.5− 2.36 · 7.2√

8, 22.5 +

2.36 · 7.2√8

]= [16.5, 28.5].

◦ The large sample CI would be [17.5, 27.5].


Confidence Intervals for the Variance

For normally distributed data X1, . . . , Xniid∼ N (µ, σ2), the ratio

(n − 1) · s2

σ2

has a χ2 distribution with n − 1 degrees of freedom.

The (1 − α) confidence interval for σ2 is[

(n − 1) · s2

χ2

n−1,α/2

, (n − 1) · s2

χ2

n−1,1−α/2

].

where χ2n−1,α is the α fractile of the χ2

n−1 distribution.

Caution: This confidence interval is not robust against depar-

tures from normality regardless of the sample size.


Suppose we are interested in the variance of Y3, the cholesterol level 14

days after the attack.

◦ Normal probability plot:

−2 −1 0 1 2

150

200

250

300

Normal quantiles

Cho

lest

erol

leve

l

Data seem to be normally distributed.

◦ s2 = 2030.55, χ227,0.975 = 14.57, χ2

27,0.025 = 43.19

◦ The 95% confidence interval for σ2 is[

27 · 2030.55

43.19,27 · 2030.55

14.57

]= [1269.26, 3761.99]


Statistical Tests

Example:

Suppose that of 100 applicants for a job 50 were women and 50 were men,

all equally qualified. Further suppose that the company hired 2 women

and 8 men.

Question:

◦ Does the company discriminate against female job applicants?

◦ How likely is this outcome under the assumption that the company

does not discriminate?

Example:




Plant 1 2 3 4 5 6 7 8 9 10

Before 45 73 46 124 33 57 83 34 26 17

After 36 60 44 119 35 51 77 29 24 11

Question:

◦ Has the safety program an effect on the loss of labour due to accidents?

◦ In 9 out of 10 plants the average weekly losses have decreased after

implementation of the safety program. How likely is this (or a more

extreme) outcome under the assumption that there is no difference

before and after implementation of the safety program.

Testing Hypotheses I, Feb 16, 2004 - 1 -

Statistical Tests

Example: Fair coin

Suppose we have a coin. We suspect it might be unfair. We devise a

statistical experiment:

◦ Toss coin 100 times

◦ Conclude that coin is fair if we see between 40 and 60 heads

◦ Otherwise decide that the coin is not fair

Let θ be the probability that the coin lands heads, that is,P(Xi = 1) = θ and P(Xi = 0) = 1 − θ.

Our suspicion (“coin not fair”) is a hypothesis about the population pa-

rameter θ (θ 6= 12) and thus about P. We emphasize this dependence of P

on θ by writing Pθ.

Decision problem:

Null hypothesis H0: X ∼ Bin(100, 12)

Alternative hypothesis Ha: X ∼ Bin(100, θ), θ 6= 12

The null hypothesis represents the default belief (here: the coin is fair).

The alternative is the hypothesis we accept in view of evidence against the

null hypothesis.

The data-based decision rule

reject H0 if X /∈ [40, 60]

do not reject H0 if X ∈ [40, 60]

is called a statistical test for the test problem H0 vs. Ha.


Statistical Tests

Example: Fair coin (contd)

Note: It is possible to obtain e.g. X = 55 (or X = 65)

◦ with probability 0.048 (resp. 0.0009) if p = 0.5



0.00

0.02

0.04

0.06

0.08

0.10

x

p(x)

Bin(100,0.5)

20 25 30 35 40 45 50 55 60 65 70 75 80

Accept H0: p ≠ 0.5Reject H0: p ≠ 0.5

0.00

0.02

0.04

0.06

0.08

0.10

x

p(x)

Bin(100,0.6)

20 25 30 35 40 45 50 55 60 65 70 75 80


0.00

0.02

0.04

0.06

0.08

0.10

x

p(x)

Bin(100,0.7)

20 25 30 35 40 45 50 55 60 65 70 75 80



Types of errors


It is possible that the test (decision rule) gives a wrong answer:

◦ If θ = 0.7 and x = 55, we do not reject the null hypothesis that the

coin is fair although the coin in fact is not fair.

◦ If θ = 0.5 and x = 65, we reject the null hypothesis that the coin is fair

although the coin in fact is fair.

The following table lists the possibilities:

Decision H0 true H0 false

Reject H0 type I error correct decision

Accept H0 correct decision type II error

Definition (Types of error)

◦ If we reject H0 when in fact H0 is true, this is a Type I error.

◦ If we do not reject H0 when in fact H0 is false, this is a Type II error.


Types of errors

Question: How good is our decision rule?

For a good decision rule, the probability of committing an error of either

type should be small.

Probability of type I error: α

If the null hypothesis is true, i.e. θ = 12, thenPθ(reject H0) = Pθ(X /∈ [40, 60])

= 1 −Pθ(X ∈ [40, 60])

= 1 −60∑

x=40

(100

x

)(1

2

)100

= 0.035.

Thus the probability of a type I error, denoted as α, is 3.5%.

Probability of type II error: β(θ)

If the null hypothesis is false and the true probability of observing “head”

is θ with θ 6= 12 , thenPθ(accept H0) = Pθ(X ∈ [40, 60])

=60∑

x=40

(100

x

)θx(1 − θ)n−x

Thus, the probability of an error of type II depends on θ. It will be denoted

as β(θ).


Power of Tests

Question: How good is our test in detecting the alternative?

Consider the probability of rejecting H0Pθ(reject H0) = Pθ(X /∈ [40, 60])

= 1 −Pθ(accept H0) = 1 − β(θ).

Note:

◦ If θ = 12 this is the probability of committing a error of type I:

1 − β(

1

2

)= α

◦ If θ > 12 this is the probability of correctly rejecting H0.

Definition (Power of a test)

We call 1 − β(θ) the power of the test as it measures the ability to

detect that the null hypothesis is false.

θ

1−

β(θ)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0

0.2

0.4

0.6

0.8

1.0

reject if X ∉ [40,60]


Significance Tests

Idea: minimize probability of committing an error of type I and II

Different probabilities of type I error

θ

1−

β(θ)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0

0.2

0.4

0.6

0.8

1.0

reject if X ∉ [40,60]reject if X ∉ [38,62]reject if X ∉ [42,58]

Note: If we decrease the probability of a type I error,

◦ the power of the test, 1 − β(θ) decreases as well and

◦ the probablity of a type II error increases.

Problem: cannot minimize both errors simultaneously

Solution:

◦ choose fixed level α for probability of a type I error

◦ under this restriction find test with small probability of a type II error

Remark:

◦ you do not have to do this minimization yourself.

◦ all tests taught in this course are of this kind.

Definition

A test of this kind is called a significance test with significance level α.


Statistical Hypotheses

A statistical hypothesis is an assertion or conjecture about a population,

which may be expressed in terms of

◦ some parameter: mean is zero;

◦ some parameters: mean and median are identical; or

◦ some sampling distribution: this sample is normally distributed.

Test problem - decide between two hypotheses

◦ the null hypothesis H0 and

◦ the alternative hypothesis Ha.

Popperian approach to scientific theories

◦ Scientific theories are subject to falsification.

◦ It is impossible to verify a scientific theory.

Null hypothesis H0

default (current) theory which we try to falsify

Alternative hypothesis Ha

alternative to adopt if null hypothesis is rejected

Examples:

◦ Clinical study of new drug - H0 : drug has no effect

◦ Criminal case - H0 : suspect is not guilty

◦ Safety test of nuclear power station - H0 : power station is not safe

◦ Chances of new investment - H0 : project not profitable

◦ Testing for independence - H0 : random variables are independent

Testing Hypotheses II, Feb 18, 2004 - 1 -

Statistical Tests

Example: Testing for pesticide in discharge water

Suppose the Environmental Protection Agency takes 10 readings on the

amount of pesticide in the discharge water of a chemical company.

Question: Does the concentration cP of pesticide in the water exceed the

allowed maximum concentration c0?

◦ Before taking action against the company, the agency must have some

evidence that the concentration cP exceeds the allowed level.

◦ Without evidence the agency assumes that the pesticide concentration

cP is within the limits of the law.

Consequently, the null hypothesis of the agency is that the pesticide con-

centration cP does not exceed c0. Thus the question corresponds to the

test problem

H0 : cP ≤ c0 vs Ha : cP > c0.

Suppose that the company regularly also runs tests on the amount of pes-

ticide in the discharge water.

Question: Does the concentration cP of pesticide in the water exceed the

allowed maximum concentration c0?

◦ The aim of the company is to avoid fines for exceeding the allowed

level. Thus the company wants to make sure that the concentration

stays within the allowed limits.

Thus, the null hypothesis of the company should be that the pesticide

concentration cP exceeds c0. The question now corresponds to the test

problem

H0 : cP ≥ c0 vs Ha : cP < c0.


Six Steps of Conducting a Test

Steps of a significance test

1. Determine null hypothesis H0 and alternative Ha.

2. Decide on probability of type I error, the significance level α.

3. Find an appropriate test statistic T .

4. Based on the sampling distribution of T , formulate a criterion for

testing H0 against Ha.

5. Calculate value of the test statistic T .

6. Decide whether or not to reject the null hypothesis H0.


We want to decide from 100 tosses of a coin whether it is fair or not. Let

θ be the probability of heads.

1. Test problem:

H0 : θ = 12 vs Ha : θ 6= 1

2

2. Significance level:

α = 0.05 (most commonly used significance level)

3. Test statistic:

T = X (number of heads in 100 tosses of the coin)

4. Rejection criterion:

reject H0 if T /∈ [40, 60]

5. Observed value of test statistic: Suppose after 100 tosses we obtain

t = 55

6. Decision: Since 55 does not lie in the rejection region, we

do not reject H0.


One and Two-sided Hypotheses

Example: Blood cholesterol after a heart attack

Suppose we are interested in whether the blood cholesterol level two days

after a heart attack differs from the average cholesterol level in the (general)

population (µ0 = 193).

Two cases:

◦ We are interested in any difference from the population mean µ0. Then

we have a two-sided test problem

H0 : µY1= µ0 vs H0 : µY1

6= µ0.

◦ We suspect that the cholesterol level after a heart attack might me

higher than in the general population. In this case, we have a one-sided

test problem

H0 : µY1= µ0 vs H0 : µY1

> µ0.

Remark:

◦ More generally, we might be interested in one-sided test problems of

the form

H0 : µY1≤ µ0 vs H0 : µY1

> µ0,

which accounts for the possibility that µ might be smaller than µ0.

◦ For all common test situations (in particular those discussed in this

course), the form of the test does not depend on the form of H0, but

only on the parameter value in H0 that is closest to Ha, that is µ0.


Test Statistic

Let θ be the parameter of interest.

Two-sided test problem

H0 : θ = θ0 against Ha : θ 6= θ0

One-sided test problem

H0 : θ = θ0 against Ha : θ > θ0 (or Ha : θ < θ0)

Suppose that θ is an estimate for θ.

◦ If θ = θ0 (null hypothesis), we expect the estimate θ to take a value

near θ0.

◦ Large deviations from θ0 are evidence against H0.

This suggests the following decision rules:

◦ Ha : θ > θ0: reject H0 if θ − θ0 is much larger than zero

◦ Ha : θ < θ0: reject H0 if θ − θ0 is much smaller than zero

◦ Ha : θ 6= θ0: reject H0 if |θ − θ0| is much larger than zero

Problem: Often the sampling distribution of the estimate θ depends on the

unknown parameter θ.

Definition (Test statistic)

A test statistic is a random variable

◦ that measures the compatibility between the null hypothesis and the

data and

◦ has a sampling distribution which we know (under H0).


Test Statistic


Data: X1, . . . , X28

◦ blood cholesterol level of 28 patients two days after a heart attack

◦ assumed to be normally distributed with mean µX and variance σ2X

The parameter µ can be estimated by the sample mean

X =1

28

28∑i=1

Xi ∼ N(µX ,

σ2

X

28

).

This suggests to the standardized sample mean as a test statistic

X − µ0

σ/√

28∼ N (0, 1) (under H0).

Test H0 : µ ≤ 193 vs Ha : µ > 193 at significance level α = 0.05

◦ Test statistic: Assume σ = 47.7 to be known.

T =X − µ0

σ/√

28

◦ Rejection criterion: Reject H0 if T > z0.05 = 1.645

◦ Outcome of test: Since the observed value of T is

t =253.9 − 193

47.7/√

28= 6.76,

we reject the null hypothesis that µ = 193.


Tests for the Mean

Tests for the mean µ (σ2 known):

◦ Test statistic:

T =X − µ0

σ/√

n

◦ Two sided test:

H0 : µ = µ0 against Ha : µ 6= µ0

reject H0 if |T | > zα/2

◦ One sided tests:

H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0)

reject H0 if T > zα (T < −zα)

Tests for the mean µ (σ2 unknown):

◦ Test statistic:

T =X − µ0

s/√

n

◦ Two sided test:


reject H0 if |T | > tn−1,α/2

◦ One sided tests:

H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0)

reject H0 if T > tn−1,α (T < −tn−1,α)


Estimating the standard deviation from the data, we obtain the test statis-

tic

T =X − µ0

s/√

28∼ t27.

Noting that t27,0.05 = 1.703 and t = 6.76, we still reject H0.


Tests and Confidence Intervals

Consider level α significance test for the two-sided test problem

H0 : θ = θ0 vs Ha : θ 6= θ0.

Let

◦ T = Tθ0(X) be the test statistic of the test (depends on θ0)

◦ R be the critical region of the test

Then

C(X) = {θ : Tθ(X) /∈ R}

is a (1 − α) confidence interval for θ: If θ is the true parameter, thenPθ

(θ ∈ C(X)

)= Pθ

(Tθ(X) /∈ R

)= 1 −Pθ

(Tθ(X) ∈ R

)= 1 − α.

We have

θ0 ∈ C(X) ⇔ Tθ0(X) /∈ R ⇔ H0 is not rejected

Result A level α two-sided significance test rejects the null hypothesis

H0 : θ = θ0 if and only if the parameter θ0 falls outside a (1 − α)

confidence interval for θ.

Example: Normal distribution

Let X1, . . . , Xniid∼ N (µ, σ2). We reject H0 : µ = µ0 if

∣∣∣X − µ0

s/√

n

∣∣∣ > tn−1,α/2

or equivalently

∣∣X − µ0

∣∣ > tn−1,α/2s√n

Rearranging terms, we find that we reject if

µ0 /∈[X − tn−1,α/2

s√n, X + tn−1,α/2

s√n

].


The P -value

Definition (P -value)

The probability that under the null hypothesis H0 the test statistic

would take a value as extreme or more extreme that that actually

observed is called the P -value of the test.

The P -value is often interpreted a measure for the strength of evidence

against the null hypothesis: the smaller the P -value, the stronger the evi-

dence.

However:

◦ The P -value is a random variable (under H0 uniformly distr. on [0, 1]).

◦ Without a measure of its variability it is not safe to interpret the actu-

ally observed P -value.

◦ If the P -value is smaller than the chosen significance level α, we reject

the null hypothesis H0.

Three approaches to deciding on test problem:

◦ reject if θ0 /∈ C(X)

◦ reject if T (X) ∈ R

◦ reject if P -value p ≤ α


The observed value for the test statistic

T =X − µ0

s/√

28∼ t27.

is t = 6.76. The corresponding P -value isP(T > 6.76) = 1.47 · 10−07.

We thus reject the null hypothesis.

Equivalently, the confidence interval for µ is [235.43, 272.42]. Since it does

not contain µ0 = 193 we reject H0 (for the third and last time!).


Example

Data: Banks’ net income

◦ percent change in net income between first half of last year and first

half of this year

◦ sample mean x = 8.1%

◦ sample standard deviation s = 26.4%

Test problem: H0 : µ = 0 against Ha : µ 6= 0

. ttesti 110 8.1 26.4 0

One-sample t test

------------------------------------------------------------------| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

----+-------------------------------------------------------------x | 110 8.1 2.517141 26.4 3.111108 13.08889

------------------------------------------------------------------Degrees of freedom: 109

Ho: mean(x) = 0

Ha: mean < 0 Ha: mean != 0 Ha: mean > 0t = 3.2179 t = 3.2179 t = 3.2179

P < t = 0.9991 P > |t| = 0.0017 P > t = 0.0009

Critical value of t distribution with 109 degrees of freedom:

t109,0.025 = 1.982

Result:

◦ |t| > t109,0.025, therefore the test rejects H0 at significance level α = 0.05.

◦ Equivalently, µ0 = 0 /∈ [3.11, 13.09] and thus the test rejects H0.

◦ Equivalently, P -value is less than α = 0.05 and thus the test rejects H0.


Exact Binomial Test

Example: Fair coin

Data: 100 tosses of a coin which we suspect might be unfair.

Modelling:

◦ θ is the probability that the coin lands heads up

◦ X is the number of heads in 100 tosses of the coin

◦ X is binomially distributed with parameters n and θ.

Decision problem:

◦ Null hypothesis H0: coin is fair

◦ Alternative hypothesis Ha: coin is unfair

Testproblem:

H0 : θ =1

2vs Ha : θ 6= 1

2.

Under the null hypothesis H0, the distribution of X is known,

X ∼ Bin(100,

1

2

).

Reject null hypothesis if

X /∈ [b100,0.5,0.975, b100,0.5,0.025] = [40, 60].

where bn,θ,α is the α fractile of Bin(n, θ).

Note:

◦ Exact binomial tests typically have smaller significance level α due to

discreteness of distribution.

◦ In the above example, the probability of a type I error isP(reject H0) = α = 0.035.

Testing Hypotheses III, Feb 20, 2004 - 1 -

Sign Test

Example: Safety program




Plant 1 2 3 4 5 6 7 8 9 10

Before 45 73 46 124 33 57 83 34 26 17

After 36 60 44 119 35 51 77 29 24 11

Question:

◦ Has the safety program an effect on the loss of labour due to accidents?

The Sign Test for matched pairs

◦ Ignore pairs with difference 0

◦ Number of trials n is the count of the remaining pairs

◦ The test statistic is the count X of pairs with positive difference

◦ X is binomially distributed with parameters n and θ.

◦ Null hypothesis H0: θ = 12

(i.e. median of the differences is zero)

Example:

For the safety program data, we find

◦ n = 10, X = 9

◦ Test H0 : θ = 12 against Ha : θ > 1

2

◦ The P -value of the observed count X isP(X ≥ 9) = 9(

1

2

)10

+(

1

2

)10

= 0.0107

Since the P -value is smaller than α = 0.05 we reject the null hypothesis H0

that the safety program has no effect on the loss of labour due to accidents.


Tests for Proportions


Suppose we are interested in the proportion p of patients who show a

decrease of cholesterol level between the second and the 14th day after a

heart attack.

The proportion p can be estimated by the sample proportion

p =X

n

where X is the number of patients whose cholesterol level decreased.

Question: Does a decrease occur more often than an increase?

Test problem: H0 : p = 12 vs Ha : p > 1

2

Exact tests:

Since X is binomially distributed, we can use exact binomial tests.

Large sample approximations:

Facts: ◦ E(p) = p

◦ var(p) =p(1 − p)

n

◦ p − p√p(1 − p)/n

≈ N (0, 1) (for large n)

Under the null hypothesis H0, we get

T =p − p0√

p0(1 − p0)/n≈ N (0, 1).

Hence, we reject H0 if T > zα.


◦ n = 28, x = 22, p = 0.79, α = 0.05, z0.05 = 1.645

◦ t =0.79 − 0.5√0.79 · 0.21/28

= 3.7675

◦ P-value: P(T > t) = 8.24 · 10−5.


Confidence Intervals for Proportions

Exact binomial confidence intervals

◦ difficult to compute

◦ use statistics software


◦ 28 patients in the study

◦ 22 showed a decrease in cholesterol level between second and 14th day

after the attack

Computation of an exact binomial confidence interval in STATA:

. cii 28 22

-- Binomial Exact --Variable | Obs Mean Std. Err. [95% Conf. Interval]---------+-----------------------------------------------------------

| 28 .7857143 .0775443 .590469 .9170394


Confidence Intervals for Proportions

Large sample approximations

The CLT states that for large n p is approximately normally distributed,

p ≈ N(p,

p(1 − p)

n

)

Problems:

◦ variance is unknown

◦ estimate p(1 − p)/n is zero if p = 0 or p = 1

Example: What is the proportion of HIV+ students at the UofC?

◦ Random sample of 100 students

◦ None test positive for HIV

Are you absolutely sure that there are no HIV+ students at the UofC?

Idea: Estimate p by

p =X + 2

n + 4(Wilson estimate)

and use[p − zα/2

√p(1 − p)

n + 4, p + zα/2

√p(1 − p)

n + 4

]

as a (1 − α) confidence interval for p


. cii 28 22, wilson

------ Wilson ------Variable | Obs Mean Std. Err. [95% Conf. Interval]---------+-----------------------------------------------------------

| 28 .7857143 .0775443 .6046141 .8978754


Paired Samples

Example: Safety program




Plant 1 2 3 4 5 6 7 8 9 10

Before 45 73 46 124 33 57 83 34 26 17

After 36 60 44 119 35 51 77 29 24 11

Question: Does the safety program have a positive effect?

Approach:

◦ Consider differences before and after implementation of the program:

Di = X(after)i − X

(before)i

◦ Di’s are approximately normal

Diiid∼ N (µ, σ2)

◦ H0 : µ = 0 against Ha : µ > 0

◦ Significance level α = 0.01

◦ One sample t test:

T =D

s/√

n

Reject if T > tn−1,α

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0

5

10

15

20

25

Normal quantiles

Dec

reas

e in

loss

es o

f wor

k

Result:

◦ y = 10.27, s = 7.98, n = 10

◦ t = 4.07 and t9,0.01 = 2.82, P -value: 0.0014

◦ Test rejects H0 at significance level α = 0.01


Paired Sample t Test

Data: (X1, Y1), . . . , (Xn, Yn)

Assumptions:

◦ Pairs are independent

◦ Di = Xi − Yiiid∼ N (µ, σ2)

◦ Apply one-sample t test

Paired sample t test

◦ Test statistic

T =D − µ0

s/√

n

◦ Two-sided test:


reject H0 if |T | > tn−1,α/2

◦ One-sided test:

H0 : µ = µ0 against Ha : µ > µ0

reject H0 if T > tn−1,α

Power of the paired sample t test and the paired sign test:

δ

1−

β(δ)

0 1 2 3 4 5 6 7 8 9 10 110.0

0.2

0.4

0.6

0.8

1.0

Sign test

t test


Sign and t Test

t test:

◦ based on Central Limit Theorem

◦ readsonably robust against departures from normality

◦ do not use if n is small and

⋄ data are strongly skewed or

⋄ data have clear outliers

Sign test:

◦ uses much less information than t test

◦ for normal data less powerful than t test

◦ no assumption on distribution keeps significance level regardless of

distribution

◦ preferable for very small data sets

Remark:

◦ The two-step procedure

1. assess normality by normal quantile plot

2. conduct either t test or sign test depending on result in step 1

does not attain the chosen significance level α (two tests!).

◦ The sign test is rarely used since there are more powerful distribution-

free tests.


Two Sample Problems

Two sample problems

◦ The goal of inference is to compare the responses in two groups.

◦ Each group is a sample from a different population.

◦ The responses in each group are independent of those in the other

group.

Example: Effects of ozone

Study the effects of ozone by controlled randomized experiment

◦ 55 70-day-old rats were randomly assigned to two treatment or control

◦ Treatment group: 22 rats were kept in an environment containing ozone.

◦ Control group: 23 rats were kept in an ozone-free environment

◦ Data: Weight gains after 7 days

We are interested in the difference in weight gain be-

tween the treatment and control group.

Question: Do the weight gains differ between groups?

◦ x1, . . . , x22 - weight gains for treatment group

◦ y1, . . . , y23 - weight gains for control group

◦ Test problem:

H0 : µX = µY vs Ha : µX 6= µY

◦ Idea: Reject null hypothesis if x − y is large.

Treatment Control

−10

010

2030

4050

Wei

ght g

ain

(in g

ram

)

Two Sample Tests, Feb 23, 2004 - 1 -

Comparing Means

Let X1, . . . , Xm and Y1, . . . , Yn be two independent normally distributed

samples. Then

X − Y ∼ N(

µX − µY ,σ2

X

m+

σ2Y

n

)

Two-sample t test

◦ Two-sample t statistic

T =X − Y√s2

X

m +s2

Y

n

Distribution of T can be approximated by t distribution

◦ Two-sided test:

H0 : µX = µY against Ha : µX 6= µY

reject H0 if |T | > tdf,α/2

◦ One-sided test:

H0 : µX = µY against Ha : µX > µY

reject H0 if T > tdf,α

◦ Degrees of freedom:

◦ Approximations for df provided by statistical software

◦ Satterthwaite approximation

df =

(s2

X

m +s2

Y

n

)2

1m−1

(s2

X

m

)2

+ 1n−1

(s2

Y

n

)2

commonly used, conservative approximation

◦ Otherwise: use df = min(m − 1, n − 1)


Comparing Means

Example: Effects of ozone

Data:

◦ Treatment group: x = 11.01, sX = 19.02, m = 22

◦ Control group: x = 22.43, sX = 10.78, n = 23

Testproblem:

◦ H0 : µX = µY vs Ha : µX 6= µY

◦ α = 0.05, df = min(m − 1, n − 1) = 21, t21,0.025 = 2.08

The value of the test statistic is

t =x − y√s2

X

m+

s2

Y

m

= −2.46

The corresponding P-value isP(|T | ≥ |t|) = P(|T | ≥ 2.46) = 0.023

Thus we reject the hypothesis that ozone has no effect on weight gain.

Two-sample t test with STATA:

. ttest weight, by(group) unequal

Two-sample t test with unequal variances

----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+------------------------------------------------------------------0 | 23 22.42609 2.247108 10.77675 17.76587 27.08631 | 22 11.00909 4.054461 19.01711 2.577378 19.4408

---------+------------------------------------------------------------------combined | 45 16.84444 2.422057 16.24765 11.96311 21.72578---------+------------------------------------------------------------------

diff | 11.417 4.635531 1.985043 20.84895----------------------------------------------------------------------------Satterthwaite’s degrees of freedom: 32.9179

Ho: mean(0) - mean(1) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0t = 2.4629 t = 2.4629 t = 2.4629

P < t = 0.9904 P > |t| = 0.0192 P > t = 0.0096


Comparing Means

Suppose that σ2X = σ2

Y = σ2. Then

σ2

m+

σ2

n= σ2

(1

m+

1

n

).

Estimate σ2 by the pooled sample variance

s2p =

(m − 1)s2X + (n − 1)s2

Y

m + n − 2.

Pooled two-sample t test

◦ Two-sample t statistic

T =X − Y

sp

√1m + 1

n

T is t distributed with m + n − 2 degrees of freedom.

◦ Two-sided test:

H0 : µX = µY against Ha : µX 6= µY

reject H0 if |T | > tm+n−2,α/2

◦ One-sided test:

H0 : µX = µY against Ha : µX > µY

reject H0 if T > tm+n−2,α

Remarks:

◦ If m ≈ n, the test is reasonably robust against

◦ nonnormality and

◦ unequal variances.

◦ If sample sizes differ a lot, test is very sensitive to unequal variances.

◦ Tests for differences in variances are sensitive to nonnormality.


Comparing Means

Example: Parkinson’s disease

Study on Parkinson’s disease

◦ Parkinson’s disease, among other things, affects a

person’s ability to speak

◦ Overall condition can be improved by an operation

◦ How does the operation affect the ability to speak?

◦ Treatment group: Eight patients received operation

◦ Control group: Fourteen patients

◦ Data:

⋄ score on several test

⋄ high scores indicate problem with speaking

Treat. Contr.

1.5

2.0

2.5

3.0

Spe

akin

g ab

ility

Pooled twpo sample t test with STATA:

. infile ability group using parkinson.txt

. ttest ability, by(group)

Two-sample t test with equal variances

---------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+-----------------------------------------------------------------0 | 14 1.821429 .148686 .5563322 1.500212 2.1426451 | 8 2.45 .14516 .4105745 2.106751 2.793249

---------+-----------------------------------------------------------------combined | 22 2.05 .1249675 .5861497 1.790116 2.309884---------+-----------------------------------------------------------------

diff | -.6285714 .2260675 -1.10014 -.1570029---------------------------------------------------------------------------Degrees of freedom: 20

Ho: mean(0) - mean(1) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0t = -2.7805 t = -2.7805 t = -2.7805

P < t = 0.0058 P > |t| = 0.0115 P > t = 0.9942


Comparing Variances

Example: Parkinson’s disease

In order to apply the pooled two-sample t test, the variances of the two

groups have to be equal. Are the data compatible with this assumption?

F test for equality of variances

The F test statistic

F =s2X

s2Y

.

is F distributed with m − 1 and n − 1 degrees of freedom.

. sdtest ability, by(group)

Variance ratio test

------------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------0 | 14 1.821429 .148686 .5563322 1.500212 2.1426451 | 8 2.45 .14516 .4105745 2.106751 2.793249

---------+--------------------------------------------------------------------combined | 22 2.05 .1249675 .5861497 1.790116 2.309884------------------------------------------------------------------------------

Ho: sd(0) = sd(1)

F(13,7) observed = F_obs = 1.836F(13,7) lower tail = F_L = 1/F_obs = 0.545F(13,7) upper tail = F_U = F_obs = 1.836

Ha: sd(0) < sd(1) Ha: sd(0) != sd(1) Ha: sd(0) > sd(1)P < F_obs = 0.7865 P < F_L + P > F_U = 0.3767 P > F_obs = 0.2135

Result: We cannot reject the null hypothesis that the variances are equal.

Problem: Are the data normally

distributed?

−1.5 −0.5 0.5 1.0 1.5

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Spe

akin

g ab

ility

(T

reat

.)

−1 0 1

1.5

2.0

2.5

3.0


Spe

akin

g ab

ility

(C

ontr

.)


Comparing Proportions

Suppose we have two populations with unknown proportions p1 and p2.

◦ Random samples of size n1 and n2 are drawn from the two population

◦ p1 is the sample proportion for the first population

◦ p2 is the sample proportion for the second population

Question: Are the two proportions p1 and p2 different?

Test problem:

H0 : p1 = p2 vs H1 : p1 6= p2

Idea: Reject H0 if p1 − p2 is large.

Note that

p1 − p2 ≈ N(p1 − p2,

p1(1 − p1)

n1

+p2(1 − p2)

n2

)

This suggests the test statistic

T =p1 − p2√

p(1 − p)(

1

n1

+ 1

n2

)

where p is the combined proportion of successes in both samples

p =X1 + X2

n1 + n2

=n1 p1 + n2 p2

n1 + n2

with X1 and X2 denoting the number of successes in each sample.

Under H0, the test statistic is approximately standard normally dis-

tributed.


Comparing Proportions

Example: Question wording

The ability of question wording to affect the outcome of a survey can be a

serious issue. Consider the following two questions:

1. Would you favor or oppose a law that would require a person to obtain

a police permit before purchasing a gun?

2. Would you favor or oppose a law that would require a person to obtain

a police permit before purchasing a gun, or do you think such a law

would interfere too much with the right of citizens to own guns?

In two surveys, the following results were obtained:

Question Yes No Total

1 463 152 615

2 403 182 585

Question: Is the true proportion of people favoring the permit law the

same in both groups or not?

. prtesti 615 0.753 585 0.689

Two-sample test of proportion x: Number of obs = 615y: Number of obs = 585

--------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]---------+----------------------------------------------------------------

x | .753 .0173904 .7189155 .7870845y | .689 .0191387 .6514889 .7265111

---------+----------------------------------------------------------------diff | .064 .0258595 .0133163 .1146837

| under Ho: .0258799 2.47 0.013--------------------------------------------------------------------------

Ho: proportion(x) - proportion(y) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0z = 2.473 z = 2.473 z = 2.473

P < z = 0.9933 P > |z| = 0.0134 P > z = 0.0067


Final Remarks

Statistical theory focuses on the significance level, the probability of a type

I error.

In practice, discussion of power of test also important:

Example: Efficient Market Hypothesis

“Efficient market hypothesis” for stock prices:

◦ future stock prices show only random variation

◦ market incorporates all information available now in present prices

◦ no information available now will help to predict future stock prices

Testing of the efficient market hypothesis:

◦ Many studies tested

H0: Market is efficient

Ha: Prediction is possible

◦ Almost all studies failed to find good evidence against H0.

◦ Consequently the efficient market hypothesis became quite popular.

Problem:

◦ Power was generally low in the significance tests employed in the stud-

ies.

◦ Failure to reject H0 is no evidence that H0 is true.

◦ More careful studies showed that the size of a company and measures

of value such as ratio of stock price to earnings do help predict future

stock prices.


Final Remarks

Example

◦ IQ of 1000 women and 1000 men

◦ µw = 100.68, σw = 14.91

◦ µm = 98.90, σm = 14.68

◦ Pooled two-sample t test: T = −2.7009

◦ Reject H0 : µw = µm since |T | > t1998,0.005 = 2.58.

◦ The difference in the IQ is statistically significant at the 0.01 level.

◦ However we might conclude that the difference is scientifically irrele-

vant.

Note: A low significance level does not mean there is a large difference,

but only that there is strong evidence that there is some difference.


Final Remarks

Example: Is radiation from cell phones harmful?

◦ Observational study

◦ Comparison of brain cancer patients and similar group without brain

cancer

◦ No statistically significant association between cell phone use and a

group of brain cancers known as gliomas.

◦ Separate analysis for 20 types of gliomas found association between

phone use and one rare from.

◦ Risk seemed to decrease with greater mobile phone use.

Think for a moment:

◦ Suppose all 20 null hypotheses are true.

◦ Each test has 5% chance of being significant - the outcome is Bernoulli

distributed with parameter 0.05.

◦ The number of false positive tests is binomially distributed:

N ∼ Bin(20, 0.05)

◦ The probability of getting one or more positive results isP(N ≥ 1) = 1 −P(N = 0) = 1 − 0.9520 = 0.64.

We therefore might have expected at least one significant association.

Beware of searching for significance


Final Remarks

Problem: If several tests are performed, the probability of a type I error

increases.

Idea: Adjust significance level of each single test.

Bonferroni procedure:

◦ Perform k tests

◦ Use significance level α/k for each of the k tests

◦ If all null hypothesis are true, the probability is α that any of the tests

rejects its null hypothesis.

Example

Suppose we perform k = 6 tests and obtain the following P -values:

P -value α/k

0.476 0.032 0.241 0.008* 0.010 0.001* 0.0083

Only two tests (*) are significant at the 0.05 level.


Two-Way Tables

Example: Depression and marital status

Question: Does severity of depression depend on marital status?

◦ Study of 159 depression patients

◦ Patients were categorized by

⋄ severity of depression (severe, normal, mild)

⋄ marital status (single, married, widowed/divorced)

The following two-way table summarizes the data:

Depression Marital Status Total

Single Married Wid/Div

Severe 16 22 19 57

Normal 29 33 14 76

Mild 9 14 3 26

Total 54 69 36 159

◦ Each combination of values defines a cell.

◦ The severity of depression is a row variable.

◦ The marital status is a column variable.

Inference for Two-Way Tables, Feb 25, 2004 - 1 -

Two-Way Tables

From this table of counts, the sample distribution can be obtained

by dividing each cell by the total sample size n = 159:



Severe 0.101 0.138 0.119 0.358

Normal 0.182 0.208 0.088 0.478

Mild 0.057 0.088 0.019 0.164

Total 0.340 0.434 0.226 1.000

◦ Joint distribution: proportion for each combination of values

◦ Marginal distribution: distribution of the row and column

variables separately.

◦ Conditional distribution: distribution of one variable at a

given level of the other variable


Test for Independence


Conditional distributions of severity of depression given marital

status:

single married wid/div

severenormalmild

0.0

0.1

0.2

0.3

0.4

0.5

Marital status

Sam

ple

prop

ortio

n

Question: Is a relationship between the row variable (depression)

and the column variable (marital status)?

◦ The distribution for widowed/divorced patients seems to differ

from the distributions for single or married patients.

◦ Are these differences significant or can they be attributed to

chance variation?

◦ How likely are differences as large or larger than those observed

if the two variables were indeed independent (and thus the con-

ditional distribution were the same)?

A statistical test will be required to answer these questions.



Test problem:

H0: the row and the column variables are independent

Ha: the row and the column variables are dependent

How can we measure evidence against the null hypothesis?

◦ What counts would we expect to observe if the null hypothesis

were true?

Expected Cell Count =row total × column total

total count

Recall: For two independent events A and B, P(A∩B) = P(A)P(B).

If the null hypothesis H0 is true, then the table of expected

counts should be “close” to the observed table of counts.

◦ We need a statistic that measures the difference between the

tables.

◦ And we need to know what is the distribution of the statistic

to make statistical inference.



Idea of the test:

◦ construct table of expected counts

◦ compare expected with observed counts

◦ if the null hypothesis is true, the difference between the tables

should be “small”

The χ2 (Chi-Squared) Statistic

To measure how far the expected table is from the observed table,

we use the following test statistic:

X =∑

all cells

(Observed − Expected)2

Expected

◦ Under the null hypothesis, T is approximately χ2 distributed

with (r − 1)(c − 1) degrees of freedom.Why (r − 1)(c − 1)?

Recall that our “expected” table is based on some quantities estimated

from the data: namely the row and column totals.

Once these totals are known, filling in any (r− 1)(c− 1) undetermined

table entries actually gives us the whole table. Thus, there are only

(r − 1)(c − 1) freely varying quantities in the table.

◦ We reject H0 if observed and expected counts are very different

and hence X is large. Consequently we reject H0 at significance

level α if

X ≥ χ2(r−1)(c−1),α.


The χ2 Distribution

What does the χ2 distribution look like?

0 10 20 30 40 500.00

0.05

0.10

0.15

0.20

χ2

Den

sity

χ2 Densities

Degrees ofFreedom

15102030

◦ Unlike the Normal or t distributions, the χ2 distribution takes

values in (0,∞).

◦ As with the t distribution, the exact shape of the χ2 distribution

depends on its degrees of freedom.

Recall that X has only an approximate χ2(r−1)(c−1) distribution.

When is the approximation valid?

◦ For any two-way table larger than 2 × 2, we require that the

average expected cell count is at least 5 and each expected count

is at least one.

◦ For 2×2 tables, we require that each expected count be at least

5.




The following table show the observed counts and expected counts

(in brackets):



Severe 16 22 19 57

(19.36) (24.74) (12.90)

Normal 29 33 14 76

(25.81) (32.98) (17.21)

Mild 9 14 3 26

(8.83) (11.28) (5.89)

Total 54 69 36 159

◦ The table is 3 × 3, so there are (r − 1)(c − 1) = 2 × 2 = 4

degrees of freedom.

◦ The critical value (significance level α = 0.05) is χ24,0.05 = 9.49.

◦ The observed value of the χ2 test statistic is

x =(16 − 19.36)2

19.36+

(22 − 24.74)2

24.74+ . . . +

(3 − 5.89)2

5.89

= 6.83 ≤ χ24,0.05

Thus we do not reject the null hypothesis of independence.

◦ The corresponding P-value isP(X ≥ x) = P(X ≥ 6.83) = 0.145 ≥ α

Again we do not reject H0



The χ2 test in STATA:

. insheet using depression.txt, clear(3 vars, 159 obs)

. tabulate depression marital, chi2

| MaritalDepression | Married Single Wid/Div | Total-----------+---------------------------------+----------

Mild | 14 9 3 | 26Normal | 33 29 14 | 76Severe | 22 16 19 | 57

-----------+---------------------------------+----------Total | 69 54 36 | 159

Pearson chi2(4) = 6.8281 Pr = 0.145

The same result can be obtained by the command

. tabi 16 22 19 \ 29 33 14 \ 9 14 3, chi2

| colrow | 1 2 3 | Total

-----------+---------------------------------+----------1 | 16 22 19 | 572 | 29 33 14 | 763 | 9 14 3 | 26

-----------+---------------------------------+----------Total | 54 69 36 | 159

Pearson chi2(4) = 6.8281 Pr = 0.145


Models for Two-Way Tables

The χ2-test for the presence of a relationship between two distributions

in a two-way table is valid for data produced by several different study

designs, although the exact null hypothesis varies.

◦ Examining independence between variables

⋄ Select random sample of size n from a population.

⋄ Classify each individual according to two categorical variables.

Question: Is there a relationship between the two variables?

Test problem:

H0: The two variables are independent

Ha: The two variables are not independent

Example: Suppose we collect an SRS of 114 college students, and cate-

gorize each my major and GPA (e.g. (0, 0.5], . . . , (3.5, 4]). Then, we can

use the χ2-test to ascertain whether grades and major are independent.

◦ Comparing several populations

⋄ Select independent random samples from each of c population, of

sizes n1, . . . , nc.

⋄ Classify each individual according to a categorical response variable

with r possible values (the same across populations),

⋄ This yields a r × c table.

Question: Does the distribution of the response variable differs be-

tween populations?

Test problem:

H0: The distribution is the same in all populations.

Ha: The distribution is not the same.

Example: Suppose we select independent SRSs of Psychology, Biology

and Math majors, of sizes 40, 39, 35, and classify each individual by

GPA range. Then, we can use a χ2-test to ascertain whether or not the

distribution of grades is the same in all three populations.


Models for Two-Way Tables

Example: Literary Analysis (Rice, 1995)

When Jane Austen died, she left the novel Sanditon only partially com-

pleted, but she left a summary of the reminder. A highly literate admirer

finished the novel, attempting to emulate Austen’s style, and the hybrid

was published. Someone counted the occurrences of various words in sev-

eral chapters from various works.

Austen Imitator

Sense and Emma Sanditon I Sanditon II

Word Sensibility

a 147 186 101 83

an 25 26 11 29

this 32 39 15 15

that 94 105 37 22

with 59 74 28 43

without 18 10 10 4

TOTAL 375 440 202 196

Questions:

◦ Is there consistency in Austen’s work (do the frequencies with which

Austen used these words change from work to work)?

Answer X = 12.27, df=?, P-value=?

◦ Was the imitator successful (are the frequencies of the words the same

in Austen’s work and the imitator’s work)?


Simpson’s Paradoxon

Example: Medical study

◦ contact randomly chosen people in a district in England

◦ data on 1314 women contacted

◦ either current smoker or who had never smoked

Question: Survival rate after 20 years?

Smoker Not

Dead 139 230

Alive 438 502

Result: A higher percent of smokers stayed alive!

Here are the same data classified by their age at time of the survey:

Age 18 to 44

Smoker Not

Dead 19 13

Alive 269 327

Age 45 to 64

Smoker Not

Dead 78 52

Alive 162 147

Age 65+

Smoker Not

Dead 42 165

Alive 7 28

Age at time of the study is a confounding variable, in each age

group a higher percent of nonsmokers survive.

Simpson’s Paradoxon

An association/comparison that holds for all of several groups can

reverse direction when the data are combined to form a single

group.


Simple Linear Regression

Example: Body density

Aim: Measure body density (weight per unit volume of the body)

(Body density indicates the fat content of the human body.)

Problem:

◦ Body density is difficult to measure directly.

◦ Research suggests that skinfold thickness can accurately predict body

density.

◦ Skinfold thickness is measures by pinching a fold of skin between

calipers.

1.03 1.04 1.05 1.06 1.07 1.08 1.09

1.0

1.2

1.4

1.6

1.8

2.0

Skinfold Thickness (mm)

Bod

y D

ensi

ty (1

03 kgm

3 )

Questions:

◦ Are body density and skinfold thickness related?

◦ How accurately can we predict body density from skinfold thickness?

Regression: predict response variable for fixed value of explanatory variable

◦ describe linear relationship in data by regression line

◦ fitted regression line is affected by chance variation in observed data

Statistical inference: accounts for chance variation in data

Simple Linear Regression, Feb 27, 2004 - 1 -

Population Regression Line

Simple linear regression studies the relationship between

◦ a response variable Y and

◦ a single explanatory variable X.

We expect that different values of X will produce different mean responses

of Y .

For given X = x, we consider the subpopulation with X = x:

◦ this subpopulation has mean

µY |X=x = E(Y |X = x) (cond. mean of Y given X = x)

◦ and variance

σ2Y |X=x = var(Y |X = x) (cond. variance of Y given X = x)

Linear regression model with constant variance:E(Y |X = x) = µY |X=x = a + b x (population regression line)

var(Y |X = x) = σ2Y |X=x = σ2

◦ The population regression line connects the conditional means of the

response variable for fixed values of the explanatory variable.

◦ This population regression line tells how the mean response of Y varies

with X.

◦ The variance (and standard deviation) does not depend on x.


Conditional Mean

01

23

45

67

89

1011

120

1

2

3

45

6

1 2 3 4 5 67

89

1011

12 01

23

45

6

01

23

45

67

89

1011

120

1

2

3

4

5

6

01

23

45

67

89

1011

120

1

2

3

4

5

6

Sample (x1, y1), . . . , (xn, yn)

Sampling probability

f(x, y)y

fix x = x0

f(x0, y)

y

rescale by fX(x0)

Conditional probability

f(y|x0) =fXY (x0, y)

fX(x0)

E(Y |X = x0) =

∫y fY |X(y|x0) dy conditional mean


The Linear Regression Model

Simple linear regression

Yi = a + b xi + εi, i = 1, . . . , n

where

Yi response (also dependent variable)

xi predictor (also independent variable)

εi error

Assumptions:

◦ Predictor xi is deterministic (fixed values, not random).

◦ Errors have zero mean, E(εi) = 0.

◦ Variation about mean does not depend on xi, i.e. var(εi) = σ2.

◦ Errors εi are independent.

Often we additionally assume:

◦ The errors are normally distributed,

εiiid∼ N (0, σ2).

For fixed x the response Y is normally distributed with

Y ∼ N (a + b x, σ2).


Least Squares Estimation

Data: (Y1, x1), . . . , (Yn, xn)

Aim: Find straight line which fits data best:

Yi = a + b xi fitted values for coefficients a and b

a - intercept

b - slope

Least Squares Approach:

Minimize squared distance between observed Yi and fitted Yi:

L(a, b) =n∑

i=1

(Yi − Yi)2 =

n∑i=1

(Yi − a − b xi)2

Set partial derivatives to zero (normal equations):

∂L

∂a= 0 ⇔

n∑i=1

(Yi − a − b xi) = 0

∂L

∂b= 0 ⇔

n∑i=1

(Yi − a − b xi) · xi = 0

Solution: Least squares estimators

a = Y − SXY

SXX· X

b =SXY

SXX

where

SXY =n∑

i=1

(Yi − Y )(xi − x) (sum of squares)

SXX =n∑

i=1

(xi − x)2



Least squares predictor Y

Yi = a + b xi

Residuals εi:

εi = Yi − Yi

= Yi − a − b xi

Residual sum of squares (SSResidual)

SSResidual =n∑

i=1

ε2i =

n∑i=1

(Yi − Yi)2

Estimation of σ2

σ2 =1

n − 2

n∑i=1

(Yi − Yi)2 =

1

n − 2SSResidual

Regression standard error

se = σ =√

SSResidual/(n − 2)

Variation accounting:

SSTotal =n∑

i=1

(Yi − Y )2 total variation

SSModel =n∑

i=1

(Yi − Y )2 variation explained by linear model

SSResidual =n∑

i=1

(Yi − Yi)2 remaining variation




Scatter plot with least squares regression line:

1.03 1.04 1.05 1.06 1.07 1.08 1.09

1.0

1.2

1.4

1.6

1.8

2.0

Skinfold Thickness (mm)

Bod

y D

ensi

ty (1

03 kgm

3 )

Calculation of least squares estimates:

x y SXX SXY SY Y SSResidual

1.064 1.568 0.0235 -0.2679 4.244 1.187

b =SXY

SXX=

−0.267

0.023= −11.40

a = y − bx = 1.568 + 11.40 · 1.064 = 13.70

σ2 =RSS

n − 2=

1.187

90= 0.0132

se =√

σ2 =√

0.0132 = 0.1149




Using STATA:

. infile ID BODYD SKINT using bodydens.txt, clear(92 observations read)

. regress BODYD SKINT

Source | SS df MS Number of obs = 92-------------+------------------------------ F( 1, 90) = 231.89

Model | 3.05747739 1 3.05747739 Prob > F = 0.0000Residual | 1.18663025 90 .013184781 R-squared = 0.7204

-------------+------------------------------ Adj R-squared = 0.7173Total | 4.24410764 91 .046638546 Root MSE = .11482

------------------------------------------------------------------------------BODYD | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------SKINT | -11.41345 .7494999 -15.23 0.000 -12.90246 -9.924433_cons | 13.71221 .7975822 17.19 0.000 12.12768 15.29675

------------------------------------------------------------------------------

. twoway (lfitci BODYD SKINT, range(1 1.1)) (scatter BODYD SKINT), xtitle(Skin thickn> ess) ytitle(Body density) scheme(s1color) legend(off)

11.

52

2.5

Bod

y de

nsity

1 1.02 1.04 1.06 1.08 1.1SKin thickness


Properties of Estimators

Statistical properties of a and b

Mean and variance of bE(b) = b

var(b) =σ2

SXX

Distribution of b

b ∼ N(

b,σ2

SXX

)

Mean and variance of aE(a) = a

var(a) =

(1

n+

x2

SXX

)σ2

Distribution of a

a ∼ N(

a,

(1

n+

x2

SXX

)σ2

)

Recall that

SXX =n∑

i=1

(xi − x)2

Inference for Regression, Mar 1, 2004 - 1 -

Confidence Intervals

Note that b ∼ N(b, σ2

SXX

). Thus

b − b

σ/√

SXX

∼ N (0, 1)

Substituting se for σ, we obtain

b − b

se/√

SXX

∼ tn−2

(1 − α) confidence interval for b:

b ± tn−2,a/2 ·se√SXX

Similarly

a − a

σ√

1n

+ X2

SXX

∼ N (0, 1)

Substituting se for σ, we obtain

a − a

se

√1n + x2

SXX

∼ tn−2

(1 − α) confidence interval for a:

a ± tn−2,α/2 · se ·√

1

n+

x2

SXX


Tests on the Coefficients

Question: Is b equal to some value b0?

The correspoding test problem is

H0 : b = b0 versus Ha : b 6= b0.

The test statistic is given by

Tb =b − b0

se/√

SXX

∼ tn−2

The null hypothesis H0 : b = b0 is rejected if

|T | > tn−2,α/2

Question: Is a equal to some value a0?

The correspoding test problem is

H0 : a = a0 versus Ha : a 6= a0.

The test statistic is given by

Ta =a − a0

se

√1n + x2

SXX

∼ tn−2

The null hypothesis H0 : a = a0 is rejected if

|T | > tn−2,α/2


Inference for the Coefficients


The confidence interval for b is given by

b ± tn−2,α/2 ·se√SXX

= −11.41± 1.99 ·√

0.0132√0.023

= [−12.92,−9.90]

The confidence interval for a is given by

a ± tn−2,α/2 se

√1

n+

x2

SXX

= 13.71± 1.99 ·√

0.0132 ·√

1

92+

1.062

0.023= [12.11, 15.30]

Furthermore we find for

Tb =b

se/√

SXX

= −15.22 > t90,0.025 = 1.99

Thus we reject H0 : b = 0 at significance level 0.05: The coefficient b is

statistically significantly different from zero.

Similarly

Ta =a

se

√1n + x2

SXX

= 17.26 > t90,0.025 = 1.99

Thus we reject H0 : a = 0 at significance level 0.05: The coefficient a is

statistically significantly different from zero.

The corresponding P -values are

◦ P(|Ta| ≥ 15.22) ≈ 0

◦ P(|Tb| ≥ 17.26) ≈ 0


Estimating the Mean

In the linear regression model, the mean of Y at x = x0 is given byE(Y ) = a + b x0

Our estimate for the mean of Y at X = x0 is

Yx0= a + b x0.

Question: How precise is this estimate?

Note that

Yx0= a + b x0 = Y − b(x0 − x).

Hence we obtainE(Yx0) = a + b x0

var(Yx0) =

(1

n+

(x0 − x)2

SXX

)σ2

(1 − α) confidence interval for E(Yx0)

(a + b x0) ± tn−2,α/2 · se ·√

1

n+

(x0 − x)2

SXX


Estimating the Mean


Suppose the measured skin thickness is x0 = 1.1 mm.

What is the mean body density for this value of skin thickness?

◦ Point estimate:

Yx0= a + hb x0 = 13.71− 11.41 · 1.1 = 1.159

The mean body density is 1.159 · 103 kg/m3.

◦ Confidence interval:

(a + b x0) ± tn−2,α/2 · se ·√

1

n+

(x0 − x)2

SXX

= (13.71− 11.41 · 1.1)± 1.99 ·√

0.0132 ·√

1

92+

(1.1 − 1.06)2

0.023

= [1.09, 1.22]

In STATA, the standard error for estimating the mean of Y is calculated

by passing the option stdp to predict:

. predict BDH

. predict SE, stdp

. generate low=BDH-invttail(49,.025)*SE

. generate high=BDH+invttail(49,.025)*SE

. sort SKINT

. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)

11.

21.

41.

61.

82

1.02 1.04 1.06 1.08 1.1SKINT


Prediction

Suppose we want to predict Y at x = x0.

Aim: (1 − α) confidence interval for Y

Note that

a + b x0 − Y ∼ N(

0, σ2

(1 +

1

n+

(x0 − X)2

SXX

))

Thus the desired (1 − α) confidence interval for Yx0is given by

a + b x0 ± tn−2,α/2 · se ·√

1 +1

n+

(x0 − X)2

SXX


Prediction


Suppose the measured skin thickness is x0 = 1.1 mm.

What is the predicted body density for this value of skin thickness?

◦ Point estimate: Yx0= a + hb x0 = 13.71− 11.41 · 1.1 = 1.159

The predicted body density is 1.159 · 103 kg/m3.

◦ Confidence interval:

(a + b x0) ± tn−2,α/2 · se ·√

1 +1

n+

(x0 − x)2

SXX

= (13.71− 11.41 · 1.1)± 1.99 ·√

0.0132 ·√

1 +1

92+

(1.1 − 1.06)2

0.023

= [0.92, 1.40]

In STATA, the standard error for predicting Y is calculated by passing the

option stdf to predict:

. drop SE low high

. predict SE, stdf

. generate low=tbillh-invttail(49,.025)*SE

. generate high=tbillh+invttail(49,.025)*SE

. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)

Alternatively, we can use the following command:

. twoway (lfitci BODYD SKINT, range(1 1.1) stdf) (scatter BODYD SKINT),> xtitle(Skin thickness) ytitle(Body density) scheme(s1color) legend(off)

11.

52

2.5

1.02 1.04 1.06 1.08 1.1SKINT

11.

52

2.5

Bod

y de

nsity

1 1.02 1.04 1.06 1.08 1.1SKin thickness


Multiple Regression

Example: Food expenditure and family income

Data: ◦ Sample of 20 households

◦ Food expenditure (response variable)

◦ Family income and family size

. regress food income-------------------------------------------------------------------------

food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------income | .1841099 .0149345 12.33 0.000 .1527336 .2154862_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615

-------------------------------------------------------------------------

. regress food number-------------------------------------------------------------------------

food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------number | 2.287334 .4224493 5.41 0.000 1.399801 3.174867_cons | 1.217365 1.410627 0.86 0.399 -1.746252 4.180981

-------------------------------------------------------------------------

Income

Foo

d E

xpen

ditu

re

0 20 40 60 80 100 1200

4

8

12

16

20

Family Size

Foo

d E

xpen

ditu

re

0 1 2 3 4 5 60

4

8

12

16

20

Multiple Regression, Mar 3, 2004 - 1 -

Multiple Regression

Multiple regression model

Yi = b0 + b1 x1,i + b2 x2,i + . . . + bp xp,i + εi i = 1, . . . , n

where

◦ Yi response variable

◦ x1,i, . . . , xp,i predictor variables (fixed, nonrandom)

◦ b0, . . . , bp regression coefficients

◦ εiiid∼ N (0, σ2) error variable


Fitting multiple regression models in STATA:

. regress food income number

Source | SS df MS Number of obs = 20--------+------------------------------ F( 2, 17) = 121.47

Model | 386.312865 2 193.156433 Prob > F = 0.0000Resid. | 27.0326365 17 1.59015509 R-squared = 0.9346--------+------------------------------ Adj R-squared = 0.9269

Total | 413.345502 19 21.7550264 Root MSE = 1.261-------------------------------------------------------------------------

food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------income | .1482117 .0163786 9.05 0.000 .1136558 .1827676number | .7931055 .2444411 3.24 0.005 .2773798 1.308831_cons | -1.118295 .6548524 -1.71 0.106 -2.499913 .2633232

-------------------------------------------------------------------------


Multiple Regression


Data: (Foodi, Incomei, Numberi), i = 1, . . . , 20

Fitted regression model:

Food = b0 + b1 Income + b2 Number

020

4060

80100

120

0

1

23

45

6

0

4

8

12

16

20

Yi

Yi

Fitted model is a two-dimensional plane - difficult to visualize.


Inference for Multiple Regression

Multiple regression model (matrix notation)

Y = X b + ε

where

Y n dimensional vector

X n × (1 + p) dimensional matrix

b 1 + p dimensional vector

ε n dimensional vector

Thus the model can be written as

Y1...

Yn

=

1 x1,1 · · · xp,1...

... . . . ...

1 x1,n · · · xp,n

b0...

bp

+

ε1...

εn

Least squares approach: Minimize

‖Y − Y ‖ =n∑

i=1

(Yi − Yi)2

Results:

b = (XTX)−1XTY ∼ N(b, σ2(XTX)−1

)

Y = X(XTX)−1XTY ∼ N(X b, σ2X(XTX)−1XT

)

ε = Y − Y =(1 − X(XTX)−1XT

)Y ∼ N

(0, σ2

(1 − X(XTX)−1XT

))

σ2 = s2e =

‖Y − Y ‖2

n − p

=1

n − p

n∑i=1

(Yi − Yi)2

Details course in regression analysis (STAT 22200) or econometrics


Inference for Multiple Regression


Interpretation of regression coefficients

. quietly regress food income

. predict e_food1, residuals

. quietly regress number income

. predict e_num, residuals

. regress e_food1 e_num------------------------------------------------------------------------e_food1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+--------------------------------------------------------------

e_num | .7931055 .2375541 3.34 0.004 .2940229 1.292188------------------------------------------------------------------------

. quietly regress food number

. predict e_food2, residuals

. quietly regress income number

. predict e_inc, residuals

. regress e_food2 e_inc------------------------------------------------------------------------e_food2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+--------------------------------------------------------------

e_inc | .1482117 .0159172 9.31 0.000 .114771 .1816525------------------------------------------------------------------------

Result:

◦ bj measures the dependence of Y on xj after removing the linear effects

of all other predictors xk, k 6= j.

◦ bj = 0 if xj does not provide information for the prediction of Y addi-

tionally to the information given by the other predictor variables.


Multiple Regression

Example: Heart cathederization

Description: A Teflon tube (catheder) 3 mm is diameter is passed into a major vein or

artery at the femoral region and pushed up into the heart to obtain information about

the heart’s physiology and functional ability. The length of the catheder is typically

determined by a physician’s educated guess.

Data:

◦ Study with 12 children with congenital heart defects

◦ Exact required catheder length was measured using a fluoroscope

◦ Patient’s height and weight were recorded

Question: How accurately can catheder length be determined by height

and length?

30 40 50 60

20

25

30

35

40

45

50

Height (in)

Dis

tanc

e (c

m)

20 40 60 80

20

25

30

35

40

45

50

Weight (lb)

Dis

tanc

e (c

m)


Multiple Regression

Example: Heart cathederization (contd)

Regression model:

Y = b0 + b1 x1 + b2 x2 + ε

where ◦ Y - distance to pulmonary artery

◦ x1 - height

◦ x2 - weight

STATA regression output:

. regress distance height weight



-------------+------------------------------ Adj R-squared = 0.7621Total | 718.729167 11 65.3390152 Root MSE = 3.9428

------------------------------------------------------------------------------distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------height | .1963566 .3605845 0.54 0.599 -.6193422 1.012056weight | .1908278 .165164 1.16 0.278 -.1827991 .5644547_cons | 21.0084 8.751156 2.40 0.040 1.211907 40.80489

------------------------------------------------------------------------------

Note:

◦ Neither height nor weight seem to be significant for predicting the dis-

tance to the pulmonary artery.

◦ The regression on both variables explains 80% of the variation of the

response (length of catheder).


Multiple Regression

Example: Heart cathederization (contd)

Consider predicting the length by height alone and by weight alone:

. regress distance heightR-squared = 0.7765


-------------+----------------------------------------------------------------height | .5967612 .1012558 5.89 0.000 .3711492 .8223732_cons | 12.12405 4.247174 2.85 0.017 2.660752 21.58734

------------------------------------------------------------------------------

. regress distance weightR-squared = 0.7989


-------------+----------------------------------------------------------------weight | .2772687 .0439881 6.30 0.000 .1792571 .3752804_cons | 25.63746 2.004207 12.79 0.000 21.17181 30.10311

------------------------------------------------------------------------------

Note:

◦ In a simple regression of Y on either height or weight, the explanatory

variable is highly significant for predicting Y .

◦ In a multiple regression of Y on height and weight, the coefficients for

both height and weight are not significantly different from zero.

Problem: Explanatory variables are highly linearly dependent (collinear)

20 30 40 50 60 70

20

40

60

80

Height (in)

Wei

ght (

lb)


Analysis of Variance

Decomposition of variation:

◦ SSTotal =∑

i(Yi − Y )2 - total variation

◦ SSResidual =∑

i(Yi − Yi)2 - variation in regression model

◦ SSModel = SSTotal − SSResidual

=∑

i(Yi − Y )2 - variation explained by regression

Coefficient of determination: The ratio

R2 =SSModel

SSTotal

indicates how well the regression model predicts the response. R2 is also

the squared multiple correlation coefficient - in a simple linear regression

we have

R2 = ρ2XY .





The coefficient of determination for these data is

R2 =578.82

718.73= 0.81.

Regression on height and weight explains 81% of the variation of distance.


Analysis of Variance

Question: Is improvement in prediction (decrease in variation) significant?

Our null hypothesis is that none of the explanatory variables helps to

predict the response, that is,

H0 : b1 = . . . = bp = 0 versus Ha : bj 6= 0 for any j ∈ {1, . . . , p}.

Under the null hypothesis H0 the F statistic

F =n − p − 1

p· SSModel

SSResidual=

n − p − 1

p· SSTotal − SSResidual

SSResidual

is F distributed with p and n − p − 1 degrees of freedom.

The null hypothesis H0 is rejected at level α if F > Fp,n−p−1,α.





The value of the F statistic is

F =9

2· 578.82

139.91= 18.61.

The critical value for rejecting H0 : b1 = b2 = 0 is F2,9,0.05 = 4.26. Thus

the null hypothesis H0 that both coefficients b1 and b2 are zero is rejected

at significance level α = 0.05.


Comparing Models

Example: Cobb-Douglas production function

Y = t · Ka · Lb · M c

where ◦ Y - output

◦ K - capital

◦ L - labour

◦ M - materials

Regression model:

log Y = log t + a log K + b log L + c log M

K

Y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

L

Y

−0.2 0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.8

MY

−0.2 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8


Comparing Models

Example: Cobb-Douglas production function (contd)

Regression model M0 for Cobb-Douglas function:

log Y = log t + a log K + b log L + c log M

. regress LY LK LM LLSource | SS df MS Number of obs = 25

---------+----------------------------- F( 3, 21) = 138.98Model | 1.35136742 3 .450455808 Prob > F = 0.0000

Residual | .068065609 21 .003241219 R-squared = 0.9520---------+----------------------------- Adj R-squared = 0.9452

Total | 1.41943303 24 .059143043 Root MSE = .05693-------------------------------------------------------------------------

LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+---------------------------------------------------------------

LK | .0718626 .1543912 0.47 0.646 -.2492114 .3929366LM | .7072231 .3004146 2.35 0.028 .0824768 1.331969LL | .2117778 .4248755 0.50 0.623 -.6717991 1.095355

_cons | .0347117 .0374354 0.93 0.364 -.0431395 .1125629

Two variables, log K and log L, do not improve prediction of log Y .

alternative model M1

log Y = log t + c log M

. regress LY LMSource | SS df MS Number of obs = 25

---------+----------------------------- F( 1, 23) = 445.69Model | 1.34977753 1 1.34977753 Prob > F = 0.0000

Residual | .069655501 23 .0030285 R-squared = 0.9509---------+----------------------------- Adj R-squared = 0.9488

Total | 1.41943303 24 .059143043 Root MSE = .05503-------------------------------------------------------------------------

LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+---------------------------------------------------------------

LM | .9086794 .0430421 21.11 0.000 .81964 .9977188_cons | .0512244 .0189767 2.70 0.013 .011968 .0904808

Question: Is model M0 significantly better than model M1?


Comparing Models

Consider the multiple regression model with p explanatory variables

Yi = b0 + b1 x1,i + . . . + bp xp,i + εi.

Problem:

Test the null hypothesis

H0: q specific explanatory variables all have zero coefficients

versus

Ha: any of these q explanatory variables has a nonzero coefficient.

Solution:

◦ Regress Y on all p explanatory variables and read SS(1)Residual from the

output.

◦ Regress Y on just p − q explanatory variables that remain after you

remove the q variables from the model. Read SS(2)Residual from the output.

◦ The test statistic is

F =n − p − 1

q· SS

(2)Residual − SS

(1)Residual

SS(1)Residual

.

Under the null hypothesis, F is F distributed with q and n − p − 1

defrees of freedom.

◦ Reject if F > Fq,n−p−1,α.


Comparing Models

Example: Cobb-Douglas production function

Comparison of models M0 and M1:

◦ M0: SS(0)Residual = .06807 and n − p − 1 = 21.

◦ M1: SS(1)Residual = .06966 and q = 2.

◦

F =21

2· .06966− .06807

.06807= 0.2453

◦ Since F < F2,21,0.05 = 3.47 we cannot reject H0 : a = b = 0.

Using STATA:

. test LK LL

( 1) LK = 0( 2) LL = 0

F( 2, 21) = 0.25Prob > F = 0.7847

. test LK LL _cons

( 1) LK = 0( 2) LL = 0( 3) _cons = 0

F( 3, 21) = 2.43Prob > F = 0.0934


Case Study

Example: Headaches and pain reliever

◦ 24 patients with a common type of headache were treated with a new

pain reliever

◦ Medicamentation was given to each patient in one of four dosage levels:

2,5,7 or 10 grams

◦ Response variable: time until noticeable relieve (in minutes)

◦ Other explanatory variables:

⋄ sex (0=female, 1=male)

⋄ blood pressure (0.25=low, 0.50=medium, 0.75=high)

Box plots

0

10

20

30

40

50

60

Tim

e (in

min

utes

)

female male female male female male female male2 grams 5 grams 7 grams 2 grams

Multiple Regression II, Mar 5, 2004 - 1 -

Case Study

. regress time dose bp if sex==0

R-squared = 0.8861--------------------------------------------------------------------------

time | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+----------------------------------------------------------------

dose | -5.519608 .6608907 -8.35 0.000 -7.014646 -4.024569bp | -5 9.439407 -0.53 0.609 -26.35342 16.35342

_cons | 61.11765 6.458495 9.46 0.000 46.50752 75.72778--------------------------------------------------------------------------

. predict YHf(option xb assumed; fitted values)

. twoway line YHf dose if bp==0.25||line YHf dose if bp==0.5||> line YHf dose if bp==0.75||scatter time dose if(sex==0), saving(a, replace)(file a.gph saved)

. regress time dose bp if sex==1

R-squared = 0.5765--------------------------------------------------------------------------

time | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+----------------------------------------------------------------

dose | -3.343137 .9564492 -3.50 0.007 -5.506776 -1.179499bp | -2.5 13.66083 -0.18 0.859 -33.40294 28.40294

_cons | 51.39216 9.346814 5.50 0.000 30.2482 72.53612--------------------------------------------------------------------------

. predict YHm(option xb assumed; fitted values)

. twoway line YHm dose if bp==0.25||line YHm dose if bp==0.5||> line YHm dose if bp==0.75||scatter time dose if(sex==1), saving(b, replace)(file b.gph saved)

. graph combine a.gph b.gph

020

4060

2 4 6 8 10dose

Fitted values Fitted values

Fitted values time

020

4060

2 4 6 8 10dose

Fitted values Fitted values

Fitted values time


Case Study

Model:

Time = Dose + Sex + Sex · Dose + BP + ε

. infile time dose sex bp using headache.dat(24 observations read). generate sexdose=sex*dose. regress time dose sex sexdose bp

Source | SS df MS Number of obs = 24----------+------------------------------ F( 4, 19) = 16.78

Model | 4387.65319 4 1096.9133 Prob > F = 0.0000Residual | 1242.30515 19 65.3844814 R-squared = 0.7793----------+------------------------------ Adj R-squared = 0.7329

Total | 5629.95833 23 244.780797 Root MSE = 8.0861---------------------------------------------------------------------------

time | Coef. Std. Err. t P>|t| [95% Conf. Interval]----------+----------------------------------------------------------------

dose | -5.519608 .8006399 -6.89 0.000 -7.195367 -3.843849sex | -8.47549 7.553222 -1.12 0.276 -24.28457 7.333585

sexdose | 2.176471 1.132276 1.92 0.070 -.19341 4.546351bp | -3.75 8.086067 -0.46 0.648 -20.67433 13.17433

_cons | 60.49265 6.698634 9.03 0.000 46.47224 74.51305---------------------------------------------------------------------------

. predict YH(option xb assumed; fitted values). predict E, residuals

Residual plot: residualsi vs Dose

2 4 6 8 10

−10

−5

0

5

10

15

Dose (in grams)

Res

idua

ls (

in m

inut

es)

−2 −1 0 1 2

−10

−5

0

5

10

15


Sam

ple

Qua

ntile

s


Case Study

Model:

Time = Dose + Dose2 + Sex + Sex · Dose + BP + ε

. drop YH E

. generate dosesq=dose^2

. regress time dose sex sexdose dosesq bp



Total | 5629.95833 23 244.780797 Root MSE = 6.3637---------------------------------------------------------------------------


dose | -12.91961 2.171775 -5.95 0.000 -17.48234 -8.356878sex | -8.47549 5.944312 -1.43 0.171 -20.96403 4.013047

sexdose | 2.176471 .8910901 2.44 0.025 .3043598 4.048581dosesq | .6166667 .1731968 3.56 0.002 .2527937 .9805396

bp | -3.75 6.363656 -0.59 0.563 -17.11955 9.619545_cons | 77.45098 7.104701 10.90 0.000 62.52456 92.3774

---------------------------------------------------------------------------

. predict E, residuals

2 4 6 8 10

−10

−5

0

5

10

Dose (in grams)

Res

idua

ls (

in m

inut

es)

−2 −1 0 1 2

−10

−5

0

5

10


Sam

ple

Qua

ntile

s

. test sex bp

( 1) sex = 0( 2) bp = 0

F( 2, 18) = 1.19Prob > F = 0.3270


Case Study

Model:

Time = Dose + Dose2 + Sex · Dose + ε

. regress time dose sexdose dosesq



Total | 5629.95833 23 244.780797 Root MSE = 6.4239---------------------------------------------------------------------------


dose | -12.34823 2.154675 -5.73 0.000 -16.8428 -7.853653sexdose | 1.033708 .3931338 2.63 0.016 .2136452 1.853771dosesq | .6166667 .1748353 3.53 0.002 .2519667 .9813667_cons | 71.33824 5.667294 12.59 0.000 59.51647 83.16

---------------------------------------------------------------------------

. twoway line YH dose if sex==0|| line YH dose if sex==1,> legend(label(1 "female") label(2 "male"))

2 4 6 8 10

0

10

20

30

40

50

60

Dose (in grams)

Fitt

ed ti

me

(in m

inut

es)


Comparing Several Means

Example: Comparison of laboratories

◦ Task: Measure amount of chlorpheniramine maleate in tablets

◦ Seven laboratories were asked to make 10 determinations of one tablet

◦ Study consistency between labs and variability of measurements

Box plot

Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 73.80

3.85

3.90

3.95

4.00

4.05

4.10

Am

ount

of c

hlor

phen

imar

ine

(in m

g)

One-Way Analysis of Variance, Mar 8, 2004 - 1 -


Example: Comparison of drugs

◦ Experimental study of drugs to relieve itching

◦ Five drugs were compared to a placebo and no drug

◦ Ten volunteer male subjects

◦ Each subject underwent one treatment per day (randomized order)

◦ Drug or placebo were given intravenously

◦ Itching was induced on forearms with cowage

◦ Subjects recorded duration of itching

Box plot

No drug Papaverine Aminophylline Tripelennamine

100

200

300

400

Dur

atio

n of

itch

ing

(sec

)

Placebo Morphine Pentobarbital



. infile amount lab using labs.txt(70 observations read)

. graph box amount, over(lab)

. oneway amount lab, bonferroni tabulate

| Summary of amountlab | Mean Std. Dev. Freq.

------------+------------------------------------1 | 4.062 .03259178 102 | 3.997 .08969706 103 | 4.003 .02311808 104 | 3.920 .03333330 105 | 3.957 .05716445 106 | 3.955 .06704064 107 | 3.998 .08482662 10

------------+------------------------------------Total | 3.9845715 .07184294 70

Analysis of VarianceSource SS df MS F Prob > F

------------------------------------------------------------------------Between groups .1247371 6 .020789517 5.66 0.0001Within groups .231400073 63 .003673017------------------------------------------------------------------------

Total .356137173 69 .005161408

Bartlett’s test for equal variances: chi2(6) = 24.3697 Prob>chi2 = 0.000

Comparison of amount by lab(Bonferroni)

Row Mean-|Col Mean | 1 2 3 4 5 6---------+------------------------------------------------------------------

2 | -.065| 0.408|

3 | -.059 .006| 0.698 1.000|

4 | -.142 -.077 -.083| 0.000 0.127 0.068|

5 | -.105 -.04 -.046 .037| 0.005 1.000 1.000 1.000|

6 | -.107 -.042 -.048 .035 -.002| 0.004 1.000 1.000 1.000 1.000|

7 | -.064 .001 -.005 .078 .041 .043| 0.448 1.000 1.000 0.115 1.000 1.000



. oneway duration drug, bonferroni tabulate

| Summary of durationdrug | Mean Std. Dev. Freq.

------------+------------------------------------1 | 191.0 54.861442 102 | 204.8 105.723750 103 | 118.2 52.809511 104 | 148.0 44.738748 105 | 144.3 42.076782 106 | 176.5 68.856130 107 | 167.2 67.499465 10

------------+------------------------------------Total | 164.28571 68.463709 70

Analysis of VarianceSource SS df MS F Prob > F

------------------------------------------------------------------------Between groups 53012.8857 6 8835.48095 2.06 0.0708Within groups 270409.4 63 4292.2127------------------------------------------------------------------------

Total 323422.286 69 4687.2795

Bartlett’s test for equal variances: chi2(6) = 11.3828 Prob>chi2 = 0.077

Comparison of duration by drug(Bonferroni)

Row Mean-|Col Mean | 1 2 3 4 5 6---------+------------------------------------------------------------------

2 | 13.8| 1.000|

3 | -72.8 -86.6| 0.328 0.092|

4 | -43 -56.8 29.8| 1.000 1.000 1.000|

5 | -46.7 -60.5 26.1 -3.7| 1.000 0.904 1.000 1.000|

6 | -14.5 -28.3 58.3 28.5 32.2| 1.000 1.000 1.000 1.000 1.000|

7 | -23.8 -37.6 49 19.2 22.9 -9.3| 1.000 1.000 1.000 1.000 1.000 1.000


stat methods

Documents