stat methods

243
Some definitions Individual: each object described by a set of data Variable: any characteristic of an individual Categorical variable: places an individual into one of several groups or categories. Quantitative variable: takes numerical values on which we can do arithmetic. Distribution of a variable: tells what values it takes and how often it takes these values. Example: The following data set consists of five variables about 20 individuals. ID Age Education Sex Total income Job class 1 43 4 1 18526 5 2 35 3 2 5400 7 3 43 2 1 3900 7 4 33 3 1 28003 5 5 38 3 2 43900 7 6 53 4 1 53000 5 7 64 6 1 51100 6 8 27 4 2 44000 5 9 34 4 1 31200 5 10 27 3 2 26030 5 11 47 6 1 6000 6 12 48 3 1 8145 5 13 39 2 1 37032 5 14 30 3 2 30000 5 15 35 3 2 17874 5 16 47 4 2 400 5 17 51 4 2 22216 5 18 56 5 1 26000 6 19 57 6 1 100267 7 20 34 1 1 15000 5 Age: age in years Education: 1=no high school, 2=some high school, 3=high school diplom, 4=some college, 5=bachelor’s degree, 6=postgraduate degree Sex: 1=male, 2=female Total income: income from all sources Job class: 5=private sector, 6=government, 7=self employed Variables Age and Total income are quantitative, variables Eduction, Sex, and Job class are categorical. Graphical Description of Data, Jan 5, 2004 -1-

Upload: m3eichler

Post on 04-Apr-2015

195 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Stat Methods

Some definitions

◦ Individual: each object described by a set of data

◦ Variable: any characteristic of an individual

⋄ Categorical variable: places an individual into one of several

groups or categories.

⋄ Quantitative variable: takes numerical values on which we can

do arithmetic.

◦ Distribution of a variable: tells what values it takes and how often

it takes these values.

Example:

The following data set consists of five variables about 20 individuals.

ID Age Education Sex Total income Job class

1 43 4 1 18526 52 35 3 2 5400 73 43 2 1 3900 74 33 3 1 28003 55 38 3 2 43900 76 53 4 1 53000 57 64 6 1 51100 68 27 4 2 44000 59 34 4 1 31200 510 27 3 2 26030 511 47 6 1 6000 612 48 3 1 8145 513 39 2 1 37032 514 30 3 2 30000 515 35 3 2 17874 516 47 4 2 400 517 51 4 2 22216 518 56 5 1 26000 619 57 6 1 100267 720 34 1 1 15000 5

Age: age in yearsEducation: 1=no high school, 2=some high school, 3=high school diplom,

4=some college, 5=bachelor’s degree, 6=postgraduate degreeSex: 1=male, 2=femaleTotal income: income from all sourcesJob class: 5=private sector, 6=government, 7=self employed

Variables Age and Total income are quantitative, variables Eduction, Sex,

and Job class are categorical.

Graphical Description of Data, Jan 5, 2004 - 1 -

Page 2: Stat Methods

Categorical variable analysis

Questions to ask about a categorical variable:

◦ How many categories are there?

◦ In each category, how many observations are there?

Bar graphs and pie charts

Categorical data can be displayed by bar graphs or pie charts.

◦ In a bar graph, the horizontal axis lists the categories, in any order.

The height of the bars can be either counts or percentages.

◦ For better comparison of the frequencies, the variables can be ordered

from most frequent to lest frequent.

◦ In a pie chart, the area of each slide is proportional to the percentage

of individuals who fall into that category.

Example: Education of people aged 25 to 34

010

2030

Per

cent

of p

eopl

e ag

ed 2

5 to

34

no HS some HS HS diploma Bachelor’s some college postgradEducation level

010

2030

Per

cent

of p

eopl

e ag

ed 2

5 to

34

HS diploma Bachelor’s some college some HS postgrad no HSEducation level

3.6%7.5%

30.4%

29.1%

22.7%

6.7%

no HS some HSHS diploma Bachelor’ssome college postgrad

Graphical Description of Data, Jan 5, 2004 - 2 -

Page 3: Stat Methods

Categorical variable analysis

Example: Education of people aged 25 to 34

STATA commands:

. infile ID AGE EDUC SEX EARN JOB using individuals.txt, clear

. drop if AGE<25 | AGE>34

. label values EDUC Education

. label define Education 1 "no HS" 2 "some HS" 3 "HS diploma" 4 "Bachelor’s"

> 5 "some college" 6 "postgrad"

. set scheme s1mono

. gen COUNT=100/_N

. graph bar (sum) COUNT, over(EDUC) ytitle("Percent of people aged 25 to 34")

> b1title("Education level")

. translate @Graph bar1.eps, translator(Graph2eps) replace

. graph bar (sum) COUNT, over(EDUC, sort(1) descending)

> ytitle("Percent of people aged 25 to 34") b1title("Education level")

. translate @Graph bar2.eps, translator(Graph2eps) replace

. set scheme s1color

. graph pie COUNT, over(EDUC) plabel(_all perc, format(%4.1f) gap(-5))

. translate @Graph pie.eps, translator(Graph2eps) replace

Graphical Description of Data, Jan 5, 2004 - 3 -

Page 4: Stat Methods

Quantitative variables: stemplots

Example: Sammy Sosa home runs

Producing stemplots in STATA:

. infile YEAR HR using sosa.dat

. stem HR

Stem-and-leaf plot for HR

0* | 48

1* | 05

2* | 5

3* | 366

4* | 009

5* | 0

6* | 346

Year Home runs

1989 41990 151991 101992 81993 331994 251995 361996 401997 361998 661999 632000 502001 642002 492003 40

How to make a stemplot

1. Separate each observation into a stem and a leaf.

e.g. 15 → 1︸︷︷︸stem

5︸︷︷︸leaf

and 4 → 0︸︷︷︸stem

4︸︷︷︸leaf

2. Write the stems in a vertical column in increasing order.

3. Write each leaf next to stem, in increasing order out from the stem.

How to choose the stem

◦ Rounding: each leaf should have exactly one digit, so rounding long

numbers before producing the stemplot can help produce a more com-

pact and informative plot.

◦ Splitting: if each stem (or many stems) have a large number of leaves,

all stems can be split, with leaves of 0-4 going to the first stem and 5-9

going to the second.

Graphical Description of Data, Jan 5, 2004 - 4 -

Page 5: Stat Methods

Quantitative variables: histograms

How to make a histogram

1. Group the observations into “bins” according to their value. Choose

the bins carefully: too few hide detail, too many decimate the pattern.

2. Count the individuals in each bin.

3. Draw the histogram

◦ Leave no space between bars.

◦ Label the axes with units of measurement.

◦ The y-axis is can be counts or percentages (per unit).

Example: Sammy Sosa home runs

Year Home runs

1989 41990 151991 101992 81993 331994 251995 361996 401997 361998 661999 632000 502001 642002 492003 40

0.0

1.0

2.0

3.0

4D

ensi

ty

0 10 20 30 40 50 60 70Home runs

The area of each bar is proportional to the percentage of data in that range.

We care about the area, not the height, but when the bar has equal width,

area is determined by the height.

For simplicity, use equally spaced bins.

Graphical Description of Data, Jan 5, 2004 - 5 -

Page 6: Stat Methods

Quantitative variables: histograms

Example: Sammy Sosa home runs

Histograms with different bin widths:

Histogram of Sosa Home Runs

Home Runs

Per

cent

age

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Histogram of Sosa Home Runs

Home Runs

Per

cent

age

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Histogram of Sosa Home Runs

Home Runs

Per

cent

age

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Histogram of Sosa Home Runs

Home Runs

Per

cent

age

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Producing histograms in STATA:

. infile YEAR HR using sosa.dat

. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs)

. translate @Graph hist1.eps, translator(Graph2eps) replace

. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs) freq

. translate @Graph hist2.eps, translator(Graph2eps) replace

0.0

1.0

2.0

3.0

4D

ensi

ty

0 10 20 30 40 50 60 70Home runs

01

23

45

Fre

quen

cy

0 10 20 30 40 50 60 70Home runs

Why is a histogram not a bar graph?

◦ Frequencies are represented by area, not height.

◦ There is no space between the bars.

◦ The horizontal axis represents a numerical quantity, with an inherent

order.

Graphical Description of Data, Jan 5, 2004 - 6 -

Page 7: Stat Methods

Interpreting histograms

◦ Describe the overall pattern and any significant deviations from that

pattern.

◦ Shape: Is the distribution (approximately) symmetric or skewed?

Histogram of x

x

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0

050

010

0015

0020

00

This distribution is skewed right

because it has a long right-hand

tail.

◦ Center: Where is the “middle” of the distribution?

◦ Spread: What are the smallest and largest values?

◦ Outliers: Are there any observations that lie outside the overall pat-

tern? They could be unusual observations, or they could be mistakes.

Check them!

Example: Newcomb’s measurements of the passage time of light (IPS Tab

1.1)

Time

Fre

quen

cy

−60 −40 −20 0 20 40 600

5

10

15

20

25

Graphical Description of Data, Jan 5, 2004 - 7 -

Page 8: Stat Methods

Time plots

Example: Average retail price of gasoline from Jan 1988 to Apr 2001

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Ret

ail g

asol

ine

pric

e

1988 1990 1992 1994 1996 1998 2000Year

Note: Whenever data are collected over time, it is a good idea to have

a time plot. Stemplots and histograms ignore time order, which can be

misleading when systematic change over time exists.

Producing a time plot in STATA:

. infile PRICE using gasoline.txt, clear

. graph twoway line PRICE T, ylabel(0.9(0.1)1.8, format(%3.1f)) xtick(0(12)159)

> xlabel(0 "1988" 24 "1990" 48 "1992" 72 "1994" 96 "1996" 120 "1998" 144 "2000")

> xtitle(Year) ytitle(Retail gasoline price)

Graphical Description of Data, Jan 5, 2004 - 8 -

Page 9: Stat Methods

Measures of center

The mean

The mean of a distribution is the arithmetic average of the obser-

vations:

x =x1 + · · · + xn

n= 1

n

n∑i=1

xi

The median

The median is the midpoint of a distribution: the number M

such that

◦ half the observations are smaller and

◦ half are larger.

How to find the median

Suppose the observations are x1, x2, . . . , xn.

1. Arrange the data in increasing order and let x(i) denote the ith

smallest observation.

2. If the number of observations n is odd, the median is the center

observation in the ordered list:

M = x((n+1)/2)

3. If the number of observation n is even, the median is the av-

erage of the two center observations in the ordered list:

M =x(n/2) + x(n/2+1)

2Numerical Description of Data, Jan 7, 2004 - 1 -

Page 10: Stat Methods

Measures of center

Examples:

Data set 1:

x1 x2 x3 x4 x5 x6 x7 x8 x9

2 4 3 4 6 5 4 -6 5

Arrange in increasing order:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)

-6 2 3 4 4 4 5 5 6

There is an odd number of observations, so the median is

M = x((n+1)/2) = x(5) = 4.

The mean is given by

x =2 + 4 + 3 + 4 + 6 + 5 + 4 + (−6) + 5

9=

27

9= 3.

Data set 2:

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1

Arrange in increasing order:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10)

1.3 2.3 2.9 3.9 4.1 4.2 5.1 5.9 6.4 8.8

There is an even number of observations, so the median is

M =x(n/2) + x(n/2+1)

2=

x(5) + x(6)

2=

4.1 + 4.2

2= 4.15.

The mean is given by

x =2.3 + 8.8 + 3.9 + 4.1 + 6.4 + 5.9 + 4.2 + 2.9 + 1.3 + 5.1

10=

44.9

10= 4.49.

Numerical Description of Data, Jan 7, 2004 - 2 -

Page 11: Stat Methods

Mean versus median

◦ The mean is easy to work with algebraically, while the median

is not.

◦ The mean is sensitive to extreme observations, while the median

is more robust.

Example:

0 1 2 3 4 5 6 7 8 9 10

The original mean and median are

x =0 + 1 + 2

3= 1 and M = x((n+1)/2) = 1

The modified mean and median are

x =0 + 1 + 10

3= 3

2

3and M = x((n+1)/2) = 1

◦ If the distribution is exactly symmetric, then mean=median.

◦ In a skewed distribution, the mean is further out in the longer

tail than the median.

◦ The median is preferable for strongly skewed distributions, or

when outliers are present.

Numerical Description of Data, Jan 7, 2004 - 3 -

Page 12: Stat Methods

Measures of spread

Example: Monthly returns on two stocks

Stock A

Daily returns (in %)

Fre

quen

cy

−10 −5 0 5 10 15 200

10

20

30

40Stock B

Daily returns (in %)

Fre

quen

cy

−10 −5 0 5 10 15 200

10

20

30

40

Stock A Stock B

Mean 4.95 4.82

Median 4.99 4.68

The distributions of the two stocks have approximately the same

mean and median, but stock B is more volatile and thus more risky.

◦ Measures of center alone are an insufficient description of a

distribution and can be misleading

◦ The simplest useful numerical description of a distribution con-

sists of both a measure of center and a measure of spread.

Common measures of spread are

◦ the quartiles and the interquartile range

◦ the standard deviation

Numerical Description of Data, Jan 7, 2004 - 4 -

Page 13: Stat Methods

Quartiles

Quartiles divide data into 4 even parts

◦ Lower (or first) quartile QL:

median of all observations less than the median M

◦ Middle (or second) quartile M = QM :

median of all observations

◦ Upper (or third) quartile QU :

median of all observations lgreater than the median M

◦ Interquartile range: IQR = QU − QL

distance between upper and lower quartile

How to find the quartiles

1. Arrange the data in increasing order and find the median M

2. Find the median of the observations to the left of M, that is the lower

quartiles, QL

3. Find the median of the observations to the right of M, that is the

upper quartiles, QU

Examples:

Data set:

x1 x2 x3 x4 x5 x6 x7 x8 x9

2 4 3 4 6 5 4 -6 5

Arrange in increasing order:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)

-6 2 3 4 4 4 5 5 6

◦ QL is the median of {−6, 2, 3, 4}: QL = 2.5

◦ QU is the median of {4, 5, 5, 6}: QU = 5

◦ IQR = 5 − 2.5 = 2.5

Numerical Description of Data, Jan 7, 2004 - 5 -

Page 14: Stat Methods

Percentiles

More generally we might be interested in the value which is ex-

ceeded only by a certain percentage of observations:

The pth percentile of a set of observations is the value such that

◦ p% of the observation are less than or equal to it and

◦ (100 − p)% of the observation are greater than or equal to it.

How to find the percentiles

1. Arrange the data into increasing order.

2. If np/100 is not an integer, then x(k+1) is the pth percentile,

where k is the largest integer less than np/100.

3. If np/100 is an integer, the pth percentile is the average of the

x(np/100) and x(np/100+1).

Five-number summary

A numerical summary of a distribution {x1, . . . , xn} is given by

x(1) QL M QU x(n)

A simple boxplot is a graph of the five-number summary.

Numerical Description of Data, Jan 7, 2004 - 6 -

Page 15: Stat Methods

Boxplots

A common “rule” for discovering outliers is the 1.5 × IQR rule:

An observations is a suspected outlier if it lies more than falls more

than 1.5 × IQR below QL or above QU .

How to draw a boxplot Box-and-whisker

plot)

1. A box (the box) is drawn from the lower to

the upper quartile (QL and QU).

2. The median of the data is shown by a line in

the box.

3. Lines (the whiskers) are drawn from the ends

of the box to the most extreme observations

within a distance of 1.5 IQR (Interquartile

range).

4. Measurements falling outside 1.5 IQR from

the ends of the box are potential outliers and

marked by ◦ or ∗.

−10

010

20

Stock A Stock B

Plotting a boxplot with STATA:

. infile A B using stocks.txt, clear

. label var A "Stock A"

. label var B "Stock B"

. graph box A B, xsize(2) ysize(5)

Numerical Description of Data, Jan 7, 2004 - 7 -

Page 16: Stat Methods

Boxplots

Interpretation of Box Plots

◦ The IQR is a measure for the sample’s variability.

◦ If the whiskers differ in length the distribution of the data is

probably skewed in the direction of the longer whisker.

◦ Very extreme observations (more than 3 IQR away from the

lower resp. upper quartile) are outliers, with one of the following

explanations:

a) The measurement is incorrect (error in measurement process

or data processing).

b) The measurement belongs to a different population.

c) The measurement is correct, but represents a rare (chance)

event.

We accept the last explanation only after carefully ruling out

all others.

Numerical Description of Data, Jan 7, 2004 - 8 -

Page 17: Stat Methods

Variance and standard deviation

Suppose there are n observations x1, x2, . . . , xn,

The variance of the n observations is:

s2 =(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2

n − 1

= 1

n − 1

n∑i=1

(xi − x)2

This is (approximately) the average of the squared distances of the

observations from the mean.

The standard deviation is:

s =√

s2 =

√1

n − 1

n∑i=1

(xi − x)2

Why n − 1?

Division by n − 1 instead of n in the variance calculation is a

common cause of confusion. Why n − 1? Note that

n∑

i=1

(xi − x) = 0

Thus, if you know any n − 1 of the differences, the last difference

can be determined from the others. The number of “freely varying”

observations, n− 1 in this case, is called the “degrees of freedom”.

Numerical Description of Data, Jan 7, 2004 - 9 -

Page 18: Stat Methods

Properties of s

◦ Measures spread around the mean =⇒ use only if the mean

is used as a measure of center.

◦ s = 0 ⇔ all observations are the same

◦ s is in the same units as the measurements, while s2 is in the

square of these units.

◦ s, like x is not resistant to outliers.

Five-number summary versus standard deviation

◦ The 5-number summary is better for describing skewed distri-

butions, since each side has a different spread.

◦ x and s are preferred for symmetric distributions with no out-

liers.

Numerical Description of Data, Jan 7, 2004 - 10 -

Page 19: Stat Methods

Histograms and density curves

What’s in our toolkit so far?

◦ Plot the data: histogram (or stemplot)

◦ Look for the overall pattern and identify deviations and outliers

◦ Numerical summary to briefly describe center and spread

A new idea:

If the pattern is sufficiently regular, approximate it with a

smooth curve.

Any curve that is always on or above the horizontal axis and has

total are underneath equal to one is a density curve.

◦ Area under the curve in a range of values indicates the propor-

tion of values in that range.

◦ Come in a variety of shapes, but the “normal” family of familiar

bell-shaped densities is commonly used.

◦ Remember the density is only an approximation, but it sim-

plifies analysis and is generally accurate enough for practical

use.

The Normal Distrbution, Jan 9, 2004 - 1 -

Page 20: Stat Methods

Examples

Sulfur oxide (in tons)

Den

sity

0 10 20 30 400.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Sulfur oxide (in tons)

Den

sity

0 10 20 30 400.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Sulfur oxide (in tons)

Den

sity

0 10 20 30 400.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Shaded area of histogram: 0.29

Shaded area under the curve: 0.30

Waiting time between eruptions (min)

Den

sity

40 46 52 58 64 70 76 82 88 94 1000.00

0.01

0.02

0.03

0.04

The Normal Distrbution, Jan 9, 2004 - 2 -

Page 21: Stat Methods

Median and mean of a density curve

Median:

The equal-areas point with 50% of the “mass” on either side.

Mean:

The balancing point of the curve, if it were a solid mass.

Note:

◦ The mean and median of a symmetric density curve are equal.

◦ The mean of a skewed curve is pulled away from the median in

the direction of the long tail.

The mean and standard deviation of a density are denoted µ and

σ, rather than x and s, to indicate that they refer to an idealized

model, and not actual data.

The Normal Distrbution, Jan 9, 2004 - 3 -

Page 22: Stat Methods

Normal distributions: N (µ, σ)

The normal distribution is

◦ symmetric,

◦ single-peaked,

◦ bell-shaped.

The density curve is given by

f(x) = 1√2πσ2

exp(− 1

2σ2(X − µ)2

).

It is determined by two parameters µ and σ:

◦ µ is the mean (also the median)

◦ σ is the standard deviation

Note: The point where the curve changes from concave to convex

is σ units from µ in either direction.

The Normal Distrbution, Jan 9, 2004 - 4 -

Page 23: Stat Methods

The 68-95-99.7 rule

◦ About 68% of the data fall inside (µ − σ, µ + σ).

◦ About 95% of the data fall inside (µ − 2σ, µ + 2σ).

◦ About 99.7% of the data fall inside (µ − 3σ, µ + 3σ).

The Normal Distrbution, Jan 9, 2004 - 5 -

Page 24: Stat Methods

Example

Scores on the Wechsler Adult Intelligence Scale (WAIS) for the 20

to 34 age group are approximately N(110, 25).

◦ About what percent of people in this age group have scores

above 110?

◦ About what percent have scores above 160?

◦ In what range do the middle 95% of all scores lie?

The Normal Distrbution, Jan 9, 2004 - 6 -

Page 25: Stat Methods

Standardization and z-scores

Linear transformation of normal distributions:

X ∼ N (µ, σ) ⇒ a X + b ∼ N (a µ + b, a σ)

In particular it follows that

X − µ

σ∼ N (0, 1).

N (0, 1) is called standard normal distribution.

For a real number x the standardized value or z-score

z =x − µ

σ

tells how many standard deviations x is from µ, and in what di-

rection.

Standardization enables us to use a standard normal table to find

probabilities for any normal variable.

For example:

◦ What is the proportion of N(0, 1) observations less than 1.2?

◦ What is the proportion of N(3, 1.5) observations greater than 5?

◦ What is the proportion of N(10, 5) observations between 3 and 9?

The Normal Distrbution, Jan 9, 2004 - 7 -

Page 26: Stat Methods

Normal calculations

Standard normal calculations

1. State the problem in terms of x.

2. Standardize: z = x−µσ .

3. Look up the required value(s) on the standard normal table.

4. Reality check: Does the answer make sense?

Backward normal calculations

We can also calculate the values, given the probabilities:

If MPG ∼ N (25.7, 5.88), what is the minimum MPG required to be in the

top 10%?

“Backward” normal calculations

1. State the problem in terms of the probability of being less

than some number.

2. Look up the required value(s) on the standard normal table.

3. “Unstandardize,” i.e. solve z = x−µσ for x.

The Normal Distrbution, Jan 9, 2004 - 8 -

Page 27: Stat Methods

Example

Suppose X ∼ N (0, 1).

◦ P(X ≤ 2) = ?

◦ P(X > 2) = ?

◦ P(−1 ≤ X ≤ 2) = ?

◦ Find the value z such that

⋄ P(X ≤ z) = 0.95

⋄ P(X > z) = 0.99

⋄ P(−z ≤ X < z) = 0.68

⋄ P(−z ≤ X < z) = 0.95

⋄ P(−z ≤ X < z) = 0.997

Suppose X ∼ N (10, 5).

◦ P(X < 5) = ?

◦ P(−3 < X < 5) = ?

◦ P(−x < X < x) = 0.95

The Normal Distrbution, Jan 9, 2004 - 9 -

Page 28: Stat Methods

Assessing Normality

How to make a normal quantile plot

1. Arrange the data in increasing order.

2. Record the percentiles ( 1n,

2n, . . . ,

nn).

3. Find the z-scores for these percentiles.

4. Plot x on the vertical axis against z on the horizontal axis.

Use of normal quantile plots

◦ If the data are (approximately) normal, the plot will be close

to a straight line.

◦ Systematic deviations from a straight line indicate a nonnormal

distribution.

◦ Outliers appear as points that are far away from the overall

patter of the plot.

−3 −2 −1 0 1 2 3

−2

−1

01

2

Theoretical Quantiles

Sam

ple

Qua

ntile

s

N(0, 1)−3 −2 −1 0 1 2 3

01

23

45

6

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Exp(1)−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Theoretical Quantiles

Sam

ple

Qua

ntile

s

U(0, 1)

The Normal Distrbution, Jan 9, 2004 - 10 -

Page 29: Stat Methods

Density Estimation

The normal density is just one possible density curve. There are

many others, some with compact mathematical formulas and many

without.

Density estimation software fits an arbitrary density to data to give

a smooth summary of the overall pattern.

Velocity of galaxy (1000km/s)

Den

sity

0 10 20 30 400.0

0.1

0.2

The Normal Distrbution, Jan 9, 2004 - 11 -

Page 30: Stat Methods

Histogram

How to scale a histogram?

◦ Easiest way to draw a histogram:

⋄ qqually spaced bins

⋄ counts on the vertical axis

Sosa home runs

Fre

quen

cy

0 10 20 30 40 50 60 700

1

2

3

4

5

Disadvantage: Scaling depends on number of observations and

bin width.

◦ Scale histogram such that area of each bar corresponds to pro-

portion of data:

height =counts

width · total number

Sosa home runs

Den

sity

0 10 20 30 40 50 60 700.00

0.01

0.02

0.03

0.04

Proportion of data in interval (0, 10]:

height · width = 0.02 · 10 = 0.2 = 20%

Since n = 15 this corresponds to 3 observations.

The Normal Distrbution, Jan 9, 2004 - 12 -

Page 31: Stat Methods

Density curves

x

Den

sity

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5n=250

x

Den

sity

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5n=2500

x

Den

sity

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5n=250000

x

Den

sity

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5n → ∞

Proportion of data in (1,2]:

#{xi : 1 < xi ≤ 2}n

↓ n → ∞

∫ 2

1

f(x) dx

Probability that a new observation X fall into [a, b]P(a ≤ X ≤ b) =

∫ b

a

f(x) dx = limn→∞

#{xi : 1 < xi ≤ 2}n

The Normal Distrbution, Jan 9, 2004 - 13 -

Page 32: Stat Methods

Relationships between data

Example: Smoking and mortality

◦ Data from 25 occupational groups

(condensed from data on thousands of individual men)

◦ Smoking (100 = average number of cigarettes per day)

◦ Mortality ratio for deaths from lung cancer

(100 = average ratio for all English men)

Scatter plot of the data:

70 80 90 100 110 120 130

60

80

100

120

140

Smoking (index)

Mor

talit

y (in

dex)

In STATA:

. insheet using smoking.txt

. graph twoway scatter mortality smoking

Scatterplots and correlation, Jan 12, 2004 - 1 -

Page 33: Stat Methods

Relationship between data

Assessing a scatter plot:

◦ What is the overall pattern?

⋄ form of the relationship?

⋄ direction of the relationship?

⋄ strength of the relationship?

◦ Are there any deviations (e.g. outliers) from these patterns?

Direction of relationship/association:

◦ positive association: above-average values of both variables

tend to occur together, and the same for below-average values

◦ negative association: above-average values of one variable

tend to occur with below-average values of the other, and vice

versa.

Strength of relationship/association:

◦ determined by how closely the points follow the overall pattern

◦ difficult to assess numerical measure

Scatterplots and correlation, Jan 12, 2004 - 2 -

Page 34: Stat Methods

Correlation

Correlation is a numerical measure of the direction and strength

of the linear relationship between two quantitative variables.

The sample correlation r is defined as

rxy =sxy√sx sy

.

where

sx = 1

n − 1

n∑i=1

(xi − x)2,

sy = 1

n − 1

n∑i=1

(yi − y)2,

sxy = 1

n − 1

n∑i=1

(xi − x)(yi − y).

Properties:

◦ dimensionless quantity

◦ not affected by linear transformations:

for x′i = a xi + b and y′i = c yi + d

rx′y′ = rxy

◦ −1 ≤ rxy ≤ 1

◦ rxy = 1 if and only if yi = a xi + b for some a and b

◦ measures linear association between xi and yi

Scatterplots and correlation, Jan 12, 2004 - 3 -

Page 35: Stat Methods

Correlation

−2 −1 0 1 2

−2

01

2

x

yρ = −0.9

−2 −1 0 1 2

−2

01

2

x

y

ρ = −0.6

−2 −1 0 1 2

−2

01

2

x

y

ρ = −0.3

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0.3

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0.6

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0.9

−2 −1 0 1 2

−2

01

2

x

y

ρ = 0.99

Scatterplots and correlation, Jan 12, 2004 - 4 -

Page 36: Stat Methods

Introduction to regression

Regression describes how one variable (response) depends on

another variable (explanatory variable).

◦ Response variable: variable of interest, measures the out-

come of a study

◦ Explanatory variable: explains (or even causes) changes in

response variable

Examples:

◦ Hearing difficulties:

response - sound level (decibels), explanatory - age (years)

◦ Real estate market:

response - listing prize ($), explanatory - house size (sq. ft.)

◦ Salaries:

response - salary ($), explanatory - experience (years), educa-

tion, sex

Least squares regression, Jan 14, 2004 - 1 -

Page 37: Stat Methods

Introduction to regression

Example: Food expenditures and income

Data: Sample of 20 households

0 20 40 60 80 100 1200

4

8

12

16

20

income

food

exp

endi

ture

Questions:

◦ How does food expenditure (Y ) depend on income (X)?

◦ Suppose we know that X = x0, what can we tell about Y ?

Linear regression:

If the response Y depends linearly on the explanatory variable

X , we can use a straight line (regression line) to predict Y

from X .

Least squares regression, Jan 14, 2004 - 2 -

Page 38: Stat Methods

Least squares regression

How to find the regression line

0 20 40 60 80 100 1200

4

8

12

16

20

income

foo

d e

xp

en

ditu

re

50 60 70 80 90

8

10

12

14

16

18

income

foo

d e

xpe

nd

iture

Predicted y

Difference y − y

Observed y

Since we intend to predict Y from X , the errors of interest are

mispredictions of Y for fixed X .

The least squares regression line of Y on X is the line that

minimizes the sum of squared errors.

For observations (x1, y1), . . . , (xn, yn), the regression line is given

by

Y = a + b X

where

b = r sy

sxand a = y − b x

(r correlation coefficient, sx, sx standard deviations, x, y means)

Least squares regression, Jan 14, 2004 - 3 -

Page 39: Stat Methods

Least squares regression

Example: Food expenditure and incomeX 28 26 32 24 54 59 44 30 40 82

Y 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18.0

X 42 58 28 20 42 47 112 85 31 26

Y 4.9 11.8 5.2 4.8 7.9 6.4 20.0 13.7 5.1 2.9

The summary statistics are:

◦ x = 45.50

◦ y = 7.97

◦ sx = 23.96

◦ sy = 4.66

◦ r = 0.946

The regression coefficients are:

b = r sy

sx= 0.946 · 4.66

23.96= 0.184

a = y − b x = 7.97 − 0.184 · 45.5 = −0.402

0 20 40 60 80 100 120

0

5

10

15

20

income

food

exp

endi

ture

Least squares regression, Jan 14, 2004 - 4 -

Page 40: Stat Methods

Interpreting the regression model

◦ The response in the model is denoted Y to indicate that these

are predicted Y values, not the true Y values. The “hat” de-

notes prediction.

◦ The slope of the line indicates how much Y changes for a unit

change in X .

◦ The intercept is the value of Y for X = 0. It may or not have

a physical interpretation, depending on whether or not X can

take values near 0.

◦ To make a prediction for an unobserved X , just plug it in and

calculate Y .

◦ Note that the line need not pass through the observed data

points. In fact, it often will not pass through any of them.

Least squares regression, Jan 14, 2004 - 5 -

Page 41: Stat Methods

Regression and correlation

Correlation analysis:

We are interested in the joint distribution of two (or more)

quantitive variables.

Example: Heights of 1,078 fathers and sons

Father’s height (inches)

Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

Points are scattered around the SD line:

◦ (y − y) =sy

sx(x − x)

◦ goes through center (x, y)

◦ has slope sy/sx

The correlation r measures how much the points spread around

the SD line.

Least squares regression, Jan 14, 2004 - 6 -

Page 42: Stat Methods

Regression and correlation

Regression analysis:

We are interested how the distribution of one response variable

depends on one (or more) explanatory variables.

Example: Heights of 1,078 fathers and sons

Father’s height (inches)

Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80 Father’s height = 64 inches

Son’s height (inches)

Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.05

0.10

0.15

0.20x

Father’s height = 68 inches

Son’s height (inches)

Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.04

0.08

0.12

0.16x

Father’s height = 72 inches

Son’s height (inches)

Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.03

0.06

0.09

0.12

0.15

0.18x

Father’s height (inches)

Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

In each vertical strip, the

points are distributed

around the regression

line.

Least squares regression, Jan 14, 2004 - 7 -

Page 43: Stat Methods

Properties of least squares regression

◦ The distinction between explanatory and response variables is

essential. Looking at vertical deviations means that changing

the axes would change the regression line.

Father’s height (inches)

Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

y = a + bx

x = a’ + b’y

◦ A change of 1 sd in X corresponds to a change of r sds in Y .

◦ The least squares regression line always passes through the

point (x, y).

◦ r2 (the square of the correlation) is the fraction of the variation

in the values of y that is explained by the least squares regres-

sion on x.

When reporting the results of a linear regression,

you should report r2.

These properties depend on the least-squares fitting criterion and

are one reason why that criterion is used.

Least squares regression, Jan 14, 2004 - 8 -

Page 44: Stat Methods

The regression effect

Regression effect

In virtually all test-retest situations, the bottom group on the

first test will on average show some improvement on the sec-

ond test - and the top group will on average fall back. This is

the regression effect. The statistician and geneticist Sir Fran-

cis Galton (1822-1911) called this effect “regression to medi-

ocrity”.

Father’s height (inches)

Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

Regression fallacy

Thinking that the regression effect must be due to something

important, not just the spread around the line, is the regression

fallacy.

Least squares regression, Jan 14, 2004 - 9 -

Page 45: Stat Methods

Regression in STATA

. infile food income size using food.txt

. graph twoway scatter food income || lfit food income, legend(off)> ytitle(food). regress food income

Source | SS df MS Number of obs = 20------------+------------------------------ F( 1, 18) = 151.97

Model | 369.572965 1 369.572965 Prob > F = 0.0000Residual | 43.7725361 18 2.43180756 R-squared = 0.8941

------------+------------------------------ Adj R-squared = 0.8882Total | 413.345502 19 21.7550264 Root MSE = 1.5594

---------------------------------------------------------------------------food | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------------+--------------------------------------------------------------income | .1841099 .0149345 12.33 0.000 .1527336 .2154862_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615

---------------------------------------------------------------------------

0

5

10

15

20

Foo

d ex

pend

iture

0 20 40 60 80 100 120Income

This graph has been generated using the graphical user interface of STATA.

The complete command is:

. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black))> (lfit food income, range(0 120) clcolor(black) clpat(solid) clwidth(medium)),> ytitle(Food expenditure, size(large)) ylabel(, valuelabel angle(horizontal)> labsize(medlarge)) xtitle(Income, size(large)) xscale(range(0 120))> xlabel(0(20)120, labsize(medlarge)) legend(off) ysize(2) xsize(3)

Least squares regression, Jan 14, 2004 - 10 -

Page 46: Stat Methods

Residual plots

Residuals: difference of observed and predicted values

ei = observed y − predicted y

= yi − yi

= yi − (a + b xi)

For a least squares regression, the residuals always have mean zero.

Residual plot

A residual plot is a scatterplot of the residuals against the

explanatory variable. It is a diagnostic tool to assess the fit of

the regression line.

Patterns to look for:

◦ Curvature indicates that the relationship is not linear.

◦ Increasing or decreasing spread indicates that the prediction

will be less accurate in the range of explanatory variables where

the spread is larger.

◦ Points with large residuals are outliers in the vertical direc-

tion.

◦ Points that are extreme in the x direction are potential high

influence points.

Influential observations are individuals with extreme x values

that exert a strong influence on the position of the regression line.

Removing them would significantly change the regression line.

Least squares regression, Jan 14, 2004 - 11 -

Page 47: Stat Methods

Regression Diagnostics

Example: First data set

Y

X5 10 15

0

5

10

Res

idua

ls

Fitted values4 6 8 10

−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

residuals are regularly distributed

Least squares regression, Jan 14, 2004 - 12 -

Page 48: Stat Methods

Regression Diagnostics

Example: Second data set

Y

X5 10 15

0

5

10

Res

idua

ls

Fitted values4 6 8 10

−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

functional relationship other than linear

Least squares regression, Jan 14, 2004 - 13 -

Page 49: Stat Methods

Regression Diagnostics

Example: Third data set

Y

X5 10 15

0

5

10

15

Res

idua

ls

Fitted values4 6 8 10

−1

0

1

2

3

Res

idua

ls

X5 10 15

−1

0

1

2

3

outlier, regression line misfits majority of data

Least squares regression, Jan 14, 2004 - 14 -

Page 50: Stat Methods

Regression Diagnostics

Example: Fourth data set

Y

X5 10 15

0

5

10

15

Res

idua

ls

Fitted values4 6 8 10

−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

heteroscedasticity

Least squares regression, Jan 14, 2004 - 15 -

Page 51: Stat Methods

Regression Diagnostics

Example: Fifth data set

Y

X5 10 15 20

0

5

10

15

Res

idua

ls

Fitted values6 8 10 12 14

−2

−1

0

1

2

Res

idua

ls

X5 10 15 20

−2

−1

0

1

2

one separate point in direction of x, highly influential

Least squares regression, Jan 14, 2004 - 16 -

Page 52: Stat Methods

The Question of Causation

Example: Are babies brought by the stork?

◦ Data from 54 countries

◦ Variables:

⋄ Birth rate (newborns per 1000 women)

⋄ Number of storks (per 1000 women)

0 1 2 3 4 50

3

6

9

12

15

18

21

Number of storks (per 1000 women)

Birt

h ra

te

Model: Birth rate (Y) is proportional to the number of storks (X)

Y = b X + ε

Least squares regression yields for the slope of the regression line

b = 4.3 ± 0.2.

Can we conclude that babies are brought by the stork?

Causation, Jan 16, 2004 - 1 -

Page 53: Stat Methods

The Question of Causation

A more serious example:

Variables:

◦ Income Y - response

◦ level of education X - explanatory variable

There is a positive association between income and the education.

Question: Does better education increase income?

X Y X Y X Y

ZZ

causal effect confounding

?

(a) (b) (c)

?

Possible alternative explanation: Confounding

◦ People from prosperous homes are likely to receive many years of edu-

cation and are more likely to have high earnings.

◦ Education and income might both be affected by personal attributes

such as self assurance. On the other hand the level of education could

have an impact on e.g. self assurance. The effects of education and self

assurance can not be separated.

Confounding:

Response and explanatory variable both depend on a third

(hidden) variable.

Causation, Jan 16, 2004 - 2 -

Page 54: Stat Methods

Establishing Causal Relationships

Controlled experiments:

A cause-effect relationship between two variables X and Y can be

established by conducting an experiment where

◦ the values of X are manipulated and

◦ the effect on Y is observed.

Problem: Often such experiments are not possible.

If we cannot establish a causal relationship by a controlled experi-

ment, we can still collect evidence from observational studies:

◦ The association is strong.

◦ The association is consistent across multiple studies.

◦ Higher doses are associated with stronger responses.

◦ The alleged cause precedes the effect in time.

◦ The alleged cause is plausible.

Example: Smoking and lung cancer

Causation, Jan 16, 2004 - 3 -

Page 55: Stat Methods

Caution about Causation

Association is not causation

Two variables may be correlated because both are affected

by some other (measured or unmeasured) variable.

Unmeasured confounding variables can influence the in-

terpretation of relationships among the measured vari-

ables. They

◦ may suggest a relationship where there is none or

◦ may mask a real relationship.

No causation in - no causation out

Causation is - unlike association - no statistical concept.

For inference on cause-effect relationships, we need some

knowledge about the causal relationships between the vari-

ables in the study.

Randomized experiments guarantee the absence of any

confounding variables. Any relationship between the ma-

nipulated variable and the response must be due to a

cause-effect relationship.

Causation, Jan 16, 2004 - 4 -

Page 56: Stat Methods

Experiments and Observational Studies

Two major types of statistical studies

◦ Observational study - observes individuals/objects and mea-

sures variables of interest but does not attempt to interfere with

the natural process.

◦ Designed experiment - deliberately imposes some treatment

on individuals to observe their responses.

Remarks:

◦ Sample survey are an example of an observational study.

◦ In economics, most studies are observational.

◦ Clinical studies are often designed experiments.

◦ Designed experiments allow statements about causal relation-

ship between treatment and response.

◦ Observational studies have no control over variables. Thus the

effect of the explanatory variable on the response variable might

be confounded (mixed up) with the effect of some other vari-

ables. Such variables are called confounder and a major source

of bias.

Experiments and Observational Studies, Jan 16, 2004 - 5 -

Page 57: Stat Methods

Designed Experiments

• In controlled experiments, the subjects are assigned to one of

two groups,

◦ treatment group and

◦ control group (which does not receive treatment).

• A controlled experiment is randomized if the subjects are ran-

domly assigned to one of the two groups.

• One precaution in designed experiments if the use of a placebo,

which are made of a completely neutral substance. The sub-

jects do not know whether they receive the treatment or a

placebo, any difference in the response thus cannot be attir-

buted to psychological and psychosomatical effects.

• In a double blind experiment, neither the subjects nor the

treatment administrators know who is assigned to the two

groups.

Example: The Salk polio vaccine field trial

◦ Randomized controlled double-blind experiment in 11 states

◦ 200,000 children in treatment group

◦ 200,000 children in control group treated with placebo

The difference between the responses of the two groups show that

the vaccine reduces the risk of polio infection.

Experiments and Observational Studies, Jan 16, 2004 - 6 -

Page 58: Stat Methods

Confounding

Confounding means a difference between the treatment and con-

trol groups—other than the treatment—which affects the responses

being studied. A confounder is a third variable. associated with

exposure and with disease.

Example: Lanarkshire Milk Experiment

The purpose of the experiment was to study the effect of pasteur-

ized milk on the health of children.

◦ The subjects of the experiment were school children.

◦ The children in the treatment group got a daily portion of pas-

teurized milk.

◦ The children in the control did not receive any extra milk.

◦ The teachers assigned poorer children to treatment group so

that they got extra milk

The effect of pasteurized milk on the health of children is con-

founded with the effect of wealth: Poorer children are more exposed

to diseases.

Experiments and Observational Studies, Jan 16, 2004 - 7 -

Page 59: Stat Methods

Observational Studies

Confounding is a major problem in observational studies.

Association is NOT Causation

Example: Does smoking cause cancer.

• Designed experiment not possible (cannot make people

smoke).

• Observation: Smokers have higher cancer rates

• Tobacco industry: There might be a gene which

◦ makes people smoke and

◦ causes cancer

In that case stopping smoking would not prevent cancer since

it is caused by the gene. The observed high association could

be attributed to the confounding effect of such a gene.

• However: Studies with identical twins—one smoker and one

nonsmoker—puts some serious doubt on the gene theory.

Experiments and Observational Studies, Jan 16, 2004 - 8 -

Page 60: Stat Methods

Example

Do screening programs speed up detection of breast cancer?

◦ Large-scale trial run by the Health Insurance Plan of Greater

New York, starting in 1963

◦ 62,000 women age 40 to 64 (all members of the plan)

◦ Randomly assigned to two equal groups

◦ Treatment group:

⋄ women were encouraged to come in for annual screeening

⋄ 20,200 women did come in for screening

⋄ 10,800 refused.

◦ Control group:

⋄ was offered usual health care

◦ All the women were followed for many years.

Epidemiologists who worked on the study found that

◦ screening had little impact on diseases other than breast cancer;

◦ poorer women were less likely to accept screening than richer

ones; and

◦ most diseases fall more heavily on the poor than the rich.

Experiments and Observational Studies, Jan 16, 2004 - 9 -

Page 61: Stat Methods

Example

Deaths in the first five years of the screening trial, by cause. Rates per

1,000 women.

Cause of Death

Breast cancer All other

Number of persons Number Rates Number Rates

Treatment group 31,000 39 1.3 837 27

Examined 20,200 23 1.1 428 21

Refused 10,800 16 1.5 409 38

Control group 31,000 63 2.0 879 28

Questions:

◦ Does screening save lives?

◦ Why is the death rate from all other causes in the whole treatment

group (“examined” and “refused” combined) about the same as the

rate in the control group?

◦ Why is the death rate from all other causes higher for the “refused”

group than the “examined” group?

◦ Breast cancer (like polio, but unlike most other diseases) affects the

rich more than the poor. Which numbers in the table confirm this

association between breast cancer and income?

◦ The death rate (from all causes) among women who accepted screening

is about half the death rate among women who refused. Did screening

cut the death rate in half? In not, what explains the difference in death

rates?

◦ To show that screening reduces the risk from breast cancer, someone

wants to compare 1.1 and 1.5. Is this a good comparison? Is it biased

against screening? For screening?

Experiments and Observational Studies, Jan 16, 2004 - 10 -

Page 62: Stat Methods

Survey Sampling

Situation:

Population of N individuals (or items)

e.g. ◦ students at this university

◦ light bulbs produced by a company on one day

Seek information about population

e.g. ◦ amount of money students spent on books this quarter

◦ percentage of students who bought more than 10 books

in this quarter

◦ lifetime of light bulbs

Full data collection is often not possible because it is e.g.

◦ too expensive

◦ too time consuming

◦ not sensible (e.g. testing every produced light bulb for its lifetime)

Statistical approach:

◦ collect information from part of the population (sample)

◦ use information on sample to draw conclusions on whole pop-

ulation

Questions:

◦ How to choose a sample?

◦ What conclusions can be drawn?

Survey Sampling, Jan 19, 2004 - 1 -

Page 63: Stat Methods

Survey Sampling

Objective of a sample survey:

Gather information on some variable for population of N individ-

uals:

xi value of interest for ith individual

x1, . . . , xN values for population

Sample of length n:

x1, . . . , xn values obtained from sampling

Parameter - number that describes the population, e.g.

µpop =1

N

N∑j=1

xj population mean

σ2pop =

1

N

N∑j=1

(xj − µpop)2 population variance

Estimate population parameter from sampled values:

µpop = x =1

n

n∑i=1

xi sample mean

σ2pop = s2 =

1

n − 1

N∑j=1

(xj − x)2 sample variance

A function of the sample x1, . . . , xn is called a statistic.

Survey Sampling, Jan 19, 2004 - 2 -

Page 64: Stat Methods

Sampling Distribution

Suppose we are interested in the amount of money students at this

university have spent on books this quarter.

Idea: Ask 20 students about the amount they have spent and take

the average.

The value we obtain will vary from sample to sample, that is, if we

asked another 20 students we would get a different answer.

Sampling distribution

The sampling distribution of a statistic is the distribution of

all values taken by the statistic if evaluated for all possible

samples of size n taken from the same population.

In our example, the sampling distribution of the average amount

obtained from the sample depends on the way we choose the sample

from the population:

◦ Ask 20 students in this class.

◦ Ask 20 students in your department.

◦ Ask 20 students in the University bookshop.

◦ Select randomly 20 students from the register of the university.

The design of a sample refers to the method used to choose the

sample from the population.

Survey Sampling, Jan 19, 2004 - 3 -

Page 65: Stat Methods

Sampling Distribution

Example:

Consider a population of 20 students who spent the following

amounts on books:x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15

100 120 150 180 200 220 220 240 260 280 290 300 310 350 400

x

Fre

quen

cy (

%)

0 100 200 300 4000

3

6

9

12σ = 55.4247

(a)

x

Fre

quen

cy (

%)

0 100 200 300 4000

3

6

9

12σ = 43.38302

(b)

x

Fre

quen

cy (

%)

0 100 200 300 4000

3

6

9

12σ = 35.96526

(c)

Sampling distribution of

x = 1

n

n∑i=1

xi

for sample sizes

(a) n = 2

(b) n = 3

(c) n = 4

Survey Sampling, Jan 19, 2004 - 4 -

Page 66: Stat Methods

Bias

Example:

Suppose we are interested in the amount of money students at this

university have spent on books last quarter.

Sample: 20 students in the University bookshop

Do we get a good estimate for the average amount spent on books

last quarter by UofC students?

◦ Students who buy more books and spend more money on books

are more likely to be found in bookshops than students who buy

less books.

◦ The sample mean might overestimate the true amount spent

on books.

◦ The sample is not representative for the population of all stu-

dents.

Careful: A poor sample design can produce misleading conclu-

sions.

The design of a study is biased if it systematically favors some

parts of the population over others.

A statistic is unbiased if the mean of its sampling distribution

is equal to the parameter being estimated. Otherwise we say the

statistic is biased.

Survey Sampling, Jan 19, 2004 - 5 -

Page 67: Stat Methods

Bias

Examples: Biased Sampling

◦ Midway Airlines Ads in the New York Times and the Wall Street Jour-

nal stated that “84 percent of frequent business travelers to Chicago

prefer Midway Metrolink to American, United, and TWA.”

The survey was “conducted among Midway Metrolink passengers be-

tween New York and Chicago.

◦ A 1992 Roper poll asked “Does it seem possible or does it seem im-

possible to you that the Nazi extermination of Jews never happened?”

22% of the American respondents said “seems possible.”

A reworded poll 1994 asked “Does it seem possible to you that the Nazi

extermination of Jews never happened, or do you feel certain that it

happened?” This time only 1% of the respondents said it was “possible

it never happened.”

◦ ABC network program Nightline once asked whether the United Na-

tions should continue to have its headquarters in the United States.

More than 186,000 callers responded, and 67% said “No.”

A properly designed sample survey showed that 72% of adults want the

UN to stay.

◦ A call-in poll conducted by USA Today concluded that Americans love

Donald Trump.

USA Today later reported that 5,640 of the 7,800 calls for the poll came

from the offices owned by one man, Cincinnati financier Carl Lindner.

Survey Sampling, Jan 19, 2004 - 6 -

Page 68: Stat Methods

Caution about Sample Surveys

• Undercoverage

◦ occurs when same groups in the population are left out of

the process of choosing the sample

◦ no accurate list of the population

◦ results in bias if this group differs from the rest of the

population

• Nonresponse

◦ occurs when a chosen individual cannot be contacted or

does not cooperate

◦ results in bias if this group differs from the rest of the

population

• Response bias

◦ subjects may not want to admit illegal or unpopular be-

haviour

◦ subjects may be affected by the interviewers appearance or

tone

◦ subjects may not remember correctly

• Question wording

◦ confusing or leading questions can introduce strong bias

◦ do not trust sample survey results unless you have read the

exact questions posed

Survey Sampling, Jan 19, 2004 - 7 -

Page 69: Stat Methods

Simple Random Sampling

A simple random sample (SRS) of size n consists of n indi-

viduals chosen from the population in such a way that every set of

n individuals is equally likely to be selected.

◦ Every possible sample has an equal chance of being selected.

◦ Every individual has an equal chance of being selected.

◦ Random selection eliminates bias in sampling.

SRS or Not?

Is each of the following samples an SRS or not?

◦ A deck of cards if shuffled, and the top five dealt.

◦ A sample of Illinois residents is drawn by choosing all the resi-

dents in each of 100 census blocks (in such a way that each set

of 100 blocks is equally likely to be chosen)

◦ A telephone survey is conducted by dialing telephone numbers

at random (i.e. each valid phone number is equally likely).

◦ A sample of 10%of all student at the University of Chicago is

chosen by numbering the students 1, . . . , N , drawing a random

integer i from 1 to 10, and drawing every tenth student begin-

ning with i.

(E.g. if i = 5, students 5, 15, 25, . . . are chosen.)

Survey Sampling, Jan 19, 2004 - 8 -

Page 70: Stat Methods

Stratified Sampling

Example:

◦ Population: Students at this university

◦ Objective: Amount of money spent on books this quarter

◦ Knowledge: Students in e.g. humanities spend more money on

books

Use knowledge to build sample:

◦ divide sample into groups of similar individuals, called strata

◦ choose simply random sample within each group

◦ size of samples in each groups e.g. proportional to size of groups

Can reduce variability of estimate significantly.

Survey Sampling, Jan 19, 2004 - 9 -

Page 71: Stat Methods

Summary

◦ A number which describes a population is a parameter.

◦ A number computed from the data is a statistic.

◦ Use statistics to make inferences about unknown population

parameters.

◦ A Simple random sample (SRS) of size n consists of n in-

dividuals from the population sampled without replacement,

that is, every set of n individuals has an equal chance to be the

sample actually selected.

◦ A statistic from a random sample has a sampling distribution

that describes how the statistic varies in repeated data produc-

tion.

◦ A statistic as an estimator of a parameter may suffer from bias

or from high variability. Bias means that the mean of the

sampling distribution is not equal to the true value of the pa-

rameter. The variability of the statistic is described by the

spread of its sampling distribution.

Survey Sampling, Jan 19, 2004 - 10 -

Page 72: Stat Methods

First Step Towards Probability

Experiment:

Toss a die and observe the number on the face up.

What is the chance

◦ of getting a six?Event of interest: 6All possible events: 1 2 3 4 5 6⇒ 1

6(one out of six)

◦ of getting an even number?Event of interest: 2 4 6All possible events: 1 2 3 4 5 6⇒ 1

2(three out of six)

The classical probability concept:

If there are N equally likely possibilities, of which one must occur

and s are regarded favorable, or as a “success”, then the probability

of a “success” is

s

N.

Counting, Jan 21, 2003 - 1 -

Page 73: Stat Methods

First Step Towards Probability

Example:

Suppose that of 100 applicants for a job 50 were women and 50

were men, all equally qualified. Further suppose that the company

hired 2 women and 8 men.

How likely is this outcome under the assumption that

the company does not discriminate?

How many ways are there to choose

◦ 10 out of 100 applicants? (⇒ N)

◦ 2 out of 50 female applicants and 8 out of 50 male applicants?

(⇒ s)

To compute such probabilities we need a way to count the num-

ber of possibilities (favorable and total).

Counting, Jan 21, 2003 - 2 -

Page 74: Stat Methods

The Multiplicative Rule

Suppose you have k choices with N1, . . . , Nk possibilities, re-

spectively, to make. Then the total number of possibilities is

the product

N1 · · ·Nk.

Sampling in order with replacement

If you sample n times in order with replacement from a set of N

elements, then the total number of possible sequences (x1, . . . , xn)

is Nn.

Example:

If you toss a die 5 times, the number of possible results is 65 = 7776.

Sampling in order without replacement

If you sample n times in order without replacement from a set of N

elements, then the total number of possible sequences (x1, . . . , xn)

is

N(N − 1) · · · (N − n + 1) =N !

(N − n)!.

Example:

If you select 5 cards in order from a card deck of 64, the number

of possible results is 64 · 63 · 62 · 61 · 60 = 914, 941, 440.

Counting, Jan 21, 2003 - 3 -

Page 75: Stat Methods

Permutations and Combinations

Example:

If you select 5 cards from a card deck of 64, you are typically only

interested in the cards you have, not in the order in which you

received them. How many different combinations of 5 cards out

of 64 are there?

To answer this question we first address the question of how many

different sequences of the same 5 cards exist.

Permutation:

Let (x1, . . . , xn) be a sequence. A permutation of this sequence is

any rearrangement of the elements without loosing or adding any

elements, that is, any new sequence

(xi1, . . . , xin)

with permuted indices {i1, . . . , in} = {1, . . . , n}. The trivial per-

mutation does not change the order, i.e. ij = j.

How many permutations of n distinct elements are there? The

multiplicative rule yields

n · · · (n − 1) · · · 1 = n!.

Example (contd):

The number of different sequences of 5 fixed cards is 5! = 5 · 4 · 3 ·2 · 1 = 120.

Counting, Jan 21, 2003 - 4 -

Page 76: Stat Methods

Permutations and Combinations

How many different combinations of n elements chosen from

N distinct elements are there?

Recall that

◦ The number of different sequences of length n that can be cho-

sen from N distinct elements are

N !

(N − n)!.

◦ The number of permutions of any sequence of length n is n!.

Thus the number of combinations of n elements chosen from N

distinct elements is

N !

n! (N − n)!=

(N

n

)=

(N

N − n

).

(Nn

)are referred to as binomial coefficient.

Since two permuted (ordered) sequences (x1, . . . , xn) lead to the same (un-

ordered) combination {x1, . . . , xn} we divide the number of ordered se-

quences by the number of permutations.

Counting, Jan 21, 2003 - 5 -

Page 77: Stat Methods

Examples

Example:

If you select 5 cards from a card deck of 64, you are typically only

interested in the cards you have, not in the order in which you

received them. How many different combinations of 5 cards out

of 64 are there?

The answer is(

64

5

)=

64 · 63 · 62 · 61 · 60

5 · 4 · 3 · 2 · 1=

914941444

120= 7, 624, 512.

Example:

Recall the example with the 100 applicants for a job. The number

of ways to choose

◦ 2 women out of 50 is(

502

).

◦ 8 men out of 50 is(

508

).

◦ 10 applicants out of 100 is(

10010

).

Thus the chance of this event is(

502

)(508

)(

10010

) = 0.037

Moreover, the chance of this or a more extreme event (only one or

no woman is hired) is 0.046.

Counting, Jan 21, 2003 - 6 -

Page 78: Stat Methods

Summary

The number of possibilities to sample with or without replacement

in order or unordered n elements from a set of N distinct elements

are summarized in the following table:

Sampling in order without order

without replacementN !

(N − n)!

(N

n

)

with replacement Nn

(N + n − 1

N

)

Counting, Jan 21, 2003 - 7 -

Page 79: Stat Methods

Introduction to Probability

Classical Concept:

◦ requires finitely many and equally likely outcomes

◦ probability of event defined as number of favorable outcomes

(s) divided by number of total outcomes (N):

Probability of event =s

N

◦ can be determined by counting outcomes

In many practical situations the different outcomes are not equally

likely:

◦ Success of treatment

◦ Chance to die of a heart attack

◦ Chance of snowfall tomorrow

It is not immediately clear how to measure chance in each of these

cases.

Three Concepts of Probability

◦ Frequency interpretation

◦ Subjective probabilities

◦ Mathematical probability concept

Elements of Probability, Jan 23, 2003 - 1 -

Page 80: Stat Methods

The Frequentist Approach

In the long run, we are all dead.

John Maynard Keynes (1883-1943)

The Frequency Interpretation of Probability

The probability of an event is the proportion of time that events

of the same kind (repeated independently and under the same

conditions) will occur in the long run.

Example:

Suppose we collect data on the weather in Chicago on Jan 21 and

we note that in the past 124 years it snowed in 34 years on Jan 21,

that is 34124

100% = 27.4% of the time.

Thus we would estimate the probability of snowfall on Jan 21 in

Chicago as 0.274.

The frequency interpretation of probability is based on the follow-

ing theorem:

The Law of Large Numbers

If a situation, trial, or experiment is repeated again and again, the

proportion of successes will converge to the probability of any one

outcame being a success.

Elements of Probability, Jan 23, 2003 - 2 -

Page 81: Stat Methods

The Frequentist Approach

Number of Tosses

Rel

ativ

e F

requ

ency

of H

eads

0.0

0.2

0.4

0.6

0.8

1.0

0 100 200 300 400 500 600 700 800 900 1000

Tosses 1 − 1000

Number of Tosses (in 1000s)

Rel

ativ

e F

requ

ency

of H

eads

0.48

0.49

0.50

0.51

0.52

1 10 20 30 40 50 60 70 80 90 100

Tosses 1000 − 100000

Number of Tosses (in 100000s)

Rel

ativ

e F

requ

ency

of H

eads

0.49

50.

500

0.50

5

1 2 3 4 5 6 7 8 9 10

Tosses 100000 − 1000000

Elements of Probability, Jan 23, 2003 - 3 -

Page 82: Stat Methods

The Subjectivist (Bayesian) Approach

Not all events are repeatable:

◦ Will it snow tomorrow?

◦ Will Mr Jones, 42, live to 65?

◦ Will the Dow Jones rise tomorrow?

◦ Does the Iraq have weapons of mass destruction?

To all these questions the answer is either “yes” or “no”, but we

are uncertain about the right answer.

Need to quantify our uncertainty about an event A:

Game with two players:

◦ 1st player determines p such that he will “win” $c · (1 − p) if

event A occurs and otherwise he will “loose” $c · p.

◦ 2nd player chooses c which can be positive or negative.

The Bayesian interpretation of probability is that probability

measures the personal (subjective) uncertainty of an event.

Example: Weather forecast

Meteorologist says that the probability of snowfall tomorrow is

90%.

He should be willing to bet $90 against $10 that it snows tomorrow

and $10 against $90 that it does not snow.

Elements of Probability, Jan 23, 2003 - 4 -

Page 83: Stat Methods

The Elements of Probability

A (statistical) experiment is a process of observation or mea-

surement. For a mathematical treatment we need:

Sample Space S - set of possible outcomes

Example: An urn contains five balls, numbered from 1 through

5. We choose two at random and at the same time. What is the

sample space?

S ={{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5},{3, 4}, {3, 5}, {4, 5}

}.

Events A ⊆ S - an event is a subset of the sample space S

Example: In the example above the event A that two balls with

uneven numbers are choses is

A ={{1, 3}, {1, 5}, {3, 5}

}.

Probability Function P - assigns each A a value in [0, 1]

Example: Assuming that all events are equally likely we obtainP(A) =3

10.

Elements of Probability, Jan 23, 2003 - 5 -

Page 84: Stat Methods

The Elements of Probability

Why not assign probabilities to outcomes?

Example: Spinner labeled from 0 to 1.

◦ Suppose that all outcomes s ∈ S = [0, 1) are equally likely.

◦ Assign probabilities uniformly on S.

◦ P({s}) = c > 0 ⇒ P(S) = ∞◦ P({s}) = 0 ⇒ P(S) = 0

Solution: Assign to each subset of S a probability equal to the

“length” of that subset:

◦ Probability that the spinner lands in [0, 14) is 1

4.

◦ Probability that the spinner lands in [12, 3

4) is 1

4.

◦ Probability that the spinner lands on 12 is 0.

In integral notation we haveP(spinner lands in [a, b]) =

∫ b

a

dx = b − a.

Remark:

Strictly speaking, we can define above probability only on a set A of subsets A ⊆ S which

however covers all important and for this class relevant subsets.

In the case of finite or countably infinite sample spaces S there are no such exceptions

and A covers all subsets of S.

Elements of Probability, Jan 23, 2003 - 6 -

Page 85: Stat Methods

A Set Theory Primer

A set is “a collection of definite, well distinguished objects of our perception

or of our thought”. (Georg Cantor, 1845-1918)

Some important sets:

◦ N = {1, 2, 3, . . .}, the set of natural numbers

◦ Z = {. . . ,−2,−1, 0, 1, 2, . . .}, the set of integers

◦ R = (−∞,∞), the set of real numbers

Intervals are denoted as follows:

[0, 1] the interval from 0 to 1 including 0 and 1

[0, 1) the interval from 0 to 1 including 0 but not 1

(0, 1) the interval from 0 to 1 not including 0 and 1

If a is an element of the set A then we write a ∈ A.

If a is not an element of the set A then we write a /∈ A.

Suppose that A and B are subsets of S (denoted as A, B ⊆ S).

The empty set is denoted by ∅ (Note: ∅ ⊆ A for all subsets A of S).

Difference of A and B (A\B): Set of all elements in A which are not in B.

Intersection of A and B (A ∩ B): Set of all elements in S which are both

in A and in B.

Union of A and B (A∪B): Set of all elements in S that are in A or in B.

Complement of A (A∁ or A′): Set of all elements in S that are not in A.

Note that A ∩ A∁ = ∅ and A ∪ A∁ = S

A and B are disjoint if A and B have no common elements, that is A∩B =

∅. Two events A and B with this property are said to be mutually

exclusive.

Elements of Probability, Jan 23, 2003 - 7 -

Page 86: Stat Methods

The Postulates of Probability

A probability on a sample space S (and a set A of events) is a

function which assigns each subset A a value in [0, 1] and satisfies

the following rules:

Axiom 1: All probabilities are nonnegative:P(A) ≥ 0 for all events A.

Axiom 2: The probability of the whole sample space is 1:P(S) = 1.

Axiom 3 (Addition Rule): If two events A and B are mutu-

ally exclusive thenP(A ∪ B) = P(A) + P(B),

that is the probability that one or the other occurs is the sum

of their probabilities.

More generally, if countably many events Ai, i ∈ N are mutu-

ally exclusive (i.e. Ai ∩ Aj = ∅ whenever i 6= j) thenP( ∞⋃i=1

Ai

)=

∞∑i=1

P(Ai).

Elements of Probability, Jan 23, 2003 - 8 -

Page 87: Stat Methods

The Postulates of Probability

Classical Concept of Probability

The probability of an event A is defined asP(A) =#A

#S,

where #A denotes the number of elements (outcomes) in A.

It satisfies

◦ P(A) ≥ 0

◦ P(S) = #S/#S = 1

◦ If A and B mutually exclusive thenP(A ∪ B) =#(A ∪ B)

#S

=#A

#S+

#B

#S= P(A) + P(B).

Elements of Probability, Jan 23, 2003 - 9 -

Page 88: Stat Methods

The Postulates of Probability

Frequency Interpretation of Probability

The probability of an event A is defined asP(A) = limn→∞

n(A)

n,

where n(A) is the number of times event A occurred in n repeti-

tions.

It satisfies

◦ P(A) ≥ 0

◦ P(S) = limn→∞nn

= 1

◦ If A and B mutually exclusive then n(A∪B) = n(A) + n(B).

HenceP(A ∪ B) = limn→∞

n(A ∪ B)

n

= limn→∞

(n(A)

n+

n(B)

n

)

= limn→∞

n(A)

n+ lim

n→∞n(B)

n= P(A) + P(B).

Elements of Probability, Jan 23, 2003 - 10 -

Page 89: Stat Methods

The Postulates of Probability

Example: Toss of one die

The events A = {1} and B = {4 5} are mutually exclusive.

Since all outcomes are equiprobable we obtainP(A) =1

6

and P(B) =1

3.

The addition rule yieldsP(A ∪ B) =1

6+

1

3=

3

6=

1

2.

On the other hand we get for C = A ∪ B = {1 4 5}P(C) =3

6=

1

2.

The first two axioms can be summarized by the

Cardinal Rule: For any subset A of S

0 ≤ P(A) ≤ 1.

In particular

◦ P(∅) = 0

◦ P(S) = 1

Elements of Probability, Jan 23, 2003 - 11 -

Page 90: Stat Methods

The Calculus of Probability

Let A and B be events in a sample space S.

Partition rule:P(A) = P(A ∩ B) + P(A ∩ B∁)

Example: Roll a pair of fair diceP(Total of 10)

= P(Total of 10 and double) +P(Total of 10 and no double)

=1

36+

2

36=

3

36=

1

12

Complementation rule:P(A∁) = 1 − P(A)

Example: Often useful for events of the type “at least one”:P(At least one even number)

= 1 −P(No even number) = 1 − 9

36=

3

4

Containment ruleP(A) ≤ P(B) for all A ⊆ B

Example: Compare two aces with doubles,

1

36= P(Two aces) ≤ P(Doubles) =

6

36=

1

6

Calculus of Probability, Jan 26, 2003 - 1 -

Page 91: Stat Methods

The Calculus of Probability

Inclusion and exclusion formulaP(A ∪ B) = P(A) + P(B) − P(A ∩ B)

Example: Roll a pair of fair diceP(Total of 10 or double)

= P(Total of 10) +P(Double) −P(Total of 10 and double)

=3

36+

6

36− 1

36=

8

36=

2

9

The two events are

Total of 10 = {46,55,64}and

Double = {11,22,33,44,55,66}

The intersection is

Total of 10 and double = {55}.

Adding the probabilities for the two events, the probability for the

event 55 is added twice.

Calculus of Probability, Jan 26, 2003 - 2 -

Page 92: Stat Methods

Conditional Probability

Probability gives chances for events in sample space S.

Often: Have partial information about event of interest.

Example: Number of Deaths in the U.S. in 1996

Cause All ages 1-4 5-14 15-24 25-44 45-64 ≥ 65

Heart 733,125 207 341 920 16,261 102,510 612,886

Cancer 544,161 440 1,035 1,642 22,147 132,805 386,092

HIV 32,003 149 174 420 22,795 8,443 22

Accidents1 92,998 2,155 3,521 13,872 26,554 16,332 30,564

Homicide2 24,486 395 513 6,548 9,261 7,717 52

All causes 2,171,935 5,947 8,465 32,699 148,904 380,396 1,717,218

1 Accidents and adverse effects, 2 Homicide and legal intervention

measure probability with respect to a subset of S

Conditional probability of A given BP(A|B) =P(A ∩ B)P(B)

, if P(B) > 0

If P(B) = 0 then P(A|B) is undefined.

Conditional probabilities for causes of death:

◦ P(accident) = 0.04282

◦ P(age=10) = 0.00390

◦ P(accident|age=10) = 0.42423

◦ P(accident|age=40) = 0.17832

Calculus of Probability, Jan 26, 2003 - 3 -

Page 93: Stat Methods

Conditional Probability

Example: Select two cards from 32 cards

◦ What is the probability that the second card is an ace?P(2nd card is an ace) =1

8

◦ What is the probability that the second card is an ace if the

first was an ace?P(2nd card is an ace|1st card was an ace) =3

31

Calculus of Probability, Jan 26, 2003 - 4 -

Page 94: Stat Methods

Multiplication rules

Example: Death Rates (per 100,000 people)

All Ages 1-4 5-14 15-24 25-44 45-64 ≥ 65

872.5 38.3 22.0 90.3 177.8 708.0 5071.4

Can we combine these rates with the table on causes of death?◦ What is the probability to die from an accident (HIV)?

◦ What is the probability to die from an accident at age 10 (40)?

Know P(accident|die) = P(die from accident)/P(die)

⇒ P(die from accident) = P(accident|die)P(die)

Calculate probabilities:

◦ P(die from accident) = 0.04281 · 0.00873 = 0.00037

◦ P(die from accident|age = 10) = 0.42423 · 0.00090 = 0.00038

◦ P(die from accident|age = 40) = 0.17832 · 0.00178 = 0.00031

◦ P(die from HIV) = 0.01473 · 0.00873 = 0.00013

◦ P(die from HIV|age = 10) = 0.02055 · 0.00090 = 0.00002

◦ P(die from HIV|age = 40) = 0.15308 · 0.00178 = 0.00027

General multiplication ruleP(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

Calculus of Probability, Jan 26, 2003 - 5 -

Page 95: Stat Methods

Independence

Example: Roll two dice

◦ What ist the probability that the second die shows 1?P(2nd die = 1) =1

6

◦ What ist the probability that the second die shows 1 if the first

die already shows 1?P(2nd die = 1|1st die = 1) =1

6

◦ What ist the probability that the second die shows 1 if the first

does not show 1?P(2nd die = 1|1st die 6= 1) =1

6

The chances of getting 1 with the second die are the same, no

matter what the first die shows. Such events are called indepen-

dent:

The event A is independent of the event B if its chances are

not affected by the occurrence of B,P(A|B) = P(A).

Equivalently, A and B are independent ifP(A ∩ B) = P(A)P(B)

Otherwise we say A and B are dependent.

Calculus of Probability, Jan 26, 2003 - 6 -

Page 96: Stat Methods

Let’s Make a Deal

The Rules:

◦ Three doors - one price, two blanks

◦ Candidate selects one door

◦ Showmaster reveals one loosing door

◦ Candidate may switch doors

1 2 3

Would YOU change?

Can probability theory help you?

◦ What is the probability of winning if candidate switches doors?

◦ What is the probability of winning if candidate does not switch

doors?

Calculus of Probability, Jan 26, 2003 - 7 -

Page 97: Stat Methods

The Rule of Total Probability

Events of interest:

◦ A - choose winning door at the beginning

◦ W - win the price

Strategy: Switch doors (S)

Know: ◦ PS(W |A) = 0

◦ PS(W |A∁) = 1

◦ PS(A) = 13

◦ PS(A∁) = 23

Probability of interest: PS(W ):PS(W ) = PS(W ∩ A) + PS(W ∩ A∁)

= PS(W |A)PS(A) + PS(W |A∁)PS(A∁)

= 0 · 1

3+ 1 · 2

3=

2

3

Strategy: Do not switch doors (N)

Know: ◦ PN(W |A) = 1

◦ PN(W |A∁) = 0

◦ PN(A) = 13

◦ PN(A∁) = 23

Probability of interest: PN(W ):PN(W ) = PN(W ∩ A) + PN(W ∩ A∁)

= PN(W |A)PN(A) + PN(W |A∁)PN(A∁)

= 1 · 1

3+ 0 · 2

3=

1

3

Calculus of Probability, Jan 26, 2003 - 8 -

Page 98: Stat Methods

The Rule of Total Probability

Rule of Total Probability

If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, thenP(A) = P(A|B1)P(B1) + . . . + P(A|Bk)P(Bk)

Example:

Suppose an applicant for a job has been invited for an interview.

The chance that

◦ he is nervous is P(N) = 0.7,

◦ the interview is succussful if he is nervous is P(S|N) = 0.2,

◦ the interview is succussful if he is not nervous isP(S|N∁) = 0.9.

What is the probability that the interview is successful?P(S) = P(S|N)P(N) + P(S|N∁)P(N∁)

= 0.2 · 0.7 + 0.9 · 0.3

= 0.441

Calculus of Probability, Jan 26, 2003 - 9 -

Page 99: Stat Methods

The Rule of Total Probability

Example:

Suppose we have two unfair coins:

◦ Coin 1 comes up heads with probability 0.8

◦ Coin 2 comes up heads with probability 0.35

Choose a coin at random and flip it. What is the probability of its

being a head?

Events: H=“heads comes up”, C1=“1st coin”, C2=“2nd coin”P(H) = P(H|C1)P(C1) + P(H|C2)P(C2)

=1

2(0.8 + 0.35) = 0.575

Calculus of Probability, Jan 26, 2003 - 10 -

Page 100: Stat Methods

Bayes’ Theorem

Example: O.J. Simpson

“Only about 110 of one percent of wife-batterers actually murder their wives”

Lawyer of O.J. Simpson on TV

Fact: Simpson pleaded no contest to beating his wife in 1988.

So he murdered his wife with probability 0.001?

◦ Sample space S - married couples in U.S. in which the husband

beat his wife in 1988

◦ Event H - all couples in S in which the husband has since

murdered his wife

◦ Event M - all couples in S in which the wife has been murdered

since 1988

We have ◦ P(H) = 0.001

◦ P(M |H) = 1 since H ⊆ M

◦ P(M |H∁) = 0.0001 at most in the U.S.

ThenP(H|M) =P(M |H)P(H)P(M)

=P(M |H)P(H)P(M |H)P(H) + P(M |H∁)P(H∁)

=0.001

0.001 + 0.0001 · 0.999= 0.91

Calculus of Probability, Jan 26, 2003 - 11 -

Page 101: Stat Methods

Bayes’ Theorem

Reversal of conditioning (general multiplication rule)P(B|A)P(A) = P(A|B)P(B)

Rewriting P(A) using the rule of total probability we obtain

Bayes’ TheoremP(B|A) =P(A|B)P(B)P(A|B)P(B) + P(A|B∁)P(B∁)

If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, thenP(Bi|A) =P(A|Bi)P(Bi)P(A|B1)P(B1) + . . . + P(A|Bk)P(Bk)

(General form of Bayes’ Theorem)

Calculus of Probability, Jan 26, 2003 - 12 -

Page 102: Stat Methods

Bayes’ Theorem

Example: Testing for AIDS

Enzyme immunoassay test for HIV:

◦ P(T+|I+) = 0.98 (sensitivity - positive for infected)

◦ P(T-|I-) = 0.995 (specificity - negative for noninfected)

◦ P(I+) = 0.0003 (prevalence)

What is the probability that the tested person is infected if the

test was positive?P(I+|T+) =P(T+|I+)P(I+)P(T+|I+)P(I+) + P(T+|I-)P(I-)

=0.98 · 0.0003

0.98 · 0.0003 + 0.005 · 0.9997

= 0.05556

Consider different population with P(I+) = 0.1 (greater risk)P(I+|T+) =0.98 · 0.1

0.98 · 0.1 + 0.005 · 0.9= 0.956

testing on large scale not sensible (too many false positives)

Repeat test (Bayesian updating):

◦ P(I+|T++) = 0.92 in 1st population

◦ P(I+|T++) = 0.9998 in 2nd population

Calculus of Probability, Jan 26, 2003 - 13 -

Page 103: Stat Methods

Random Variables

Aim: ◦ Learn about population

◦ Available information: observed data x1, . . . , xn

Problem: ◦ Data affected by chance variation

◦ New set of data would look different

Suppose we observe/measure some characteristic (variable) of n

individuals. The actual observed values x1, . . . , xn are the outcome

of a random phenomenon.

Random variable: a variable whose value is a numerical out-

come of a random phenomenon

Remark: Mathematically, a random variable is a real-valued func-

tion on the sample space S:

SX−−−−→ R

ω 7−→ x = X(ω)

◦ SX = X(S) is the sample space of the random variable.

◦ The outcome x = X(ω) is called realisation of X .

◦ X induces a probability P (B) = P(X ∈ B) on SX , the prob-

ability distribution of X

Example: Roll one die

Outcome ω 1 2 3 4 5 6Realization X(ω) 1 2 3 4 5 6

Random Variables, Jan 28, 2003 - 1 -

Page 104: Stat Methods

Random Variables

Example: Roll two dice

◦ X1 - number on the first die

◦ X2 - number on the second die

◦ Y = X1 + X2 - total number of points

(a function of random variables is again a random variable)

Table of outcomes:

Outcome (X1, X2) Y1 1 (1,1) 21 2 (1,2) 31 3 (1,3) 41 4 (1,4) 51 5 (1,5) 61 6 (1,6) 72 1 (2,1) 32 2 (2,2) 42 3 (2,3) 52 4 (2,4) 62 5 (2,5) 72 6 (2,6) 83 1 (3,1) 43 2 (3,2) 53 3 (3,3) 63 4 (3,4) 73 5 (3,5) 83 6 (3,6) 9

Outcome (X1, X2) Y4 1 (4,1) 54 2 (4,2) 64 3 (4,3) 74 4 (4,4) 84 5 (4,5) 94 6 (4,6) 105 1 (5,1) 65 2 (5,2) 75 3 (5,3) 85 4 (5,4) 95 5 (5,5) 105 6 (5,6) 116 1 (6,1) 76 2 (6,2) 86 3 (6,3) 96 4 (6,4) 106 5 (6,5) 116 6 (6,6) 12

Random Variables, Jan 28, 2003 - 2 -

Page 105: Stat Methods

Random Variables

Two important types of random variables:

• Discrete random variable

◦ takes values in a finite or countable set

• Continuous random variable

◦ takes values in a continuum, or uncountable set

◦ probability of any particular outcome x is zeroP(X = x) = 0 for all x ∈ SX

Example: Ten tosses of a coin

Suppose we toss a coin ten times. Let

◦ X be the number of heads in ten tosses of a coin

◦ Y be the time it takes to toss ten times

Random Variables, Jan 28, 2003 - 3 -

Page 106: Stat Methods

Discrete Random Variables

Suppose X is a discrete random variables with values x1, x2, . . ..

Example: Roll two dice

Y = X1 + X2 total number of points

y 2 3 4 5 6 7 8 9 10 11 12P(Y = y) 136

236

336

436

536

636

536

436

336

236

136

Frequency function: The function

p(x) = P(X = x) = P({ω ∈ S|X(ω) = x})

is called the frequency function or probability mass function.

Note: p defines a probability on SX = {x1, x2, . . .}:

P (B) =∑x∈B

p(x) = P(X ∈ B).

We call P the (probability) distribution of X .

Properties of a discrete probability distribution

◦ p(x) ≥ 0 for all values of X

◦ ∑i p(xi) = 1

Random Variables, Jan 28, 2003 - 4 -

Page 107: Stat Methods

Discrete Random Variables

Example: Roll one die

Let X denote the number of points on the face turned up. Since

all numbers are equally likely we obtain

p(x) = P(X = x) =

{16 if x ∈ {1, . . . , 6}0 otherwise

.

Example: Roll two dice

The probability mass function of the total number of points

Y = X1 + X2

can be written as:

p(y) = P(Y = y) =

{136

(6 − |y − 7|

)if y ∈ {2, . . . , 12}

0 otherwise

Example: Three tosses of a coin

Let X be the number of heads in three tosses of a coin. There are(3x

)outcomes with x heads and 3 − x tails, thus

p(x) =

(3

x

)1

8.

Random Variables, Jan 28, 2003 - 5 -

Page 108: Stat Methods

Continuous Random Variables

For a continuous random variable X , the probability that X falls

in the interval (a, b ] is given byP(a < X ≤ B) =∫ b

a

f(x)dx,

where f is the density function of X .

Note: The density defines a probability on R:

P([a, b]

)=

∫ b

a

f(x) dx = P(X ∈ [a, b]

)

We call P the (probability) distribution of X .

Remark: The definition of P can be extended to (almost) all B ⊆ R.

Example: Spinner

Consider a spinner that turns freely on its axis and slowly comes to a stop.

◦ X is the stopping point on the circle marked from 0 to 1.

◦ X can take any value in SX = [0, 1).

◦ The outcomes of X are uniformly distributed over the interval [0, 1).

Then the density function of X is

f(x) =

{1 if 0 ≤ x < 1

0 otherwise.

ConsequentlyP(X ∈ [a, b]

)= b − a.

Note that for all possible outcomes x ∈ [0, 1) we haveP(X ∈ [x, x]

)= x − x = 0.

Random Variables, Jan 28, 2003 - 6 -

Page 109: Stat Methods

Independence of Random Variables

Recall: Two events A and B are independent ifP(A ∩ B) = P(A)P(B)

Independence of Random Variables

Two discrete random variables X and Y are independent ifP(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)

for all A ⊆ SX and B ⊆ SY .

Remark: It is sufficient to show thatP(X = x, Y = y) = pX(x) pY (y) = P(X = x)P(Y = y)

for all x ∈ SX and y ∈ SY .

More generally, X1, X2, . . . are independent if for all n ∈ NP(X1 ∈ A1, . . . , Xn ∈ An) = P(X1 ∈ A1) · · ·P(Xn ∈ An).

for all Ai ⊆ Xi.

Example: Toss coin three times

Consider

Xi =

{1 if head in ith toss of coin

0 otherwise

X1, X2, and X3 are independent:P(X1 = x1, . . . , X3 = x3) =1

8= P(X1 = x1)P(X2 = x2)P(X3 = x3)

Random Variables, Jan 28, 2003 - 7 -

Page 110: Stat Methods

Multivariate Distributions: Discrete Case

Discrete Case

Let X and Y be discrete random variables.

Joint frequency function of X and Y

pXY (x, y) = P(X = x, Y = y) = P({X = x} ∩ {Y = y})

Marginal frequency function of X

pX(x) =∑i

pXY (x, yi)

Marginal frequency function of Y

pY (y) =∑i

pXY (xi, y)

The random variables X and Y are independent if and only if

pXY (x, y) = pX(x) pY (y)

for all possible values x ∈ SX and y ∈ SY .

Conditional probability of X = x given Y = yP(X = x|Y = y) = pX|Y (x|y) =pXY (x, y)

pY (y)=P(X = x, Y = y)P(Y = y)

where pX|Y (x|y) is the conditional frequency function.

Random Variables, Jan 28, 2003 - 8 -

Page 111: Stat Methods

Multivariate Distributions

Discrete Case

Example: Three Tosses of a Coin

◦ X - number of heads on the first toss (values in {0, 1})

◦ Y - total number of heads (values in {0, 1, 2, 3})

The joint frequency function pXY (x, y) is given by the following

table

x\y 0 1 2 3

0 18

28

18

0 12

1 0 18

28

18

12

18

38

38

18

1

Marginal frequency function of Y

pY (0) = P(Y = 0)

= P(Y = 0, X = 0) + P(Y = 0, X = 1)

= 18 + 0 = 1

8

pY (1) = P(Y = 1)

= P(Y = 1, X = 0) + P(Y = 1, X = 1)

= 28

+ 18

= 38

...

Random Variables, Jan 28, 2003 - 9 -

Page 112: Stat Methods

Multivariate Distributions

Continuous Case

Let X and Y be continuous random variables.

Joint density function of X and Y : fXY such that∫

A

B

fXY (x, y) dy dx = P(X ∈ A, Y ∈ B)

Marginal density function of X :

fX(x) =∫

fXY (x, y) dy

Marginal density function of Y

fY (y) =∫

fXY (x, y) dx

The random variables X and Y are independent if and only if

fXY (x, y) = fX(x) fY (y)

for all possible values x ∈ SX and y ∈ SY .

Conditional density function of X given Y = y

fX|Y (x|y) =fXY (x, y)

fY (y)

Conditional probability of X ∈ A given Y = yP(X ∈ A|Y = y) =∫

A

fX|Y (x|y) dx

Random Variables, Jan 28, 2003 - 10 -

Page 113: Stat Methods

Bernoulli Distribution

Example: Toss of coin

Define X = 1 if head comes up and

X = 0 if tail comes up.

Both realizations are equally likely: P(X = 1) = P(X = 0) = 12

Examples:

Often: Two outcomes which are not equally likely:

◦ Success of medical treatment

◦ Interviewed person is female

◦ Student passes exam

◦ Transmittance of a disease

Bernoulli distribution (with parameter θ)

◦ X takes two values, 0 and 1, with probabilities p and 1 − p

◦ Frequency function of X

p(x) =

{θx(1 − θ)1−x for x ∈ {0, 1}0 otherwise

◦ Often:

X =

{1 if event A has occured

0 otherwise

Example: A = blood pressure above 140/90 mm HG.

Distributions, Jan 30, 2003 - 1 -

Page 114: Stat Methods

Bernoulli Distribution

Let X1, . . . , Xn be independent Bernoulli random variables with

same parameter θ.

Frequency function of X1, . . . , Xn

p(x1, . . . , xn) = p(x1) · · · p(xn) = θx1+...+xn(1 − θ)n−x1−...−xn

for xi ∈ {0, 1} and i = 1, . . . , n

Example: Paired-Sample Sign Test

◦ Study success of new elaborate safety program

◦ Record average weekly losses in hours of labor due to accidents before

and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10

Before 45 73 46 124 33 57 83 34 26 17

After 36 60 44 119 35 51 77 29 24 11

Define for the ith plant

Xi =

{1 if first value is greater than the second

0 otherwise

Result: 1 1 1 1 0 1 1 1 1 1

The Xi’s are independently Bernoulli distributed with unknown

parameter θ.

Distributions, Jan 30, 2003 - 2 -

Page 115: Stat Methods

Binomial Distribution

Let X1, . . . , Xn be independent Bernoulli random variables

◦ Often only interested in number of successes

Y = X1 + . . . + Xn

Example: Paired Sample Sign Test (contd)

Define for the ith plant

Xi =

{1 if first value is greater than the second

0 otherwise

Y =n∑

i=1

Xi

Y is the number of plants for which the number of lost hours has

decreased after the installation of the safety program

We know:

◦ Xi is Bernoulli distributed with parameter θ

◦ Xi’s are independent

What is the distribution of Y ?

◦ Probability of realization x1, . . . , xn with y successes:

p(x1, . . . , xn) = θy(1 − θ)n−y

◦ Number of different realizations with y successes:(ny

)

Distributions, Jan 30, 2003 - 3 -

Page 116: Stat Methods

Binomial Distribution

Binomial distribution (with parameters n and θ)

Let X1, . . . , Xn be independent and Bernoulli distributed with pa-

rameter θ and

Y =n∑

i=1

Xi.

Y has frequency function

p(y) =(

n

y

)θy (1 − θ)n−y for y ∈ {0, . . . , n}

Y is binomially distributed with parameters n and θ. We write

Y ∼ Bin(n, θ).

Note that

◦ the number of trials is fixed,

◦ the probability of success is the same for each trial, and

◦ the trials are independent.

Example: Paired Sample Sign Test (contd)

Let Y be the number of plants for which the number of lost hours

has decreased after the installation of the safety program. Then

Y ∼ Bin(n, θ)

Distributions, Jan 30, 2003 - 4 -

Page 117: Stat Methods

Binomial Distribution

Binomial distribution for n = 10

p(x)

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

x

θ = 0.1

p(x)

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

x

θ = 0.3

p(x)

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

x

θ = 0.5

p(x)

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

x

θ = 0.8

Distributions, Jan 30, 2003 - 5 -

Page 118: Stat Methods

Geometric Distribution

Consider a sequence of independent Bernoulli trials.

◦ On each trial, a success occurs with probability θ.

◦ Let X be the number of trials up to the first success.

What is the distribution of X?

◦ Probability of no success in x − 1 trials: (1 − θ)x−1

◦ Probability of one success in the xth trial: θ

The frequency function of X is

p(x) = θ(1 − θ)x−1, x = 1, 2, 3, . . .

X is geometrically distributed with parameter θ.

Example:

Suppose a batter has probability 13 to hit the ball. What is the chance that

he misses the ball less than 3 times?

The number X of balls up to the first success is geometrically distributed

with parameter 13. ThusP(X ≤ 3) =

1

3+

1

3· 2

3+

1

3

(2

3

)2

= 0.7037.

Distributions, Jan 30, 2003 - 6 -

Page 119: Stat Methods

Hypergemetric Distribution

Example: Quality Control

Quality control - sample and examine fraction of produced units

◦ N produced units

◦ M defective units

◦ n sampled units

What is the probability that the sample contains x defective units?

The frequency function of X is

p(x) =

(Mx

)(N−Mn−x

)(Nn

) , x = 0, 1, . . . , n.

X is a hypergeometric random variable with parameters N , M ,

and n.

Example:

Suppose that of 100 applicants for a job 50 were women and 50 were men,

all equally qualified. If we select 10 applicants at random what is the

probability that x of them are female?

The number of chosen female applicants is hypergeometrically distributed

with parameters 100, 50, and 10. The frequency function is

p(x) =

(50x

)(50

10−x

)(10010

) for x ∈ {0, . . . , n}

for x = 0, 1, . . . , 10.

Distributions, Jan 30, 2003 - 7 -

Page 120: Stat Methods

Poisson Distribution

Often we are interested in the number of events which occur in aspecific period of time or in a specific area of volume:◦ Number of alpha particles emitted from a radioactive source during a

given period of time

◦ Number of telephone calls coming into an exchange during one unit of

time

◦ Number of diseased trees per acre of a certain woodland

◦ Number of death claims received per day by an insurance company

Characteristics

Let X be the number of times a certain event occurs during a given

unit of time (or in a given area, etc).

◦ The probability that the event occurs in a given unit of time is

the same for all the units.

◦ The number of events that occur in one unit of time is inde-

pendent of the number of events in other units.

◦ The mean (or expected) rate is λ.

Then X is a Poisson random variable with parameter λ and

frequency function

p(x) =λx

x!e−λ, x = 0, 1, 2, . . .

Distributions, Jan 30, 2003 - 8 -

Page 121: Stat Methods

Poisson Approximation

The Poisson distribution is often used as an approximation for

binomial probabilities when n is large and θ is small:

p(x) =(

n

x

)θx (1 − θ)n−x ≈ λx

x!e−λ

with λ = n θ.

Example: Fatalities in Prussian cavalry

Classical example from von Bortkiewicz (1898).

◦ Number of fatalities resulting from being kicked by a horse

◦ 200 observations (10 corps over a period of 20 years)

Statistical model:

◦ Each soldier is kicked to death by a horse with probability θ.

◦ Let Y be the number of such fatalities in one corps. Then

Y ∼ Bin(n, θ)

where n is the number of soldiers in one corps.

Observation: The data are well approximated by a Poisson distribution

with λ = 0.61

Deaths per Year Observed Rel. Frequency Poisson Prob.

0 109 0.545 0.543

1 65 0.325 0.331

2 22 0.110 0.101

3 3 0.015 0.021

4 1 0.005 0.003

Distributions, Jan 30, 2003 - 9 -

Page 122: Stat Methods

Poisson Approximation

Poisson approximation of Bin(40, θ)

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

0.3

0.4

0.5

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

0.3

0.4

0.5

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

x

p(x)

0 1 2 3 4 5 6 7 8 9 11 13 15 17 1910 12 14 16 18 200.0

0.1

0.2

x

θ = 14

θ = 18

θ = 140

θ = 1400

λ = 10

λ = 5

λ = 1

λ = 110

Distributions, Jan 30, 2003 - 10 -

Page 123: Stat Methods

Continuous Distributions

Uniform distribution U(0, θ)

Range (0, 1)

f(x) =1

θ1(0,θ)(x)E(X) =θ

2

var(X) =θ2

12

Exponential distribution Exp(λ)

Range [0,∞)

f(x) = λ exp(−λx)1[0,∞)(x)E(X) =1

λ

var(X) =1

λ2

Normal distribution N (µ, σ2)

Range Rf(x) =

1√2πσ2

exp(− 1

2σ2(x−µ)2

)E(X) = µ

var(X) = σ2

X

Fre

quen

cy

−2 −1 0 1 2 3 4

010

2030

40

U(0, θ)

X

Fre

quen

cy

−2 −1 0 1 2 3 4

010

2030

40

Exp(λ)

X

Fre

quen

cy

−2 −1 0 1 2 3 4

010

2030

40

N(µ, σ2)

−2

02

46

U(0, θ) Exp(λ) N(µ, σ2)

Distributions, Jan 30, 2003 - 11 -

Page 124: Stat Methods

Expected Value

Let X be a discrete random variable which takes values in SX =

{x1, x2, . . . , xn}

Expected Value or Mean of X :E(X) =n∑

i=1

xi p(xi)

Example: Roll one die

Let X be outcome of rolling one die. The frequency function is

p(x) =1

6, x = 1, . . . , 6,

and henceE(X) =6∑

x=1

x

6=

7

2= 3.5

Example: Bernoulli random variable

Let X ∼ Bin(1, θ).

p(x) = θx(1 − θ)1−x

Thus the mean of X isE(X) = 0 · (1 − θ) + 1 · θ = θ.

Expected Value and Variance, Feb 2, 2003 - 1 -

Page 125: Stat Methods

Expected Value

Linearity of the expected value

Let X and Y be two discrete random variables. ThenE(a X + b Y ) = aE(X) + bE(Y )

for any constants a, b ∈ RNote: No independence is required.

Proof:E(a X + b Y ) =∑x,y

(a x + b y)p(x, y)

= a∑x,y

x p(x, y) + b∑x,y

y p(x, y)∑x

p(x, y) = p(y)

x

= a∑x

x p(x) + b∑y

y p(y)

= aE(X) + bE(Y )

Example: Binomial distribution

Let X ∼ Bin(n, θ). Then X = X1+. . .+Xn with Xi ∼ Bin(1, θ):E(X) =n∑

i=1

E(Xi) =n∑

i=1

θ = nθ

Expected Value and Variance, Feb 2, 2003 - 2 -

Page 126: Stat Methods

Expected Value

Example: Poisson distribution

Let X be a Poisson random variable with parameter λ.

E(X) =∞∑

x=0

xλx

x!e−λ

= λ e−λ∞∑

x=0

λx−1

(x − 1)!

= λ e−λeλ

= λ

Remarks:

◦ For most distributions some “advanced” knowledge of calculus

is required to find the mean.

◦ Use tables for means of commonly used distribution.

Expected Value and Variance, Feb 2, 2003 - 3 -

Page 127: Stat Methods

Expected Value

Example: European Call Options

Agreement that gives an investor the right (but not the obliga-

tion) to buy a stock, bond, commodity, or other instruments at

a specific time at a specific price.

What is a fair price P for European call options?

If ST is the price of the stock at time T , the profit will be

Profit = (ST − K)+ − P.

Profit is a random variable.

0 10 20 30 40 50

−10

010

2030

0

0

Fair price P for this option is expected value

P = E(ST − K)+.

Expected Value and Variance, Feb 2, 2003 - 4 -

Page 128: Stat Methods

Expected Value

Example: European Call Options (contd)

Consider the following simple model:

◦ St = St−1 + εt, t = 1, . . . , T

◦ P(εt = 1) = p and P(εt = −1) = 1 − p.

St is also called a random walk.

The distribution of ST is given by (s0 known at time 0)

ST = s0 + 2 Y − T, with Y ∼ Bin(T, p)

Therefore the price P is (assuming s0 = 0 without loss of generality)

P = E(ST − K)+ =T∑

y=1(2 y − T − K) pθ(y) 1{y>(K+T )/2}

Let n = 20, K = 10, θ = 0.6

P = 2.75

Profit

p(x)

−2.

75

−0.

75

1.2

5

3.2

5

5.2

5

7.2

5

9.2

5

11.2

5

13.2

5

15.2

5

17.2

5

19.2

5

0.0

0.1

0.2

0.3

0.4

0.5

Frequency function of profit

Expected Value and Variance, Feb 2, 2003 - 5 -

Page 129: Stat Methods

Expected Value

Example: Group testing

Suppose that a large number of blood samples are to be screened for a rare

disease with prevalence 1 − p.

• If each sample is assayed individually, n tests will be required.

• Alternative scheme:

◦ n samples, m groups with k samples

◦ Split each sample in half and pool all samples in one group

◦ Test pooled sample for each group

◦ If test positive test all samples in group separately

What is the expected number of tests under this alternative scheme?

Let Xi be the number of tests in group i. The frequency function of Xi is

p(x) =

{pk if x = 1

1 − pk if x = k + 1

The expected number of tests in each group isE(Xi) = pk + (k + 1)(1− pk) = k + 1 − kpk

HenceE(N) =m∑

i=1

E(Xi) = n(1 +

1

k− pk

)

Plot of E(N):

The mean is minimized forgroups of size 11.

2 4 6 8 10 12 14 16

0.20

0.25

0.30

0.35

0.40

0.45

0.50

k

Pro

port

ion

Expected Value and Variance, Feb 2, 2003 - 6 -

Page 130: Stat Methods

Variance

Let X be a random variable.

Variance of X :

var(X) = E(X − E(X)

)2.

The variance of X is the expected squared distance of X from its

mean.

Suppose X is discrete random variable with SX = {x1, . . . , xn}.

Then the variance of X can be written as

var(X) =n∑

i=1

(xi −

n∑j=1

xj p(xj))2

p(xi)

Example: Roll one die

X takes values in {1, 2, 3, 4, 5, 6}with frequency function p(x) = 16.E(X) =

6∑x=1

x1

6=

7

2

var(X) =6∑

x=1

(x − 7

2

)2 1

6=

1

6

(25

4+

9

4+

1

4+

1

4+

9

4+

25

4

)=

35

12

We often denote the variance of a random variable X by σ2X ,

σ2X = var(X)

and its standard deviation by σX .

Expected Value and Variance, Feb 2, 2003 - 7 -

Page 131: Stat Methods

Properties of the Variance

The variance can also be written as

var(X) = E(X2) −(E(X)

)2

To see this (using linearity of the mean):

var(X) = E(X −E(X))2

= E[X2 − 2XE(X) +

(E(X))2]

= E(X2

)− 2E(X)E(X) +

(E(X))2

= E(X2) −(E(X)

)2

Example: Let X ∼ Bin(1, θ). Then

var(X) = E(X2) −(E(X)

)2

= E(X) −(E(X)

)2= θ − θ2 = θ (1 − θ)

Rules for the variance:

◦ For constants a and b

var(aX + b) = a2var(X).

◦ For independent random variables X and Y

var(X + Y ) = var(X) + var(Y ).

Example: Let X ∼ Bin(n, θ). Then

var(X) = n θ (1 − θ)

Expected Value and Variance, Feb 2, 2003 - 8 -

Page 132: Stat Methods

Covariance

For independent random variables X and Y we have

var(X + Y ) = var(X) + var(Y ).

Question: What about dependent random variables?

It can be shown that

var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y )

where

cov(X, Y ) = E[(X − E(X))(Y − E(Y )

]

is the covariance of X and Y .

Properties of the covariance

◦ cov(X, Y ) = E(XY ) − E(X)E(Y )

◦ cov(X, X) = var(X)

◦ cov(X, 1) = 0

◦ cov(X, Y ) = cov(Y, X)

◦ cov(a X1 + b X2, Y ) = a cov(X1, Y ) + b cov(X2, Y )

Expected Value and Variance, Feb 2, 2003 - 9 -

Page 133: Stat Methods

Covariance

Important:

cov(X, Y ) = 0 does NOT imply that X and Y are independent.

Example:

Suppose X ∈ {−1, 0, 1} with probabilities P(X = x) = 13

for

x = −1, 0, 1. Then E(X) = 0 and

cov(X, X2) = E(X3) = E(X) = 0

On the other handP(X = 1, X2 = 0) = 0 6= 19 = P(X = 1)P(X2 = 0),

that is, X and Y are not independent!

Note: The covariance of X and Y measures only linear depen-

dence.

Expected Value and Variance, Feb 2, 2003 - 10 -

Page 134: Stat Methods

Correlation

The correlation coefficient ρ is defined as

ρXY = corr(X, Y ) =cov(X, Y )√var(X)var(Y )

.

Properties:

◦ dimensionless quantity

◦ not affected by linear transformations, i.e.

corr(a X + b, c Y + d) = corr(X, Y )

◦ −1 ≤ ρXY ≤ 1

◦ ρXY = 1 if and only if P(Y = a + b X) = 1 for some a and b

◦ measures linear association between X and Y

Example: Three boxes: pp, pd, and dd (Ex 3.6)

Let Xi = 1{penny on ith draw}. Then Xi ∼ Bin(1, p) with p = 12

and

joint frequency function

p(x1, x2):

x1\x2 0 1

0 13

16

1 16

13

Thus:

cov(X1, X2) = E[(X1 − p)(X2 − p)]

= 14· 1

3+ 1

4· 1

3+ 2 1

4· 1

6= 1

12

corr(X1, X2) = 41 · 1

12 = 13

Expected Value and Variance, Feb 2, 2003 - 11 -

Page 135: Stat Methods

Prediction

An instructor standardizes his midterm and final so the class aver-

age is µ = 75 and the SD is σ = 10 on both tests. The correlation

between the tests is always around ρ = 0.50.

◦ X - score of student on the first examination

◦ Y - score of student on the second examination

Since X and Y are dependent we should be able to predict the

score in the final from the midterm score.

Approach:

◦ Predict Y from linear function a + b X

◦ Minimize mean squared error

MSE = E(Y − a − b X

)2

= var(Y − b X) +[E(Y − a − b X)

]2

Solution:

a = µ − b µ and b =σXY

σ2X

= ρ

Thus the best linear predictor is

Y = µ + ρ (X − µ)

Note:

We expect the student’s score on the final to differ from the mean

only by half the difference observed in the midterm (regression to

the mean).

Expected Value and Variance, Feb 2, 2003 - 12 -

Page 136: Stat Methods

Summary

Bernoulli distribution - Bin(1, θ)

p(x) = θx(1 − θ)1−x E(X) = θ

var(X) = θ(1 − θ)

Binomial distribution - Bin(n, θ)

p(x) =

(n

x

)θx(1 − θ)n−x E(X) = nθ

var(X) = nθ(1 − θ)

Poisson distribution - Poiss(λ)

p(x) =λx

x!e−λ E(X) = λ

var(X) = λ

Geometric distribution

p(x) = θ(1 − θ)x−1 E(X) =1

θ

var(X) =1 − θ

θ2

Hypergeometric distribution - H(N, M, n)

p(x) =

(Mx

)(N−Mn−x

)(Nn

) E(X) =n M

N

Expected Value and Variance, Feb 2, 2003 - 13 -

Page 137: Stat Methods

Properties of the Sample Mean

Consider X1, . . . , Xn independent and identically distributed (iid)

with mean µ and variance σ2.

X =1

n

n∑i=1

Xi (sample mean)

ThenE(X) =1

n

n∑i=1

µ = µ

var(X) =1

n2

n∑i=1

σ2 =σ2

n

Remarks:

◦ The sample mean is an unbiased estimate of the true mean.

◦ The variance of the sample mean decreases as the sample size

increases.

◦ Law of Large Numbers: It can be shown that for n → ∞

X =1

n

n∑i=1

Xi → µ.

Question:

◦ How close to µ is the sample mean for finite n?

◦ Can we answer this without knowing the distribution of X?

Central Limit Theorem, Feb 4, 2004 - 1 -

Page 138: Stat Methods

Properties of the Sample Mean

Chebyshev’s inequality

Let X be a random variable with mean µ and variance σ2.

Then for any ε > 0P(|X − µ| > ε

)≤ σ2

ε2.

Proof: Let

1{|xi − µ| > ε} =

{1 if |xi − µ| > ε

0 otherwise

Then

n∑i=1

1{|xi − µ| > ε} p(xi) =n∑

i=1

1{(xi − µ)2

ε2> 1

}p(xi)

≤n∑

i=1

(xi − µ)2

ε2p(xi) =

σ2

ε2

Application to the sample mean:P(µ − 3σ√

n≤ X ≤ µ +

3σ√n

)≥ 1 − 1

9≈ 0.889

However: Known to be not very precise

Example: Xiiid∼ N (0, 1)

X =1

n

n∑i=1

Xi ∼ N (0, 1n)

ThereforeP(− 3√

n≤ X ≤ 3√

n

)= 0.997

Central Limit Theorem, Feb 4, 2004 - 2 -

Page 139: Stat Methods

Central Limit Theorem

Let X1, X2, . . . be a sequence of random variables

◦ independent and identically distributed

◦ with mean µ and variance σ2.

For n ∈ N define

Zn =√

nX − µ

σ=

1√n

n∑i=1

Xi − µ

σ.

Zn has mean 0 and variance 1.

Central Limit Theorem

For large n, the distribution of Zn can be approximated by the

standard normal distribution N (0, 1). More precisely,

limn→∞

P(a ≤ √

nX − µ

σ≤ b

)= Φ(b) − Φ(a),

where Φ(x) is the standard normal probability

Φ(z) =∫ z

−∞f(x) dx,

that is, the area under the standard normal curve to left of z.

Example:

◦ U1, . . . , U12 uniformly distributed on [ 0, 12).

◦ What is the probability that the sample mean exceeds 9?P(U > 9) = P(√12

U − 6√12

> 3)≈ 1 − Φ(3) = 0.0013

Central Limit Theorem, Feb 4, 2004 - 3 -

Page 140: Stat Methods

Central Limit Theoremde

nsity

f(x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=1

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=2

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=6

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=12

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4 U[0,1],n=100

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0 Exp(1),n=1

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Exp(1),n=2

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5 Exp(1),n=6

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4Exp(1),n=12

dens

ity f(

x)

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Exp(1),n=100

Central Limit Theorem, Feb 4, 2004 - 4 -

Page 141: Stat Methods

Central Limit Theorem

Example: Shipping packages

Suppose a company ships packages that vary in weight:

◦ Packages have mean 15 lb and standard deviation 10 lb.

◦ They come from a arge number of customurs, i.e. packages are

independent.

Question: What is the probability that 100 packages will have a

total weight exceeding 1700 lb?

Let Xi be the weight of the ith package and

T =100∑i=1

Xi.

ThenP(T > 1700 lb) = P(T − 1500 lb√

100 · 10 lb>

1700 lb − 1500 lb√100 · 10 lb

)

= P(T − 1500 lb√

100 · 10 lb> 2

)

≈ 1 − Φ(2) = 0.023

Central Limit Theorem, Feb 4, 2004 - 5 -

Page 142: Stat Methods

Central Limit Theorem

Remarks

• How fast approximation becomes good depends on distribution

of Xi’s:

◦ If it is symmetric and has tails that die off rapidly, n can

be relatively small.

Example: If Xiiid∼ U [0, 1], the approximation is good for

n = 12.

◦ If it is very skewed or if its tails die down very slowly, a

larger value of n is needed.

Example: Exponential distribution.

• Central limit theorems are very important in statistics.

• There are many central limit theorems covering many situa-

tions, e.g.

◦ for not identically distributed random variables or

◦ for dependent, but not “too” dependent random variables.

Central Limit Theorem, Feb 4, 2004 - 6 -

Page 143: Stat Methods

The Normal Approximation to the Binomial

Let X be binomially distributed with parameters n and p.

Recall that X is the sum of n iid Bernoulli random variables,

X =n∑

i=1

Xi, Xiiid∼ Bin(1, p).

Therefore we can apply the Central Limit Theorem:

Normal Approximation to the Binomial Distribution

For n large enough, X is approximately N(np, np(1 − p)

)

distributed:P(a ≤ X ≤ b) ≈ P(

a − 12 ≤ Z ≤ b + 1

2

)

where

Z ∼ N(np, np(1 − p)

).

Rule of thumb for n: np > 5 and n(1 − p) > 5.

In terms of the standard normal distribution we getP(a ≤ X ≤ b) = P(

a − 12 − np√

np(1 − p)≤ Z ′ ≤ b + 1

2 − np√np(1 − p)

)

= Φ

(b + 1

2 − np√np(1 − p)

)− Φ

(a − 1

2 − np√np(1 − p)

)

where Z ′ ∼ N (0, 1).

Central Limit Theorem, Feb 4, 2004 - 7 -

Page 144: Stat Methods

The Normal Approximation to the Binomialp(

x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

Bin(1,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

Bin(2,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

0.4

0.5

x

Bin(5,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

x

Bin(10,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

x

Bin(20,0.5)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

Bin(1,0.1)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.2

0.4

0.6

0.8

1.0

x

Bin(5,0.1)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

0.4

0.5

x

Bin(10,0.1)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

x

Bin(20,0.1)

p(x)

0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 200.0

0.1

0.2

0.3

x

Bin(50,0.1)

Central Limit Theorem, Feb 4, 2004 - 8 -

Page 145: Stat Methods

The Normal Approximation to the Binomial

Example: The random walk of a drunkard

Suppose a drunkard executes a “random” walk in the following

way:

◦ Each minute he takes a step north or south, with probability 12

each.

◦ His successive step directions are independent.

◦ His step length is 50 cm.

How likely is he to have advanced 10 m north after one hour?

◦ Position after one hour: X · 1 m − 30 m

◦ X binomially distributed with parameters n = 60 and p = 12

◦ X is approximately normal with mean 30 and variance 15:P(X · 1 m − 30 m > 10 m)

= P(X > 40)

≈ P(Z > 39.5) Z ∼ N (30, 15)

= P(Z − 30√

15>

9.5√15

)

= 1 − Φ(2.452) = 0.007

How does the probability change if he has same idea of where he

wants to go and steps north with probability p = 23

and south with

probability 13?

Central Limit Theorem, Feb 4, 2004 - 9 -

Page 146: Stat Methods

Estimation

Example: Cholesterol levels of heart-attack patients

Data: Observational study at a Pennsylvania medical center

◦ blood cholesterol levels patients treated for heart attacks

◦ measurements 2, 4, and 14 days after the attack

Id Y1 Y2 Y3 Id Y1 Y2 Y3

1 270 218 156 15 294 240 264

2 236 234 193 16 282 294 220

3 210 214 242 17 234 220 264

4 142 116 120 18 224 200 213

5 280 200 181 19 276 220 188

6 272 276 256 20 282 186 182

7 160 146 142 21 360 352 294

8 220 182 216 22 310 202 214

9 226 238 248 23 280 218 170

10 242 288 298 24 278 248 198

11 186 190 168 25 288 278 236

12 266 236 236 26 288 248 256

13 206 244 238 27 244 270 280

14 318 258 200 28 236 242 204

Aim: Make inference on distribution of

◦ cholesterol level 14 days after the attack: Y3

◦ decrease in cholesterol level: D = Y1 − Y3

◦ relative decrease in cholesterol level: R = Y1 − Y3

Y3

Confidence intervals I, Feb 11, 2004 - 1 -

Page 147: Stat Methods

Estimation

Data:

d1, . . . , d28 observed decrease in cholesterol level

In this example, parameters of interest might be

µD = E(D) the mean decrease in cholesterol level,

σ2D = var(D) the variation of the cholesterol level,

pD = P(D ≤ 0) probability of no decrease in cholesterol level

These parameters are naturally estimated by the following sample

statistics:

µD =1

n

n∑i=1

di (sample mean)

σ2D =

1

n

n∑i=1

(di − d)2, (sample mean)

pD =#{di|di ≤ 0}

n(sample proportion)

Such statistics are point estimators since they estimate the corre-

sponding parameter by a single numerical value.

◦ Point estimates provide no information about their chance vari-

ation.

◦ Estimates without an indication of their variability are of lim-

ited value.

Confidence intervals I, Feb 11, 2004 - 2 -

Page 148: Stat Methods

Confidence Intervals for the Mean

Recall:

◦ CLT for the sample mean: For large n we have

X ≈ N(µ,

σ2

n

)

◦ 68-95-99 rule: With 95% probability the sample differs from

its mean µ by less that two standard deviations.

More precisely, we haveP(µ − 1.96

σ√n≤ X ≤ µ + 1.96

σ√n

)= 0.95,

or equivalently, after rearranging the terms,P(X − 1.96

σ√n≤ µ ≤ X + 1.96

σ√n

)= 0.95.

Interpretation: There is 95% probability that the random in-

terval[X − 1.96

σ√n

, X + 1.96σ√n

]

will cover the mean µ.

Example: Cholesterol levels

d = 36.89, σ = 51.00, n = 28.

Therefore, the 95% confidence interval for µ is

[18.00, 55.78].

Confidence intervals I, Feb 11, 2004 - 3 -

Page 149: Stat Methods

Confidence Intervals for the Mean

Assumption: The population standard deviation σ is known.

◦ In the next lecture, we will drop this unrealistic assumption.

◦ Assumption is approximately satisfied for large sample sizes,

since then σ ≈ σ by the law of large numbers.

Definition: Confidence interval for µ (σ known)

The interval[X − zα/2

σ√n, X + zα/2

σ√n

]

is called a 1− α confidence interval for the population mean

µ. (1 − α) is the confidence level.

For large sample sizes n, an approximate (1 − α) confidence

interval for µ is given by[X − zα/2

σ√n, X + zα/2

σ√n

].

Here, zα is the α-critical value of the standard normal distribution:

z

f(x)

−3 −2 −1 0 1 2 30.0

0.1

0.2

0.3

0.4

α

◦ zα has area α to its right

◦ Φ(zα) = 1 − α

Confidence intervals I, Feb 11, 2004 - 4 -

Page 150: Stat Methods

Confidence Interval for the Mean

Example: Community banks

◦ Community banks are banks with less than a billion dollars of assets.

◦ Approximately 7500 such banks in the United States.

Annual survey of the Community Bankers Council of the American Bankers

Association (ABA)

◦ Population: Community banks in the United States.

◦ Variable of interest: Total assets of community banks.

◦ Sample size: n = 110

◦ Sample mean: X = 220 millions of dollars

◦ Sample standard deviation: SD = 161 millions of dollars

◦ Histogram of sampled values:Assets of Community Banks in the U.S.

Assets (in millions of dollars)

Fre

quen

cy

0

5

10

15

20

0 100 200 300 400 500 600 700 800 900 1000

(sample of 110 community banks)

Suppose we want to give a 95% confidence interval for the mean total assets

of all community banks in the United States.

◦ α = 0.05, zα/2 = 1.96

A 95% confidence interval for the mean assets (in millions of dollars) is[220 − 1.96 · 161√

110, 220 + 1.96 · 161√

110

]≈

[190, 250].

Confidence intervals I, Feb 11, 2004 - 5 -

Page 151: Stat Methods

Sample Size

Example: Cholesterol levels

Suppose we want a 99% confidence interval for the decrease in

cholesterol level:

◦ α = 0.01, z0.005 = 2.58

◦ The 99% confidence interval for µD is[36.89− 2.58 · 50.93√

28, 36.89 + 2.58 · 50.93√

28

]≈

[12.06, 61.72].

Note: If we raise the confidence level, the confidence interval

becomes wider.

Suppose we want to obtain increase the confidence level without

increasing the error of estimation (indicated by the half-width of

the confidence interval). For this we have to increase the sample

size n.

Question: What sample size n is needed to estimate the mean

decrease in cholesterol with error e = 20 and confidence level 99%?

The error (half-width of the confidence interval) is

e = zα/2σ√n

Therefore the sample size ne needed is given by

ne ≥(

zα/2σ

e

)2

=(

2.58 · 50.93

20

)2

= 43.16,

that is, a sample of 44 patients is needed to estimate µD with error

e = 20 and 99% confidence.

Confidence intervals I, Feb 11, 2004 - 6 -

Page 152: Stat Methods

Estimation of the Mean

Example: Banks’ loan-to-deposit ratio

The ABA survey of community banks also asked about the loan-to-deposit

ratio (LTDR), a bank’s total loans as a percent of its total deposits.

Sample statistics:

◦ n = 110

◦ µLTDR = 76.7

◦ σLTDR = 12.3

Loan−To−Deposit Ratio of Community Banks

LTDR (in %)

Fre

quen

cy

0

3

6

9

12

15

18

50 60 70 80 90 100 110 120

(sample of 110 community banks)

Construction of 95% confidence interval:

◦ α = 0.05, zα/2 = 1.96

◦ Standard error σX =σLTDR√

n= 1.17

◦ 95% confidence interval for µLTDR:[X − zα/2

σLTDR√n

, X + zα/2σLTDR√

n

]=

[74.4, 79.0

]

◦ To get an estimation with error e = 3.0 (half-width of confidence inter-

val) it suffices to sample ne banks,

ne ≥(

zα/2σLTDR

e

)2

=

(1.96 · 12.3

3.0

)2

= 64.6.

Thus a sample of ne = 65 banks it sufficient.

Confidence intervals I, Feb 11, 2004 - 7 -

Page 153: Stat Methods

Confidence intervals

Definition: Confidence interval

A (1 − α) confidence interval for a parameter is an interval that

◦ depends only on sample statistics and

◦ covers the parameter with probability (1 − α)

Note:

◦ Confidence intervals are random while the estimated parameter

is fixed.

◦ For repeated samples, only 95% of the confidence intervals will

cover the true parameter is a random:

µ

Confidence intervals II, Feb 13, 2004 - 1 -

Page 154: Stat Methods

Confidence Intervals for the Mean

Suppose that X1, . . . , Xniid∼ N (µ, σ2). Then

X − µ

σ/√

n∼ N (0, 1) (*)

Assuming that σ is known, we obtain[X − zα/2 · σ√

n, X + zα/2 · σ√

n

]

as (1 − α) confidence interval for µ.

More realistic situation: σ is unknown.

Approach: Replace by estimate σ = s

This approach leads to the t statistic

T =X − µ

s/√

n∼ tn−1.

It is t distributed with n − 1 degrees of freedom. x

f(x)

−4 −3 −2 −1 0 1 2 3 40.0

0.1

0.2

0.3

0.4 t1t3t10

N(0, 1)

Confidence interval for the mean µ (σ unknown)

The interval[X − tn−1,α/2 · s√

n, X + tn−1,α/2 · s√

n

]

is a (1 − α) confidence interval for the mean µ.

Notation: Critical values of distributions

zα standard normal distribution

tn,α t distribution with n degrees of freedom

Confidence intervals II, Feb 13, 2004 - 2 -

Page 155: Stat Methods

Confidence Intervals for the Mean

Example: Cholesterol levels

In the study on cholesterol levels, the standard deviation of the decrease

of cholesterol level was unknown.

◦ µD = 36.89, σD = 50.94

◦ t27,0.025 = 2.05

◦ Then[36.89− 2.05 · 50.94√

27, 36.89 + 2.05 · 50.94√

27

]= [16.78, 57.01]

is a 95% confidence interval for µD

◦ The large sample confidence interval based on (*) was [18.00,55.78].

Example: Level of vitamin C

The following data are the amounts of vitamin C, measured in milligrams

per 100 grams (mg/100 g) of corn soy blend, for a random sample of size 8

from a production run:

26 31 23 22 11 22 14 31

What is the 95% confidence interval for µ, the mean vitamin C content of

the CSB produced during this run?

◦ µ = 22.5, σ = 7.2, t7,0.025 = 2.36

◦ The 95% confidence interval for µ is[22.5− 2.36 · 7.2√

8, 22.5 +

2.36 · 7.2√8

]= [16.5, 28.5].

◦ The large sample CI would be [17.5, 27.5].

Confidence intervals II, Feb 13, 2004 - 3 -

Page 156: Stat Methods

Confidence Intervals for the Variance

For normally distributed data X1, . . . , Xniid∼ N (µ, σ2), the ratio

(n − 1) · s2

σ2

has a χ2 distribution with n − 1 degrees of freedom.

The (1 − α) confidence interval for σ2 is[

(n − 1) · s2

χ2

n−1,α/2

, (n − 1) · s2

χ2

n−1,1−α/2

].

where χ2n−1,α is the α fractile of the χ2

n−1 distribution.

Caution: This confidence interval is not robust against depar-

tures from normality regardless of the sample size.

Example: Cholesterol levels

Suppose we are interested in the variance of Y3, the cholesterol level 14

days after the attack.

◦ Normal probability plot:

−2 −1 0 1 2

150

200

250

300

Normal quantiles

Cho

lest

erol

leve

l

Data seem to be normally distributed.

◦ s2 = 2030.55, χ227,0.975 = 14.57, χ2

27,0.025 = 43.19

◦ The 95% confidence interval for σ2 is[

27 · 2030.55

43.19,27 · 2030.55

14.57

]= [1269.26, 3761.99]

Confidence intervals II, Feb 13, 2004 - 4 -

Page 157: Stat Methods

Statistical Tests

Example:

Suppose that of 100 applicants for a job 50 were women and 50 were men,

all equally qualified. Further suppose that the company hired 2 women

and 8 men.

Question:

◦ Does the company discriminate against female job applicants?

◦ How likely is this outcome under the assumption that the company

does not discriminate?

Example:

◦ Study success of new elaborate safety program

◦ Record average weekly losses in hours of labor due to accidents before

and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10

Before 45 73 46 124 33 57 83 34 26 17

After 36 60 44 119 35 51 77 29 24 11

Question:

◦ Has the safety program an effect on the loss of labour due to accidents?

◦ In 9 out of 10 plants the average weekly losses have decreased after

implementation of the safety program. How likely is this (or a more

extreme) outcome under the assumption that there is no difference

before and after implementation of the safety program.

Testing Hypotheses I, Feb 16, 2004 - 1 -

Page 158: Stat Methods

Statistical Tests

Example: Fair coin

Suppose we have a coin. We suspect it might be unfair. We devise a

statistical experiment:

◦ Toss coin 100 times

◦ Conclude that coin is fair if we see between 40 and 60 heads

◦ Otherwise decide that the coin is not fair

Let θ be the probability that the coin lands heads, that is,P(Xi = 1) = θ and P(Xi = 0) = 1 − θ.

Our suspicion (“coin not fair”) is a hypothesis about the population pa-

rameter θ (θ 6= 12) and thus about P. We emphasize this dependence of P

on θ by writing Pθ.

Decision problem:

Null hypothesis H0: X ∼ Bin(100, 12)

Alternative hypothesis Ha: X ∼ Bin(100, θ), θ 6= 12

The null hypothesis represents the default belief (here: the coin is fair).

The alternative is the hypothesis we accept in view of evidence against the

null hypothesis.

The data-based decision rule

reject H0 if X /∈ [40, 60]

do not reject H0 if X ∈ [40, 60]

is called a statistical test for the test problem H0 vs. Ha.

Testing Hypotheses I, Feb 16, 2004 - 2 -

Page 159: Stat Methods

Statistical Tests

Example: Fair coin (contd)

Note: It is possible to obtain e.g. X = 55 (or X = 65)

◦ with probability 0.048 (resp. 0.0009) if p = 0.5

◦ with probability 0.048 (resp. 0.0049) if p = 0.6

◦ with probability 0.0005 (resp. 0.047) if p = 0.7

0.00

0.02

0.04

0.06

0.08

0.10

x

p(x)

Bin(100,0.5)

20 25 30 35 40 45 50 55 60 65 70 75 80

Accept H0: p ≠ 0.5Reject H0: p ≠ 0.5

0.00

0.02

0.04

0.06

0.08

0.10

x

p(x)

Bin(100,0.6)

20 25 30 35 40 45 50 55 60 65 70 75 80

Accept H0: p ≠ 0.5Reject H0: p ≠ 0.5

0.00

0.02

0.04

0.06

0.08

0.10

x

p(x)

Bin(100,0.7)

20 25 30 35 40 45 50 55 60 65 70 75 80

Accept H0: p ≠ 0.5Reject H0: p ≠ 0.5

Testing Hypotheses I, Feb 16, 2004 - 3 -

Page 160: Stat Methods

Types of errors

Example: Fair coin (contd)

It is possible that the test (decision rule) gives a wrong answer:

◦ If θ = 0.7 and x = 55, we do not reject the null hypothesis that the

coin is fair although the coin in fact is not fair.

◦ If θ = 0.5 and x = 65, we reject the null hypothesis that the coin is fair

although the coin in fact is fair.

The following table lists the possibilities:

Decision H0 true H0 false

Reject H0 type I error correct decision

Accept H0 correct decision type II error

Definition (Types of error)

◦ If we reject H0 when in fact H0 is true, this is a Type I error.

◦ If we do not reject H0 when in fact H0 is false, this is a Type II error.

Testing Hypotheses I, Feb 16, 2004 - 4 -

Page 161: Stat Methods

Types of errors

Question: How good is our decision rule?

For a good decision rule, the probability of committing an error of either

type should be small.

Probability of type I error: α

If the null hypothesis is true, i.e. θ = 12, thenPθ(reject H0) = Pθ(X /∈ [40, 60])

= 1 −Pθ(X ∈ [40, 60])

= 1 −60∑

x=40

(100

x

)(1

2

)100

= 0.035.

Thus the probability of a type I error, denoted as α, is 3.5%.

Probability of type II error: β(θ)

If the null hypothesis is false and the true probability of observing “head”

is θ with θ 6= 12 , thenPθ(accept H0) = Pθ(X ∈ [40, 60])

=60∑

x=40

(100

x

)θx(1 − θ)n−x

Thus, the probability of an error of type II depends on θ. It will be denoted

as β(θ).

Testing Hypotheses I, Feb 16, 2004 - 5 -

Page 162: Stat Methods

Power of Tests

Question: How good is our test in detecting the alternative?

Consider the probability of rejecting H0Pθ(reject H0) = Pθ(X /∈ [40, 60])

= 1 −Pθ(accept H0) = 1 − β(θ).

Note:

◦ If θ = 12 this is the probability of committing a error of type I:

1 − β(

1

2

)= α

◦ If θ > 12 this is the probability of correctly rejecting H0.

Definition (Power of a test)

We call 1 − β(θ) the power of the test as it measures the ability to

detect that the null hypothesis is false.

θ

1−

β(θ)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0

0.2

0.4

0.6

0.8

1.0

reject if X ∉ [40,60]

Testing Hypotheses I, Feb 16, 2004 - 6 -

Page 163: Stat Methods

Significance Tests

Idea: minimize probability of committing an error of type I and II

Different probabilities of type I error

θ

1−

β(θ)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.0

0.2

0.4

0.6

0.8

1.0

reject if X ∉ [40,60]reject if X ∉ [38,62]reject if X ∉ [42,58]

Note: If we decrease the probability of a type I error,

◦ the power of the test, 1 − β(θ) decreases as well and

◦ the probablity of a type II error increases.

Problem: cannot minimize both errors simultaneously

Solution:

◦ choose fixed level α for probability of a type I error

◦ under this restriction find test with small probability of a type II error

Remark:

◦ you do not have to do this minimization yourself.

◦ all tests taught in this course are of this kind.

Definition

A test of this kind is called a significance test with significance level α.

Testing Hypotheses I, Feb 16, 2004 - 7 -

Page 164: Stat Methods

Statistical Hypotheses

A statistical hypothesis is an assertion or conjecture about a population,

which may be expressed in terms of

◦ some parameter: mean is zero;

◦ some parameters: mean and median are identical; or

◦ some sampling distribution: this sample is normally distributed.

Test problem - decide between two hypotheses

◦ the null hypothesis H0 and

◦ the alternative hypothesis Ha.

Popperian approach to scientific theories

◦ Scientific theories are subject to falsification.

◦ It is impossible to verify a scientific theory.

Null hypothesis H0

default (current) theory which we try to falsify

Alternative hypothesis Ha

alternative to adopt if null hypothesis is rejected

Examples:

◦ Clinical study of new drug - H0 : drug has no effect

◦ Criminal case - H0 : suspect is not guilty

◦ Safety test of nuclear power station - H0 : power station is not safe

◦ Chances of new investment - H0 : project not profitable

◦ Testing for independence - H0 : random variables are independent

Testing Hypotheses II, Feb 18, 2004 - 1 -

Page 165: Stat Methods

Statistical Tests

Example: Testing for pesticide in discharge water

Suppose the Environmental Protection Agency takes 10 readings on the

amount of pesticide in the discharge water of a chemical company.

Question: Does the concentration cP of pesticide in the water exceed the

allowed maximum concentration c0?

◦ Before taking action against the company, the agency must have some

evidence that the concentration cP exceeds the allowed level.

◦ Without evidence the agency assumes that the pesticide concentration

cP is within the limits of the law.

Consequently, the null hypothesis of the agency is that the pesticide con-

centration cP does not exceed c0. Thus the question corresponds to the

test problem

H0 : cP ≤ c0 vs Ha : cP > c0.

Suppose that the company regularly also runs tests on the amount of pes-

ticide in the discharge water.

Question: Does the concentration cP of pesticide in the water exceed the

allowed maximum concentration c0?

◦ The aim of the company is to avoid fines for exceeding the allowed

level. Thus the company wants to make sure that the concentration

stays within the allowed limits.

Thus, the null hypothesis of the company should be that the pesticide

concentration cP exceeds c0. The question now corresponds to the test

problem

H0 : cP ≥ c0 vs Ha : cP < c0.

Testing Hypotheses II, Feb 18, 2004 - 2 -

Page 166: Stat Methods

Six Steps of Conducting a Test

Steps of a significance test

1. Determine null hypothesis H0 and alternative Ha.

2. Decide on probability of type I error, the significance level α.

3. Find an appropriate test statistic T .

4. Based on the sampling distribution of T , formulate a criterion for

testing H0 against Ha.

5. Calculate value of the test statistic T .

6. Decide whether or not to reject the null hypothesis H0.

Example: Fair coin (contd)

We want to decide from 100 tosses of a coin whether it is fair or not. Let

θ be the probability of heads.

1. Test problem:

H0 : θ = 12 vs Ha : θ 6= 1

2

2. Significance level:

α = 0.05 (most commonly used significance level)

3. Test statistic:

T = X (number of heads in 100 tosses of the coin)

4. Rejection criterion:

reject H0 if T /∈ [40, 60]

5. Observed value of test statistic: Suppose after 100 tosses we obtain

t = 55

6. Decision: Since 55 does not lie in the rejection region, we

do not reject H0.

Testing Hypotheses II, Feb 18, 2004 - 3 -

Page 167: Stat Methods

One and Two-sided Hypotheses

Example: Blood cholesterol after a heart attack

Suppose we are interested in whether the blood cholesterol level two days

after a heart attack differs from the average cholesterol level in the (general)

population (µ0 = 193).

Two cases:

◦ We are interested in any difference from the population mean µ0. Then

we have a two-sided test problem

H0 : µY1= µ0 vs H0 : µY1

6= µ0.

◦ We suspect that the cholesterol level after a heart attack might me

higher than in the general population. In this case, we have a one-sided

test problem

H0 : µY1= µ0 vs H0 : µY1

> µ0.

Remark:

◦ More generally, we might be interested in one-sided test problems of

the form

H0 : µY1≤ µ0 vs H0 : µY1

> µ0,

which accounts for the possibility that µ might be smaller than µ0.

◦ For all common test situations (in particular those discussed in this

course), the form of the test does not depend on the form of H0, but

only on the parameter value in H0 that is closest to Ha, that is µ0.

Testing Hypotheses II, Feb 18, 2004 - 4 -

Page 168: Stat Methods

Test Statistic

Let θ be the parameter of interest.

Two-sided test problem

H0 : θ = θ0 against Ha : θ 6= θ0

One-sided test problem

H0 : θ = θ0 against Ha : θ > θ0 (or Ha : θ < θ0)

Suppose that θ is an estimate for θ.

◦ If θ = θ0 (null hypothesis), we expect the estimate θ to take a value

near θ0.

◦ Large deviations from θ0 are evidence against H0.

This suggests the following decision rules:

◦ Ha : θ > θ0: reject H0 if θ − θ0 is much larger than zero

◦ Ha : θ < θ0: reject H0 if θ − θ0 is much smaller than zero

◦ Ha : θ 6= θ0: reject H0 if |θ − θ0| is much larger than zero

Problem: Often the sampling distribution of the estimate θ depends on the

unknown parameter θ.

Definition (Test statistic)

A test statistic is a random variable

◦ that measures the compatibility between the null hypothesis and the

data and

◦ has a sampling distribution which we know (under H0).

Testing Hypotheses II, Feb 18, 2004 - 5 -

Page 169: Stat Methods

Test Statistic

Example: Blood cholesterol after a heart attack

Data: X1, . . . , X28

◦ blood cholesterol level of 28 patients two days after a heart attack

◦ assumed to be normally distributed with mean µX and variance σ2X

The parameter µ can be estimated by the sample mean

X =1

28

28∑i=1

Xi ∼ N(µX ,

σ2

X

28

).

This suggests to the standardized sample mean as a test statistic

X − µ0

σ/√

28∼ N (0, 1) (under H0).

Test H0 : µ ≤ 193 vs Ha : µ > 193 at significance level α = 0.05

◦ Test statistic: Assume σ = 47.7 to be known.

T =X − µ0

σ/√

28

◦ Rejection criterion: Reject H0 if T > z0.05 = 1.645

◦ Outcome of test: Since the observed value of T is

t =253.9 − 193

47.7/√

28= 6.76,

we reject the null hypothesis that µ = 193.

Testing Hypotheses II, Feb 18, 2004 - 6 -

Page 170: Stat Methods

Tests for the Mean

Tests for the mean µ (σ2 known):

◦ Test statistic:

T =X − µ0

σ/√

n

◦ Two sided test:

H0 : µ = µ0 against Ha : µ 6= µ0

reject H0 if |T | > zα/2

◦ One sided tests:

H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0)

reject H0 if T > zα (T < −zα)

Tests for the mean µ (σ2 unknown):

◦ Test statistic:

T =X − µ0

s/√

n

◦ Two sided test:

H0 : µ = µ0 against Ha : µ 6= µ0

reject H0 if |T | > tn−1,α/2

◦ One sided tests:

H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0)

reject H0 if T > tn−1,α (T < −tn−1,α)

Example: Blood cholesterol after a heart attack

Estimating the standard deviation from the data, we obtain the test statis-

tic

T =X − µ0

s/√

28∼ t27.

Noting that t27,0.05 = 1.703 and t = 6.76, we still reject H0.

Testing Hypotheses II, Feb 18, 2004 - 7 -

Page 171: Stat Methods

Tests and Confidence Intervals

Consider level α significance test for the two-sided test problem

H0 : θ = θ0 vs Ha : θ 6= θ0.

Let

◦ T = Tθ0(X) be the test statistic of the test (depends on θ0)

◦ R be the critical region of the test

Then

C(X) = {θ : Tθ(X) /∈ R}

is a (1 − α) confidence interval for θ: If θ is the true parameter, thenPθ

(θ ∈ C(X)

)= Pθ

(Tθ(X) /∈ R

)= 1 −Pθ

(Tθ(X) ∈ R

)= 1 − α.

We have

θ0 ∈ C(X) ⇔ Tθ0(X) /∈ R ⇔ H0 is not rejected

Result A level α two-sided significance test rejects the null hypothesis

H0 : θ = θ0 if and only if the parameter θ0 falls outside a (1 − α)

confidence interval for θ.

Example: Normal distribution

Let X1, . . . , Xniid∼ N (µ, σ2). We reject H0 : µ = µ0 if

∣∣∣X − µ0

s/√

n

∣∣∣ > tn−1,α/2

or equivalently

∣∣X − µ0

∣∣ > tn−1,α/2s√n

Rearranging terms, we find that we reject if

µ0 /∈[X − tn−1,α/2

s√n, X + tn−1,α/2

s√n

].

Testing Hypotheses II, Feb 18, 2004 - 8 -

Page 172: Stat Methods

The P -value

Definition (P -value)

The probability that under the null hypothesis H0 the test statistic

would take a value as extreme or more extreme that that actually

observed is called the P -value of the test.

The P -value is often interpreted a measure for the strength of evidence

against the null hypothesis: the smaller the P -value, the stronger the evi-

dence.

However:

◦ The P -value is a random variable (under H0 uniformly distr. on [0, 1]).

◦ Without a measure of its variability it is not safe to interpret the actu-

ally observed P -value.

◦ If the P -value is smaller than the chosen significance level α, we reject

the null hypothesis H0.

Three approaches to deciding on test problem:

◦ reject if θ0 /∈ C(X)

◦ reject if T (X) ∈ R

◦ reject if P -value p ≤ α

Example: Blood cholesterol after a heart attack

The observed value for the test statistic

T =X − µ0

s/√

28∼ t27.

is t = 6.76. The corresponding P -value isP(T > 6.76) = 1.47 · 10−07.

We thus reject the null hypothesis.

Equivalently, the confidence interval for µ is [235.43, 272.42]. Since it does

not contain µ0 = 193 we reject H0 (for the third and last time!).

Testing Hypotheses II, Feb 18, 2004 - 9 -

Page 173: Stat Methods

Example

Data: Banks’ net income

◦ percent change in net income between first half of last year and first

half of this year

◦ sample mean x = 8.1%

◦ sample standard deviation s = 26.4%

Test problem: H0 : µ = 0 against Ha : µ 6= 0

. ttesti 110 8.1 26.4 0

One-sample t test

------------------------------------------------------------------| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

----+-------------------------------------------------------------x | 110 8.1 2.517141 26.4 3.111108 13.08889

------------------------------------------------------------------Degrees of freedom: 109

Ho: mean(x) = 0

Ha: mean < 0 Ha: mean != 0 Ha: mean > 0t = 3.2179 t = 3.2179 t = 3.2179

P < t = 0.9991 P > |t| = 0.0017 P > t = 0.0009

Critical value of t distribution with 109 degrees of freedom:

t109,0.025 = 1.982

Result:

◦ |t| > t109,0.025, therefore the test rejects H0 at significance level α = 0.05.

◦ Equivalently, µ0 = 0 /∈ [3.11, 13.09] and thus the test rejects H0.

◦ Equivalently, P -value is less than α = 0.05 and thus the test rejects H0.

Testing Hypotheses II, Feb 18, 2004 - 10 -

Page 174: Stat Methods

Exact Binomial Test

Example: Fair coin

Data: 100 tosses of a coin which we suspect might be unfair.

Modelling:

◦ θ is the probability that the coin lands heads up

◦ X is the number of heads in 100 tosses of the coin

◦ X is binomially distributed with parameters n and θ.

Decision problem:

◦ Null hypothesis H0: coin is fair

◦ Alternative hypothesis Ha: coin is unfair

Testproblem:

H0 : θ =1

2vs Ha : θ 6= 1

2.

Under the null hypothesis H0, the distribution of X is known,

X ∼ Bin(100,

1

2

).

Reject null hypothesis if

X /∈ [b100,0.5,0.975, b100,0.5,0.025] = [40, 60].

where bn,θ,α is the α fractile of Bin(n, θ).

Note:

◦ Exact binomial tests typically have smaller significance level α due to

discreteness of distribution.

◦ In the above example, the probability of a type I error isP(reject H0) = α = 0.035.

Testing Hypotheses III, Feb 20, 2004 - 1 -

Page 175: Stat Methods

Sign Test

Example: Safety program

◦ Study success of new elaborate safety program

◦ Record average weekly losses in hours of labor due to accidents before

and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10

Before 45 73 46 124 33 57 83 34 26 17

After 36 60 44 119 35 51 77 29 24 11

Question:

◦ Has the safety program an effect on the loss of labour due to accidents?

The Sign Test for matched pairs

◦ Ignore pairs with difference 0

◦ Number of trials n is the count of the remaining pairs

◦ The test statistic is the count X of pairs with positive difference

◦ X is binomially distributed with parameters n and θ.

◦ Null hypothesis H0: θ = 12

(i.e. median of the differences is zero)

Example:

For the safety program data, we find

◦ n = 10, X = 9

◦ Test H0 : θ = 12 against Ha : θ > 1

2

◦ The P -value of the observed count X isP(X ≥ 9) = 9(

1

2

)10

+(

1

2

)10

= 0.0107

Since the P -value is smaller than α = 0.05 we reject the null hypothesis H0

that the safety program has no effect on the loss of labour due to accidents.

Testing Hypotheses III, Feb 20, 2004 - 2 -

Page 176: Stat Methods

Tests for Proportions

Example: Blood cholesterol after a heart attack

Suppose we are interested in the proportion p of patients who show a

decrease of cholesterol level between the second and the 14th day after a

heart attack.

The proportion p can be estimated by the sample proportion

p =X

n

where X is the number of patients whose cholesterol level decreased.

Question: Does a decrease occur more often than an increase?

Test problem: H0 : p = 12 vs Ha : p > 1

2

Exact tests:

Since X is binomially distributed, we can use exact binomial tests.

Large sample approximations:

Facts: ◦ E(p) = p

◦ var(p) =p(1 − p)

n

◦ p − p√p(1 − p)/n

≈ N (0, 1) (for large n)

Under the null hypothesis H0, we get

T =p − p0√

p0(1 − p0)/n≈ N (0, 1).

Hence, we reject H0 if T > zα.

Example: Blood cholesterol after a heart attack

◦ n = 28, x = 22, p = 0.79, α = 0.05, z0.05 = 1.645

◦ t =0.79 − 0.5√0.79 · 0.21/28

= 3.7675

◦ P-value: P(T > t) = 8.24 · 10−5.

Testing Hypotheses III, Feb 20, 2004 - 3 -

Page 177: Stat Methods

Confidence Intervals for Proportions

Exact binomial confidence intervals

◦ difficult to compute

◦ use statistics software

Example: Blood cholesterol after a heart attack

◦ 28 patients in the study

◦ 22 showed a decrease in cholesterol level between second and 14th day

after the attack

Computation of an exact binomial confidence interval in STATA:

. cii 28 22

-- Binomial Exact --Variable | Obs Mean Std. Err. [95% Conf. Interval]---------+-----------------------------------------------------------

| 28 .7857143 .0775443 .590469 .9170394

Testing Hypotheses III, Feb 20, 2004 - 4 -

Page 178: Stat Methods

Confidence Intervals for Proportions

Large sample approximations

The CLT states that for large n p is approximately normally distributed,

p ≈ N(p,

p(1 − p)

n

)

Problems:

◦ variance is unknown

◦ estimate p(1 − p)/n is zero if p = 0 or p = 1

Example: What is the proportion of HIV+ students at the UofC?

◦ Random sample of 100 students

◦ None test positive for HIV

Are you absolutely sure that there are no HIV+ students at the UofC?

Idea: Estimate p by

p =X + 2

n + 4(Wilson estimate)

and use[p − zα/2

√p(1 − p)

n + 4, p + zα/2

√p(1 − p)

n + 4

]

as a (1 − α) confidence interval for p

Example: Blood cholesterol after a heart attack

. cii 28 22, wilson

------ Wilson ------Variable | Obs Mean Std. Err. [95% Conf. Interval]---------+-----------------------------------------------------------

| 28 .7857143 .0775443 .6046141 .8978754

Testing Hypotheses III, Feb 20, 2004 - 5 -

Page 179: Stat Methods

Paired Samples

Example: Safety program

◦ Study success of new elaborate safety program

◦ Record average weekly losses in hours of labor due to accidents before

and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10

Before 45 73 46 124 33 57 83 34 26 17

After 36 60 44 119 35 51 77 29 24 11

Question: Does the safety program have a positive effect?

Approach:

◦ Consider differences before and after implementation of the program:

Di = X(after)i − X

(before)i

◦ Di’s are approximately normal

Diiid∼ N (µ, σ2)

◦ H0 : µ = 0 against Ha : µ > 0

◦ Significance level α = 0.01

◦ One sample t test:

T =D

s/√

n

Reject if T > tn−1,α

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0

5

10

15

20

25

Normal quantiles

Dec

reas

e in

loss

es o

f wor

k

Result:

◦ y = 10.27, s = 7.98, n = 10

◦ t = 4.07 and t9,0.01 = 2.82, P -value: 0.0014

◦ Test rejects H0 at significance level α = 0.01

Testing Hypotheses III, Feb 20, 2004 - 6 -

Page 180: Stat Methods

Paired Sample t Test

Data: (X1, Y1), . . . , (Xn, Yn)

Assumptions:

◦ Pairs are independent

◦ Di = Xi − Yiiid∼ N (µ, σ2)

◦ Apply one-sample t test

Paired sample t test

◦ Test statistic

T =D − µ0

s/√

n

◦ Two-sided test:

H0 : µ = µ0 against Ha : µ 6= µ0

reject H0 if |T | > tn−1,α/2

◦ One-sided test:

H0 : µ = µ0 against Ha : µ > µ0

reject H0 if T > tn−1,α

Power of the paired sample t test and the paired sign test:

δ

1−

β(δ)

0 1 2 3 4 5 6 7 8 9 10 110.0

0.2

0.4

0.6

0.8

1.0

Sign test

t test

Testing Hypotheses III, Feb 20, 2004 - 7 -

Page 181: Stat Methods

Sign and t Test

t test:

◦ based on Central Limit Theorem

◦ readsonably robust against departures from normality

◦ do not use if n is small and

⋄ data are strongly skewed or

⋄ data have clear outliers

Sign test:

◦ uses much less information than t test

◦ for normal data less powerful than t test

◦ no assumption on distribution keeps significance level regardless of

distribution

◦ preferable for very small data sets

Remark:

◦ The two-step procedure

1. assess normality by normal quantile plot

2. conduct either t test or sign test depending on result in step 1

does not attain the chosen significance level α (two tests!).

◦ The sign test is rarely used since there are more powerful distribution-

free tests.

Testing Hypotheses III, Feb 20, 2004 - 8 -

Page 182: Stat Methods

Two Sample Problems

Two sample problems

◦ The goal of inference is to compare the responses in two groups.

◦ Each group is a sample from a different population.

◦ The responses in each group are independent of those in the other

group.

Example: Effects of ozone

Study the effects of ozone by controlled randomized experiment

◦ 55 70-day-old rats were randomly assigned to two treatment or control

◦ Treatment group: 22 rats were kept in an environment containing ozone.

◦ Control group: 23 rats were kept in an ozone-free environment

◦ Data: Weight gains after 7 days

We are interested in the difference in weight gain be-

tween the treatment and control group.

Question: Do the weight gains differ between groups?

◦ x1, . . . , x22 - weight gains for treatment group

◦ y1, . . . , y23 - weight gains for control group

◦ Test problem:

H0 : µX = µY vs Ha : µX 6= µY

◦ Idea: Reject null hypothesis if x − y is large.

Treatment Control

−10

010

2030

4050

Wei

ght g

ain

(in g

ram

)

Two Sample Tests, Feb 23, 2004 - 1 -

Page 183: Stat Methods

Comparing Means

Let X1, . . . , Xm and Y1, . . . , Yn be two independent normally distributed

samples. Then

X − Y ∼ N(

µX − µY ,σ2

X

m+

σ2Y

n

)

Two-sample t test

◦ Two-sample t statistic

T =X − Y√s2

X

m +s2

Y

n

Distribution of T can be approximated by t distribution

◦ Two-sided test:

H0 : µX = µY against Ha : µX 6= µY

reject H0 if |T | > tdf,α/2

◦ One-sided test:

H0 : µX = µY against Ha : µX > µY

reject H0 if T > tdf,α

◦ Degrees of freedom:

◦ Approximations for df provided by statistical software

◦ Satterthwaite approximation

df =

(s2

X

m +s2

Y

n

)2

1m−1

(s2

X

m

)2

+ 1n−1

(s2

Y

n

)2

commonly used, conservative approximation

◦ Otherwise: use df = min(m − 1, n − 1)

Two Sample Tests, Feb 23, 2004 - 2 -

Page 184: Stat Methods

Comparing Means

Example: Effects of ozone

Data:

◦ Treatment group: x = 11.01, sX = 19.02, m = 22

◦ Control group: x = 22.43, sX = 10.78, n = 23

Testproblem:

◦ H0 : µX = µY vs Ha : µX 6= µY

◦ α = 0.05, df = min(m − 1, n − 1) = 21, t21,0.025 = 2.08

The value of the test statistic is

t =x − y√s2

X

m+

s2

Y

m

= −2.46

The corresponding P-value isP(|T | ≥ |t|) = P(|T | ≥ 2.46) = 0.023

Thus we reject the hypothesis that ozone has no effect on weight gain.

Two-sample t test with STATA:

. ttest weight, by(group) unequal

Two-sample t test with unequal variances

----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+------------------------------------------------------------------0 | 23 22.42609 2.247108 10.77675 17.76587 27.08631 | 22 11.00909 4.054461 19.01711 2.577378 19.4408

---------+------------------------------------------------------------------combined | 45 16.84444 2.422057 16.24765 11.96311 21.72578---------+------------------------------------------------------------------

diff | 11.417 4.635531 1.985043 20.84895----------------------------------------------------------------------------Satterthwaite’s degrees of freedom: 32.9179

Ho: mean(0) - mean(1) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0t = 2.4629 t = 2.4629 t = 2.4629

P < t = 0.9904 P > |t| = 0.0192 P > t = 0.0096

Two Sample Tests, Feb 23, 2004 - 3 -

Page 185: Stat Methods

Comparing Means

Suppose that σ2X = σ2

Y = σ2. Then

σ2

m+

σ2

n= σ2

(1

m+

1

n

).

Estimate σ2 by the pooled sample variance

s2p =

(m − 1)s2X + (n − 1)s2

Y

m + n − 2.

Pooled two-sample t test

◦ Two-sample t statistic

T =X − Y

sp

√1m + 1

n

T is t distributed with m + n − 2 degrees of freedom.

◦ Two-sided test:

H0 : µX = µY against Ha : µX 6= µY

reject H0 if |T | > tm+n−2,α/2

◦ One-sided test:

H0 : µX = µY against Ha : µX > µY

reject H0 if T > tm+n−2,α

Remarks:

◦ If m ≈ n, the test is reasonably robust against

◦ nonnormality and

◦ unequal variances.

◦ If sample sizes differ a lot, test is very sensitive to unequal variances.

◦ Tests for differences in variances are sensitive to nonnormality.

Two Sample Tests, Feb 23, 2004 - 4 -

Page 186: Stat Methods

Comparing Means

Example: Parkinson’s disease

Study on Parkinson’s disease

◦ Parkinson’s disease, among other things, affects a

person’s ability to speak

◦ Overall condition can be improved by an operation

◦ How does the operation affect the ability to speak?

◦ Treatment group: Eight patients received operation

◦ Control group: Fourteen patients

◦ Data:

⋄ score on several test

⋄ high scores indicate problem with speaking

Treat. Contr.

1.5

2.0

2.5

3.0

Spe

akin

g ab

ility

Pooled twpo sample t test with STATA:

. infile ability group using parkinson.txt

. ttest ability, by(group)

Two-sample t test with equal variances

---------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+-----------------------------------------------------------------0 | 14 1.821429 .148686 .5563322 1.500212 2.1426451 | 8 2.45 .14516 .4105745 2.106751 2.793249

---------+-----------------------------------------------------------------combined | 22 2.05 .1249675 .5861497 1.790116 2.309884---------+-----------------------------------------------------------------

diff | -.6285714 .2260675 -1.10014 -.1570029---------------------------------------------------------------------------Degrees of freedom: 20

Ho: mean(0) - mean(1) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0t = -2.7805 t = -2.7805 t = -2.7805

P < t = 0.0058 P > |t| = 0.0115 P > t = 0.9942

Two Sample Tests, Feb 23, 2004 - 5 -

Page 187: Stat Methods

Comparing Variances

Example: Parkinson’s disease

In order to apply the pooled two-sample t test, the variances of the two

groups have to be equal. Are the data compatible with this assumption?

F test for equality of variances

The F test statistic

F =s2X

s2Y

.

is F distributed with m − 1 and n − 1 degrees of freedom.

. sdtest ability, by(group)

Variance ratio test

------------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------0 | 14 1.821429 .148686 .5563322 1.500212 2.1426451 | 8 2.45 .14516 .4105745 2.106751 2.793249

---------+--------------------------------------------------------------------combined | 22 2.05 .1249675 .5861497 1.790116 2.309884------------------------------------------------------------------------------

Ho: sd(0) = sd(1)

F(13,7) observed = F_obs = 1.836F(13,7) lower tail = F_L = 1/F_obs = 0.545F(13,7) upper tail = F_U = F_obs = 1.836

Ha: sd(0) < sd(1) Ha: sd(0) != sd(1) Ha: sd(0) > sd(1)P < F_obs = 0.7865 P < F_L + P > F_U = 0.3767 P > F_obs = 0.2135

Result: We cannot reject the null hypothesis that the variances are equal.

Problem: Are the data normally

distributed?

−1.5 −0.5 0.5 1.0 1.5

1.8

2.0

2.2

2.4

2.6

2.8

3.0

Theoretical Quantiles

Spe

akin

g ab

ility

(T

reat

.)

−1 0 1

1.5

2.0

2.5

3.0

Theoretical Quantiles

Spe

akin

g ab

ility

(C

ontr

.)

Two Sample Tests, Feb 23, 2004 - 6 -

Page 188: Stat Methods

Comparing Proportions

Suppose we have two populations with unknown proportions p1 and p2.

◦ Random samples of size n1 and n2 are drawn from the two population

◦ p1 is the sample proportion for the first population

◦ p2 is the sample proportion for the second population

Question: Are the two proportions p1 and p2 different?

Test problem:

H0 : p1 = p2 vs H1 : p1 6= p2

Idea: Reject H0 if p1 − p2 is large.

Note that

p1 − p2 ≈ N(p1 − p2,

p1(1 − p1)

n1

+p2(1 − p2)

n2

)

This suggests the test statistic

T =p1 − p2√

p(1 − p)(

1

n1

+ 1

n2

)

where p is the combined proportion of successes in both samples

p =X1 + X2

n1 + n2

=n1 p1 + n2 p2

n1 + n2

with X1 and X2 denoting the number of successes in each sample.

Under H0, the test statistic is approximately standard normally dis-

tributed.

Two Sample Tests, Feb 23, 2004 - 7 -

Page 189: Stat Methods

Comparing Proportions

Example: Question wording

The ability of question wording to affect the outcome of a survey can be a

serious issue. Consider the following two questions:

1. Would you favor or oppose a law that would require a person to obtain

a police permit before purchasing a gun?

2. Would you favor or oppose a law that would require a person to obtain

a police permit before purchasing a gun, or do you think such a law

would interfere too much with the right of citizens to own guns?

In two surveys, the following results were obtained:

Question Yes No Total

1 463 152 615

2 403 182 585

Question: Is the true proportion of people favoring the permit law the

same in both groups or not?

. prtesti 615 0.753 585 0.689

Two-sample test of proportion x: Number of obs = 615y: Number of obs = 585

--------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]---------+----------------------------------------------------------------

x | .753 .0173904 .7189155 .7870845y | .689 .0191387 .6514889 .7265111

---------+----------------------------------------------------------------diff | .064 .0258595 .0133163 .1146837

| under Ho: .0258799 2.47 0.013--------------------------------------------------------------------------

Ho: proportion(x) - proportion(y) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0z = 2.473 z = 2.473 z = 2.473

P < z = 0.9933 P > |z| = 0.0134 P > z = 0.0067

Two Sample Tests, Feb 23, 2004 - 8 -

Page 190: Stat Methods

Final Remarks

Statistical theory focuses on the significance level, the probability of a type

I error.

In practice, discussion of power of test also important:

Example: Efficient Market Hypothesis

“Efficient market hypothesis” for stock prices:

◦ future stock prices show only random variation

◦ market incorporates all information available now in present prices

◦ no information available now will help to predict future stock prices

Testing of the efficient market hypothesis:

◦ Many studies tested

H0: Market is efficient

Ha: Prediction is possible

◦ Almost all studies failed to find good evidence against H0.

◦ Consequently the efficient market hypothesis became quite popular.

Problem:

◦ Power was generally low in the significance tests employed in the stud-

ies.

◦ Failure to reject H0 is no evidence that H0 is true.

◦ More careful studies showed that the size of a company and measures

of value such as ratio of stock price to earnings do help predict future

stock prices.

Two Sample Tests, Feb 23, 2004 - 9 -

Page 191: Stat Methods

Final Remarks

Example

◦ IQ of 1000 women and 1000 men

◦ µw = 100.68, σw = 14.91

◦ µm = 98.90, σm = 14.68

◦ Pooled two-sample t test: T = −2.7009

◦ Reject H0 : µw = µm since |T | > t1998,0.005 = 2.58.

◦ The difference in the IQ is statistically significant at the 0.01 level.

◦ However we might conclude that the difference is scientifically irrele-

vant.

Note: A low significance level does not mean there is a large difference,

but only that there is strong evidence that there is some difference.

Two Sample Tests, Feb 23, 2004 - 10 -

Page 192: Stat Methods

Final Remarks

Example: Is radiation from cell phones harmful?

◦ Observational study

◦ Comparison of brain cancer patients and similar group without brain

cancer

◦ No statistically significant association between cell phone use and a

group of brain cancers known as gliomas.

◦ Separate analysis for 20 types of gliomas found association between

phone use and one rare from.

◦ Risk seemed to decrease with greater mobile phone use.

Think for a moment:

◦ Suppose all 20 null hypotheses are true.

◦ Each test has 5% chance of being significant - the outcome is Bernoulli

distributed with parameter 0.05.

◦ The number of false positive tests is binomially distributed:

N ∼ Bin(20, 0.05)

◦ The probability of getting one or more positive results isP(N ≥ 1) = 1 −P(N = 0) = 1 − 0.9520 = 0.64.

We therefore might have expected at least one significant association.

Beware of searching for significance

Two Sample Tests, Feb 23, 2004 - 11 -

Page 193: Stat Methods

Final Remarks

Problem: If several tests are performed, the probability of a type I error

increases.

Idea: Adjust significance level of each single test.

Bonferroni procedure:

◦ Perform k tests

◦ Use significance level α/k for each of the k tests

◦ If all null hypothesis are true, the probability is α that any of the tests

rejects its null hypothesis.

Example

Suppose we perform k = 6 tests and obtain the following P -values:

P -value α/k

0.476 0.032 0.241 0.008* 0.010 0.001* 0.0083

Only two tests (*) are significant at the 0.05 level.

Two Sample Tests, Feb 23, 2004 - 12 -

Page 194: Stat Methods

Two-Way Tables

Example: Depression and marital status

Question: Does severity of depression depend on marital status?

◦ Study of 159 depression patients

◦ Patients were categorized by

⋄ severity of depression (severe, normal, mild)

⋄ marital status (single, married, widowed/divorced)

The following two-way table summarizes the data:

Depression Marital Status Total

Single Married Wid/Div

Severe 16 22 19 57

Normal 29 33 14 76

Mild 9 14 3 26

Total 54 69 36 159

◦ Each combination of values defines a cell.

◦ The severity of depression is a row variable.

◦ The marital status is a column variable.

Inference for Two-Way Tables, Feb 25, 2004 - 1 -

Page 195: Stat Methods

Two-Way Tables

From this table of counts, the sample distribution can be obtained

by dividing each cell by the total sample size n = 159:

Depression Marital Status Total

Single Married Wid/Div

Severe 0.101 0.138 0.119 0.358

Normal 0.182 0.208 0.088 0.478

Mild 0.057 0.088 0.019 0.164

Total 0.340 0.434 0.226 1.000

◦ Joint distribution: proportion for each combination of values

◦ Marginal distribution: distribution of the row and column

variables separately.

◦ Conditional distribution: distribution of one variable at a

given level of the other variable

Inference for Two-Way Tables, Feb 25, 2004 - 2 -

Page 196: Stat Methods

Test for Independence

Example: Depression and marital status

Conditional distributions of severity of depression given marital

status:

single married wid/div

severenormalmild

0.0

0.1

0.2

0.3

0.4

0.5

Marital status

Sam

ple

prop

ortio

n

Question: Is a relationship between the row variable (depression)

and the column variable (marital status)?

◦ The distribution for widowed/divorced patients seems to differ

from the distributions for single or married patients.

◦ Are these differences significant or can they be attributed to

chance variation?

◦ How likely are differences as large or larger than those observed

if the two variables were indeed independent (and thus the con-

ditional distribution were the same)?

A statistical test will be required to answer these questions.

Inference for Two-Way Tables, Feb 25, 2004 - 3 -

Page 197: Stat Methods

Test for Independence

Test problem:

H0: the row and the column variables are independent

Ha: the row and the column variables are dependent

How can we measure evidence against the null hypothesis?

◦ What counts would we expect to observe if the null hypothesis

were true?

Expected Cell Count =row total × column total

total count

Recall: For two independent events A and B, P(A∩B) = P(A)P(B).

If the null hypothesis H0 is true, then the table of expected

counts should be “close” to the observed table of counts.

◦ We need a statistic that measures the difference between the

tables.

◦ And we need to know what is the distribution of the statistic

to make statistical inference.

Inference for Two-Way Tables, Feb 25, 2004 - 4 -

Page 198: Stat Methods

Test for Independence

Idea of the test:

◦ construct table of expected counts

◦ compare expected with observed counts

◦ if the null hypothesis is true, the difference between the tables

should be “small”

The χ2 (Chi-Squared) Statistic

To measure how far the expected table is from the observed table,

we use the following test statistic:

X =∑

all cells

(Observed − Expected)2

Expected

◦ Under the null hypothesis, T is approximately χ2 distributed

with (r − 1)(c − 1) degrees of freedom.Why (r − 1)(c − 1)?

Recall that our “expected” table is based on some quantities estimated

from the data: namely the row and column totals.

Once these totals are known, filling in any (r− 1)(c− 1) undetermined

table entries actually gives us the whole table. Thus, there are only

(r − 1)(c − 1) freely varying quantities in the table.

◦ We reject H0 if observed and expected counts are very different

and hence X is large. Consequently we reject H0 at significance

level α if

X ≥ χ2(r−1)(c−1),α.

Inference for Two-Way Tables, Feb 25, 2004 - 5 -

Page 199: Stat Methods

The χ2 Distribution

What does the χ2 distribution look like?

0 10 20 30 40 500.00

0.05

0.10

0.15

0.20

χ2

Den

sity

χ2 Densities

Degrees ofFreedom

15102030

◦ Unlike the Normal or t distributions, the χ2 distribution takes

values in (0,∞).

◦ As with the t distribution, the exact shape of the χ2 distribution

depends on its degrees of freedom.

Recall that X has only an approximate χ2(r−1)(c−1) distribution.

When is the approximation valid?

◦ For any two-way table larger than 2 × 2, we require that the

average expected cell count is at least 5 and each expected count

is at least one.

◦ For 2×2 tables, we require that each expected count be at least

5.

Inference for Two-Way Tables, Feb 25, 2004 - 6 -

Page 200: Stat Methods

Test for Independence

Example: Depression and marital status

The following table show the observed counts and expected counts

(in brackets):

Depression Marital Status Total

Single Married Wid/Div

Severe 16 22 19 57

(19.36) (24.74) (12.90)

Normal 29 33 14 76

(25.81) (32.98) (17.21)

Mild 9 14 3 26

(8.83) (11.28) (5.89)

Total 54 69 36 159

◦ The table is 3 × 3, so there are (r − 1)(c − 1) = 2 × 2 = 4

degrees of freedom.

◦ The critical value (significance level α = 0.05) is χ24,0.05 = 9.49.

◦ The observed value of the χ2 test statistic is

x =(16 − 19.36)2

19.36+

(22 − 24.74)2

24.74+ . . . +

(3 − 5.89)2

5.89

= 6.83 ≤ χ24,0.05

Thus we do not reject the null hypothesis of independence.

◦ The corresponding P-value isP(X ≥ x) = P(X ≥ 6.83) = 0.145 ≥ α

Again we do not reject H0

Inference for Two-Way Tables, Feb 25, 2004 - 7 -

Page 201: Stat Methods

Test for Independence

The χ2 test in STATA:

. insheet using depression.txt, clear(3 vars, 159 obs)

. tabulate depression marital, chi2

| MaritalDepression | Married Single Wid/Div | Total-----------+---------------------------------+----------

Mild | 14 9 3 | 26Normal | 33 29 14 | 76Severe | 22 16 19 | 57

-----------+---------------------------------+----------Total | 69 54 36 | 159

Pearson chi2(4) = 6.8281 Pr = 0.145

The same result can be obtained by the command

. tabi 16 22 19 \ 29 33 14 \ 9 14 3, chi2

| colrow | 1 2 3 | Total

-----------+---------------------------------+----------1 | 16 22 19 | 572 | 29 33 14 | 763 | 9 14 3 | 26

-----------+---------------------------------+----------Total | 54 69 36 | 159

Pearson chi2(4) = 6.8281 Pr = 0.145

Inference for Two-Way Tables, Feb 25, 2004 - 8 -

Page 202: Stat Methods

Models for Two-Way Tables

The χ2-test for the presence of a relationship between two distributions

in a two-way table is valid for data produced by several different study

designs, although the exact null hypothesis varies.

◦ Examining independence between variables

⋄ Select random sample of size n from a population.

⋄ Classify each individual according to two categorical variables.

Question: Is there a relationship between the two variables?

Test problem:

H0: The two variables are independent

Ha: The two variables are not independent

Example: Suppose we collect an SRS of 114 college students, and cate-

gorize each my major and GPA (e.g. (0, 0.5], . . . , (3.5, 4]). Then, we can

use the χ2-test to ascertain whether grades and major are independent.

◦ Comparing several populations

⋄ Select independent random samples from each of c population, of

sizes n1, . . . , nc.

⋄ Classify each individual according to a categorical response variable

with r possible values (the same across populations),

⋄ This yields a r × c table.

Question: Does the distribution of the response variable differs be-

tween populations?

Test problem:

H0: The distribution is the same in all populations.

Ha: The distribution is not the same.

Example: Suppose we select independent SRSs of Psychology, Biology

and Math majors, of sizes 40, 39, 35, and classify each individual by

GPA range. Then, we can use a χ2-test to ascertain whether or not the

distribution of grades is the same in all three populations.

Inference for Two-Way Tables, Feb 25, 2004 - 9 -

Page 203: Stat Methods

Models for Two-Way Tables

Example: Literary Analysis (Rice, 1995)

When Jane Austen died, she left the novel Sanditon only partially com-

pleted, but she left a summary of the reminder. A highly literate admirer

finished the novel, attempting to emulate Austen’s style, and the hybrid

was published. Someone counted the occurrences of various words in sev-

eral chapters from various works.

Austen Imitator

Sense and Emma Sanditon I Sanditon II

Word Sensibility

a 147 186 101 83

an 25 26 11 29

this 32 39 15 15

that 94 105 37 22

with 59 74 28 43

without 18 10 10 4

TOTAL 375 440 202 196

Questions:

◦ Is there consistency in Austen’s work (do the frequencies with which

Austen used these words change from work to work)?

Answer X = 12.27, df=?, P-value=?

◦ Was the imitator successful (are the frequencies of the words the same

in Austen’s work and the imitator’s work)?

Inference for Two-Way Tables, Feb 25, 2004 - 10 -

Page 204: Stat Methods

Simpson’s Paradoxon

Example: Medical study

◦ contact randomly chosen people in a district in England

◦ data on 1314 women contacted

◦ either current smoker or who had never smoked

Question: Survival rate after 20 years?

Smoker Not

Dead 139 230

Alive 438 502

Result: A higher percent of smokers stayed alive!

Here are the same data classified by their age at time of the survey:

Age 18 to 44

Smoker Not

Dead 19 13

Alive 269 327

Age 45 to 64

Smoker Not

Dead 78 52

Alive 162 147

Age 65+

Smoker Not

Dead 42 165

Alive 7 28

Age at time of the study is a confounding variable, in each age

group a higher percent of nonsmokers survive.

Simpson’s Paradoxon

An association/comparison that holds for all of several groups can

reverse direction when the data are combined to form a single

group.

Inference for Two-Way Tables, Feb 25, 2004 - 11 -

Page 205: Stat Methods

Simple Linear Regression

Example: Body density

Aim: Measure body density (weight per unit volume of the body)

(Body density indicates the fat content of the human body.)

Problem:

◦ Body density is difficult to measure directly.

◦ Research suggests that skinfold thickness can accurately predict body

density.

◦ Skinfold thickness is measures by pinching a fold of skin between

calipers.

1.03 1.04 1.05 1.06 1.07 1.08 1.09

1.0

1.2

1.4

1.6

1.8

2.0

Skinfold Thickness (mm)

Bod

y D

ensi

ty (1

03 kgm

3 )

Questions:

◦ Are body density and skinfold thickness related?

◦ How accurately can we predict body density from skinfold thickness?

Regression: predict response variable for fixed value of explanatory variable

◦ describe linear relationship in data by regression line

◦ fitted regression line is affected by chance variation in observed data

Statistical inference: accounts for chance variation in data

Simple Linear Regression, Feb 27, 2004 - 1 -

Page 206: Stat Methods

Population Regression Line

Simple linear regression studies the relationship between

◦ a response variable Y and

◦ a single explanatory variable X.

We expect that different values of X will produce different mean responses

of Y .

For given X = x, we consider the subpopulation with X = x:

◦ this subpopulation has mean

µY |X=x = E(Y |X = x) (cond. mean of Y given X = x)

◦ and variance

σ2Y |X=x = var(Y |X = x) (cond. variance of Y given X = x)

Linear regression model with constant variance:E(Y |X = x) = µY |X=x = a + b x (population regression line)

var(Y |X = x) = σ2Y |X=x = σ2

◦ The population regression line connects the conditional means of the

response variable for fixed values of the explanatory variable.

◦ This population regression line tells how the mean response of Y varies

with X.

◦ The variance (and standard deviation) does not depend on x.

Simple Linear Regression, Feb 27, 2004 - 2 -

Page 207: Stat Methods

Conditional Mean

01

23

45

67

89

1011

120

1

2

3

45

6

1 2 3 4 5 67

89

1011

12 01

23

45

6

01

23

45

67

89

1011

120

1

2

3

4

5

6

01

23

45

67

89

1011

120

1

2

3

4

5

6

Sample (x1, y1), . . . , (xn, yn)

Sampling probability

f(x, y)y

fix x = x0

f(x0, y)

y

rescale by fX(x0)

Conditional probability

f(y|x0) =fXY (x0, y)

fX(x0)

E(Y |X = x0) =

∫y fY |X(y|x0) dy conditional mean

Simple Linear Regression, Feb 27, 2004 - 3 -

Page 208: Stat Methods

The Linear Regression Model

Simple linear regression

Yi = a + b xi + εi, i = 1, . . . , n

where

Yi response (also dependent variable)

xi predictor (also independent variable)

εi error

Assumptions:

◦ Predictor xi is deterministic (fixed values, not random).

◦ Errors have zero mean, E(εi) = 0.

◦ Variation about mean does not depend on xi, i.e. var(εi) = σ2.

◦ Errors εi are independent.

Often we additionally assume:

◦ The errors are normally distributed,

εiiid∼ N (0, σ2).

For fixed x the response Y is normally distributed with

Y ∼ N (a + b x, σ2).

Simple Linear Regression, Feb 27, 2004 - 4 -

Page 209: Stat Methods

Least Squares Estimation

Data: (Y1, x1), . . . , (Yn, xn)

Aim: Find straight line which fits data best:

Yi = a + b xi fitted values for coefficients a and b

a - intercept

b - slope

Least Squares Approach:

Minimize squared distance between observed Yi and fitted Yi:

L(a, b) =n∑

i=1

(Yi − Yi)2 =

n∑i=1

(Yi − a − b xi)2

Set partial derivatives to zero (normal equations):

∂L

∂a= 0 ⇔

n∑i=1

(Yi − a − b xi) = 0

∂L

∂b= 0 ⇔

n∑i=1

(Yi − a − b xi) · xi = 0

Solution: Least squares estimators

a = Y − SXY

SXX· X

b =SXY

SXX

where

SXY =n∑

i=1

(Yi − Y )(xi − x) (sum of squares)

SXX =n∑

i=1

(xi − x)2

Simple Linear Regression, Feb 27, 2004 - 5 -

Page 210: Stat Methods

Least Squares Estimation

Least squares predictor Y

Yi = a + b xi

Residuals εi:

εi = Yi − Yi

= Yi − a − b xi

Residual sum of squares (SSResidual)

SSResidual =n∑

i=1

ε2i =

n∑i=1

(Yi − Yi)2

Estimation of σ2

σ2 =1

n − 2

n∑i=1

(Yi − Yi)2 =

1

n − 2SSResidual

Regression standard error

se = σ =√

SSResidual/(n − 2)

Variation accounting:

SSTotal =n∑

i=1

(Yi − Y )2 total variation

SSModel =n∑

i=1

(Yi − Y )2 variation explained by linear model

SSResidual =n∑

i=1

(Yi − Yi)2 remaining variation

Simple Linear Regression, Feb 27, 2004 - 6 -

Page 211: Stat Methods

Least Squares Estimation

Example: Body density

Scatter plot with least squares regression line:

1.03 1.04 1.05 1.06 1.07 1.08 1.09

1.0

1.2

1.4

1.6

1.8

2.0

Skinfold Thickness (mm)

Bod

y D

ensi

ty (1

03 kgm

3 )

Calculation of least squares estimates:

x y SXX SXY SY Y SSResidual

1.064 1.568 0.0235 -0.2679 4.244 1.187

b =SXY

SXX=

−0.267

0.023= −11.40

a = y − bx = 1.568 + 11.40 · 1.064 = 13.70

σ2 =RSS

n − 2=

1.187

90= 0.0132

se =√

σ2 =√

0.0132 = 0.1149

Simple Linear Regression, Feb 27, 2004 - 7 -

Page 212: Stat Methods

Least Squares Estimation

Example: Body density

Using STATA:

. infile ID BODYD SKINT using bodydens.txt, clear(92 observations read)

. regress BODYD SKINT

Source | SS df MS Number of obs = 92-------------+------------------------------ F( 1, 90) = 231.89

Model | 3.05747739 1 3.05747739 Prob > F = 0.0000Residual | 1.18663025 90 .013184781 R-squared = 0.7204

-------------+------------------------------ Adj R-squared = 0.7173Total | 4.24410764 91 .046638546 Root MSE = .11482

------------------------------------------------------------------------------BODYD | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------SKINT | -11.41345 .7494999 -15.23 0.000 -12.90246 -9.924433_cons | 13.71221 .7975822 17.19 0.000 12.12768 15.29675

------------------------------------------------------------------------------

. twoway (lfitci BODYD SKINT, range(1 1.1)) (scatter BODYD SKINT), xtitle(Skin thickn> ess) ytitle(Body density) scheme(s1color) legend(off)

11.

52

2.5

Bod

y de

nsity

1 1.02 1.04 1.06 1.08 1.1SKin thickness

Simple Linear Regression, Feb 27, 2004 - 8 -

Page 213: Stat Methods

Properties of Estimators

Statistical properties of a and b

Mean and variance of bE(b) = b

var(b) =σ2

SXX

Distribution of b

b ∼ N(

b,σ2

SXX

)

Mean and variance of aE(a) = a

var(a) =

(1

n+

x2

SXX

)σ2

Distribution of a

a ∼ N(

a,

(1

n+

x2

SXX

)σ2

)

Recall that

SXX =n∑

i=1

(xi − x)2

Inference for Regression, Mar 1, 2004 - 1 -

Page 214: Stat Methods

Confidence Intervals

Note that b ∼ N(b, σ2

SXX

). Thus

b − b

σ/√

SXX

∼ N (0, 1)

Substituting se for σ, we obtain

b − b

se/√

SXX

∼ tn−2

(1 − α) confidence interval for b:

b ± tn−2,a/2 ·se√SXX

Similarly

a − a

σ√

1n

+ X2

SXX

∼ N (0, 1)

Substituting se for σ, we obtain

a − a

se

√1n + x2

SXX

∼ tn−2

(1 − α) confidence interval for a:

a ± tn−2,α/2 · se ·√

1

n+

x2

SXX

Inference for Regression, Mar 1, 2004 - 2 -

Page 215: Stat Methods

Tests on the Coefficients

Question: Is b equal to some value b0?

The correspoding test problem is

H0 : b = b0 versus Ha : b 6= b0.

The test statistic is given by

Tb =b − b0

se/√

SXX

∼ tn−2

The null hypothesis H0 : b = b0 is rejected if

|T | > tn−2,α/2

Question: Is a equal to some value a0?

The correspoding test problem is

H0 : a = a0 versus Ha : a 6= a0.

The test statistic is given by

Ta =a − a0

se

√1n + x2

SXX

∼ tn−2

The null hypothesis H0 : a = a0 is rejected if

|T | > tn−2,α/2

Inference for Regression, Mar 1, 2004 - 3 -

Page 216: Stat Methods

Inference for the Coefficients

Example: Body density

The confidence interval for b is given by

b ± tn−2,α/2 ·se√SXX

= −11.41± 1.99 ·√

0.0132√0.023

= [−12.92,−9.90]

The confidence interval for a is given by

a ± tn−2,α/2 se

√1

n+

x2

SXX

= 13.71± 1.99 ·√

0.0132 ·√

1

92+

1.062

0.023= [12.11, 15.30]

Furthermore we find for

Tb =b

se/√

SXX

= −15.22 > t90,0.025 = 1.99

Thus we reject H0 : b = 0 at significance level 0.05: The coefficient b is

statistically significantly different from zero.

Similarly

Ta =a

se

√1n + x2

SXX

= 17.26 > t90,0.025 = 1.99

Thus we reject H0 : a = 0 at significance level 0.05: The coefficient a is

statistically significantly different from zero.

The corresponding P -values are

◦ P(|Ta| ≥ 15.22) ≈ 0

◦ P(|Tb| ≥ 17.26) ≈ 0

Inference for Regression, Mar 1, 2004 - 4 -

Page 217: Stat Methods

Estimating the Mean

In the linear regression model, the mean of Y at x = x0 is given byE(Y ) = a + b x0

Our estimate for the mean of Y at X = x0 is

Yx0= a + b x0.

Question: How precise is this estimate?

Note that

Yx0= a + b x0 = Y − b(x0 − x).

Hence we obtainE(Yx0) = a + b x0

var(Yx0) =

(1

n+

(x0 − x)2

SXX

)σ2

(1 − α) confidence interval for E(Yx0)

(a + b x0) ± tn−2,α/2 · se ·√

1

n+

(x0 − x)2

SXX

Inference for Regression, Mar 1, 2004 - 5 -

Page 218: Stat Methods

Estimating the Mean

Example: Body density

Suppose the measured skin thickness is x0 = 1.1 mm.

What is the mean body density for this value of skin thickness?

◦ Point estimate:

Yx0= a + hb x0 = 13.71− 11.41 · 1.1 = 1.159

The mean body density is 1.159 · 103 kg/m3.

◦ Confidence interval:

(a + b x0) ± tn−2,α/2 · se ·√

1

n+

(x0 − x)2

SXX

= (13.71− 11.41 · 1.1)± 1.99 ·√

0.0132 ·√

1

92+

(1.1 − 1.06)2

0.023

= [1.09, 1.22]

In STATA, the standard error for estimating the mean of Y is calculated

by passing the option stdp to predict:

. predict BDH

. predict SE, stdp

. generate low=BDH-invttail(49,.025)*SE

. generate high=BDH+invttail(49,.025)*SE

. sort SKINT

. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)

11.

21.

41.

61.

82

1.02 1.04 1.06 1.08 1.1SKINT

Inference for Regression, Mar 1, 2004 - 6 -

Page 219: Stat Methods

Prediction

Suppose we want to predict Y at x = x0.

Aim: (1 − α) confidence interval for Y

Note that

a + b x0 − Y ∼ N(

0, σ2

(1 +

1

n+

(x0 − X)2

SXX

))

Thus the desired (1 − α) confidence interval for Yx0is given by

a + b x0 ± tn−2,α/2 · se ·√

1 +1

n+

(x0 − X)2

SXX

Inference for Regression, Mar 1, 2004 - 7 -

Page 220: Stat Methods

Prediction

Example: Body density

Suppose the measured skin thickness is x0 = 1.1 mm.

What is the predicted body density for this value of skin thickness?

◦ Point estimate: Yx0= a + hb x0 = 13.71− 11.41 · 1.1 = 1.159

The predicted body density is 1.159 · 103 kg/m3.

◦ Confidence interval:

(a + b x0) ± tn−2,α/2 · se ·√

1 +1

n+

(x0 − x)2

SXX

= (13.71− 11.41 · 1.1)± 1.99 ·√

0.0132 ·√

1 +1

92+

(1.1 − 1.06)2

0.023

= [0.92, 1.40]

In STATA, the standard error for predicting Y is calculated by passing the

option stdf to predict:

. drop SE low high

. predict SE, stdf

. generate low=tbillh-invttail(49,.025)*SE

. generate high=tbillh+invttail(49,.025)*SE

. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)

Alternatively, we can use the following command:

. twoway (lfitci BODYD SKINT, range(1 1.1) stdf) (scatter BODYD SKINT),> xtitle(Skin thickness) ytitle(Body density) scheme(s1color) legend(off)

11.

52

2.5

1.02 1.04 1.06 1.08 1.1SKINT

11.

52

2.5

Bod

y de

nsity

1 1.02 1.04 1.06 1.08 1.1SKin thickness

Inference for Regression, Mar 1, 2004 - 8 -

Page 221: Stat Methods

Multiple Regression

Example: Food expenditure and family income

Data: ◦ Sample of 20 households

◦ Food expenditure (response variable)

◦ Family income and family size

. regress food income-------------------------------------------------------------------------

food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------income | .1841099 .0149345 12.33 0.000 .1527336 .2154862_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615

-------------------------------------------------------------------------

. regress food number-------------------------------------------------------------------------

food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------number | 2.287334 .4224493 5.41 0.000 1.399801 3.174867_cons | 1.217365 1.410627 0.86 0.399 -1.746252 4.180981

-------------------------------------------------------------------------

Income

Foo

d E

xpen

ditu

re

0 20 40 60 80 100 1200

4

8

12

16

20

Family Size

Foo

d E

xpen

ditu

re

0 1 2 3 4 5 60

4

8

12

16

20

Multiple Regression, Mar 3, 2004 - 1 -

Page 222: Stat Methods

Multiple Regression

Multiple regression model

Yi = b0 + b1 x1,i + b2 x2,i + . . . + bp xp,i + εi i = 1, . . . , n

where

◦ Yi response variable

◦ x1,i, . . . , xp,i predictor variables (fixed, nonrandom)

◦ b0, . . . , bp regression coefficients

◦ εiiid∼ N (0, σ2) error variable

Example: Food expenditure and family income

Fitting multiple regression models in STATA:

. regress food income number

Source | SS df MS Number of obs = 20--------+------------------------------ F( 2, 17) = 121.47

Model | 386.312865 2 193.156433 Prob > F = 0.0000Resid. | 27.0326365 17 1.59015509 R-squared = 0.9346--------+------------------------------ Adj R-squared = 0.9269

Total | 413.345502 19 21.7550264 Root MSE = 1.261-------------------------------------------------------------------------

food | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+----------------------------------------------------------------income | .1482117 .0163786 9.05 0.000 .1136558 .1827676number | .7931055 .2444411 3.24 0.005 .2773798 1.308831_cons | -1.118295 .6548524 -1.71 0.106 -2.499913 .2633232

-------------------------------------------------------------------------

Multiple Regression, Mar 3, 2004 - 2 -

Page 223: Stat Methods

Multiple Regression

Example: Food expenditure and family income

Data: (Foodi, Incomei, Numberi), i = 1, . . . , 20

Fitted regression model:

Food = b0 + b1 Income + b2 Number

020

4060

80100

120

0

1

23

45

6

0

4

8

12

16

20

Yi

Yi

Fitted model is a two-dimensional plane - difficult to visualize.

Multiple Regression, Mar 3, 2004 - 3 -

Page 224: Stat Methods

Inference for Multiple Regression

Multiple regression model (matrix notation)

Y = X b + ε

where

Y n dimensional vector

X n × (1 + p) dimensional matrix

b 1 + p dimensional vector

ε n dimensional vector

Thus the model can be written as

Y1...

Yn

=

1 x1,1 · · · xp,1...

... . . . ...

1 x1,n · · · xp,n

b0...

bp

+

ε1...

εn

Least squares approach: Minimize

‖Y − Y ‖ =n∑

i=1

(Yi − Yi)2

Results:

b = (XTX)−1XTY ∼ N(b, σ2(XTX)−1

)

Y = X(XTX)−1XTY ∼ N(X b, σ2X(XTX)−1XT

)

ε = Y − Y =(1 − X(XTX)−1XT

)Y ∼ N

(0, σ2

(1 − X(XTX)−1XT

))

σ2 = s2e =

‖Y − Y ‖2

n − p

=1

n − p

n∑i=1

(Yi − Yi)2

Details course in regression analysis (STAT 22200) or econometrics

Multiple Regression, Mar 3, 2004 - 4 -

Page 225: Stat Methods

Inference for Multiple Regression

Example: Food expenditure and family income

Interpretation of regression coefficients

. quietly regress food income

. predict e_food1, residuals

. quietly regress number income

. predict e_num, residuals

. regress e_food1 e_num------------------------------------------------------------------------e_food1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+--------------------------------------------------------------

e_num | .7931055 .2375541 3.34 0.004 .2940229 1.292188------------------------------------------------------------------------

. quietly regress food number

. predict e_food2, residuals

. quietly regress income number

. predict e_inc, residuals

. regress e_food2 e_inc------------------------------------------------------------------------e_food2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+--------------------------------------------------------------

e_inc | .1482117 .0159172 9.31 0.000 .114771 .1816525------------------------------------------------------------------------

Result:

◦ bj measures the dependence of Y on xj after removing the linear effects

of all other predictors xk, k 6= j.

◦ bj = 0 if xj does not provide information for the prediction of Y addi-

tionally to the information given by the other predictor variables.

Multiple Regression, Mar 3, 2004 - 5 -

Page 226: Stat Methods

Multiple Regression

Example: Heart cathederization

Description: A Teflon tube (catheder) 3 mm is diameter is passed into a major vein or

artery at the femoral region and pushed up into the heart to obtain information about

the heart’s physiology and functional ability. The length of the catheder is typically

determined by a physician’s educated guess.

Data:

◦ Study with 12 children with congenital heart defects

◦ Exact required catheder length was measured using a fluoroscope

◦ Patient’s height and weight were recorded

Question: How accurately can catheder length be determined by height

and length?

30 40 50 60

20

25

30

35

40

45

50

Height (in)

Dis

tanc

e (c

m)

20 40 60 80

20

25

30

35

40

45

50

Weight (lb)

Dis

tanc

e (c

m)

Multiple Regression, Mar 3, 2004 - 6 -

Page 227: Stat Methods

Multiple Regression

Example: Heart cathederization (contd)

Regression model:

Y = b0 + b1 x1 + b2 x2 + ε

where ◦ Y - distance to pulmonary artery

◦ x1 - height

◦ x2 - weight

STATA regression output:

. regress distance height weight

Source | SS df MS Number of obs = 12-------------+------------------------------ F( 2, 9) = 18.62

Model | 578.81613 2 289.408065 Prob > F = 0.0006Residual | 139.913037 9 15.545893 R-squared = 0.8053

-------------+------------------------------ Adj R-squared = 0.7621Total | 718.729167 11 65.3390152 Root MSE = 3.9428

------------------------------------------------------------------------------distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------height | .1963566 .3605845 0.54 0.599 -.6193422 1.012056weight | .1908278 .165164 1.16 0.278 -.1827991 .5644547_cons | 21.0084 8.751156 2.40 0.040 1.211907 40.80489

------------------------------------------------------------------------------

Note:

◦ Neither height nor weight seem to be significant for predicting the dis-

tance to the pulmonary artery.

◦ The regression on both variables explains 80% of the variation of the

response (length of catheder).

Multiple Regression, Mar 3, 2004 - 7 -

Page 228: Stat Methods

Multiple Regression

Example: Heart cathederization (contd)

Consider predicting the length by height alone and by weight alone:

. regress distance heightR-squared = 0.7765

------------------------------------------------------------------------------distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------height | .5967612 .1012558 5.89 0.000 .3711492 .8223732_cons | 12.12405 4.247174 2.85 0.017 2.660752 21.58734

------------------------------------------------------------------------------

. regress distance weightR-squared = 0.7989

------------------------------------------------------------------------------distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------weight | .2772687 .0439881 6.30 0.000 .1792571 .3752804_cons | 25.63746 2.004207 12.79 0.000 21.17181 30.10311

------------------------------------------------------------------------------

Note:

◦ In a simple regression of Y on either height or weight, the explanatory

variable is highly significant for predicting Y .

◦ In a multiple regression of Y on height and weight, the coefficients for

both height and weight are not significantly different from zero.

Problem: Explanatory variables are highly linearly dependent (collinear)

20 30 40 50 60 70

20

40

60

80

Height (in)

Wei

ght (

lb)

Multiple Regression, Mar 3, 2004 - 8 -

Page 229: Stat Methods

Analysis of Variance

Decomposition of variation:

◦ SSTotal =∑

i(Yi − Y )2 - total variation

◦ SSResidual =∑

i(Yi − Yi)2 - variation in regression model

◦ SSModel = SSTotal − SSResidual

=∑

i(Yi − Y )2 - variation explained by regression

Coefficient of determination: The ratio

R2 =SSModel

SSTotal

indicates how well the regression model predicts the response. R2 is also

the squared multiple correlation coefficient - in a simple linear regression

we have

R2 = ρ2XY .

Example: Heart cathederization

Source | SS df MS Number of obs = 12-------------+------------------------------ F( 2, 9) = 18.62

Model | 578.81613 2 289.408065 Prob > F = 0.0006Residual | 139.913037 9 15.545893 R-squared = 0.8053

-------------+------------------------------ Adj R-squared = 0.7621Total | 718.729167 11 65.3390152 Root MSE = 3.9428

The coefficient of determination for these data is

R2 =578.82

718.73= 0.81.

Regression on height and weight explains 81% of the variation of distance.

Multiple Regression, Mar 3, 2004 - 9 -

Page 230: Stat Methods

Analysis of Variance

Question: Is improvement in prediction (decrease in variation) significant?

Our null hypothesis is that none of the explanatory variables helps to

predict the response, that is,

H0 : b1 = . . . = bp = 0 versus Ha : bj 6= 0 for any j ∈ {1, . . . , p}.

Under the null hypothesis H0 the F statistic

F =n − p − 1

p· SSModel

SSResidual=

n − p − 1

p· SSTotal − SSResidual

SSResidual

is F distributed with p and n − p − 1 degrees of freedom.

The null hypothesis H0 is rejected at level α if F > Fp,n−p−1,α.

Example: Heart cathederization

Source | SS df MS Number of obs = 12-------------+------------------------------ F( 2, 9) = 18.62

Model | 578.81613 2 289.408065 Prob > F = 0.0006Residual | 139.913037 9 15.545893 R-squared = 0.8053

-------------+------------------------------ Adj R-squared = 0.7621Total | 718.729167 11 65.3390152 Root MSE = 3.9428

The value of the F statistic is

F =9

2· 578.82

139.91= 18.61.

The critical value for rejecting H0 : b1 = b2 = 0 is F2,9,0.05 = 4.26. Thus

the null hypothesis H0 that both coefficients b1 and b2 are zero is rejected

at significance level α = 0.05.

Multiple Regression, Mar 3, 2004 - 10 -

Page 231: Stat Methods

Comparing Models

Example: Cobb-Douglas production function

Y = t · Ka · Lb · M c

where ◦ Y - output

◦ K - capital

◦ L - labour

◦ M - materials

Regression model:

log Y = log t + a log K + b log L + c log M

K

Y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

L

Y

−0.2 0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.8

MY

−0.2 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

Multiple Regression, Mar 3, 2004 - 11 -

Page 232: Stat Methods

Comparing Models

Example: Cobb-Douglas production function (contd)

Regression model M0 for Cobb-Douglas function:

log Y = log t + a log K + b log L + c log M

. regress LY LK LM LLSource | SS df MS Number of obs = 25

---------+----------------------------- F( 3, 21) = 138.98Model | 1.35136742 3 .450455808 Prob > F = 0.0000

Residual | .068065609 21 .003241219 R-squared = 0.9520---------+----------------------------- Adj R-squared = 0.9452

Total | 1.41943303 24 .059143043 Root MSE = .05693-------------------------------------------------------------------------

LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+---------------------------------------------------------------

LK | .0718626 .1543912 0.47 0.646 -.2492114 .3929366LM | .7072231 .3004146 2.35 0.028 .0824768 1.331969LL | .2117778 .4248755 0.50 0.623 -.6717991 1.095355

_cons | .0347117 .0374354 0.93 0.364 -.0431395 .1125629

Two variables, log K and log L, do not improve prediction of log Y .

alternative model M1

log Y = log t + c log M

. regress LY LMSource | SS df MS Number of obs = 25

---------+----------------------------- F( 1, 23) = 445.69Model | 1.34977753 1 1.34977753 Prob > F = 0.0000

Residual | .069655501 23 .0030285 R-squared = 0.9509---------+----------------------------- Adj R-squared = 0.9488

Total | 1.41943303 24 .059143043 Root MSE = .05503-------------------------------------------------------------------------

LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+---------------------------------------------------------------

LM | .9086794 .0430421 21.11 0.000 .81964 .9977188_cons | .0512244 .0189767 2.70 0.013 .011968 .0904808

Question: Is model M0 significantly better than model M1?

Multiple Regression, Mar 3, 2004 - 12 -

Page 233: Stat Methods

Comparing Models

Consider the multiple regression model with p explanatory variables

Yi = b0 + b1 x1,i + . . . + bp xp,i + εi.

Problem:

Test the null hypothesis

H0: q specific explanatory variables all have zero coefficients

versus

Ha: any of these q explanatory variables has a nonzero coefficient.

Solution:

◦ Regress Y on all p explanatory variables and read SS(1)Residual from the

output.

◦ Regress Y on just p − q explanatory variables that remain after you

remove the q variables from the model. Read SS(2)Residual from the output.

◦ The test statistic is

F =n − p − 1

q· SS

(2)Residual − SS

(1)Residual

SS(1)Residual

.

Under the null hypothesis, F is F distributed with q and n − p − 1

defrees of freedom.

◦ Reject if F > Fq,n−p−1,α.

Multiple Regression, Mar 3, 2004 - 13 -

Page 234: Stat Methods

Comparing Models

Example: Cobb-Douglas production function

Comparison of models M0 and M1:

◦ M0: SS(0)Residual = .06807 and n − p − 1 = 21.

◦ M1: SS(1)Residual = .06966 and q = 2.

F =21

2· .06966− .06807

.06807= 0.2453

◦ Since F < F2,21,0.05 = 3.47 we cannot reject H0 : a = b = 0.

Using STATA:

. test LK LL

( 1) LK = 0( 2) LL = 0

F( 2, 21) = 0.25Prob > F = 0.7847

. test LK LL _cons

( 1) LK = 0( 2) LL = 0( 3) _cons = 0

F( 3, 21) = 2.43Prob > F = 0.0934

Multiple Regression, Mar 3, 2004 - 14 -

Page 235: Stat Methods

Case Study

Example: Headaches and pain reliever

◦ 24 patients with a common type of headache were treated with a new

pain reliever

◦ Medicamentation was given to each patient in one of four dosage levels:

2,5,7 or 10 grams

◦ Response variable: time until noticeable relieve (in minutes)

◦ Other explanatory variables:

⋄ sex (0=female, 1=male)

⋄ blood pressure (0.25=low, 0.50=medium, 0.75=high)

Box plots

0

10

20

30

40

50

60

Tim

e (in

min

utes

)

female male female male female male female male2 grams 5 grams 7 grams 2 grams

Multiple Regression II, Mar 5, 2004 - 1 -

Page 236: Stat Methods

Case Study

. regress time dose bp if sex==0

R-squared = 0.8861--------------------------------------------------------------------------

time | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+----------------------------------------------------------------

dose | -5.519608 .6608907 -8.35 0.000 -7.014646 -4.024569bp | -5 9.439407 -0.53 0.609 -26.35342 16.35342

_cons | 61.11765 6.458495 9.46 0.000 46.50752 75.72778--------------------------------------------------------------------------

. predict YHf(option xb assumed; fitted values)

. twoway line YHf dose if bp==0.25||line YHf dose if bp==0.5||> line YHf dose if bp==0.75||scatter time dose if(sex==0), saving(a, replace)(file a.gph saved)

. regress time dose bp if sex==1

R-squared = 0.5765--------------------------------------------------------------------------

time | Coef. Std. Err. t P>|t| [95% Conf. Interval]---------+----------------------------------------------------------------

dose | -3.343137 .9564492 -3.50 0.007 -5.506776 -1.179499bp | -2.5 13.66083 -0.18 0.859 -33.40294 28.40294

_cons | 51.39216 9.346814 5.50 0.000 30.2482 72.53612--------------------------------------------------------------------------

. predict YHm(option xb assumed; fitted values)

. twoway line YHm dose if bp==0.25||line YHm dose if bp==0.5||> line YHm dose if bp==0.75||scatter time dose if(sex==1), saving(b, replace)(file b.gph saved)

. graph combine a.gph b.gph

020

4060

2 4 6 8 10dose

Fitted values Fitted values

Fitted values time

020

4060

2 4 6 8 10dose

Fitted values Fitted values

Fitted values time

Multiple Regression II, Mar 5, 2004 - 2 -

Page 237: Stat Methods

Case Study

Model:

Time = Dose + Sex + Sex · Dose + BP + ε

. infile time dose sex bp using headache.dat(24 observations read). generate sexdose=sex*dose. regress time dose sex sexdose bp

Source | SS df MS Number of obs = 24----------+------------------------------ F( 4, 19) = 16.78

Model | 4387.65319 4 1096.9133 Prob > F = 0.0000Residual | 1242.30515 19 65.3844814 R-squared = 0.7793----------+------------------------------ Adj R-squared = 0.7329

Total | 5629.95833 23 244.780797 Root MSE = 8.0861---------------------------------------------------------------------------

time | Coef. Std. Err. t P>|t| [95% Conf. Interval]----------+----------------------------------------------------------------

dose | -5.519608 .8006399 -6.89 0.000 -7.195367 -3.843849sex | -8.47549 7.553222 -1.12 0.276 -24.28457 7.333585

sexdose | 2.176471 1.132276 1.92 0.070 -.19341 4.546351bp | -3.75 8.086067 -0.46 0.648 -20.67433 13.17433

_cons | 60.49265 6.698634 9.03 0.000 46.47224 74.51305---------------------------------------------------------------------------

. predict YH(option xb assumed; fitted values). predict E, residuals

Residual plot: residualsi vs Dose

2 4 6 8 10

−10

−5

0

5

10

15

Dose (in grams)

Res

idua

ls (

in m

inut

es)

−2 −1 0 1 2

−10

−5

0

5

10

15

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Multiple Regression II, Mar 5, 2004 - 3 -

Page 238: Stat Methods

Case Study

Model:

Time = Dose + Dose2 + Sex + Sex · Dose + BP + ε

. drop YH E

. generate dosesq=dose^2

. regress time dose sex sexdose dosesq bp

Source | SS df MS Number of obs = 24----------+------------------------------ F( 5, 18) = 24.20

Model | 4901.02819 5 980.205637 Prob > F = 0.0000Residual | 728.930147 18 40.4961193 R-squared = 0.8705----------+------------------------------ Adj R-squared = 0.8346

Total | 5629.95833 23 244.780797 Root MSE = 6.3637---------------------------------------------------------------------------

time | Coef. Std. Err. t P>|t| [95% Conf. Interval]----------+----------------------------------------------------------------

dose | -12.91961 2.171775 -5.95 0.000 -17.48234 -8.356878sex | -8.47549 5.944312 -1.43 0.171 -20.96403 4.013047

sexdose | 2.176471 .8910901 2.44 0.025 .3043598 4.048581dosesq | .6166667 .1731968 3.56 0.002 .2527937 .9805396

bp | -3.75 6.363656 -0.59 0.563 -17.11955 9.619545_cons | 77.45098 7.104701 10.90 0.000 62.52456 92.3774

---------------------------------------------------------------------------

. predict E, residuals

2 4 6 8 10

−10

−5

0

5

10

Dose (in grams)

Res

idua

ls (

in m

inut

es)

−2 −1 0 1 2

−10

−5

0

5

10

Theoretical Quantiles

Sam

ple

Qua

ntile

s

. test sex bp

( 1) sex = 0( 2) bp = 0

F( 2, 18) = 1.19Prob > F = 0.3270

Multiple Regression II, Mar 5, 2004 - 4 -

Page 239: Stat Methods

Case Study

Model:

Time = Dose + Dose2 + Sex · Dose + ε

. regress time dose sexdose dosesq

Source | SS df MS Number of obs = 24----------+------------------------------ F( 3, 20) = 38.81

Model | 4804.63916 3 1601.54639 Prob > F = 0.0000Residual | 825.319178 20 41.2659589 R-squared = 0.8534----------+------------------------------ Adj R-squared = 0.8314

Total | 5629.95833 23 244.780797 Root MSE = 6.4239---------------------------------------------------------------------------

time | Coef. Std. Err. t P>|t| [95% Conf. Interval]----------+----------------------------------------------------------------

dose | -12.34823 2.154675 -5.73 0.000 -16.8428 -7.853653sexdose | 1.033708 .3931338 2.63 0.016 .2136452 1.853771dosesq | .6166667 .1748353 3.53 0.002 .2519667 .9813667_cons | 71.33824 5.667294 12.59 0.000 59.51647 83.16

---------------------------------------------------------------------------

. twoway line YH dose if sex==0|| line YH dose if sex==1,> legend(label(1 "female") label(2 "male"))

2 4 6 8 10

0

10

20

30

40

50

60

Dose (in grams)

Fitt

ed ti

me

(in m

inut

es)

Multiple Regression II, Mar 5, 2004 - 5 -

Page 240: Stat Methods

Comparing Several Means

Example: Comparison of laboratories

◦ Task: Measure amount of chlorpheniramine maleate in tablets

◦ Seven laboratories were asked to make 10 determinations of one tablet

◦ Study consistency between labs and variability of measurements

Box plot

Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 73.80

3.85

3.90

3.95

4.00

4.05

4.10

Am

ount

of c

hlor

phen

imar

ine

(in m

g)

One-Way Analysis of Variance, Mar 8, 2004 - 1 -

Page 241: Stat Methods

Comparing Several Means

Example: Comparison of drugs

◦ Experimental study of drugs to relieve itching

◦ Five drugs were compared to a placebo and no drug

◦ Ten volunteer male subjects

◦ Each subject underwent one treatment per day (randomized order)

◦ Drug or placebo were given intravenously

◦ Itching was induced on forearms with cowage

◦ Subjects recorded duration of itching

Box plot

No drug Papaverine Aminophylline Tripelennamine

100

200

300

400

Dur

atio

n of

itch

ing

(sec

)

Placebo Morphine Pentobarbital

One-Way Analysis of Variance, Mar 8, 2004 - 2 -

Page 242: Stat Methods

Comparing Several Means

. infile amount lab using labs.txt(70 observations read)

. graph box amount, over(lab)

. oneway amount lab, bonferroni tabulate

| Summary of amountlab | Mean Std. Dev. Freq.

------------+------------------------------------1 | 4.062 .03259178 102 | 3.997 .08969706 103 | 4.003 .02311808 104 | 3.920 .03333330 105 | 3.957 .05716445 106 | 3.955 .06704064 107 | 3.998 .08482662 10

------------+------------------------------------Total | 3.9845715 .07184294 70

Analysis of VarianceSource SS df MS F Prob > F

------------------------------------------------------------------------Between groups .1247371 6 .020789517 5.66 0.0001Within groups .231400073 63 .003673017------------------------------------------------------------------------

Total .356137173 69 .005161408

Bartlett’s test for equal variances: chi2(6) = 24.3697 Prob>chi2 = 0.000

Comparison of amount by lab(Bonferroni)

Row Mean-|Col Mean | 1 2 3 4 5 6---------+------------------------------------------------------------------

2 | -.065| 0.408|

3 | -.059 .006| 0.698 1.000|

4 | -.142 -.077 -.083| 0.000 0.127 0.068|

5 | -.105 -.04 -.046 .037| 0.005 1.000 1.000 1.000|

6 | -.107 -.042 -.048 .035 -.002| 0.004 1.000 1.000 1.000 1.000|

7 | -.064 .001 -.005 .078 .041 .043| 0.448 1.000 1.000 0.115 1.000 1.000

One-Way Analysis of Variance, Mar 8, 2004 - 3 -

Page 243: Stat Methods

Comparing Several Means

. oneway duration drug, bonferroni tabulate

| Summary of durationdrug | Mean Std. Dev. Freq.

------------+------------------------------------1 | 191.0 54.861442 102 | 204.8 105.723750 103 | 118.2 52.809511 104 | 148.0 44.738748 105 | 144.3 42.076782 106 | 176.5 68.856130 107 | 167.2 67.499465 10

------------+------------------------------------Total | 164.28571 68.463709 70

Analysis of VarianceSource SS df MS F Prob > F

------------------------------------------------------------------------Between groups 53012.8857 6 8835.48095 2.06 0.0708Within groups 270409.4 63 4292.2127------------------------------------------------------------------------

Total 323422.286 69 4687.2795

Bartlett’s test for equal variances: chi2(6) = 11.3828 Prob>chi2 = 0.077

Comparison of duration by drug(Bonferroni)

Row Mean-|Col Mean | 1 2 3 4 5 6---------+------------------------------------------------------------------

2 | 13.8| 1.000|

3 | -72.8 -86.6| 0.328 0.092|

4 | -43 -56.8 29.8| 1.000 1.000 1.000|

5 | -46.7 -60.5 26.1 -3.7| 1.000 0.904 1.000 1.000|

6 | -14.5 -28.3 58.3 28.5 32.2| 1.000 1.000 1.000 1.000 1.000|

7 | -23.8 -37.6 49 19.2 22.9 -9.3| 1.000 1.000 1.000 1.000 1.000 1.000

One-Way Analysis of Variance, Mar 8, 2004 - 4 -