st 544: applied categorical data analysispeople.uncw.edu/.../ppt-chapter/zhangdaowen---chap1.pdf ·...

ST 544 c©D. Zhang

ST 544: Applied Categorical Data Analysis

Daowen Zhang

[email protected]

http://www4.stat.ncsu.edu/∼dzhang2

Slide 1

TABLE OF CONTENTS ST 544, D. Zhang

Contents

1 Introduction 3

2 Contingency Tables 40

3 Generalized Linear Models (GLMs) 122

4 Logistic Regression 189

5 Building and Applying Logistic Regression Models 248

6 Multicategory Logit Models 299

8 Models for Matched Pairs 366

9 Modeling Correlated, Clustered, Longitudinal Categorical Data435

10 Random Effects: Generalized Linear Mixed Models (GLMMs) 480

Slide 2

CHAPTER 1 ST 544, D. Zhang

1 Introduction

I. Categorical Data

Definition

• A categorical variable is a (random) variable that can only take finite

or countably many values (categories).

• Type of categorical variables:

? Gender: F/M or 0/1; Race: White, Black, Others – Nominal

? Patient’s Health Status: Excellent, Good, Fair, Bad – Ordinal

? # of car accidents in next Jan in Wake County – Interval

Slide 3


• Application of math operations:

Type Nominal Ordinal Interval Continuous

Example Gender, Race Patient’s

Health Status

# of car acci-

dents

Height

Math Operation None >,< >,<,± Any

• Response (Dependent) Variable: Y

Explanatory (Independent, Covariate) Variable: X.

• We focus on the cases where Y is categorical.

Slide 4


II. Common Distributions

II.1 Binomial distribution

• We have a Bernoulli process:

1. n independent trials, n > 0 – fixed integer

2. Each trial produces 1 of 2 outcomes: S for success & F for failure

3. Success probability at each trial is the same (π ∈ (0, 1))

• Y = total # of successes out of n trials, Y ∼ Bin(n, π) and has a

probability mass function (pmf):

p(y) = P [Y = y] =n!

y!(n− y)!πy(1− π)n−y, y = 0, 1, 2, ..., n.

n!y!(n−y)! is usually denoted as

(ny

), and usually is nCr in your calculator.

• The above pmf is useful in calculating probabilities associated with a

binomial distribution (for a known π).

Slide 5


Slide 6


• Examples: Suppose two people (A and B) are to play n = 10 chess

games with no tie. If we assume that the games are independent to

each other and π = P [A wins B in a single game] = 0.6.

1. Find the prob that A wins 4 games.

P [Y = 4] =

(10

4

)0.64(1− 0.6)10−4 = 0.1115

2. Find the prob that A wins at least 4 games.

P [Y ≥ 4] = 1− P [Y ≤ 3] = 1− 0.0548 = 0.9452.

3. Find the prob that B wins more than A.

P [10− Y > Y ] = P [Y < 10/2 = 5] = P [Y ≤ 4] = 0.1662.

Slide 7

dbinom(y,n,p): find Pr(Y=y); pbinom(y,n,p): find Pr(Y<=y);qbinom(p,n,p): given p=Pr(Y<=y), to find the value of observation y.

1) dbinom(4,10,0.6);2) 1-pbinom(3,10,0.6);3) pbinom(4,10,0.6)


• Properties of a binomial distribution Y ∼ Bin(n, π):

1. Y = Y1 + Y2 + · · ·+ Yn, where Yi = 1/0 is the number of success

in the ith trial, Yi indep of Yj for i 6= j.

2. Mean, variance and standard deviation of Y :

E(Y ) = nπ

var(Y ) = nπ(1− π)

σ =√

var(Y ) =√nπ(1− π)

3. Y has smaller variation when π is closer to 0 or 1.

• When n is large, Bin(n, π) can be well approximated by a normal dist.

Requirement: nπ ≥ 5 & n(1− π) ≥ 5.

Slide 8

For example: Pi=0.5， then we will need n>=10;Pi=0.1 (or Pi=0.9), then we will need n>=50.

chenc

Highlight


Normal Approximation to Bin(12, 0.5)

Slide 9

The binomial distribution is symmetric when π = 0.50. For fixed n, it becomes more bell-shaped as π gets closer to 0.50. For fixed π, it becomes more bell-shaped as n increases. When n is large, it can be approximated distribution with μ = nπ and σ =nπ(1 − π). A guideline1 is that the expected number of outcomes of the two types, nπand n(1 − π), should both be at least about 5. For π = 0.50 this requires only n ≥ 10, whereas π = 0.10 (or π = 0.90) requires n ≥ 50. When π gets nearer to 0 or 1, larger samples are needed before a symmetric, bell shape occurs.


II.2 Multinomial distribution (for nominal or ordinal categorical variables)

Y 1 2 · · · c

Prob π1 π2 · · · πc

where πi = P [Y = j] > 0,∑cj=1 πj = 1.

• Each trial of n trials results in an outcome in one (and only one) of c

categories, represented by

Y˜ i =

Yi1

Yi2...

Yic

, i = 1, 2, ..., n. For example, Y˜ i =

0

1...

0

.

Only one of {Yij}cj=1 is 1, others are 0; πj = P [Yij = 1].

• Prob of observing Y˜ i: πYi11 πYi2

2 · · ·πYicc

Slide 10


• Often time, we may not have the individual outcome. Instead, we have

the following summary:

n˜ =

n1

n2

...

nc

,where nj is the # of trials resulting outcome in the j category. That is

nj =∑ni=1 Yij .

• The probability of observing n˜ is

p(n1, n2, ..., nc) =n!

n1!n2! · · ·nc!πn1

1 πn22 · · ·πnc

c .

• We often denote n˜ ∼ multinomial(n, (π1, ..., πc)).

Slide 11


• In practice, we want to keep the data in the original form of Y˜ i, or the

category the ith observation fell, together with other covariate

information if such information is available. This is especially the case

if each i represents a subject and we would like to use the covariate

information to predict which category the individual i most likely falls

(regression setting).

Slide 12


• Properties of a multinomial distribution:

1. nj ∼ Bin(n, πj) ⇒

E(nj) = nπj , var(nj) = nπj(1− πj).

2. ni and nj (i 6= j) are negatively associated:

cov(ni, nj) = −nπiπj , i 6= j.

• n˜ can be written:

n˜ =

n1

n2

...

nc

=

n∑i=1

Y˜ i.

By CLT, n˜ approximately has a (multivariate) normal distribution when

n is large.

Slide 13

chenc

Highlight

chenc

Highlight


III. Large-Sample Inference on π in a Binomial Distribution

III.1 Likelihood function and maximum likelihood estimation (MLE)

• The parameter π in Bin(n, π) is usually unknown and we would like to

learn about π based on data y from Bin(n, π).

• An intuitive estimate of π is the sample proportion

p =y

n=y1 + y2 + ...+ yn

n.

1. p is an unbiased estimator (as a random variable):

E(p) = π.

2. p has a better accuracy when n gets larger:

var(p) =π(1− π)

n.

3. When n is large, p has an approximate normal distribution

(sampling distribution)Slide 14

http://people.uncw.edu/chenc/STT215/PPT/Stt210%20chapter07.pptx

https://istats.shinyapps.io/SampDist_Prop/

chenc

Highlight

chenc

Highlight

https://istats.shinyapps.io/SampDist_Prop/


• Sample proportion p is the MLE of π:

1. Given data y ∼ Bin(n, π), we exchange the roles of y and π in the

pmf and treat it as a function of π:

L(π) =

(n

y

)πy(1− π)n−y.

This function is called the likelihood function of π for given data y.

2. For example, if y = 6 out of n = 10 Bernoulli trials, the likelihood

function of π is

L(π) =

(10

6

)π6(1− π)10−6 = 210π6(1− π)4.

3. Intuitively, the best estimate of π would be the one that maximizes

this likelihood or the log-likelihood:

`(π) = const+ y log(π) + (n− y) log(1− π).

Note that we use natural log here.

4. It can be shown that the MLE π̂ of π is p = y/n.

Slide 15

chenc

Highlight

chenc

Highlight


Slide 16


• In general, the MLE of a parameter has many good statistical

properties:

1. When sample size n is large, an MLE is unbiased.

2. When sample size n is large, the variance of an MLE → 0.

3. When sample size n is large, an MLE has an approximate normal

distribution.

4. Under some conditions, the MLE is the most efficient estimator.

• We will use ML method most of time in this course.

Slide 17

chenc

Highlight

chenc

Highlight


III.2 Significance test on π

• Test H0 : π = π0 v.s. Ha : π 6= π0 based on data y ∼ Bin(n, π).

• The MLE π̂ = p = y/n has properties:

E(p) = π, σ(p) =√π(1− π)/n (standard error).

• Three classical tests:

1. Wald test (less reliable):

Z =p− π0√p(1− p)/n

, or Z2 =

(p− π0√p(1− p)/n

)2

.

Compare Z to N(0, 1), or compare Z2 to χ21 if n is large.

That is, if |Z| ≥ zα/2 or Z2 ≥ χ21,α, then we reject H0 at the

significance level α.

Large-sample p-value = 2P [Z ≥ |z|] = P [χ21 ≥ z2].

Slide 18

For two-sided: p-value=2*pnorm(-abs(z)) = 2* (1-pnorm(abs(z))) or p-value = 1-pchisq(z^2, 1)

chenc

Highlight

chenc

Highlight


2. Score test (more reliable):

Z =p− π0√

π0(1− π0)/n, or Z2 =

(p− π0√

π0(1− π0)/n

)2

.

Compare Z to N(0, 1), or compare Z2 to χ21 if n is large.

That is, if |Z| ≥ zα/2 or Z2 ≥ χ21,α, then we reject H0 at the

significance level α.

Large-sample p-value = 2P [Z ≥ |z|] = P [χ21 ≥ z2].

Slide 19

For two-sided: p-value=2*pnorm(-abs(z)) = 2* (1-pnorm(abs(z)))

or p-value = 1-pchisq(z^2, 1)

chenc

Highlight


3. Likelihood ratio test (LRT):

`0 = y log π0 + (n− y) log(1− π0)

`1 = y log p+ (n− y) log(1− p)

G2 = 2(`1 − `0)

= 2 [y(log p− log π0) + (n− y){log(1− p)− log(1− π0)}]

= 2

[y log

p

π0+ (n− y) log

(1− p)(1− π0)

]= 2

[y log

np

nπ0+ (n− y) log

n(1− p)n(1− π0)

]= 2

[y log

y

nπ0+ (n− y) log

(n− y)

n− nπ0

]= 2

∑2 cells

obs. logobs.

exp.

Slide 20

chenc

Highlight

chenc

Highlight


Compare G2 to χ21.

That is, if G2 ≥ χ21,α, then we reject H0 at the significance level α.

Large-sample p-value = P [χ21 ≥ G2].

Slide 21

chenc

Highlight


• Example: In 2002 GSS, 400 out of 893 responded yes to “...for a

pregnant woman to obtain a legal abortion if ...”

• Test H0 : π = 0.5 v.s. Ha : π 6= 0.5 at significance level 0.05.

• p = y/n = 400/893 = 0.448.

1. Wald test:

z =p− π0√p(1− p)/n

=0.448− 0.5√

0.448 ∗ (1− 0.448)/893= −3.12.

Since z < −1.96, reject H0 at 0.05 significance level.

Large sample p-value = 2P [Z ≥ | − 3.12|] = 0.0018.

Slide 22

chenc

Highlight


2. Score test:

z =p− π0√

π0(1− π0)/n=

0.448− 0.5√0.5× (1− 0.5)/893

= −3.11.

Since z < −1.96, reject H0 at 0.05 significance level.

Large sample p-value = 2P [Z ≥ | − 3.11|] = 0.0019.

Slide 23

For two-sided: p-value=2*pnorm(-abs(-3.11)) = 0.001870873or p-value = 2* (1-pnorm(abs(-3.11))) = 0.001870873

or say p-value = 1-pchisq((-3.11)^2, 1) = 0.001870873

chenc

Highlight


3. LRT:

G2 = 2∑

2 cells

obs. logobs.

exp.

= 2[400× log{400/(893× 0.5)}

+(893− 400)× log{(893− 400)/(893− 893× 0.5)}]

= 9.7 > 1.962 = 3.84,

⇒ Reject H0 at 0.05 significance level.

Large sample p-value = P [χ21 ≥ 9.7] = 0.0018.

• Note: These three tests can be extended to test other parameters.

Slide 24

For two-sided: p-value = 1-pchisq(9.7, 1) = 0.00184268

chenc

Highlight


III.C Large-Sample Confidence Interval (CI) for π

• Wald CI of π: For given confidence level 1− α, solve the following

inequality for π0 ∣∣∣∣∣ p− π0√p(1− p)/n

∣∣∣∣∣ ≤ zα/2⇒ [p− zα/2 p(1− p)/n, p+ zα/2

√p(1− p)/n].

Note:√p(1− p)/n is called the estimated standard error (SE) of p.

The Wald CI has the form: Est. ± zα/2SE.

For the 2002 GSS example, a 95% Wald CI for π is:

[0.448− 1.96√

0.448(1− 0.448)/893,

0.448 + 1.96√

0.448(1− 0.448)/893]

= [0.415, 0.481]

Slide 25

Critical value:Zα/2 = qnorm(0.975)=1.959964

chenc

Highlight

chenc

Highlight

chenc

Highlight


Note. The Wald CI is not very reliable for small n and p ≈ 0 or 1.

Remedy for 95% CI: add 2 successes and 2 failures to the data and

re-construct the 95% Wald CI.

For example, y = 2, n = 10, 95% Wald CI:

[0.2−1.96×√

0.2× 0.8/10, 0.2+1.96×√

0.2× 0.8/10] = [−0.048, 0.448].

With the remedy, y∗ = 4, n∗ = 14, p∗ = 4/14 = 0.286, 95% Wald CI is

[0.286− 1.96×√

0.286× 0.714/14, 0.286 + 1.96×√

0.286× 0.714/14

= [0.049, 0.523].

Slide 26

called Agresti-Coull Confidence Interval

chenc

Highlight


• Score CI of π: For given confidence level 1− α, solve the following

inequality for π0 ∣∣∣∣∣ p− π0√π0(1− π0)/n

∣∣∣∣∣ ≤ zα/2For the 2002 GSS example, a 95% score CI solves∣∣∣∣∣ 0.448− π0√

π0(1− π0)/893

∣∣∣∣∣ ≤ 1.96

⇒ [0.416, 0.481].

Note: Here the sample size n is very large, the Wald CI and the score

CI are very close.

Slide 27

chenc

Highlight

chenc

Highlight


Absolute values of the score statistic as a function of π0

Slide 28


• Likelihood ratio CI: For given confidence level 1− α, solve for π0:

2

[y log

{y

nπ0

}+ (n− y) log

{(n− y)

n− nπ0

}]≤ z2

α/2.

• For the 2002 GSS example, a 95% LR CI solves:

2

[400 log

{400

893π0

}+ (893− 400) log

{(893− 400)

893− 893π0

}]≤ 1.962

⇒ [0.415, 0.481].

Slide 29

chenc

Highlight


LRT statistic as a function of π0

Slide 30


• Note: We see from the GSS example that, for large sample size n, the

Wald, score, LR CIs are all very close. However, if n is not large, there

will be some discrepancy among them.

• For example, if y = 9, n = 10, then:

1. Wald CI: [0.714, 1.086] = [0.714, 1]

2. Score CI: [0.596, 0.982]

3. LR CI: [0.628, 0.994]

Slide 31

chenc

Highlight


IV. Other Inference Approaches

IV.1 Small-sample inference for π in Bin(n, π)

1. One-sided test: H0 : π = π0 v.s. Ha : π > π0.

Given data y ∼ Bin(n, π), the testing procedure would be: Reject H0

if y is large.

Exact p-value = P [Y ≥ y|H0].

For example, H0 : π = 0.5 v.s. Ha : π > 0.5, and y = 6, n = 10. Then

exact p-value = P [Y ≥ 6|π = 0.5] = 0.377.

Slide 32

chenc

Highlight


2. Two-sided test: H0 : π = π0 v.s. Ha : π 6= π0.

Given data y ∼ Bin(n, π), the testing procedure would be: Reject H0

if |y − nπ0| is large.

Exact p-value = P [|Y − nπ0| ≥ |y − nπ0||H0].

For example, H0 : π = 0.5 v.s. Ha : π 6= 0.5, and y = 6, n = 10. Then

exact p-value = P [|Y − 10× 0.5| ≥ |6− 10× 0.5||H0]

= P [|Y − 5| ≥ 1|H0]

= P [Y − 5 ≥ 1|H0] + P [Y − 5 ≤ −1|H0]

= P [Y ≥ 6|H0] + P [Y ≤ 4|H0]

= 0.377 + 0.377 = 0.754.

Using exact p-value can be conservative!

Slide 33

chenc

Highlight

chenc

Highlight


Slide 34

dbinom(y,n,p): find Pr(Y=y); pbinom(y,n,p): find Pr(Y<=y);qbinom(p,n,p): given p=Pr(Y<=y), to find the value of observation y.

1) dbinom(4,10,0.6);2) 1-pbinom(3,10,0.6);3) pbinom(4,10,0.6)


• Using exact p-value is conservative!

For example, if we are testing H0 : π = 0.5 v.s. Ha : π > 0.5 and our

significance level =0.05 using data y from Bin(n = 10, π). Then based

on Table 1.2, we should reject H0 only if y = 9 or y = 10. However,

the actual type I error probability is 0.011 < α = 0.05. Conservative!

Slide 35

chenc

Highlight


IV.2 Inference based on the mid p-value

• For testing H0 : π = 0.5 v.s. Ha : π > 0.5 with data y from Bin(n, π),

we calculate the

mid p-value = 0.5P [Y = y|H0] + P[Y = y + 1|H0] + · ·· +P [Y = n|H0].

For example, suppose y = 9, n = 10, then

mid p-value = 0.5P [Y = 9|H0] + P[Y = 10|H0] = 0.006.

With the use of mid p-value, we will reject H0 : π = 0.5 in favor of

Ha : π > 0.5 if y = 8, 9, 10. The actual type I error probability is

0.055, much closer to the significance level α = 0.05.

Slide 36

chenc

Highlight

chenc

Highlight


IV.3 Exact confidence interval for π using exact p-value

• For given confidence level (1− α) and observed y ∼ Bin(n, π), solve

Pπ[Y ≥ y] =

n∑i=y

(n

i

)πi(1− π)n−i = α/2

to get lower limit π̂L; If y = 0, then set π̂L = 0.

Solve

Pπ[Y ≤ y] =

y∑i=0

(n

i

)πi(1− π)n−i = α/2

to get upper limit π̂U ; if y = n, then set π̂U = 1.

⇒ [π̂L, π̂U ] is an exact (1− α) for π.

• For example, y = 3, n = 10, an exact 95% CI is [0.07, 0.65]. That is,

Pπ=0.07[Y ≥ 3] = 0.025, Pπ=0.65[Y ≤ 3] = 0.025.

This exact CI is conservative, that is, too wide.

Slide 37

chenc

Highlight

chenc

Highlight


P [Y ≥ 3|π] (—) and P [Y ≤ 3|π] (...) as functions of π

Slide 38


IV.4 Exact confidence interval for π using exact mid p-value

• For given confidence level (1− α) and observed y ∼ Bin(n, π), solve

1

2Pπ[Y = y] + Pπ[Y > y] = α/2

to get lower limit π̂L; if y = 0, then π̂L = 0.

Solve1

2Pπ[Y = y] + Pπ[Y < y] = α/2

to get upper limit π̂U ; if y = n, then π̂U = 1.

⇒ [π̂L, π̂U ] is an exact (1− α) for π using mid p-value

• For example, y = 3, n = 10, an exact 95% CI is [0.08, 0.62]. That is,

1

2Pπ=0.08[Y = 3] + Pπ=0.08[Y > 3] = 0.025

1

2Pπ=0.62[Y = 3] + Pπ=0.62[Y < 3] = 0.025.

This exact CI may be anti-conservative, that is, too short.

Slide 39

chenc

Highlight

chenc

Highlight

st 544: applied categorical data analysispeople.uncw.edu/.../ppt-chapter/zhangdaowen---chap1.pdf ·...

Documents