st 544: applied categorical data analysispeople.uncw.edu/.../ppt-chapter/zhangdaowen---chap1.pdf ·...
TRANSCRIPT
ST 544 c©D. Zhang
ST 544: Applied Categorical Data Analysis
Daowen Zhang
http://www4.stat.ncsu.edu/∼dzhang2
Slide 1
TABLE OF CONTENTS ST 544, D. Zhang
Contents
1 Introduction 3
2 Contingency Tables 40
3 Generalized Linear Models (GLMs) 122
4 Logistic Regression 189
5 Building and Applying Logistic Regression Models 248
6 Multicategory Logit Models 299
8 Models for Matched Pairs 366
9 Modeling Correlated, Clustered, Longitudinal Categorical Data435
10 Random Effects: Generalized Linear Mixed Models (GLMMs) 480
Slide 2
CHAPTER 1 ST 544, D. Zhang
1 Introduction
I. Categorical Data
Definition
• A categorical variable is a (random) variable that can only take finite
or countably many values (categories).
• Type of categorical variables:
? Gender: F/M or 0/1; Race: White, Black, Others – Nominal
? Patient’s Health Status: Excellent, Good, Fair, Bad – Ordinal
? # of car accidents in next Jan in Wake County – Interval
Slide 3
CHAPTER 1 ST 544, D. Zhang
• Application of math operations:
Type Nominal Ordinal Interval Continuous
Example Gender, Race Patient’s
Health Status
# of car acci-
dents
Height
Math Operation None >,< >,<,± Any
• Response (Dependent) Variable: Y
Explanatory (Independent, Covariate) Variable: X.
• We focus on the cases where Y is categorical.
Slide 4
CHAPTER 1 ST 544, D. Zhang
II. Common Distributions
II.1 Binomial distribution
• We have a Bernoulli process:
1. n independent trials, n > 0 – fixed integer
2. Each trial produces 1 of 2 outcomes: S for success & F for failure
3. Success probability at each trial is the same (π ∈ (0, 1))
• Y = total # of successes out of n trials, Y ∼ Bin(n, π) and has a
probability mass function (pmf):
p(y) = P [Y = y] =n!
y!(n− y)!πy(1− π)n−y, y = 0, 1, 2, ..., n.
n!y!(n−y)! is usually denoted as
(ny
), and usually is nCr in your calculator.
• The above pmf is useful in calculating probabilities associated with a
binomial distribution (for a known π).
Slide 5
CHAPTER 1 ST 544, D. Zhang
Slide 6
CHAPTER 1 ST 544, D. Zhang
• Examples: Suppose two people (A and B) are to play n = 10 chess
games with no tie. If we assume that the games are independent to
each other and π = P [A wins B in a single game] = 0.6.
1. Find the prob that A wins 4 games.
P [Y = 4] =
(10
4
)0.64(1− 0.6)10−4 = 0.1115
2. Find the prob that A wins at least 4 games.
P [Y ≥ 4] = 1− P [Y ≤ 3] = 1− 0.0548 = 0.9452.
3. Find the prob that B wins more than A.
P [10− Y > Y ] = P [Y < 10/2 = 5] = P [Y ≤ 4] = 0.1662.
Slide 7
dbinom(y,n,p): find Pr(Y=y); pbinom(y,n,p): find Pr(Y<=y);qbinom(p,n,p): given p=Pr(Y<=y), to find the value of observation y.
1) dbinom(4,10,0.6);2) 1-pbinom(3,10,0.6);3) pbinom(4,10,0.6)
CHAPTER 1 ST 544, D. Zhang
• Properties of a binomial distribution Y ∼ Bin(n, π):
1. Y = Y1 + Y2 + · · ·+ Yn, where Yi = 1/0 is the number of success
in the ith trial, Yi indep of Yj for i 6= j.
2. Mean, variance and standard deviation of Y :
E(Y ) = nπ
var(Y ) = nπ(1− π)
σ =√
var(Y ) =√nπ(1− π)
3. Y has smaller variation when π is closer to 0 or 1.
• When n is large, Bin(n, π) can be well approximated by a normal dist.
Requirement: nπ ≥ 5 & n(1− π) ≥ 5.
Slide 8
For example: Pi=0.5, then we will need n>=10;Pi=0.1 (or Pi=0.9), then we will need n>=50.
CHAPTER 1 ST 544, D. Zhang
Normal Approximation to Bin(12, 0.5)
Slide 9
The binomial distribution is symmetric when π = 0.50. For fixed n, it becomes more bell-shaped as π gets closer to 0.50. For fixed π, it becomes more bell-shaped as n increases. When n is large, it can be approximated distribution with μ = nπ and σ =nπ(1 − π). A guideline1 is that the expected number of outcomes of the two types, nπand n(1 − π), should both be at least about 5. For π = 0.50 this requires only n ≥ 10, whereas π = 0.10 (or π = 0.90) requires n ≥ 50. When π gets nearer to 0 or 1, larger samples are needed before a symmetric, bell shape occurs.
CHAPTER 1 ST 544, D. Zhang
II.2 Multinomial distribution (for nominal or ordinal categorical variables)
Y 1 2 · · · c
Prob π1 π2 · · · πc
where πi = P [Y = j] > 0,∑cj=1 πj = 1.
• Each trial of n trials results in an outcome in one (and only one) of c
categories, represented by
Y˜ i =
Yi1
Yi2...
Yic
, i = 1, 2, ..., n. For example, Y˜ i =
0
1...
0
.
Only one of {Yij}cj=1 is 1, others are 0; πj = P [Yij = 1].
• Prob of observing Y˜ i: πYi11 πYi2
2 · · ·πYicc
Slide 10
CHAPTER 1 ST 544, D. Zhang
• Often time, we may not have the individual outcome. Instead, we have
the following summary:
n˜ =
n1
n2
...
nc
,where nj is the # of trials resulting outcome in the j category. That is
nj =∑ni=1 Yij .
• The probability of observing n˜ is
p(n1, n2, ..., nc) =n!
n1!n2! · · ·nc!πn1
1 πn22 · · ·πnc
c .
• We often denote n˜ ∼ multinomial(n, (π1, ..., πc)).
Slide 11
CHAPTER 1 ST 544, D. Zhang
• In practice, we want to keep the data in the original form of Y˜ i, or the
category the ith observation fell, together with other covariate
information if such information is available. This is especially the case
if each i represents a subject and we would like to use the covariate
information to predict which category the individual i most likely falls
(regression setting).
Slide 12
CHAPTER 1 ST 544, D. Zhang
• Properties of a multinomial distribution:
1. nj ∼ Bin(n, πj) ⇒
E(nj) = nπj , var(nj) = nπj(1− πj).
2. ni and nj (i 6= j) are negatively associated:
cov(ni, nj) = −nπiπj , i 6= j.
• n˜ can be written:
n˜ =
n1
n2
...
nc
=
n∑i=1
Y˜ i.
By CLT, n˜ approximately has a (multivariate) normal distribution when
n is large.
Slide 13
CHAPTER 1 ST 544, D. Zhang
III. Large-Sample Inference on π in a Binomial Distribution
III.1 Likelihood function and maximum likelihood estimation (MLE)
• The parameter π in Bin(n, π) is usually unknown and we would like to
learn about π based on data y from Bin(n, π).
• An intuitive estimate of π is the sample proportion
p =y
n=y1 + y2 + ...+ yn
n.
1. p is an unbiased estimator (as a random variable):
E(p) = π.
2. p has a better accuracy when n gets larger:
var(p) =π(1− π)
n.
3. When n is large, p has an approximate normal distribution
(sampling distribution)Slide 14
http://people.uncw.edu/chenc/STT215/PPT/Stt210%20chapter07.pptx
https://istats.shinyapps.io/SampDist_Prop/
CHAPTER 1 ST 544, D. Zhang
• Sample proportion p is the MLE of π:
1. Given data y ∼ Bin(n, π), we exchange the roles of y and π in the
pmf and treat it as a function of π:
L(π) =
(n
y
)πy(1− π)n−y.
This function is called the likelihood function of π for given data y.
2. For example, if y = 6 out of n = 10 Bernoulli trials, the likelihood
function of π is
L(π) =
(10
6
)π6(1− π)10−6 = 210π6(1− π)4.
3. Intuitively, the best estimate of π would be the one that maximizes
this likelihood or the log-likelihood:
`(π) = const+ y log(π) + (n− y) log(1− π).
Note that we use natural log here.
4. It can be shown that the MLE π̂ of π is p = y/n.
Slide 15
CHAPTER 1 ST 544, D. Zhang
Slide 16
CHAPTER 1 ST 544, D. Zhang
• In general, the MLE of a parameter has many good statistical
properties:
1. When sample size n is large, an MLE is unbiased.
2. When sample size n is large, the variance of an MLE → 0.
3. When sample size n is large, an MLE has an approximate normal
distribution.
4. Under some conditions, the MLE is the most efficient estimator.
• We will use ML method most of time in this course.
Slide 17
CHAPTER 1 ST 544, D. Zhang
III.2 Significance test on π
• Test H0 : π = π0 v.s. Ha : π 6= π0 based on data y ∼ Bin(n, π).
• The MLE π̂ = p = y/n has properties:
E(p) = π, σ(p) =√π(1− π)/n (standard error).
• Three classical tests:
1. Wald test (less reliable):
Z =p− π0√p(1− p)/n
, or Z2 =
(p− π0√p(1− p)/n
)2
.
Compare Z to N(0, 1), or compare Z2 to χ21 if n is large.
That is, if |Z| ≥ zα/2 or Z2 ≥ χ21,α, then we reject H0 at the
significance level α.
Large-sample p-value = 2P [Z ≥ |z|] = P [χ21 ≥ z2].
Slide 18
For two-sided: p-value=2*pnorm(-abs(z)) = 2* (1-pnorm(abs(z))) or p-value = 1-pchisq(z^2, 1)
CHAPTER 1 ST 544, D. Zhang
2. Score test (more reliable):
Z =p− π0√
π0(1− π0)/n, or Z2 =
(p− π0√
π0(1− π0)/n
)2
.
Compare Z to N(0, 1), or compare Z2 to χ21 if n is large.
That is, if |Z| ≥ zα/2 or Z2 ≥ χ21,α, then we reject H0 at the
significance level α.
Large-sample p-value = 2P [Z ≥ |z|] = P [χ21 ≥ z2].
Slide 19
For two-sided: p-value=2*pnorm(-abs(z)) = 2* (1-pnorm(abs(z)))
or p-value = 1-pchisq(z^2, 1)
CHAPTER 1 ST 544, D. Zhang
3. Likelihood ratio test (LRT):
`0 = y log π0 + (n− y) log(1− π0)
`1 = y log p+ (n− y) log(1− p)
G2 = 2(`1 − `0)
= 2 [y(log p− log π0) + (n− y){log(1− p)− log(1− π0)}]
= 2
[y log
p
π0+ (n− y) log
(1− p)(1− π0)
]= 2
[y log
np
nπ0+ (n− y) log
n(1− p)n(1− π0)
]= 2
[y log
y
nπ0+ (n− y) log
(n− y)
n− nπ0
]= 2
∑2 cells
obs. logobs.
exp.
Slide 20
CHAPTER 1 ST 544, D. Zhang
Compare G2 to χ21.
That is, if G2 ≥ χ21,α, then we reject H0 at the significance level α.
Large-sample p-value = P [χ21 ≥ G2].
Slide 21
CHAPTER 1 ST 544, D. Zhang
• Example: In 2002 GSS, 400 out of 893 responded yes to “...for a
pregnant woman to obtain a legal abortion if ...”
• Test H0 : π = 0.5 v.s. Ha : π 6= 0.5 at significance level 0.05.
• p = y/n = 400/893 = 0.448.
1. Wald test:
z =p− π0√p(1− p)/n
=0.448− 0.5√
0.448 ∗ (1− 0.448)/893= −3.12.
Since z < −1.96, reject H0 at 0.05 significance level.
Large sample p-value = 2P [Z ≥ | − 3.12|] = 0.0018.
Slide 22
CHAPTER 1 ST 544, D. Zhang
2. Score test:
z =p− π0√
π0(1− π0)/n=
0.448− 0.5√0.5× (1− 0.5)/893
= −3.11.
Since z < −1.96, reject H0 at 0.05 significance level.
Large sample p-value = 2P [Z ≥ | − 3.11|] = 0.0019.
Slide 23
For two-sided: p-value=2*pnorm(-abs(-3.11)) = 0.001870873or p-value = 2* (1-pnorm(abs(-3.11))) = 0.001870873
or say p-value = 1-pchisq((-3.11)^2, 1) = 0.001870873
CHAPTER 1 ST 544, D. Zhang
3. LRT:
G2 = 2∑
2 cells
obs. logobs.
exp.
= 2[400× log{400/(893× 0.5)}
+(893− 400)× log{(893− 400)/(893− 893× 0.5)}]
= 9.7 > 1.962 = 3.84,
⇒ Reject H0 at 0.05 significance level.
Large sample p-value = P [χ21 ≥ 9.7] = 0.0018.
• Note: These three tests can be extended to test other parameters.
Slide 24
For two-sided: p-value = 1-pchisq(9.7, 1) = 0.00184268
CHAPTER 1 ST 544, D. Zhang
III.C Large-Sample Confidence Interval (CI) for π
• Wald CI of π: For given confidence level 1− α, solve the following
inequality for π0 ∣∣∣∣∣ p− π0√p(1− p)/n
∣∣∣∣∣ ≤ zα/2⇒ [p− zα/2 p(1− p)/n, p+ zα/2
√p(1− p)/n].
Note:√p(1− p)/n is called the estimated standard error (SE) of p.
The Wald CI has the form: Est. ± zα/2SE.
For the 2002 GSS example, a 95% Wald CI for π is:
[0.448− 1.96√
0.448(1− 0.448)/893,
0.448 + 1.96√
0.448(1− 0.448)/893]
= [0.415, 0.481]
Slide 25
Critical value:Zα/2 = qnorm(0.975)=1.959964
CHAPTER 1 ST 544, D. Zhang
Note. The Wald CI is not very reliable for small n and p ≈ 0 or 1.
Remedy for 95% CI: add 2 successes and 2 failures to the data and
re-construct the 95% Wald CI.
For example, y = 2, n = 10, 95% Wald CI:
[0.2−1.96×√
0.2× 0.8/10, 0.2+1.96×√
0.2× 0.8/10] = [−0.048, 0.448].
With the remedy, y∗ = 4, n∗ = 14, p∗ = 4/14 = 0.286, 95% Wald CI is
[0.286− 1.96×√
0.286× 0.714/14, 0.286 + 1.96×√
0.286× 0.714/14
= [0.049, 0.523].
Slide 26
called Agresti-Coull Confidence Interval
CHAPTER 1 ST 544, D. Zhang
• Score CI of π: For given confidence level 1− α, solve the following
inequality for π0 ∣∣∣∣∣ p− π0√π0(1− π0)/n
∣∣∣∣∣ ≤ zα/2For the 2002 GSS example, a 95% score CI solves∣∣∣∣∣ 0.448− π0√
π0(1− π0)/893
∣∣∣∣∣ ≤ 1.96
⇒ [0.416, 0.481].
Note: Here the sample size n is very large, the Wald CI and the score
CI are very close.
Slide 27
CHAPTER 1 ST 544, D. Zhang
Absolute values of the score statistic as a function of π0
Slide 28
CHAPTER 1 ST 544, D. Zhang
• Likelihood ratio CI: For given confidence level 1− α, solve for π0:
2
[y log
{y
nπ0
}+ (n− y) log
{(n− y)
n− nπ0
}]≤ z2
α/2.
• For the 2002 GSS example, a 95% LR CI solves:
2
[400 log
{400
893π0
}+ (893− 400) log
{(893− 400)
893− 893π0
}]≤ 1.962
⇒ [0.415, 0.481].
Slide 29
CHAPTER 1 ST 544, D. Zhang
LRT statistic as a function of π0
Slide 30
CHAPTER 1 ST 544, D. Zhang
• Note: We see from the GSS example that, for large sample size n, the
Wald, score, LR CIs are all very close. However, if n is not large, there
will be some discrepancy among them.
• For example, if y = 9, n = 10, then:
1. Wald CI: [0.714, 1.086] = [0.714, 1]
2. Score CI: [0.596, 0.982]
3. LR CI: [0.628, 0.994]
Slide 31
CHAPTER 1 ST 544, D. Zhang
IV. Other Inference Approaches
IV.1 Small-sample inference for π in Bin(n, π)
1. One-sided test: H0 : π = π0 v.s. Ha : π > π0.
Given data y ∼ Bin(n, π), the testing procedure would be: Reject H0
if y is large.
Exact p-value = P [Y ≥ y|H0].
For example, H0 : π = 0.5 v.s. Ha : π > 0.5, and y = 6, n = 10. Then
exact p-value = P [Y ≥ 6|π = 0.5] = 0.377.
Slide 32
CHAPTER 1 ST 544, D. Zhang
2. Two-sided test: H0 : π = π0 v.s. Ha : π 6= π0.
Given data y ∼ Bin(n, π), the testing procedure would be: Reject H0
if |y − nπ0| is large.
Exact p-value = P [|Y − nπ0| ≥ |y − nπ0||H0].
For example, H0 : π = 0.5 v.s. Ha : π 6= 0.5, and y = 6, n = 10. Then
exact p-value = P [|Y − 10× 0.5| ≥ |6− 10× 0.5||H0]
= P [|Y − 5| ≥ 1|H0]
= P [Y − 5 ≥ 1|H0] + P [Y − 5 ≤ −1|H0]
= P [Y ≥ 6|H0] + P [Y ≤ 4|H0]
= 0.377 + 0.377 = 0.754.
Using exact p-value can be conservative!
Slide 33
CHAPTER 1 ST 544, D. Zhang
Slide 34
dbinom(y,n,p): find Pr(Y=y); pbinom(y,n,p): find Pr(Y<=y);qbinom(p,n,p): given p=Pr(Y<=y), to find the value of observation y.
1) dbinom(4,10,0.6);2) 1-pbinom(3,10,0.6);3) pbinom(4,10,0.6)
CHAPTER 1 ST 544, D. Zhang
• Using exact p-value is conservative!
For example, if we are testing H0 : π = 0.5 v.s. Ha : π > 0.5 and our
significance level =0.05 using data y from Bin(n = 10, π). Then based
on Table 1.2, we should reject H0 only if y = 9 or y = 10. However,
the actual type I error probability is 0.011 < α = 0.05. Conservative!
Slide 35
CHAPTER 1 ST 544, D. Zhang
IV.2 Inference based on the mid p-value
• For testing H0 : π = 0.5 v.s. Ha : π > 0.5 with data y from Bin(n, π),
we calculate the
mid p-value = 0.5P [Y = y|H0] + P[Y = y + 1|H0] + · ·· +P [Y = n|H0].
For example, suppose y = 9, n = 10, then
mid p-value = 0.5P [Y = 9|H0] + P[Y = 10|H0] = 0.006.
With the use of mid p-value, we will reject H0 : π = 0.5 in favor of
Ha : π > 0.5 if y = 8, 9, 10. The actual type I error probability is
0.055, much closer to the significance level α = 0.05.
Slide 36
CHAPTER 1 ST 544, D. Zhang
IV.3 Exact confidence interval for π using exact p-value
• For given confidence level (1− α) and observed y ∼ Bin(n, π), solve
Pπ[Y ≥ y] =
n∑i=y
(n
i
)πi(1− π)n−i = α/2
to get lower limit π̂L; If y = 0, then set π̂L = 0.
Solve
Pπ[Y ≤ y] =
y∑i=0
(n
i
)πi(1− π)n−i = α/2
to get upper limit π̂U ; if y = n, then set π̂U = 1.
⇒ [π̂L, π̂U ] is an exact (1− α) for π.
• For example, y = 3, n = 10, an exact 95% CI is [0.07, 0.65]. That is,
Pπ=0.07[Y ≥ 3] = 0.025, Pπ=0.65[Y ≤ 3] = 0.025.
This exact CI is conservative, that is, too wide.
Slide 37
CHAPTER 1 ST 544, D. Zhang
P [Y ≥ 3|π] (—) and P [Y ≤ 3|π] (...) as functions of π
Slide 38
CHAPTER 1 ST 544, D. Zhang
IV.4 Exact confidence interval for π using exact mid p-value
• For given confidence level (1− α) and observed y ∼ Bin(n, π), solve
1
2Pπ[Y = y] + Pπ[Y > y] = α/2
to get lower limit π̂L; if y = 0, then π̂L = 0.
Solve1
2Pπ[Y = y] + Pπ[Y < y] = α/2
to get upper limit π̂U ; if y = n, then π̂U = 1.
⇒ [π̂L, π̂U ] is an exact (1− α) for π using mid p-value
• For example, y = 3, n = 10, an exact 95% CI is [0.08, 0.62]. That is,
1
2Pπ=0.08[Y = 3] + Pπ=0.08[Y > 3] = 0.025
1
2Pπ=0.62[Y = 3] + Pπ=0.62[Y < 3] = 0.025.
This exact CI may be anti-conservative, that is, too short.
Slide 39