likelihood ratio test in high-dimensional logistic …yc5/slides/lrt_highdim_slides.pdf¥ bartlett...
TRANSCRIPT
Likelihood Ratio Test in High-DimensionalLogistic Regression Is Asymptotically
a Rescaled Chi-Square
Yuxin Chen
Electrical Engineering, Princeton University
Coauthors
Pragya SurStanford Statistics
Emmanuel CandesStanford Statistics & Math
2/ 26
In memory of Tom Cover (1938 - 2012)
Tom @ Stanford EE
“We all know the feeling that follows when one investigates a problem, goesthrough a large amount of algebra, and finally investigates the answer to find that
the entire problem is illuminated not by the analysisbut by the inspection of the answer”
3/ 26
Inference in regression problems
Example: logistic regression
yi ∼ logistic-model(x>i β
), 1 ≤ i ≤ n
One wishes to determine which covariate is of importance, i.e.
βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)
4/ 26
Inference in regression problems
Example: logistic regression
yi ∼ logistic-model(x>i β
), 1 ≤ i ≤ n
One wishes to determine which covariate is of importance, i.e.
βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)
4/ 26
Classical tests
βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)
Standard approaches (widely used in R, Matlab, etc): use asymptoticdistributions of certain statistics
• Wald test: Wald statistic → χ2
• Likelihood ratio test: log-likelihood ratio statistic → χ2
• Score test: score → N (0,Fisher Info)• ...
5/ 26
Classical tests
βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)
Standard approaches (widely used in R, Matlab, etc): use asymptoticdistributions of certain statistics
• Wald test: Wald statistic → χ2
• Likelihood ratio test: log-likelihood ratio statistic → χ2
• Score test: score → N (0,Fisher Info)• ...
5/ 26
Example: logistic regression in R (n = 100, p = 30)Logistic regression in R: n = 100, p = 30> fit = glm(y ~ X, family = binomial)> summary(fit)
Call:glm(formula = y ~ X, family = binomial)
Deviance Residuals:Min 1Q Median 3Q Max
-1.7727 -0.8718 0.3307 0.8637 2.3141
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.086602 0.247561 0.350 0.72647X1 0.268556 0.307134 0.874 0.38190X2 0.412231 0.291916 1.412 0.15790X3 0.667540 0.363664 1.836 0.06642 .X4 -0.293916 0.331553 -0.886 0.37536X5 0.207629 0.272031 0.763 0.44531X6 1.104661 0.345493 3.197 0.00139 **...---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.47 on 99 degrees of freedomResidual deviance: 104.33 on 69 degrees of freedomAIC: 166.33
Number of Fisher Scoring iterations: 5
Can inferencebe trusted?
Can these inference calculations (e.g. p-values) be trusted?
6/ 26
This talk: likelihood ratio test (LRT)
βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)
Log-likelihood ratio (LLR) statistic
LLRj := `(β)− `(β(−j)
)
• `(·): log-likelihood• β = arg maxβ `(β): unconstrained MLE
• β(−j) = arg maxβ:βj=0 `(β): constrained MLE
7/ 26
This talk: likelihood ratio test (LRT)
βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)
Log-likelihood ratio (LLR) statistic
LLRj := `(β)− `(β(−j)
)
• `(·): log-likelihood• β = arg maxβ `(β): unconstrained MLE• β(−j) = arg maxβ:βj=0 `(β): constrained MLE
7/ 26
Wilks’ phenomenon ’1938
Wilks’ phenomenon ’1938
Samuel Wilks
—j = 0 vs. —j ”= 0 (1 Æ j Æ p)
LRT asymptotically follows chi-square distribution
LLRj ≠æ ‰21 (p fixed, n æ Œ)
7/ 12
Bartlett correction? (n = 4000, p = 1200)
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Counts
0
5000
10000
0.25 0.50 0.75 1.00P−Values
Counts
classical Wilks Bartlett-corrected
rescaled ‰2
p-value
• Bartlett correction (finite sample e�ect): 2LLRj1+–n/n
≥ ‰21
¶ p-values are still non-uniform ≠æ this is NOT finite sample e�ect
What happens in high dimensions?• Our results: LRT follows a rescaled ‰2 distribution
11/ 22
Samuel Wilks, Princeton
assess significance of coefficients
βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)LRT asymptotically follows chi-square distribution (under null)
2 LLRj d−→ χ21 (p fixed, n→∞)
8/ 26
Wilks’ phenomenon ’1938
Wilks’ phenomenon ’1938
Samuel Wilks
—j = 0 vs. —j ”= 0 (1 Æ j Æ p)
LRT asymptotically follows chi-square distribution
LLRj ≠æ ‰21 (p fixed, n æ Œ)
7/ 12
Bartlett correction? (n = 4000, p = 1200)
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Counts
0
5000
10000
0.25 0.50 0.75 1.00P−Values
Counts
classical Wilks Bartlett-corrected
rescaled ‰2
p-value
• Bartlett correction (finite sample e�ect): 2LLRj1+–n/n
≥ ‰21
¶ p-values are still non-uniform ≠æ this is NOT finite sample e�ect
What happens in high dimensions?• Our results: LRT follows a rescaled ‰2 distribution
11/ 22
Samuel Wilks, Princeton assess significance of coefficients
βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)LRT asymptotically follows chi-square distribution (under null)
2 LLRj d−→ χ21 (p fixed, n→∞)
8/ 26
Classical LRT in high dimensions
p/n ∈ (1,∞)
Linear regression
y = Xβ + η︸︷︷︸i.i.d. Gaussian 0
2000
4000
6000
0.00 0.25 0.50 0.75 1.00P−Values
Counts
Classical LRT in high dimensions
classical p-values are highly nonuniform
p = 1200, n = 4000
Logistic regression
y ≥ logistic-model(X�)
Wilks’ theorem is inadequate in accommodatinglogistic regression in high dimensions
10/ 22
Classical LRT in high dimensions
classical p-values are highly nonuniform
p = 1200, n = 4000
Logistic regression
y ≥ logistic-model(X�)
Wilks’ theorem is inadequate in accommodatinglogistic regression in high dimensions
10/ 22
For linear regression (with Gaussian noise) in high dimensions,2LLRj ∼ χ2
1 (classical test always works)
9/ 26
Classical LRT in high dimensions
p = 1200, n = 4000
Logistic regression
y ∼ logistic-model(Xβ)0
5000
10000
15000
0.00 0.25 0.50 0.75 1.00P−Values
Counts
Classical LRT in high dimensions
classical p-values are highly nonuniform
p = 1200, n = 4000
Logistic regression
y ≥ logistic-model(X�)
Wilks’ theorem is inadequate in accommodatinglogistic regression in high dimensions
10/ 22
Wilks’ theorem seems inadequate in accommodatinglogistic regression in high dimensions
10/ 26
Classical LRT in high dimensions
p = 1200, n = 4000
Logistic regression
y ∼ logistic-model(Xβ)0
5000
10000
15000
0.00 0.25 0.50 0.75 1.00P−Values
Counts
Classical LRT in high dimensions
classical p-values are highly nonuniform
p = 1200, n = 4000
Logistic regression
y ≥ logistic-model(X�)
Wilks’ theorem is inadequate in accommodatinglogistic regression in high dimensions
10/ 22
Wilks’ theorem seems inadequate in accommodatinglogistic regression in high dimensions
10/ 26
Bartlett correction? (n = 4000, p = 1200)
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
0
5000
10000
0.25 0.50 0.75 1.00P−Values
Cou
nts
0
2000
4000
6000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
classical Wilks
Bartlett-corrected rescaled χ2
• Bartlett correction (finite sample effect): 2LLRj
1+αn/n∼ χ2
1
◦ p-values are still non-uniform −→ this is NOT finite sample effect
What happens in high dimensions?• A glimpse of our theory: LRT follows a rescaled χ2 distribution
11/ 26
Bartlett correction? (n = 4000, p = 1200)
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
0
5000
10000
0.25 0.50 0.75 1.00P−Values
Cou
nts
0
2000
4000
6000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
classical Wilks Bartlett-corrected
rescaled χ2
• Bartlett correction (finite sample effect): 2LLRj
1+αn/n∼ χ2
1
◦ p-values are still non-uniform −→ this is NOT finite sample effect
What happens in high dimensions?• A glimpse of our theory: LRT follows a rescaled χ2 distribution
11/ 26
Bartlett correction? (n = 4000, p = 1200)
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
0
5000
10000
0.25 0.50 0.75 1.00P−Values
Cou
nts
0
2000
4000
6000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
classical Wilks Bartlett-corrected
rescaled χ2
• Bartlett correction (finite sample effect): 2LLRj
1+αn/n∼ χ2
1
◦ p-values are still non-uniform −→ this is NOT finite sample effect
What happens in high dimensions?
• A glimpse of our theory: LRT follows a rescaled χ2 distribution
11/ 26
Our findings
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
0
5000
10000
0.25 0.50 0.75 1.00P−Values
Cou
nts
0
2000
4000
6000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
classical Wilks Bartlett-corrected rescaled χ2
• Bartlett correction (finite sample effect): 2LLRj
1+αn/n∼ χ2
1
◦ p-values are still non-uniform −→ this is NOT finite sample effect
What happens in high dimensions?
• A glimpse of our theory: LRT follows a rescaled χ2 distribution
11/ 26
Problem formulation (formal)Y likelihood: fY |X(y | x} unknown signal: X observation: Y
� X y
1
Y likelihood: fY |X(y | x} unknown signal: X observation: Y
� X y
1
Y likelihood: fY |X(y | x} unknown signal: X observation: Y
� X y
1
Y likelihood: fY |X(y | x} unknown signal: X observation: Y
� X y n p
1
Y likelihood: fY |X(y | x} unknown signal: X observation: Y
� X y n p
1
• Gaussian design: Xiind.∼ N (0,Σ)
• Logistic model:
yi =
1, with prob. 11+exp(−X>
i β)
−1, with prob. 11+exp(X>
i β)1 ≤ i ≤ n
• Proportional growth: p/n→ constant• Global null: β = 0
12/ 26
When does MLE exist?
(MLE) maximizeβ `(β) = −n∑
i=1log
{1 + exp(−yiX>i β)
}
︸ ︷︷ ︸≤0
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi 2 {0, 1} or yi 2 {�1, 1}
Perfect separation: 9b
sign(yiX>i b) = 1 8i
sign(yiX>i b)
d= sign(X>
i b)
sign(X>i b) = 1 8i () Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b 2 Rp} \ Rn
+ 6= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)
()p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
p-values are uniform p-values are highly non-uniform
yi = 1 yi = �1
1
p-values are uniform p-values are highly non-uniform
yi = 1 yi = �1
1
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi 2 {0, 1} or yi 2 {�1, 1}
Perfect separation: 9b
sign(yiX>i b) = 1 8i
sign(yiX>i b)
d= sign(X>
i b)
sign(X>i b) = 1 8i () Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b 2 Rp} \ Rn
+ 6= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)
()p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi 2 {0, 1} or yi 2 {�1, 1}
Perfect separation: 9b
sign(yiX>i b) = 1 8i
sign(yiX>i b)
d= sign(X>
i b)
sign(X>i b) = 1 8i () Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b 2 Rp} \ Rn
+ 6= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)
()p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
MLE is unbounded if ∃ perfect separating hyperplane
If ∃ a hyperplane that perfectly separates {yi}, i.e.
∃ β s.t. yiX>i β > 0 for all i
then MLE is unbounded
lima→∞
`( aβ︸︷︷︸unbounded
) = 0
13/ 26
When does MLE exist?
(MLE) maximizeβ `(β) = −n∑
i=1log
{1 + exp(−yiX>i β)
}
︸ ︷︷ ︸≤0
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi 2 {0, 1} or yi 2 {�1, 1}
Perfect separation: 9b
sign(yiX>i b) = 1 8i
sign(yiX>i b)
d= sign(X>
i b)
sign(X>i b) = 1 8i () Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b 2 Rp} \ Rn
+ 6= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)
()p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
p-values are uniform p-values are highly non-uniform
yi = 1 yi = �1
1
p-values are uniform p-values are highly non-uniform
yi = 1 yi = �1
1
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi 2 {0, 1} or yi 2 {�1, 1}
Perfect separation: 9b
sign(yiX>i b) = 1 8i
sign(yiX>i b)
d= sign(X>
i b)
sign(X>i b) = 1 8i () Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b 2 Rp} \ Rn
+ 6= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)
()p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi 2 {0, 1} or yi 2 {�1, 1}
Perfect separation: 9b
sign(yiX>i b) = 1 8i
sign(yiX>i b)
d= sign(X>
i b)
sign(X>i b) = 1 8i () Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b 2 Rp} \ Rn
+ 6= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)
()p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
MLE is unbounded if ∃ perfect separating hyperplane
If ∃ a hyperplane that perfectly separates {yi}, i.e.
∃ β s.t. yiX>i β > 0 for all i
then MLE is unbounded
lima→∞
`( aβ︸︷︷︸unbounded
) = 0
13/ 26
When does MLE exist?
Separating capacity (Tom Cover, Ph. D. thesis ’1965)
p-values are uniform p-values are highly non-uniform
n = 2 n = 4 n = 12
1
p-values are uniform p-values are highly non-uniform
n = 2 n = 4 n = 12
1
p-values are uniform p-values are highly non-uniform
n = 2 n = 4 n = 12
1
p-values are uniform p-values are highly non-uniform
yi = 1 yi = �1
1
p-values are uniform p-values are highly non-uniform
yi = 1 yi = �1
1
number of samples n increases=⇒ more difficult to find separating hyperplane
Theorem 1 (Cover ’1965)Under i.i.d. Gaussian design, a separating hyperplane exists with highprob. iff n/p < 2 (asymptotically)
14/ 26
When does MLE exist?
Separating capacity (Tom Cover, Ph. D. thesis ’1965)
p-values are uniform p-values are highly non-uniform
n = 2 n = 4 n = 12
1
p-values are uniform p-values are highly non-uniform
n = 2 n = 4 n = 12
1
p-values are uniform p-values are highly non-uniform
n = 2 n = 4 n = 12
1
p-values are uniform p-values are highly non-uniform
yi = 1 yi = �1
1
p-values are uniform p-values are highly non-uniform
yi = 1 yi = �1
1
number of samples n increases=⇒ more difficult to find separating hyperplane
Theorem 1 (Cover ’1965)Under i.i.d. Gaussian design, a separating hyperplane exists with highprob. iff n/p < 2 (asymptotically)
14/ 26
Main result: asymptotic distribution of LRT
Theorem 2 (Sur, Chen, Candes ’2017)Suppose n/p > 2. Under i.i.d. Gaussian design and global null,
2 LLRj d−→ α( pn
)χ2
1︸ ︷︷ ︸rescaled χ2
• α(p/n) can be determined by solving a system of 2 nonlinearequations and 2 unknowns
τ2 = n
pE[(Ψ(τZ; b))2
]
p
n= E
[Ψ′(τZ; b)
]
where Z ∼ N (0, 1), Ψ is some operator, and α(p/n) = τ2/b
◦ α(·) depends only on aspect ratio p/n◦ It is not a finite sample effect◦ α(0) = 1: matches classical theory
15/ 26
Main result: asymptotic distribution of LRT
Theorem 2 (Sur, Chen, Candes ’2017)Suppose n/p > 2. Under i.i.d. Gaussian design and global null,
2 LLRj d−→ α( pn
)χ2
1︸ ︷︷ ︸rescaled χ2
• α(p/n) can be determined by solving a system of 2 nonlinearequations and 2 unknowns
τ2 = n
pE[(Ψ(τZ; b))2
]
p
n= E
[Ψ′(τZ; b)
]
where Z ∼ N (0, 1), Ψ is some operator, and α(p/n) = τ2/b
◦ α(·) depends only on aspect ratio p/n◦ It is not a finite sample effect◦ α(0) = 1: matches classical theory
15/ 26
Main result: asymptotic distribution of LRT
Theorem 2 (Sur, Chen, Candes ’2017)Suppose n/p > 2. Under i.i.d. Gaussian design and global null,
2 LLRj d−→ α( pn
)χ2
1︸ ︷︷ ︸rescaled χ2
• α(p/n) can be determined by solving a system of 2 nonlinearequations and 2 unknowns
τ2 = n
pE[(Ψ(τZ; b))2
]
p
n= E
[Ψ′(τZ; b)
]
where Z ∼ N (0, 1), Ψ is some operator, and α(p/n) = τ2/b
◦ α(·) depends only on aspect ratio p/n◦ It is not a finite sample effect◦ α(0) = 1: matches classical theory
15/ 26
Our adjusted LRT theory in practice
●●●● ● ● ● ●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
1.00
1.25
1.50
1.75
2.00
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40κ
Res
calin
g C
onst
ant
p-va
lues
are
unifo
rmp-
valu
esar
ehi
ghly
non-
unifo
rm
resc
alin
gco
nsta
nt↵(p
/n)
p/n
1
p-values are uniform p-values are highly non-uniform
rescaling constant ↵(p/n) p/n
1
0
2000
4000
6000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
rescaling constant for logistic model empirical p-values ≈ Unif(0, 1)
Empirically, LRT ≈ rescaled χ21 (as predicted)
16/ 26
Validity of tail approximation
●●●●
●●●
●●●●
●●●●
●●●●
●●●●
●●●
●●●
●●●●
●●●●
●●●
●●●●
●●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●●●●●
●●●
●●●●
●●●●
●●●●
●●●●
●●●
●
0.0000
0.0025
0.0050
0.0075
0.0100
0.000 0.002 0.004 0.006 0.008 0.010t
Em
piric
al c
df
Empirical CDF of adjusted pvalues (n = 4000, p = 1200)
Empirical CDF is in near-perfect aggreement with diagonal, suggestingthat our theory is remarkably accurate even when we zoom in
17/ 26
Efficacy under moderate sample sizes
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00t
Em
piric
al c
df
Empirical CDF of adjusted pvalues (n = 200, p = 60)
Our theory seems adequete for moderately large samples
18/ 26
Universality: non-Gaussian covariates
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
0
5000
10000
15000
0.25 0.50 0.75 1.00P−Values
Cou
nts
0
2500
5000
7500
10000
12500
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
classical Wilks Bartlett-corrected rescaled χ2
i.i.d. Bernoulli design, n = 4000, p = 1200
19/ 26
Connection to convex geometry
separating hyperplane
exists
no separating hyperplane
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi 2 {0, 1} or yi 2 {�1, 1}
Perfect separation: 9b
sign(yiX>i b) = 1 8i
sign(yiX>i b)
d= sign(X>
i b)
sign(X>i b) = 1 8i () Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b 2 Rp} \ Rn
+ 6= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)
()p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
Connection to convex geometry
separating hyperplane
exists
no separating hyperplane
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi � {0, 1} or yi � {�1, 1}
Perfect separation: �b
sign(yiX�i b) = 1 �i
sign(yiX�i b)
d= sign(X�
i b)
sign(X�i b) = 1 �i �� Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b � Rp} � Rn
+ �= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � � Rp}) + �(Rn+) > n + o(n)
��p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi � {0, 1} or yi � {�1, 1}
Perfect separation: �b
sign(yiX�i b) = 1 �i
sign(yiX�i b)
d= sign(X�
i b)
sign(X�i b) = 1 �i �� Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b � Rp} � Rn
+ �= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � � Rp}) + �(Rn+) > n + o(n)
��p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
WLOG, if y1 = · · · = yn = 1, then
separability =)range(X) fl Rn
+ ”= {0}*(1)
20/ 24
WLOG, if y1 = · · · = yn = 1, then
separability ={
range(X) ∩ Rn+ 6= {0}}
︸ ︷︷ ︸can be analyzed via convex geometry (e.g. Amelunxen et al.)
20/ 26
Connection to convex geometry
separating hyperplane
exists
no separating hyperplane
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi 2 {0, 1} or yi 2 {�1, 1}
Perfect separation: 9b
sign(yiX>i b) = 1 8i
sign(yiX>i b)
d= sign(X>
i b)
sign(X>i b) = 1 8i () Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b 2 Rp} \ Rn
+ 6= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)
()p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
Connection to convex geometry
separating hyperplane
exists
no separating hyperplane
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi � {0, 1} or yi � {�1, 1}
Perfect separation: �b
sign(yiX�i b) = 1 �i
sign(yiX�i b)
d= sign(X�
i b)
sign(X�i b) = 1 �i �� Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b � Rp} � Rn
+ �= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � � Rp}) + �(Rn+) > n + o(n)
��p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
For ‘p/n > 1/2’ MLE does not exist
Class labels
yi � {0, 1} or yi � {�1, 1}
Perfect separation: �b
sign(yiX�i b) = 1 �i
sign(yiX�i b)
d= sign(X�
i b)
sign(X�i b) = 1 �i �� Xb > 0
Prob. rand. subspace intersects pos. orth.
P�{Xb | b � Rp} � Rn
+ �= {0}�?
High-dim. geometry: prob. is 1 + o(1) if
� ({X� | � � Rp}) + �(Rn+) > n + o(n)
��p/n > 1/2 + o(1)
Amelunxen et. al. (’14)
WLOG, if y1 = · · · = yn = 1, then
separability =)range(X) fl Rn
+ ”= {0}*(1)
20/ 24
WLOG, if y1 = · · · = yn = 1, then
separability ={
range(X) ∩ Rn+ 6= {0}}
︸ ︷︷ ︸can be analyzed via convex geometry (e.g. Amelunxen et al.)
20/ 26
Connection to robust M-estimation
Since yi = ±1 and Xiind.∼ N (0,Σ),
maximizeβ `(β) = −n∑
i=1log
{1 + exp(−yiX>i β)
}
d= −n∑
i=1log
{1 + exp(−X>i β)
}
︸ ︷︷ ︸:=∑n
i=1 ρ(−Xiβ)
=⇒ MLE β = arg minβ
n∑
i=1ρ(yi −Xiβ)
︸ ︷︷ ︸robust M-estimation
with y = 0
21/ 26
Connection to robust M-estimation
Since yi = ±1 and Xiind.∼ N (0,Σ),
maximizeβ `(β) = −n∑
i=1log
{1 + exp(−yiX>i β)
}
d= −n∑
i=1log
{1 + exp(−X>i β)
}
︸ ︷︷ ︸:=∑n
i=1 ρ(−Xiβ)
=⇒ MLE β = arg minβ
n∑
i=1ρ(yi −Xiβ)
︸ ︷︷ ︸robust M-estimation
with y = 0
21/ 26
Connection to robust M-estimation
MLE β = arg minβ
n∑
i=1ρ(yi −Xiβ)
︸ ︷︷ ︸robust M-estimation
Variance inflation as p/n ↓ (El Karoui et al. ’13, Donoho, Montanari ’13)
●●●● ● ● ● ●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
1.00
1.25
1.50
1.75
2.00
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40κ
Res
calin
g C
onst
ant
p-va
lues
are
unifo
rmp-
valu
esar
ehi
ghly
non-
unifo
rm
resc
alin
gco
nsta
nt↵(p
/n)
p/n
1
p-values are uniform p-values are highly non-uniform
rescaling constant ↵(p/n) p/n
1
variance inflation−→ increasing rescaling factor
22/ 26
Connection to robust M-estimation
MLE β = arg minβ
n∑
i=1ρ(yi −Xiβ)
︸ ︷︷ ︸robust M-estimation
Variance inflation as p/n ↓ (El Karoui et al. ’13, Donoho, Montanari ’13)
●●●● ● ● ● ●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
1.00
1.25
1.50
1.75
2.00
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40κ
Res
calin
g C
onst
ant
p-va
lues
are
unifo
rmp-
valu
esar
ehi
ghly
non-
unifo
rm
resc
alin
gco
nsta
nt↵(p
/n)
p/n
1
p-values are uniform p-values are highly non-uniform
rescaling constant ↵(p/n) p/n
1
variance inflation−→ increasing rescaling factor
22/ 26
This is not just about logistic regression
Our theory is applicable to• logistic model• probit model• linear model (under Gaussian noise)
◦ rescaling const α(p/n) ≡ 1 (consistent with classical theory)
• linear model (under non-Gaussian noise)• · · ·
23/ 26
Key ingredients in establishing our theoryKey step is to realize that
2 LLRj d−→ p
b(p/n) β2j
︸ ︷︷ ︸rescaled χ2
where b(·) depends only on pn , βj ∼ N
(0,√α( p
n )b( pn )√
p
)
1. Convex geometry: show ‖β‖ = O(1)
2. Approximate message passing: determine asymptoticdistribution of ‖β‖
3. Leave-one-out analysis: characterize marginal distribution ofβj (rescaled Gaussian) and ensure higher-order terms vanish
24/ 26
Key ingredients in establishing our theoryKey step is to realize that
2 LLRj d−→ p
b(p/n) β2j
︸ ︷︷ ︸rescaled χ2
where b(·) depends only on pn , βj ∼ N
(0,√α( p
n )b( pn )√
p
)
1. Convex geometry: show ‖β‖ = O(1)
2. Approximate message passing: determine asymptoticdistribution of ‖β‖
3. Leave-one-out analysis: characterize marginal distribution ofβj (rescaled Gaussian) and ensure higher-order terms vanish
24/ 26
A small sample of prior work
• Candes, Fan, Janson, Lv ’16: observed empiricallynonuniformity of p-values in logistic regression
• Fan, Demirkaya, Lv ’17: classical asymptotic normality of MLE(basis of Wald test) fails to hold in logistic regression whenp � n2/3
• El Karoui, Beana, Bickel, Limb, Yu ’13, El Karoui ’13,Donoho, Montanari ’13, El Karoui ’15: robust M-estimationfor linear models (assuming strong convexity)
25/ 26
Summary
• Caution needs to be exercised when applying classical statisticalprocedures in a high-dimensional regime — a regime of growinginterest and relevance• What shall we do under non-null (β 6= 0)?
Paper: “The likelihood ratio test in high-dimensional logistic regression isasymptotically a rescaled chi-square”, Pragya Sur, Yuxin Chen, EmmanuelCandes, 2017.
26/ 26