likelihood ratio test in high-dimensional logistic …yc5/slides/lrt_highdim_slides.pdf¥ bartlett...

Likelihood Ratio Test in High-DimensionalLogistic Regression Is Asymptotically

a Rescaled Chi-Square

Yuxin Chen

Electrical Engineering, Princeton University

Coauthors

Pragya SurStanford Statistics

Emmanuel CandesStanford Statistics & Math

2/ 26

In memory of Tom Cover (1938 - 2012)

Tom @ Stanford EE

“We all know the feeling that follows when one investigates a problem, goesthrough a large amount of algebra, and finally investigates the answer to find that

the entire problem is illuminated not by the analysisbut by the inspection of the answer”

3/ 26

Inference in regression problems

Example: logistic regression

yi ∼ logistic-model(x>i β

), 1 ≤ i ≤ n

One wishes to determine which covariate is of importance, i.e.

βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)

4/ 26

Classical tests

βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)

Standard approaches (widely used in R, Matlab, etc): use asymptoticdistributions of certain statistics

• Wald test: Wald statistic → χ2

• Likelihood ratio test: log-likelihood ratio statistic → χ2

• Score test: score → N (0,Fisher Info)• ...

5/ 26

Example: logistic regression in R (n = 100, p = 30)Logistic regression in R: n = 100, p = 30> fit = glm(y ~ X, family = binomial)> summary(fit)

Call:glm(formula = y ~ X, family = binomial)

Deviance Residuals:Min 1Q Median 3Q Max

-1.7727 -0.8718 0.3307 0.8637 2.3141

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.086602 0.247561 0.350 0.72647X1 0.268556 0.307134 0.874 0.38190X2 0.412231 0.291916 1.412 0.15790X3 0.667540 0.363664 1.836 0.06642 .X4 -0.293916 0.331553 -0.886 0.37536X5 0.207629 0.272031 0.763 0.44531X6 1.104661 0.345493 3.197 0.00139 **...---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 138.47 on 99 degrees of freedomResidual deviance: 104.33 on 69 degrees of freedomAIC: 166.33

Number of Fisher Scoring iterations: 5

Can inferencebe trusted?

Can these inference calculations (e.g. p-values) be trusted?

6/ 26

This talk: likelihood ratio test (LRT)

βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)

Log-likelihood ratio (LLR) statistic

LLRj := `(β)− `(β(−j)

)

• `(·): log-likelihood• β = arg maxβ `(β): unconstrained MLE

• β(−j) = arg maxβ:βj=0 `(β): constrained MLE

7/ 26

This talk: likelihood ratio test (LRT)

βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)

Log-likelihood ratio (LLR) statistic

LLRj := `(β)− `(β(−j)

)

• `(·): log-likelihood• β = arg maxβ `(β): unconstrained MLE• β(−j) = arg maxβ:βj=0 `(β): constrained MLE

7/ 26

Wilks’ phenomenon ’1938


Samuel Wilks

—j = 0 vs. —j ”= 0 (1 Æ j Æ p)

LRT asymptotically follows chi-square distribution

LLRj ≠æ ‰21 (p fixed, n æ Œ)

7/ 12

Bartlett correction? (n = 4000, p = 1200)

0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Counts

0

5000

10000

0.25 0.50 0.75 1.00P−Values

Counts

classical Wilks Bartlett-corrected

rescaled ‰2

p-value

• Bartlett correction (finite sample e�ect): 2LLRj1+–n/n

≥ ‰21

¶ p-values are still non-uniform ≠æ this is NOT finite sample e�ect

What happens in high dimensions?• Our results: LRT follows a rescaled ‰2 distribution

11/ 22

Samuel Wilks, Princeton

assess significance of coefficients

βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)LRT asymptotically follows chi-square distribution (under null)

2 LLRj d−→ χ21 (p fixed, n→∞)

8/ 26



Samuel Wilks

—j = 0 vs. —j ”= 0 (1 Æ j Æ p)

LRT asymptotically follows chi-square distribution

LLRj ≠æ ‰21 (p fixed, n æ Œ)

7/ 12


0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Counts

0

5000

10000

0.25 0.50 0.75 1.00P−Values

Counts


rescaled ‰2

p-value

• Bartlett correction (finite sample e�ect): 2LLRj1+–n/n

≥ ‰21

¶ p-values are still non-uniform ≠æ this is NOT finite sample e�ect

What happens in high dimensions?• Our results: LRT follows a rescaled ‰2 distribution

11/ 22

Samuel Wilks, Princeton assess significance of coefficients

βj = 0 vs. βj 6= 0 (1 ≤ j ≤ p)LRT asymptotically follows chi-square distribution (under null)

2 LLRj d−→ χ21 (p fixed, n→∞)

8/ 26

Classical LRT in high dimensions

p/n ∈ (1,∞)

Linear regression

y = Xβ + η︸︷︷︸i.i.d. Gaussian 0

2000

4000

6000

0.00 0.25 0.50 0.75 1.00P−Values

Counts


classical p-values are highly nonuniform

p = 1200, n = 4000

Logistic regression

y ≥ logistic-model(X�)

Wilks’ theorem is inadequate in accommodatinglogistic regression in high dimensions

10/ 22



p = 1200, n = 4000

Logistic regression



10/ 22

For linear regression (with Gaussian noise) in high dimensions,2LLRj ∼ χ2

1 (classical test always works)

9/ 26


p = 1200, n = 4000

Logistic regression

y ∼ logistic-model(Xβ)0

5000

10000

15000

0.00 0.25 0.50 0.75 1.00P−Values

Counts



p = 1200, n = 4000

Logistic regression



10/ 22

Wilks’ theorem seems inadequate in accommodatinglogistic regression in high dimensions

10/ 26


0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

0

5000

10000

0.25 0.50 0.75 1.00P−Values

Cou

nts

0

2000

4000

6000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

classical Wilks

Bartlett-corrected rescaled χ2

• Bartlett correction (finite sample effect): 2LLRj

1+αn/n∼ χ2

1

◦ p-values are still non-uniform −→ this is NOT finite sample effect

What happens in high dimensions?• A glimpse of our theory: LRT follows a rescaled χ2 distribution

11/ 26


0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

0

5000

10000

0.25 0.50 0.75 1.00P−Values

Cou

nts

0

2000

4000

6000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts


rescaled χ2


1+αn/n∼ χ2

1


What happens in high dimensions?• A glimpse of our theory: LRT follows a rescaled χ2 distribution

11/ 26


0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

0

5000

10000

0.25 0.50 0.75 1.00P−Values

Cou

nts

0

2000

4000

6000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts


rescaled χ2


1+αn/n∼ χ2

1


What happens in high dimensions?

• A glimpse of our theory: LRT follows a rescaled χ2 distribution

11/ 26

Our findings

0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

0

5000

10000

0.25 0.50 0.75 1.00P−Values

Cou

nts

0

2000

4000

6000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

classical Wilks Bartlett-corrected rescaled χ2


1+αn/n∼ χ2

1


What happens in high dimensions?

• A glimpse of our theory: LRT follows a rescaled χ2 distribution

11/ 26

Problem formulation (formal)Y likelihood: fY |X(y | x} unknown signal: X observation: Y

� X y

1

Y likelihood: fY |X(y | x} unknown signal: X observation: Y

� X y

1


� X y

1


� X y n p

1


� X y n p

1

• Gaussian design: Xiind.∼ N (0,Σ)

• Logistic model:

yi =

1, with prob. 11+exp(−X>

i β)

−1, with prob. 11+exp(X>

i β)1 ≤ i ≤ n

• Proportional growth: p/n→ constant• Global null: β = 0

12/ 26

When does MLE exist?

(MLE) maximizeβ `(β) = −n∑

i=1log

{1 + exp(−yiX>i β)

}

︸︷︷︸≤0

For ‘p/n > 1/2’ MLE does not exist

Class labels

yi 2 {0, 1} or yi 2 {�1, 1}

Perfect separation: 9b

sign(yiX>i b) = 1 8i

sign(yiX>i b)

d= sign(X>

i b)

sign(X>i b) = 1 8i () Xb > 0

Prob. rand. subspace intersects pos. orth.

P�{Xb | b 2 Rp} \ Rn

+ 6= {0}�?

High-dim. geometry: prob. is 1 + o(1) if

� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)

()p/n > 1/2 + o(1)

Amelunxen et. al. (’14)

p-values are uniform p-values are highly non-uniform

yi = 1 yi = �1

1


yi = 1 yi = �1

1


Class labels

yi 2 {0, 1} or yi 2 {�1, 1}



sign(yiX>i b)

d= sign(X>

i b)

sign(X>i b) = 1 8i () Xb > 0



+ 6= {0}�?


� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)

()p/n > 1/2 + o(1)



Class labels

yi 2 {0, 1} or yi 2 {�1, 1}



sign(yiX>i b)

d= sign(X>

i b)

sign(X>i b) = 1 8i () Xb > 0



+ 6= {0}�?


� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)

()p/n > 1/2 + o(1)


MLE is unbounded if ∃ perfect separating hyperplane

If ∃ a hyperplane that perfectly separates {yi}, i.e.

∃ β s.t. yiX>i β > 0 for all i

then MLE is unbounded

lima→∞

`( aβ︸︷︷︸unbounded

) = 0

13/ 26

When does MLE exist?

Separating capacity (Tom Cover, Ph. D. thesis ’1965)


n = 2 n = 4 n = 12

1


n = 2 n = 4 n = 12

1


n = 2 n = 4 n = 12

1


yi = 1 yi = �1

1


yi = 1 yi = �1

1

number of samples n increases=⇒ more difficult to find separating hyperplane

Theorem 1 (Cover ’1965)Under i.i.d. Gaussian design, a separating hyperplane exists with highprob. iff n/p < 2 (asymptotically)

14/ 26

Main result: asymptotic distribution of LRT

Theorem 2 (Sur, Chen, Candes ’2017)Suppose n/p > 2. Under i.i.d. Gaussian design and global null,

2 LLRj d−→ α( pn

)χ2

1︸︷︷︸rescaled χ2

• α(p/n) can be determined by solving a system of 2 nonlinearequations and 2 unknowns

τ2 = n

pE[(Ψ(τZ; b))2

]

p

n= E

[Ψ′(τZ; b)

]

where Z ∼ N (0, 1), Ψ is some operator, and α(p/n) = τ2/b

◦ α(·) depends only on aspect ratio p/n◦ It is not a finite sample effect◦ α(0) = 1: matches classical theory

15/ 26

Our adjusted LRT theory in practice

●●●● ● ● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

1.00

1.25

1.50

1.75

2.00

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40κ

Res

calin

g C

onst

ant

p-va

lues

are

unifo

rmp-

valu

esar

ehi

ghly

non-

unifo

rm

resc

alin

gco

nsta

nt↵(p

/n)

p/n

1


rescaling constant ↵(p/n) p/n

1

0

2000

4000

6000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

rescaling constant for logistic model empirical p-values ≈ Unif(0, 1)

Empirically, LRT ≈ rescaled χ21 (as predicted)

16/ 26

Validity of tail approximation

●●●●

●●●

●●●●

●●●●

●●●●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●●●●

●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●●●

●●●

●●●●

●●●●

●●●●

●●●●

●●●

●

0.0000

0.0025

0.0050

0.0075

0.0100

0.000 0.002 0.004 0.006 0.008 0.010t

Em

piric

al c

df

Empirical CDF of adjusted pvalues (n = 4000, p = 1200)

Empirical CDF is in near-perfect aggreement with diagonal, suggestingthat our theory is remarkably accurate even when we zoom in

17/ 26

Efficacy under moderate sample sizes

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00t

Em

piric

al c

df

Empirical CDF of adjusted pvalues (n = 200, p = 60)

Our theory seems adequete for moderately large samples

18/ 26

Universality: non-Gaussian covariates

0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

0

5000

10000

15000

0.25 0.50 0.75 1.00P−Values

Cou

nts

0

2500

5000

7500

10000

12500

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

classical Wilks Bartlett-corrected rescaled χ2

i.i.d. Bernoulli design, n = 4000, p = 1200

19/ 26

Connection to convex geometry

separating hyperplane

exists

no separating hyperplane


Class labels

yi 2 {0, 1} or yi 2 {�1, 1}



sign(yiX>i b)

d= sign(X>

i b)

sign(X>i b) = 1 8i () Xb > 0



+ 6= {0}�?


� ({X� | � 2 Rp}) + �(Rn+) > n + o(n)

()p/n > 1/2 + o(1)


Connection to convex geometry

separating hyperplane

exists

no separating hyperplane


Class labels

yi � {0, 1} or yi � {�1, 1}

Perfect separation: �b

sign(yiX�i b) = 1 �i

sign(yiX�i b)

d= sign(X�

i b)

sign(X�i b) = 1 �i �� Xb > 0


P�{Xb | b � Rp} � Rn

+ �= {0}�?


� ({X� | � � Rp}) + �(Rn+) > n + o(n)

��p/n > 1/2 + o(1)



Class labels

yi � {0, 1} or yi � {�1, 1}

Perfect separation: �b

sign(yiX�i b) = 1 �i

sign(yiX�i b)

d= sign(X�

i b)

sign(X�i b) = 1 �i �� Xb > 0


P�{Xb | b � Rp} � Rn

+ �= {0}�?


� ({X� | � � Rp}) + �(Rn+) > n + o(n)

��p/n > 1/2 + o(1)


WLOG, if y1 = · · · = yn = 1, then

separability =)range(X) fl Rn

+ ”= {0}*(1)

20/ 24

WLOG, if y1 = · · · = yn = 1, then

separability ={

range(X) ∩ Rn+ 6= {0}}

︸︷︷︸can be analyzed via convex geometry (e.g. Amelunxen et al.)

20/ 26

Connection to robust M-estimation

Since yi = ±1 and Xiind.∼ N (0,Σ),

maximizeβ `(β) = −n∑

i=1log

{1 + exp(−yiX>i β)

}

d= −n∑

i=1log

{1 + exp(−X>i β)

}

︸︷︷︸:=∑n

i=1 ρ(−Xiβ)

=⇒ MLE β = arg minβ

n∑

i=1ρ(yi −Xiβ)

︸︷︷︸robust M-estimation

with y = 0

21/ 26

Connection to robust M-estimation

MLE β = arg minβ

n∑

i=1ρ(yi −Xiβ)

︸︷︷︸robust M-estimation

Variance inflation as p/n ↓ (El Karoui et al. ’13, Donoho, Montanari ’13)

●●●● ● ● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

1.00

1.25

1.50

1.75

2.00

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40κ

Res

calin

g C

onst

ant

p-va

lues

are

unifo

rmp-

valu

esar

ehi

ghly

non-

unifo

rm

resc

alin

gco

nsta

nt↵(p

/n)

p/n

1


rescaling constant ↵(p/n) p/n

1

variance inflation−→ increasing rescaling factor

22/ 26

This is not just about logistic regression

Our theory is applicable to• logistic model• probit model• linear model (under Gaussian noise)

◦ rescaling const α(p/n) ≡ 1 (consistent with classical theory)

• linear model (under non-Gaussian noise)• · · ·

23/ 26

Key ingredients in establishing our theoryKey step is to realize that

2 LLRj d−→ p

b(p/n) β2j

︸︷︷︸rescaled χ2

where b(·) depends only on pn , βj ∼ N

(0,√α( p

n )b( pn )√

p

)

1. Convex geometry: show ‖β‖ = O(1)

2. Approximate message passing: determine asymptoticdistribution of ‖β‖

3. Leave-one-out analysis: characterize marginal distribution ofβj (rescaled Gaussian) and ensure higher-order terms vanish

24/ 26

A small sample of prior work

• Candes, Fan, Janson, Lv ’16: observed empiricallynonuniformity of p-values in logistic regression

• Fan, Demirkaya, Lv ’17: classical asymptotic normality of MLE(basis of Wald test) fails to hold in logistic regression whenp � n2/3

• El Karoui, Beana, Bickel, Limb, Yu ’13, El Karoui ’13,Donoho, Montanari ’13, El Karoui ’15: robust M-estimationfor linear models (assuming strong convexity)

25/ 26

Summary

• Caution needs to be exercised when applying classical statisticalprocedures in a high-dimensional regime — a regime of growinginterest and relevance• What shall we do under non-null (β 6= 0)?

Paper: “The likelihood ratio test in high-dimensional logistic regression isasymptotically a rescaled chi-square”, Pragya Sur, Yuxin Chen, EmmanuelCandes, 2017.

26/ 26

likelihood ratio test in high-dimensional logistic …yc5/slides/lrt_highdim_slides.pdf¥ bartlett...

Documents