generalized inferential models: basics and beyond
TRANSCRIPT
Generalized inferential models: basics and beyond
Ryan Martin1
North Carolina State UniversityResearchers.One
Statistics SeminarNorthwestern Polytechnical University, China
November 19, 2021
1www4.stat.ncsu.edu/~rmartin
1 / 40
Introduction
Statistics aims to give reliable/valid uncertainty quantificationabout unknowns based on data, models, etc.
Two dominant schools of thought:
frequentistBayesian
Both have familiar pros and cons...
Do we have to accept the cons? Can’t we just have all pros?
e.g., Efron (2013):
Perhaps the most important unresolved problem in
statistical inference is the use of Bayes theorem in the
absence of prior information.
2 / 40
Introduction, cont.
Chuanhai Liu2 and I developed a pros-focused approach.
Objectives:
data-dependent “probabilities,” without priorscalibration properties to make inference reliable
Our framework: inferential models (IMs).
Some similarities to what Fisher & others did.
Key difference:
reliability requires “probabilities” to be imprecise
2https://www.stat.purdue.edu/people/faculty/chuanhai.html
3 / 40
This talk
Background / inferential models (IMs)
Generalized IMs
easier constructionstill valid
Applications:
meta-analysissurvival analysis
Next-generation generalized IMs...
4 / 40
Inferential models
Observable data: Y in sample space YStatistical model: Y ∼ PY |θ, θ in parameter space Θ
Goal: learn about unknown θ from observed data, y
That is, quantify uncertainty about θ based on y
Bayes, Fisher, and others use probability distributionsDempster & Shafer use “belief functions”
These are special cases of IMs...
5 / 40
IMs, cont.
Mathematically, an IM is a mapping that takes data y to apair of lower and upper probabilities34
Πy (A) = degree of belief in “θ ∈ A”
Πy (A) = degree of plausibility in “θ ∈ A”
→ Probabilities are additive, Πy = Πy .
→ Belief functions, etc. are non-additive, Πy ≤ Πy .
Clearly lots of options, how to choose?
Recommend an IM that’s “statistically reliable”
3Technically, these are super/sub-additive and monotone capacities4Linked via the duality Πy (A) = 1 − Πy (Ac)
6 / 40
IMs, cont.
“Reliable” in what sense?
Basic principle: if ΠY (A) is large, infer A.
Reid & Cox:it is unacceptable if a procedure. . . of representing
uncertain knowledge would, if used repeatedly, give
systematically misleading conclusions.
We don’t want, e.g., ΠY (A) to be large if A is false.
Idea: require that y 7→ Πy (·) satisfy
{θ 6∈ A and Y ∼ PY |θ} =⇒ ΠY (A) tends to be small.
7 / 40
IMs, cont.
Definition.
An IM y 7→ (Πy ,Πy ) is valid if
supθ 6∈A
PY |θ{ΠY (A) > 1− α} ≤ α, ∀ A ⊆ Θ, α ∈ [0, 1]
Validity controls the frequency at which the IM assignsrelatively high beliefs to false assertions.
There’s an equivalent statement in terms of Πy :
supθ∈A
PY |θ{ΠY (A) ≤ α} ≤ α, ∀ A ⊆ Θ, α ∈ [0, 1].
False confidence theorem:5 additive IMs can’t be valid.
5Balch, M., and Ferson, arXiv:1706.085658 / 40
IMs, cont.
Theorem.
If (ΠY ,ΠY ) are valid, then derived procedures control error rates:
“reject H0 : θ ∈ A if Πy (A) ≤ α” is a size α test,
the 100(1− α)% plausibility region {ϑ : Πy ({ϑ}) > α} hascoverage probability ≥ 1− α.
IM validity =⇒ usual frequentist validity
Connection is mutually beneficial:
IMs help with interpretation of frequentist outputcalibration makes IM’s (Πy ,Πy ) real-world relevant
9 / 40
IMs, cont.
How to construct a valid IM?
A. Y = a(θ,U), U ∼ PU .P. Use a random set U to “guess” the
unobserved U.C. Data-dependent random set
Θy (U) =⋃u∈U{ϑ : y = a(ϑ, u)}.
leads to lower and upper probs
Πy (A) = PU{Θy (U) ⊆ A}Πy (A) = PU{Θy (U) ∩ A 6= ∅}.
This is what the book is about!
10 / 40
IMs, cont.
Problems first considered by Fisher:
Scalar Y ∼ PY |θ and scalar θcontinuous distribution function Fθrange of Fθ(y) unconstrained by fixed y or fixed θ.
IM construction:6
A. Y = F−1θ (U), U ∼ PU = Unif(0, 1)P. U = {u ∈ [0, 1] : |u − 0.5| ≤ |Unif(0, 1)− 0.5|}C. Θy (U) = {ϑ : Fϑ(y) ∈ U}, with
πy (ϑ) = PU{Θy (U) 3 ϑ}︸ ︷︷ ︸plausibility contour
= 1− |2Fϑ(y)− 1|.
Lots of examples can be covered by this analysis.
6For original details, see M. and Liu, arXiv:1206.409111 / 40
IMs, cont.
Two examples: Cauchy(θ, 1) and Gamma(θ, 1)
Plots below show the plausibility contour, πy (ϑ).
How this is used?confidence interval {ϑ : πy (ϑ) > α}upper probabilities: Πy (A) = supϑ∈A πy (ϑ)
−30 −20 −10 0 10 20 30
0.0
0.2
0.4
0.6
0.8
1.0
ϑ
φy(ϑ
)
(a) y = 0, Cauchy
0 1 2 3 4 5 6 7
0.0
0.2
0.4
0.6
0.8
1.0
ϑ
πy(ϑ
)
(b) y = 1, gamma
12 / 40
IMs, cont.
Despite IM’s nice features, practical challenges can arise
Basic issue: A-step must determine PY |θChallenges:
efficiency-motivated auxiliary variable dimension reduction7
eliminating nuisance parameters8
(big) requires a fully-specified statistical model...
Formal remedies are difficult to carry out.
Idea: do these dimension-reduction-related tasks less formallybefore starting the IM construction.
Leads to a generalized IM...
7M. and Liu, arXiv:1211.15308M. and Liu, arXiv:1306.3092
13 / 40
Generalized IMs
The idea9 is to connect some function of (Y , θ) to anauxiliary variable with known distribution.
Let Ty ,ϑ be a real-valued function of (y , ϑ).
Good example to keep in mind
Ty ,ϑ =Ly (ϑ)
Ly (θ), relative likelihood.
Note: the value of TY ,θ does not determine (Y , θ).
So, an association in terms of TY ,θ amounts to a “loss ofinformation” in a sense that turns out to be irrelevant.
9M., arXiv:1203.6665 and arXiv:1511.06733
14 / 40
Generalized IMs, cont.
Generalized association:
TY ,θ = F−1θ (U), U ∼ Unif(0, 1)
where Fθ(t) = PY |θ(TY ,θ ≤ t), t ∈ R.
Unlike before, the generalized association doesn’t determinethe distribution of Y — but that’s not important.
Key benefits:
U is a scalar, no dimension reduction needed!ordering in ϑ 7→ Ty ,ϑ suggests a particular random set U .
15 / 40
Generalized IMs, cont.
Generalized IM construction.
A. TY ,θ = F−1θ (U) for U ∼ Unif(0, 1).
P. Introduce a suitable random set U on [0, 1]
C. Combine to get a new random set on Θ:
Θy (U) = {ϑ : Fϑ(Ty ,ϑ) ∈ U}.
For the special case U = [U ′, 1] for U ′ ∼ Unif(0, 1), somesimplification is possible:
πy (ϑ) = PU{Θy (U) 3 ϑ} = Fϑ(Ty ,ϑ), ϑ ∈ Θ.
Immediately gives valid, prior-free probabilistic inferenceacross a wide range of problems!
16 / 40
Generalized IMs, cont.
Simple binomial example.
Left: plot of πy (ϑ) based on (n, y) = (25, 15)
Right: GIM’s and Clopper–Pearson’s coverage probability.
17 / 40
Generalized IMs, cont.
Too good to be true?
Computation of Fθ can be challenging.
Lots of sampling-based methods available for this.
To evaluate πy (ϑ) on a grid:
do separate Monte Carlo on each grid pointMonte Carlo + importance sampling adjustmentsother things...10
Better/general strategies for GIM computation would be aninteresting and welcomed contribution!
10Syring and M., arXiv:2103.0265918 / 40
Generalized IMs, cont.
Often only interested in some feature of θ.
Split θ = (φ, λ), interest in φ.
Now the idea is to connect a function of (Y , φ) to an auxiliaryvariable with known distribution.
Let Ty ,ϕ be a real-valued function of (y , ϕ).
Good example to keep in mind
Ty ,ϕ =Ly (ϕ, λϕ)
Ly (φ, λ), relative profile likelihood.
19 / 40
Generalized IMs, cont.
Generalized association:
TY ,φ = F−1φ,λ(U), U ∼ Unif(0, 1)
where Fφ,λ(t) = PY |φ,λ(TY ,φ ≤ t).
TY ,φ doesn’t directly depend on λ, but its distribution does.
If λ were known, or if the dependence on λ dropped out,11
then this would be exactly like before.
That is, we end up with
“πy (ϕ)” = Fϕ,λ(Ty ,ϕ), ϕ ∈ φ(Θ).
11e.g., bivariate normal with φ the correlation and λ the means and variances20 / 40
Generalized IMs, cont.
Natural idea is to use a plug-in estimate
Define λϕ = arg maxλ Ly (ϕ, λ).
Generalized IM has (plug-in) plausibility contour
πy (ϕ) = Fϕ,λϕ(Ty ,ϕ).
Plug-in means it can’t be exactly valid; but one can usuallyprove asymptotic validity, i.e.,
limn→∞
PY n|φ,λ{πY n(φ) ≤ α} = α.
Open question: Empirically, this convergence is very fast, butis there a built-in “higher-order accuracy”?
21 / 40
Applications
Two recent applications of generalized IMs:
1 Meta-analysis with few studies.12
2 Survival analysis.13
Both involve nuisance parameters and non-trivial computation.
Generalized IM methods outperform existing methods.
Below are some details for each in turn.
12Cahoon and M., arXiv:1910.0053313Cahoon and M., arXiv:1912.00037
22 / 40
Meta-analysis
It’s natural in science for multiple researchers to carry outtheir own analysis related to the same question.
Pool these separate analyses into a “meta-analysis”?
K independent studies produce data (Yk , σ2k)
Yk is estimate of µ from study-k dataσk is the study-k standard error, treated as fixed
Basic model: Yk ∼ N(µ, ν + σ2k), k = 1, . . . ,K .
ν > 0 is the across-study variance, unknown.
Goal is inference on µ.
23 / 40
Meta-analysis, cont.
Estimating µ is easy, valid inference on µ is difficult.
Challenge comes from the nuisance parameter ν.
Asymptotic confidence intervals for µ are available,14 theserequire large K .
I was unsuccessful trying to work out basic IM details for this,but aforementioned generalized IM works.
Take-away messages:
Probabilistic inference on µAsymptotically valid, empirically accurate for small KOutperforms other methods we tried
14DerSimonian & Laird is classic24 / 40
Meta-analysis, cont.
Left: 3 individual plausibility contours & combined
Right: empirical CDF of πY (µ) for K = 5 studies.
25 / 40
Meta-analysis, cont.
Simulation comparison of GIM against competitors.
As functions of K , compare 95% CIs for µ
coverage probability (left)average length (right)
e.g., GIM (black), oracle (green), DL (purple)
26 / 40
Survival analysis
Data may be incomplete in some applications.
e.g., in time-to-event studies, event times may be censored.
Survival analysis deals with such things.
Basic right-censoring model:
Xi ∼ Hφ and Ci ∼ G , with θ = (φ,G )Ti = Xi ∧ Ci , Di = 1(Xi ≤ Ci )
Data Yi = (Ti ,Di ) for i = 1, . . . , n.
Goal is inference on φ
G is an (infinite-dim) nuisance parameter.
27 / 40
Survival analysis, cont.
There’s a likelihood function, hence MLEs
Asymptotic normality of φ can be used for inference.
Bayesian methods are also available.
I was unsuccessful trying to work out basic IM details for this,but aforementioned generalized IM works.
Take-away messages:
Probabilistic inference on φAsymptotically valid, empirically accurate for small nOutperforms other methods we tried
28 / 40
Survival analysis, cont.
Some computational details:
Monte Carlo to evaluate πy (ϕ)To simulate censoring times for Monte Carlo, use plug-inKaplan–Meier G , the nonparametric MLE
Some theoretical details:
challenging to handle the infinite-dim plug-in Gworks because G is root-n consistentempirical results suggest higher-order accuracy...
29 / 40
Survival analysis, cont.
Hφ is Weibull with φ = (shape, α; scale, β)
Compare coverage probability with existing methods
GIM (black), MLE (red), Bayes (green)
Not an extensive simulation...
30 / 40
Survival analysis, cont.
Real-data example:
Chemical concentrations in soilLeft-censored — measuring instrument has limited precisionHφ is log-normal, φ = (µ, σ2)
Plots:
Left: joint plausibility contour for (µ, σ2)Right: derived marginal for ψ = exp(µ+ σ2/2)
31 / 40
Next-gen GIMs
Generalized IMs relaxed the basic IM’s construction by notrequiring the association to determine the model for Y .
But still requires a statistical model.
In machine learning applications, it’s common to work withouta statistical model.
This has been a barrier for IMs and other probabilisticinference frameworks.
Turns out the generalized IM can be generalized even furtherto cover certain no-model cases.
I’ll talk briefly about two such extensions.
32 / 40
No-model inference
Often the quantity of interest isn’t a model parameter.
Common situation: θ = arg minϑ E{`ϑ(Y )}Loss could be squared error, classification error, etc.
Analogue to the relative likelihood
Ty ,ϑ = e−{Ry (ϑ)−Ry (θy )} Rn(ϑ) = 1n
∑ni=1 `ϑ(yi ).
Similar in principle to generalized IM before, but there areseveral new challenges.15
e.g., bootstrap needed to compute distribution of TY ,θ
Asymptotically valid under regularity conditions.
15Cella and M., almost done...33 / 40
No-model inference, cont.
Quantile regression is an important example.
Conditional τ th quantile: Qτ (Y | x) = x>βτ
“Check loss” defines the risk minimization problem.
34 / 40
No-model prediction
Prediction of future observations is important.
Can do model-based IM prediction.16
Model assumptions are restrictive in applications, so whatabout a no-model version?
Recent development using a new type of generalized IM.17
Only assumes exchangeability, provably valid.
Close connections to conformal prediction.18
16M. and Lingham, arXiv:1403.758917Cella and M. , https://researchers.one/articles/20.01.0001018Vovk, Shafer, etc.
35 / 40
No-model prediction, cont.
Latest developments for supervised learning, e.g., regression.19
Very general relationships between response and predictors.
Leads to valid probabilistic inference!
19Cella and M., almost done...36 / 40
Conclusion
IM framework is a promising potential solution to Efron’s“most important unresolved problem”
Necessary breakthrough to get past old B vs. F debates is theimportance of non-additivity/imprecision
I didn’t fully appreciate how fundamental imprecision wasuntil after the book was finished.
So I’ve written quite a bit about this recently...
37 / 40
Conclusion, cont.
Two recent papers about validity and the importance of
imprecision/non-additivity
38 / 40
Conclusion, cont.
Generalized IM framework is also powerful, in some sensemight be the “right” way to do IMs.
I’m excited about the latest developments.
Tons of potential applications!
Open questions/problems:
General computational strategies?Possible higher-order accuracy?Anything lost in the move IM → GIM?Likelihood-based inference is “optimal,” so can GIMs shedlight on “optimal IM constructions”?
39 / 40
The end
Thanks for your attention!
Questions? [email protected]
Links to papers: www4.stat.ncsu.edu/~rmartin/
100% open peer-review & publicationhttps://researchers.one
www.twitter.com/ResearchersOne
40 / 40