uncertain inference and artificial intelligence¬cial intelligence probabilistic inference...

Artificial IntelligenceProbabilistic Inference

Inferential ModelsBenchmark Problems

Uncertain Inference and Artificial Intelligence

Chuanhai Liu1

March 3, 2011

1Prepared for a Purdue Machine Learning Seminar

Chuanhai Liu1 Uncertain Inference and Artificial Intelligence



Acknowledgement

◮ Prof. A. P. Dempster for intensive collaborations on the Dempster-Shafer theory.

◮ Jianchun Zhang, Ryan Martin, Duncan Ermini Leaf, Zouyi Zhang, Huiping Xu,Jing-Shiang Hwang, Jun Xie, and Hyokun Yun for collaborations on a variety ofIM research projects.

◮ NSF support for a joint project with Jun Xie on large-scale multinomialinference and its applications in genome-wide association studies.




References

Martin, R. and Liu, C. (2011, Inferential Models) and the references therein.

A possible tbook (Liu and Martin, 2012+; Inferential Models — Reasoning with Uncertainty)having the futures:

◮ A prior-free and valid probabilistic inference system, which is promising forserious applications of statistics.

◮ Fully developed valid probabilistic inferential methods for textbook problems

◮ A large collection of applications to modern, challenging, and large-scalestatistical problems

◮ Deeper understanding of existing schools of thought and their strengths andweaknesses.

◮ Satisfactory solutions to well-known benchmark problems, including Stein’sparadox and the Behrens-Fisher problem

◮ A direct attack on the source of uncertainty, which makes learning and teachingeasier and more enjoyable




Abstract

It is difficult, perhaps, to believe that artificial intelligence canbe made intelligent enough without a valid probabilisticinferential system as a critical module. After a brief review ofexisting schools of thought on uncertain inference, we introducea valid probabilistic inferential framework termed inferentialmodels (IMs). With several simple and benchmark examples,we discuss potential applications of IMs in artificial intelligence(in general and machine learning in particular).




Artificial intelligenceMachine learningLearning from data

What is it? An answer from the web

Artificial Intelligence (AI) is the area of computer science focusing on creatingmachines that can engage on behaviors that humans consider intelligent.

The ability to create intelligent machines has intrigued humans since ancient times,

and today with the advent of the computer and 50 years of research into AI

programming techniques, the dream of smart machines is becoming a reality.

Researchers are creating systems which can mimic human thought, understand speech,

beat the best human chess player, and countless other feats never before possible.





Is the answer precise?

If not, blame on Google’s machine learning algorithms





What is it? An answer from the web

Machine learning has been central to AI research from the beginning. Unsupervised

learning is the ability to find patterns in a stream of input. Supervised learning

includes both classification and numerical regression. Classification is used to

determine what category something belongs in, after seeing a number of examples of

things from several categories. Regression takes a set of numerical input/output

examples and attempts to discover a continuous function that would generate the

outputs from the inputs. In reinforcement learning the agent is rewarded for good

responses and punished for bad ones. These can be analyzed in terms of decision

theory, using concepts like utility. The mathematical analysis of machine learning

algorithms and their performance is a branch of theoretical computer science known as

computational learning theory.





The inference problem

◮ Input

1. Data x — observed observable quantities X ∈ X.2. Assertion A — statements on θ ∈ Θ, unknown quantities.3. Association between X and θ.

For example, x is a sample of the population characterized by the cdf

Fθ(.).

◮ Output:

1. Probabilistic uncertainty assessments on the truth or the falsityof A given X = x .

2. Plausible regions for θ and its functions.




Intelligence and uncertaintyProbability modelsStatistical modelsExisting schools of thought

Uncertain inference

is critical to AI — No?





One (simple) kind of uncertain inference

Probability models A probability model has a meaningful/validprobability distribution assumed to be adequate foreverything. In particular, θ has a valid marginaldistribution that can be operated via the usualprobability calculus to derive valid, e.g., marginal andconditional posterior distributions.

Subjective Bayesian Philosophically, every Bayesian is subjective.

◮ Bayes was not Bayesian.◮ What’s wrong? Nothing is wrong — you make

the decision and (you or your clients) shouldtake the consequence.





Statistical models

Statistical models In what follows, we consider the cases whereyou don’t have valid distributions for everything,which we refer to as Statistical Models.

θ is taken to be unknown.





“Objective” Bayesian — a personal view

The idea can be viewed as to use magic priors to approximate(ideal) frequentist results.

Remarks:

◮ Assertion-specific priors: Certain priors can work for certain assertions on θ.

◮ Large-sample theory: It is really on the case when uncertainty goes away;thinking about both normality and vanishing variances in very-high-dimensionalproblems.

◮ Robust Bayesian: The ‘worst case scenario’ thinking ultimately leads theBayesian to a non-Bayesian school.





Existing schools of thought

◮ Bayes: for it to work, it really requires valid priors.

◮ Fiducial: it is very interesting. It is wrong (but better than Bayes[?]).

◮ Dempster-Shafer: as an extension of both Bayes and fiducial, itrequires valid independent individual components that areprobabilistically meaningful.

For example, individual components are specified with fiducial probabilities.

◮ Frequentist: starting with specified rules and criteria, it invites the“guess and check” approach to uncertain inference. If so, is it veryappealing?

For example, 24+ methods for 2x2 tables and penalty-based methods.





Remarks

◮ These existing methods are useful.

◮ All these schools of thought fail for many “benchmark”examples, such as, the many-normal-means,Behrens-Fisher, and constrained parameter problems.

◮ Thinking outside the box may be necessary for newgenerations.




A valid probabilistic inference frameworkTwo simple examplesPredictive random setsOne sample test

The likelihood insufficiency principle

◮ Likelihood alone is not sufficient for probabilistic inference.

◮ An unobserved but predictable quantity called the auxiliary(a)-variable, must be introduced for predictive/probabilisticinference.

Remark: Bayes makes θ predictable. Is it credible/valid?





The “No Validity, No Probability” principle?

◮ Notation: denote by Px(A) the probability for the truth of Agiven the observed data x .

◮ Definition (validity). An inferential framework is said to be valid if ∀A ⊂ Θ,PX (A), as a function of X , satisfies

PX (A)stochastically

≤ Unif (0, 1)

under the falsity of A, i.e., under the truth of Ac , the negation of A.





The Inferential Model (IM) framework

IM is valid and consists of three steps:

Association-step: Associate X and θ with an a-variable z to obtain the mapping

ΘX (z) ⊆ Θ (z ∼ πz)

consisting of candidate values of θ given X and z .

Prediction-step: Predict z with a credible predictive random set (PRS) Sθ, i.e.,

P (Sθ 6∋ z) ≤ Unif (0, 1), where z ∼ πz .

Combination-step: Combine x and Sθ to obtain Θx (Sθ) = ∪z∈SθΘx (z) and

compute evidence

ex (A) = P (Θx (Sθ) ⊆ A) and ex (Ac ) = P (Θx (Sθ) ⊆ Ac )

with ex (A) = 1 − ex (Ac ) called plausibility.





X ∼ N(θ, 1)

A-step. X = θ + z , where z ∼ N(0, 1).P-step. Sθ = [−|Z |, |Z |], where Z ∼ N(0, 1).C-step. ex(A) and ex(A

c) with Θx(Sθ) = [x − |Z |, x + |Z |].

Example

−2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

θ0

e x( θ 0

)

Figure: Plausibility of assertion A = {θ : θ = θ0}, indexed by θ0, given x = 1.96.Note ex (θ0) = 0.





X ∼ Binomial(n, θ)

This is a homework problem for Stat 598D.





Efficiency

See Stat 598D lecture notes on Statistical Inference.Let b(z) be a continuous function and define

S = {z : b(z) ≤ b(Z )} (Z ∼ πz).

ThenP (S 6∋ z) ∼ Unif (0, 1) (z ∼ πz).

We can use this result to construct credible PRS.





Combining information: Conditional IMs

Example (A textbook example)Consider the association model

Xi = θ + zi (ziiid∼ N(0, 1), i = 1, ...,n).

WriteX̄ = θ + z̄ and Xi − X̄ = (zi − z̄) (i = 1, ...,n).

Predict z̄ conditional on the observed a-quantities {(zi − z̄)}n1 . This leads to simplified

conditional IM:

A-step. X̄ = θ + 1√nu, where u ∼ N(0, 1).

P-step. S = [−|U|, |U|], where U ∼ N(0, 1).

C-step. Θx (S) = [X̄ − |U|/√n, X̄ + |U|/√n].





Efficient inference: Marginal IMs

Example (Another textbook example)Consider the association model

Xi = η + σzi (ziiid∼ N(0, 1), i = 1, ...,n).

Let θ = (η, σ2) ∈ Θ = R × R+ and write

X̄ = η + σz̄ , s2x = σ2s2

z , and (X − X̄1)/sx = (z − z̄1)/sz .

Predict z̄ and s2z conditional on the observed a-quantities (z − z̄1)/sz . This leads to

simplified conditional IM:

A-step. X̄ = η + sx√nu and s2

x = σ2s2z , where u ∼ tn−1(0, 1) ⊥ s2

z ∼ χ2n−1.

P-step. S = [−|U|, |U|] × [0,∞], where U ∼ tn−1(0, 1).

C-step. Θx (S) = [X̄ − |U|sx/√

n, X̄ + |U|sx/√

n] × [0,∞].





Model selection via AI (or by AS — Artificial Statistician)?

Consider choosing a model from a collection of models, including,e.g., normal for simplicity (and efficiency) and non-parametric forrobustness.

See Jianchun Zhang’s PhD thesis for an IM-based method.




Simpson’s paradoxThe Behrens-Fisher problemStein’s paradoxA meta-analysis problem

2 × 2 tables

Example (Kidney stone treatment, Steven et al (1994))

Table 1. Small Stones⊕? Table 2. Large Stones

Treatment Success Failure Treatment Success FailureA 81 6 A 192 71B 234 26 B 55 25

For making intelligent decision, there are (at least) two things to consider.

Prediction: Conditional on the Stone type.

Estimation: Combining data if possible.

Thus, check the homogeneity of each of the two tables

Table 3. Treatment A & Table 4. Treatment BStone type Success Failure Stone type Success Failure

Small 81 6 Small 234 26Large 192 71 Large 55 25





Evidence for and against homogeneity of treatments

For each of Table 3 and Table 4, compute

1. e(homogeneous),

2. e(homogeneous), and

3. 95% plausibility interval for the odd ratio.

Remarks.

1. Simpson’s paradox is related more to wrong statistical analysis, i.e.,modeling, than to inferential method(?) How can this be done in AI?

2. Some relevant statistical thoughts◮ Increase precision of prediction via conditioning, and◮ Increase precision of estimation via pooling.

Can some basics like these be integrated into AI?





Numerical results

Figure: Plausibilities for log odd ratios Tables 3 and 4, which shows that poolingmakes no sense in this example.





Comparing two normal means with unknown variances

This is a common textbook, controversial, and practically usefulexample (Bayes and fiducial do not work well); See Martin,Hwang, and Liu (2010b).





Many-normal-means

The association model:

Xi = µi + zi (ziiid∼ N(0, 1), i = 1, ..., n).

The problem of interest is to infer ‖µ‖.

A very important example for understanding inference. (Bayesand fiducial do not work); See Martin, Hwang, and Liu (2010b).





Many-normal-means

The usual model for the observable X1, ...,Xn:

µiiid∼ N(θ, σ2) (i = 1, ...,n)

and

Xi |µ ind∼ N(µi , s2i ) (i = 1, ...,n)

with known positive s21 , ..., s2

n , where µ = (µ1, ..., µn) and (θ, σ2) ∈ R × R+ unknown.Here, we are interested in inference about σ2.

Since there are really meaningful prior knowledge in practice, it has been tremendous

interest on choosing Bayesian priors.





Many-normal-means

The sampling model for the observable quantity is

Xiind∼ N(θ, σ2 + s2

i ) (i = 1, ...,n)

For simplicity to motivate ideas, consider the case with known θ = 0, that is,

Xiind∼ N(0, σ2 + s2

i ) (i = 1, ...,n)

An association model is given by

nX

i=1

X 2i

σ2 + s2i

= V

and"

nX

i=1

X 2i

σ2 + s2i

#−1/20

B

@

X1q

σ2 + s2i

, ...,Xn

p

σ2 + s2n

1

C

A= U,

where V ∼ χ2n ⊥ U ∼ Unif (On).





Many-normal-means

Specify the predictive random set, which predicts u⋆ alone,

S = {(v , u) : |Fn(v) − .5| ≤ |Fn(V ) − .5|}

This is a constrained parameter inference problem.

Remark. Validity is not a problem, but efficient inference is not straightforward. Itrequires to consider Generalized Conditional IMs — a challenging topic underinvestigation!


uncertain inference and artificial intelligence¬cial intelligence probabilistic inference...

Documents