dempster68 generalization bayesianinference

8/10/2019 Dempster68 Generalization BayesianInference

1/31

4

A Generalization of Bayesian Inference

Arthur P. Dempster

Abstract. Procedures of statistical inference are described which generalizeBayesian inference in specific ways. Probability is used in such a way that in generalonly bounds may be placed on the probabilities of given events, and probabilitysystems of this kind are suggested both for sample information and for prior infor-mation. These systems are then combined using a specified rule. Illustrations aregiven for inferences about trinomial probabilities, and for inferences about a mono-tone sequence of binomial pi. Finally, some comments are made on the general classof models which produce upper and lower probabilities, and on the specific modelswhich underlie the suggested inference procedures.

1 Introduction

Reduced to its mathematical essentials, Bayesian inference means startingwith a global probability distribution for all relevant variables, observing thevalues of some of these variables, and quoting the conditional distributionof the remaining variables given the observations. In the generalization ofthis paper, something less than a global probability distribution is required,

while the basic device of conditioning on observed data is retained. Actually,the generalization is more specific. The term Bayesian commonly implies aglobal probability law given in two parts, first the marginal distribution ofa set of parameters, and second a family of conditional distributions of aset of observable variables given potential sets of parameter values. The firstpart, or prior distribution, summarizes a set of beliefs or state of knowledgein hand before any observations are taken. The second part, or likelihood

function, characterizes the information carried by the observations. Specificgeneralizations are suggested in this paper for both parts of the commonBayesian model, and also for the method of combining the two parts. The

Read at a Research Methods Meeting of the Society, February 14th, 1968,Professor R.L. Plackett in the Chair.


2/31

74 A. P. Dempster

components of these generalizations are built up gradually in Sect. 2 wherethey are illustrated on a model for trinomial sampling.

Inferences will be expressed as probabilitiesof events defined by unknownvalues, usually unknown parameter values, but sometimes the values of observ-

ables not yet observed. It is not possible here to go far into the much-embroiledquestions of whether probabilities are or are not objective, are or are notdegrees of belief, are or are not frequencies, and so on. But a few remarksmay help to set the stage. I feel that the proponents of different specific viewsof probability generally share more attitudes rooted in the common sense ofthe subject than they outwardly profess, and that careful analysis rendersmany of the basic ideas more complementary than contradictory. Definitionsin terms of frequencies or equally likely cases do illustrate clearly how rea-sonably objective probabilities arise in practice, but they fail in themselves

to say what probabilities mean or to explain the pervasiveness of the conceptof probability in human affairs. Another class of definitions stresses conceptslike degree of confidence or degree of belief or degree of knowledge, sometimesin relation to betting rules and sometimes not. These convey the flavour andmotivation of the science of probability, but they tend to hide the realitieswhich make it both possible and important for cognizant people to agree whenassigning probabilities to uncertain outcomes. The possibility of agreementarises basically from common perceptions of symmetries, such as symmetriesamong cases counted to provide frequencies, or symmetries which underlieassumptions of exchangeability or of equally likely cases. The importance ofagreement may be illustrated by the statistician who expresses his inferencesabout an unknown parameter value in terms of a set of betting odds. If thisstatistician accepts any bet proposed at his stated odds, and if he wagerswith colleagues who consistently have more information, perhaps in the formof larger samples, then he is sure to suffer disaster in the long run. The moralis that probabilities can scarcely be fair for business deals unless both par-ties have approximately the same probability assessments, presumably basedon similar knowledge or information. Likewise, probability inferences can con-tribute little to public science unless they are as objective as the web of gen-

erally accepted fact on which they are based. While knowledge may certainlybe personal, the communication of knowledge is one of the most fundamentalof human endeavours. Statistical inference can be viewed as the science whoseformulations make it possible to communicate partial knowledge in the formof probabilities.

Generalized Bayesian inference seeks to permit improvement on classi-cal Bayesian inference through a complex trade-off of advantages and disad-vantages. On the credit side, the requirement of a global probability law isdropped and it becomes possible to work with only those probability assump-

tions which are based on readily apparent symmetry conditions and are there-fore reasonably objective. For example, in a wide class of sampling models,including the trinomial sampling model analysed in Sect. 2, no probabilitiesare assumed except the familiar and non-controversial representation of a


3/31

4 A Generalization of Bayesian Inference 75

sample asnindependent and identically distributed random elements from apopulation. Beyond this, further assumptions like specific parametric formsor prior distributions for parameters need be put in only to the extent thatthey appear to command a fair degree of assent.

The new inference procedures do not in general yield exact probabilities fordesired inferences, but only bounds for such probabilities. While it may countas a debit item that inferences are less precise than one might have hoped, it isa credit item that greater flexibility is allowed in the representation of a stateof knowledge. For example, a state of total ignorance about an uncertain eventT is naturally represented by an upper probability P(T) = 1 and a lowerprobabilityP(T) = 0. The new flexibility thus permits a simple resolution ofthe old controversy about how to represent total ignorance via a probabilitydistribution. In real life, ignorance is rarely so total that (0, 1) bounds are

justified, but ignorance is likely to be such that a precise numerical probabilityis difficult to justify. I believe that experience and familiarity will show thatthe general range of bounds 0 P(T) P(T) 1 provides a useful toolfor representing degrees of knowledge.

Upper and lower probabilities apparently originated with Boole (1854) andhave reappeared after a largely dormant period in Good (1962) and Smith(1961, 1965). In this paper upper and lower probabilities are generated by aspecific mathematical device whereby a well-defined probability measure overone sample space becomes diffused in its application to directly interestingevents. In order to illustrate the idea simply, consider a map showing regionsof land and water. Suppose that 0 80 of the area of the map is visible andthat the visible area divides in the proportions 030 to 070 of water area toland area. What is the probability that a point drawn at random from thewholemap falls in a region of water? Since the visible water area is 0 24 ofthe total area of the map, while the unobserved 0 20 of the total area couldbe water or land, it can be asserted only that the desired probability liesbetween 0 24 and 0 44. The model supposes a well-defined uniform distri-bution over the whole map. Of the total measure of unity, the fraction 0 24is associated with water, the fraction 0 56 is associated with land, and the

remaining fraction 0 20 is ambiguously associated with water or land. Notethe implication of total ignorance of the unobserved area. There would be noobjection to introducing other sources of information about the unobservedarea. Indeed, if such information were appropriately expressed in terms of anupper and lower probability model, it could be combined with the above infor-mation using a rule of combination defined within the mathematical system.A correct analogy can be drawn with prior knowledge of parameter values,which can likewise be formally incorporated into inferences based on sampledata, using the same rule of combination. The general mathematical system,

as given originally in Dempster (1967a), will be unfolded in Sect. 2 and willbe further commented upon in Sect. 4.

If the inference procedures suggested in this paper are somewhat spec-ulative in nature, the reason lies, I believe, not in a lack of objectivity in


4/31

76 A. P. Dempster

the probability assumptions, nor in the upper and lower probability feature.Rather, the source of the speculative quality is to be found in the logicalrelationships between population members and their observable characteris-tics which are postulated in each model set up to represent sampling from

a population. These logical relationships are conceptual devices, which arenot regarded as empirically checkable even in principle, and they are some-what arbitrary. Their acceptability will be analysed in Sect. 5 where it will beargued that the arbitrariness may correspond to something real in the natureof an uncertainty principle.

A degree of arbitrariness does not in itself rule out a method of statisti-cal inference. For example, confidence statements are widely used in practicedespite the fact that many confidence procedures are often available within thesame model and for the same question, and there is no well-established theory

for automatic choice among available confidence procedures. In part, there-fore, the usefulness of generalized Bayesian inference procedures will requirethat practitioners experiment with them and come to feel comfortable withthem. Relatively few procedures are as yet analytically tractable, but twoexamples are included, namely, the trinomial sampling inference proceduresof Sect. 2, and a procedure for distinguishing between monotone upward andmonotone downward sequences of binomial pi as given in Sect. 3. Anothermodel is worked through in detail in Dempster (1967b).

Finally, an acknowledgement is due to R.A. Fisher who announced withcharacteristic intellectual boldness, nearly four decades ago, that probabilityinferences were indeed possible outside of the Bayesian formulation. Fishercompiled a list of examples and guide-lines which seemed to him to lead toacceptable inferences in terms of probabilities which he called fiducial probabil-ities. The mathematical formulation of this paper is broad enough to includethe fiducial argument in addition to standard Bayesian methods. But the spe-cific models which Fisher advocated, depending on ingenious but often con-troversial pivotal quantities, are replaced here by models which start furtherback at the concept of a population explicitly represented by a mathematicalspace. Fisher did not consider models which lead to separated upper and lower

probabilities, and indeed went to some lengths, using sufficiency and ancillar-ity, and arranging that the spaces of pivotal quantities and of parameters beof the same dimension, in order to ensure that ambiguity did not appear. Thispaper is largely an exploration of fiducial-like arguments in a more relaxedmathematical framework. But, since Bayesian methods are more in the mainstream of development, and since I do explicitly provide for the incorpora-tion of prior information, I now prefer to describe my methods as extensionsof Bayesian methods rather than alternative fiducial methods. I believe thatFisher too regarded fiducial inference as being very close to Bayesian inference

in spirit, differing primarily in that fiducial inference did not make use of priorinformation.


5/31


2 Upper and Lower Probability Inferences Illustrated

on a Model for Trinomial Sampling

A pair of sample spaces X and Sunderlie the general form of mathematical

model appearing throughout this work. The first space Xcarries an ordinaryprobability measure, but interest centres on events which are identified withsubsets ofS. A bridge is provided from XtoSby a logical relationship whichasserts that, ifx is the realized sample point in X, then the realized samplepoint s in S must belong to a subset x of S. Thus a basic component ofthe model is a mathematical transformation which associates a subset x ofS with each point x ofX. Since the x determined by a specific x containsin general many points (or branches or values), the transformation x xmay be called a multivalued mapping. Apart from measurability considera-

tions, which are ignored in this paper, the general model is defined by theelements introduced above and will be labelled (X,S,, ) for convenient ref-erence. Given (X,S,, ), upper and lower probabilitiesP(T) andP(T) aredetermined for each subset T ofS.

In the cartographical example of Sect. 1, X is defined by the points ofthe map, S is defined by two points labelled water and land, is theuniform distribution of probability over the map, and is the mapping whichassociates the single point water or land in Swith the appropriate pointsof the visible part ofXand associates both points ofSwith the points of theunseen part ofX. For set-theoretic consistency, x should be regarded as asingle point subset ofS, rather than a single point itself, over the visible partofX, but the meaning is the same either way.

The general definitions ofP(T) and P(T) as given in Dempster (1967a)are repeated below in more verbal form. For any subset T ofS, define T tobe the set of points x in X for which x has a non-empty intersection withT, and defineT to be the set of points xinXfor which xis contained inTbut is not empty. In particular, the sets S andS coincide. The complementX S ofS consists of those x for which x is the empty set. Now definethe upper probabilityofT to be

P(T) =(T)/(S) (1)

and the lower probabilityofT to be

P(T) =(T)/(S). (2)

Note that, since TT S, one has

0 P(T) P(T) 1. (3)

Also, ifT is the complement ofT in S, then T and T are respectively thecomplements ofT and T in S, so that

P(T) = 1 P(T) and P(T) = 1 P(T). (4)


6/31

78 A. P. Dempster

Other formal consequences of the above definitions are explored in Dempster(1967a).

The heuristic conception which motivates (1) and (2) is the idea of carryingprobability elements d from X to Salong the branches of the mapping x.

The ambiguity in the consequent probability measure over S occurs becausethe probability elementd(x) associated withxinXmay be carried along anybranch of xor, more generally, may be distributed over the different branchesof xfor eachx. Part of the measure, namely the measure of the set XS

consisting of points xsuch that xis empty, cannot be moved from Xat all.Since there is an implicit assumption that some s in S is actually realized, itis appropriate to condition by S when defining relevant probabilities. Thisexplains the divisor (S) appearing in (1) and (2). Among all the ways oftransferring the relevant probability (S) from X to S along branches of

x, the largest fraction which can possibly follow branches into T is P

(T),while the smallest possible fraction is P(T). Thus conservative probabilityjudgements may be rendered by asserting only that the probability ofT liesbetween the indicated upper and lower bounds.

It may also be illuminating to view x as a random set in S generatedby the random point x in X, subject to the condition that x is not empty.After conditioning on S, P(T) is the probability that the random set xintersects the fixed set T, whileP(T) is the probability that the random setx is contained in the fixed set T.

A probability model like (X,S,, ) may be modified into other proba-bility models of the same general type by conditioning on subsets ofS. Suchconditioning on observed data defines the generalized Bayesian inferences ofthis paper. Beyond and generalizing the concept of conditioning, there is anatural rule for combining or multiplying several independent models of thetype (X,S,, ) to obtain a single product model of the same type. For exam-ple, the models fornindependent sample observations may be put together bythe product rule to yield a single model for a sample of size n, and the modeldefining prior information may be combined with the model carrying sam-ple information by the same rule. The rules for conditioning and multiplying

will be transcribed below from Dempster (1967a) and will be illustrated ona model for trinomial sampling. First, however, the elements of the trinomialsampling model will be introduced for a sample of size one.

Each member of a large population, shortly to be idealized as an infinitepopulation, is supposed known to belong to one of three identifiable categoriesc1, c2 and c3, where the integer subscripts do not indicate a natural orderingof the categories. Thus the individuals of the population could be balls in anurn, identical in appearance apart from their colours which are red (c1) orwhite (c2) or blue (c3). A model will be defined which will ultimately lead

to procedures for drawing inferences about unknown population proportionsof c1, c2 and c3, given the categories of a random sample of size n from thepopulation. Following Dempster (1966), the individuals of the population willbe explicitly represented by the points of a space U, and the randomness


7/31


8/31


9/31


is defined to be the subset ofUconsisting of points u for which Bu contains(ci,). The subsets U1, U2, U3 defined by a given in are illustrated inFig. 1.

It is easily checked with the help of Fig. 1 that

(Ui) =i and (Ui Uj) = 0 (9)

for i, j = 1, 2, 3 and i = j. It will be shown later that the property (9) isa basic requirement for the mapping B defined in (7). Other choices of Uand B could be made which would also satisfy (9). Some of these choicesamount to little more than adopting different coordinate systems for U, butother possible choices differ in a more fundamental way. Thus an element ofarbitrariness enters the model for trinomial sampling at the point of choosingU and B. The present model was introduced in Dempster (1966) under the

name structure of the second kind. Other possibilities will be mentioned inSect. 5.

All of the pieces of the model (U, C, , B) are now in place, so thatupper and lower probabilities may be computed for subsets T of C. Itturns out, however, thatP(T) = 1 andP(T) = 0 for interesting choices ofT,and that interesting illustrations of upper and lower probabilities are apparentonly after conditioning. For example, take Tto be the event that categoryc1will be observed in a single drawing from the population, i.e. T = C1 ,where C1 is the subset ofC consisting of c1 only. To check that P(T) = 1

andP(T) = 0, note (i) that T =Ubecause everyu inU lies in U1 of Fig. 1for some (c1, ) in C1, and (ii) that T is empty because no uinU lies inU1 for all (c1,) in C1 . In general, any non-trivial event governed by Calone or by alone will have upper probability unity and lower probabilityzero. Such a result is sensible, for if no information about is put into thesystem no information about a sample observation should be available, whileif no sample observation is in hand there should be no available informationabout . (Recall the interpretation suggested in Sect. 1 that P(T) = 1 andP(T) = 0 should convey a state of complete ignorance about whether or not

the real world outcome s will prove to lie in T.)Turning now to the concept of upper and lower conditional probabilities,

the definition which fits naturally with the general model (X,S,, ) arises asfollows. If information is received to the effect that sample points in ST areruled out of consideration, then the logical assertion xin Xmust correspondto s in x S is effectively altered to read x in X must correspond tos in x T S. Thus the original model (X,S,, ) is conditioned on Tby altering (X,S,, ) to (X,S,,), where the multivalued mapping isdefined by

x= x T. (10)

Under the conditioned model, an outcome in S T is regarded as impossi-ble, and indeed the set S T has upper and lower conditional probabilities


10/31

82 A. P. Dempster

both zero. It is sufficient for practical purposes, therefore, to take the condi-tional model to be (X,T,,) and to consider upper and lower conditionalprobabilities only for subsets ofT.

Although samples of size one are of little practical interest, the model for a

single trinomial observation provides two good illustrations of the definition ofa conditioned model. First, it will be shown that conditioning on a fixed valueof = (1, 2, 3) results in i being both the upper and lower conditionalprobability of an observation ci, for i = 1, 2, 3. This result is equivalent to(9) and explains the importance of (9), since any reasonable model shouldrequire that the population proportions be the same as the probabilities of thedifferent possible outcomes in a single random drawing when the populationproportions are known. Second, it will be shown that non-trivial inferencesabout may be obtained by conditioning on the observed category c of a

single individual randomly drawn from U.In precise mathematical terms, to condition the trinomial sampling model(U, C , , B) on a fixed is to condition on T = C , where is thesubset of consisting of the single point .Titself consists of the three points(c1,), (c2, ) and (c3, ) which in turn define single point subsets T1, T2and T3ofT. The conditioned model may be written (U,T,, B) where Bu= BuTfor all u. By referring back to the definition of B as illustrated in Figs. 1and 2, it is easily checked that the set ofu inU such that BuintersectsTi isthe closed triangle Ui appearing in Fig. 1, while the set ofu in U such thatBuis contained in Ti is the open triangleUi, for i= 1, 2, 3. Whether open orclosed, the triangle Ui has measure i, and it follows easily from (9) that theupper and lower conditional probabilities ofTi given T are

P(Ti|T) =P(Ti|T) = i, (11)

for i= 1, 2, 3, Note that Bu is not empty for any u in U, so that the denom-inators in (1) and (2) are both unity in the application (11).

Consider next the details of conditioning the trinomial model on a fixedobservationc1. The cases where a single drawing produces c2 or c3 may be

handled by permuting indices. Observing c1 is formally represented by con-ditioning on T =C1 where C1 as above is the subset ofCconsisting ofc1 alone. In the conditional model (U, T , , B), the space Tis represented bythe first level in Fig. 2 while Bu is represented by the closed shaded regionin that first level. Since Bu is non-empty for all u in U, the measure maybe used directly without renormalization to compute upper and lower condi-tional probabilities given T. An eventRdefined as a subset of is equivalentlyrepresented by the subset C1 R ofT. The upper conditional probability ofC1Rgiven T is the probability that the random region BuintersectsC1R

where (c1, u) is uniformly distributed over C1 . See Fig. 3. Similarly, thelower conditional probability of C1 R given T is the probability that therandom region Bu is contained in C1 R. For example, if R is the lowerportion of the triangle where 0 1 1 , then


11/31


(c1, 1, 0, 0)

Bu

(c1, 0, 1,0) (c1, 0,0,1)

(c1, u1,u2,u3)

C1R

Fig. 3. The triangle T = C1 for the model conditioned on the observationc1. Horizontal shading covers the region Bu, while vertical shading covers a general

fixed region C1 R

P(C1 R|T) = 1 (1 1)

2 = 1(2 1) andP(C1 R|T) = 0.

Or, in more colloquial notation,

P(0 1 1 |c= c1) =

1(2

1) and P(0 1

1 |c= c1) = 0.

More generally, it can easily be checked that

P(1 1 1 |c= c1) =

1(2

1), (12)

while

P(1 1

1 |c= c1) = 0

= (1 1)2

if1


12/31

84 A. P. Dempster

A collection of n models (X(i), S , (i), (i)) for i = 1, 2, . . . , n may becombined or multiplied to obtain a product model (X,S,, ). The formaldefinition of (X,S,, ) is given by

and

X=X(1) X(2) . . . X(n),= (1) (2) . . . (n)

x= (1)x(1) (2)x(2) . . . (n)x(n),

(16)

where x= (x(1), x(2), . . . , x(n)) denotes a general point of the product space X.The product model is appropriate where the realized valuesx(1), x(2), . . . , x(n)

are regarded as independently random according to the probability measures(1), (2), . . . , (n), while the logical relationships implied by (1), (2), . . . , (n)

are postulated to apply simultaneously to a common realized outcomes inS.

It may be helpful to view the models (X(i)

, S , (i)

, (i)

) as separate sourcesof information about the unknown sin S. In such a view, if the nsources aregenuinely independent, then the product rule (16) represents the legitimateway to pool their information.

The concept of a product model actually includes the concept of a condi-tioned model which was introduced earlier. Proceeding formally, the informa-tion that Toccurs with certainty may be represented by a degenerate model(Y,S,, ), where Yconsists of a single point y, while y=T and y carriesmeasure unity. Multiplying a general model (X,S,, ) by (Y,S,, ) pro-

duces essentially the same result as conditioning the general model (X,S,, )on T. For X Y and are isomorphic in an obvious way to X and ,while xy = xT =xas in (10). Thus the objective of taking accountof information in the special form of an assertion that Tmust occur may bereached either through the rule of conditioning or through the rule of multipli-cation, with identical results. In particular, whenT =Sthe degenerate model(Y,S,, ) conveys no information about the uncertain outcomes in S, bothin the heuristic sense that upper and lower probabilities of non-trivial eventsare unity and zero, and in the formal sense that combining such a ( Y,S,, )with any information source (X,S,, ) leaves the latter model essentiallyunaltered.

Product models are widely used in mathematical statistics to repre-sent random samples of size n from infinite populations, and they applydirectly to provide the general sample size extension of the trinomial sam-pling model (U, C, , B). A random sample of size nfrom the populationU is represented by u(1), u(2), . . . , u(n) independently drawn from U accord-ing to the same uniform probability measure . More precisely, the sample(u(1), u(2), . . . , u(n)) is represented by a single random point drawn from theproduct space

Un =U(1) U(2) . . . U(n) (17)

according to the product measure

n =(1) (2) . . . (n), (18)


13/31


where the pairs (U(1), (1)), (U(2), (2)), . . . , (U(n), (n)) are n identical math-ematical copies of the original pair (U, ). In a similar way, the observablecategories of thensample individuals are represented by a point in the prod-uct space

Cn

=C(1)

C(2)

. . . C(n)

, (19)where C(i) is the three-element space from which the observable categoryc(i) of the sample individual u(i) is taken. The interesting unknowns beforesampling arec(1), c(2), . . . , c(n) and, which define a point in the spaceCn.Accordingly, the model which represents a random sample of size n from atrinomial population is of the form (Un, Cn , n, Bn), where it remainsonly to defineBn. In words,Bn is the logical relationship which requires that(7) shall hold for each u(i). In symbols,

Bn

u(1), u(2), . . . , u(n)

= B(1)u(1) B(2)u(2) . . . B(n)u(n), (20)

where B(i)u(i) consists of those points (c(1), c(2), . . . , c(n), ) in Cn such that

k/u(i)k = max

1/u

(i)1

, (2/u

(i)2 ), (3/u

(i)3 )

(21)

for k= 1, 2, 3.The model (Un, Cn , n, Bn) now completely defined provides in itself

an illustration of the product rule. For (17), (18) and (20) are instances of the

three lines of (16), and hence show that (Un, Cn , n, Bn) is the productof the n models (U(i), Cn , (i), B(i)) for i= 1, 2, . . . , n, each representingan individual sample member.

As in the special case n = 1, the model (Un, Cn , n, Bn) does notin itself provide interesting upper and lower probabilities. Again, condition-ing may be illustrated either by fixing and asking for probability judg-ments about c(1), c(2), . . . , c(n) or conversely by fixing c(1), c(2), . . . , c(n) andasking for probability judgments (i.e. generalized Bayesian inferences) about. Conditioning on fixed leads easily to the expected generalization of (11).

Specifically, if T is the event that has a specified value, while T is theevent that c(1), c(2), . . . , c(n) are fixed, with ni observations in categoryci fori= 1, 2, 3, then

P(T|T) =P(T|T) =n11

n22

n33 . (22)

The converse approach of conditioning on T leads to more difficult mathe-matics.

Before c(1), c(2), . . . , c(n) are observed, the relevant sample space Cn consists of 3n triangles, each a copy of . Conditioning on a set of recordedobservationsc(1), c(2), . . . , c(n) reduces the relevant sample space to the single

triangle associated with those observations. Although this triangle is actuallya subset ofCn, it is essentially the same as and will be formally identifiedwith for the remainder of this discussion. Conditioning the model (Un, Cn, n, Bn) on c(1), c(2), . . . , c(n) leads therefore to the model (Un, , n, Bn)


14/31

86 A. P. Dempster

where Bn is defined by restricting Bn to the appropriate copy of . Theimportant random subset Bn(u(1), u(2), . . . , u(n)) of defined by the randomsample u(1), u(2), . . . , u(n) will be denoted by V for short. V determines thedesired inferences, that is, the upper and lower probabilities of a fixed subset

Rof are respectively the probability that V intersectsR and the probabilitythatV is contained in R, both conditional on V being non-empty.

V is the intersection of the n random regions B(i)u(i) for i = 1, 2, . . . , nwhere each B(i)u(i) is one of the three types illustrated on the three levels ofFig. 2, the type and level depending on whether the observation c(i) is c1, c2or c3. Figure 4 illustrates one such region for n= 4. It is easily discovered byexperimenting with pictures like Fig. 4 that the shaded region V may have3,4,5 or 6 sides, but most often is empty. It is shown in Appendix A thatV is non-empty with probability n1!n2!n3!/n! under independent uniformly

distributed u

(1)

, u

(2)

, . . . , u

(n)

. Moreover, conditional on non-empty V, sixrandom vertices ofVare shown in Appendix A to have Dirichlet distributions.Specifically, define W(i) for i= 1, 2, 3 to be the point in V with maximumcoordinate i and define Z

(i) for i = 1, 2, 3 to be the point in V withminimum coordinatei. These six vertices ofVneed not be distinct, but aredistinct with positive probability and so have different distributions. Theirdistributions are

W(1) : D(n1+ 1, n2, n3),W(2) : D(n1, n2+ 1, n3),

W

(3) : D(n1

, n2

, n3

+ 1),

Z(1) : D(n1, n2+ 1, n3+ 1),Z(2) : D(n1+ 1, n2, n3+ 1),Z(3) : D(n1+ 1, n2+ 1, n3),

(23)

(0, 1, 0)

u(2) u

(4)

u(3)

u(1)

(0, 0, 1)

(1, 0, 0)

Fig. 4.The triangle representing the sample space of unknowns after n = 4 obser-vations c(1) = c1, c

(2) = c3, c(3) = c1, c

(4) = c2 have been taken. The shaded regionis the realization of V determined by the illustrated realization of u(1), u(2), u(3)

and u(4)


15/31


whereD(r1, r2, r3) denotes the Dirichlet distribution over the triangle whoseprobability density function is proportional to

r111 r212

r313 .

The Dirichlet distribution is defined as a continuous distribution over ifri > 0 for i = 1, 2, 3. Various conventions, not listed here, are required tocover the distributions of the six vertices when some of the ni are zero.

Many interesting upper and lower probabilities follow from the distribu-tions (23). For example, the upper probability that1 exceeds 1 is the prob-ability that V intersects the region where 1 1 which is, in turn, theprobability that the first coordinate ofW(1) exceeds 1. In symbols,

P(1

1

|n1, n2, n3) = 1

1

1

0

n!

n1!(n2 1)!(n3 1)!n1

1

n21

2

n31

3

d1d2

=

1n1

n!

n1!(n2+ n3 1)!n11 (1 1)

n2+n31d1

(24)

ifn2 >0 and n3 >0. Similarly, P(1 1|n1, n2, n3) is the probability thatthe first coordinate ofZ(1) exceeds 1, that is,

P(1

1|n

1, n

2, n

3) =

1

1

(n+ 1)!

(n1 1)!(n2+ n3+ 1)!n11

1 (1

1)n2+n3+1d

1,

(25)

again assuming no prior information about . Two further analogues of thepair (24) and (25) may be obtained by permuting the indices so that the roleof 1 is played successively by 2 and 3. In a hypothetical numerical examplewithn1 = 2, n2 = 1, n3 = 1 as used in Fig. 4, it is inferred that the probabilityof at least half the population belonging in c1 lies between

316 and

1116 . In pass-

ing, note that the upper and lower probabilities (24) and (25) are formally

identical with Bayes posterior probabilities corresponding to the pseudo-priordistributionsD(1, 0, 0) andD(0, 1, 1), respectively. This appears to be a math-ematical accident with a limited range of applicability, much like the relationsbetween fiducial and Bayesian results pointed out by Lindley (1958). In thepresent situation, it could be shown that the relations no longer hold for eventsof the form (1 1

1 ).

The model (Un, Cn, n, Bn) has the illuminating feature of remaininga product model after conditioning on the sample observations. Recall thatthe original model (Un, Cn, n, Bn) is expressible as the product of the nmodels (U(i), Cn , (i), B(i)) for i = 1, 2, . . . , n. Conditioning the originalmodel on the observations yields (Un, T , n, Bn) where, as above, T is thesubset ofCn with c(1), c(2), . . . , c(n) fixed at their observed values and

Bn(u(1), u(2), . . . , u(n)) =Bn(u(1), u(2), . . . , u(n)) T . (26)


16/31

88 A. P. Dempster

Conditioning the ith component model on the ith sample observation yields(U(i), T(i), (i), B(i)), where T(i) is the subset ofCn with c(i) fixed at itsobserved value, and

B(i)u(i) =B (i)u(i) T(i), (27)

for i= 1, 2, . . . , n. It is clear that

T = T(i) T(2) . . . T(n), (28)

and from (20), (26), (27) and (28) it follows that

Bn(u(1), u(2), . . . , u(n)) = B(1)u(1) B(2)u(2) . . . B(n)u(n). (29)

From (28) and (29) it is immediate that the model (Un, T , n, Bn) is theproduct of thenmodels (U(i), T(i), (i), B(i)) fori= 1, 2, . . . , n. The meaningof this result is that inferences about may be calculated by traversing twoequivalent routes. First, as above, one may multiply the original n models andcondition the product on T. Alternatively, one may condition the original nmodels on their associated T(i) and then multiply the conditioned models. Theavailability of the second route is conceptually interesting, because it showsthat the information from theith sample observationc(i) may be isolated andstored in the form (U(i), T(i), (i), B(i)), and when the time comes to assembleall the information one need only pick up the pieces and multiply them. Thisbasic result clearly holds for a wide class of choices ofU and B, not just the

particular trinomial sampling model illustrated here.The separability of sample information suggests that prior information

about should also be, stored as a model of the general type (X, , , )and should be combined with sample information according to the productrule. Such prior information could be regarded as the distillation of previousempirical data. This proposal brings out the full dimensions of the general-ized Bayesian inference scheme. Not only does the product rule show how tocombine individual pieces of sample information: it handles the incorporationof prior information as well. Moreover, the sample information and the prior

information are handled symmetrically by the product rule, thus banishing theasymmetric appearance of standard Bayesian inference. At the same time, ifthe prior information is given in the standard form of an ordinary probabilitydistribution, the methods of generalized Bayesian inference reproduce exactlythe standard Bayesian inferences.

A proof of the last assertion will now be sketched in the context of trinomialsampling. An ordinary prior distribution for an unknown is representedby a model of the form (X, , , ) where is single-valued and hence noambiguity is allowed in the computed probabilities. Without loss of generality,

the model (X, , , ) may be specialized to (, , , I ), where Iis the identitymapping and is the ordinary prior distribution over . For simplicity, assumethat is a discrete distribution with probabilities p1, p2, . . . , pd assigned topoints 1, 2, . . . , d in . From (16) it follows that the mapping associated


17/31


with a product of models is single-valued if the mapping associated with anycomponent model is single-valued. If a component model not only has a single-valued mapping, but has a discrete measure as well, then the product modelis easily seen to reduce to another discrete distribution over the same carriers

1,2, . . . ,d. Indeed the second line of (16) shows that the product modelassigns probabilities P(i) to i which are proportional to pili, where li isthe probability that the random regionV includes the point i. Setting i =(i1, i2, i3), it follows from the properties of the random region V that

li=n1i1

n2i2

n3i3, (30)

which is just the probability that all of the independent random regions whoseintersection is V include i. Normalizing the product model as indicated in(1) or (2) leads finally to

P(i) = pili

p1l1+ p2l2+ . . . +pdld(31)

for i= 1, 2, . . . , d, which is the standard form of Bayess theorem. This resultholds for any choices ofU and B satisfying (9). Note that li is identical withthe likelihood ofi.

Generalized Bayesian inference permits the use of sample informationalone, which is mathematically equivalent to adopting the informationless

prior model in which all upper probabilities are unity and all lower probabil-ities are zero. At another extreme, it permits the incorporation of a familiarBayesian prior distribution (if it is a genuine distribution) and then yieldsthe familiar Bayesian inferences. Between these extremes a wide range of flex-ibility exists. For example, a prior distribution could be introduced for thecoordinate 1 alone, while making no prior judgment about the ratio 2/3.Alternatively, one could specify prior information to be the same as that con-tained in a sample of size m which produced mi observations in category cifori = 1, 2, 3. In the analysis of quite small samples, it would be reasonable toattempt to find some characterization of prior information which could reflecttolerably well public notions about . In large samples, the inferences clearlyresemble Bayesian inferences and are insensitive to prior information over awide range.

3 A Second Illustration

Consider a sequence of independent Bernoulli trials represented by zi with

P(zi= 1|pi) =pi and P(zi = 0|pi) = 1 pi, for i= 1, 2, , n, (32)

where it is suspected that the sequence pi is subject to a monotone upwarddrift. In this situation, the common approach to a sequence of observations


18/31


19/31


20/31

92 A. P. Dempster

It is clear that T12=T1 T

2 and thatT

12, T

1 T

12 andT

2 T

12 are disjoint

sets whose union is S.Un may be decomposed inton! geometrically similar simplexes, each char-

acterized by a particular ordering of the values of the coordinates (u1, u2, . . . ,

un). These simplexes are in one-to-one correspondence with the permutations

(1, 2, , n) (1, 2, . . . , n),

where for every (u1, u2, . . . , un) in a given simplex the corresponding permu-tation obeysu1 u2 . . . un . Since the characterizations (41), (42) and(43) involve only order relations among coordinatesui, each of the simplexesis either included or excluded as a unit from T1 or T

2 or T

12. And since each

of the n! simplexes has n measure 1/n!, the n measures ofT1 or T2 or T

12

may be found by counting the appropriate number of simplexes and dividing

byn!. Or, instead of counting simplexes, one may count the permutations towhich they correspond. The permutation

(1, 2, . . . , n) (1, 2, , n)

carries the observed sequence (z1, z2, . . . , zn) of zeros and ones into anothersequence (z1 , z2 , . . . , zn) of zeros and ones. According to the definition ofT1 , a simplex is contained in T

1 if and only if its corresponding permutation

has the property that i < j for all i < j such that zi = 1 and zj =0, i.e. any pair ordered (1,0) extracted from (z1, z2, . . . , zn) must retain thesame order in the permuted sequence (z1 , z2, . . . , zn). Similarly, to satisfyT2 any pair ordered (0,1) extracted from (z1, z2, . . . , zn) must have its orderreversed in the permuted sequence, while to satisfyT12=T

1 T

2 the sequence

(z1 , z2 , . . . , zn) must consist of all ones followed by all zeros.If (z1, z2, . . . , zn) containsn1 ones andn2 zeros, then a simple counting of

permutations yields

(T12) = n1!n2!

n! (44)

A simple iterative procedure for computing n(T1 ) or n(T2 ) is derived in

Appendix B by Herbert Weisberg. The result is quoted below and illustratedon a numerical example.

For a given sequence of observations z1, z2, . . . of indefinite length defineN(n) to be the number of permutations of the restricted type counted inT1 . N(n) may be decomposed into

N(n) =

rk=0

N(k, n), (45)

where N(k, n) counts the subset of permutations such that (z1 , z2 , . . . , zn)has k zeros preceding the rightmost one. Since no zero which follows therightmost one in the original sequence (z1, z2, . . . , zn) can be permuted to theleft of any one under any allowable permutation, the upper limit r in (45) may


21/31


be taken as the number of zeros preceding the rightmost one in the originalsequence (z1, z2, . . . , zn). In the special case of a sequence consisting entirelyof zeros, all of the zeros will be assumed to follow the rightmost one so thatN(k, n) = 0 for k >0 and indeed N(n) =N(0, n) =n!. Weisbergs iterative

formula is

N(k, n+ 1) =

k1j=0

N(j, n) + (n1+ 1 + k)N(k, n) if zn+1 = 1

= (n2+ 1 k)N(k, n). if zn+1= 0, (46)

wheren1 andn2 denote as above the numbers of ones and zeros, respectively,in (z1, z2, . . . , zn).

Formula (46) has the pleasant feature that the counts for the sequences

(z1), (z1, z2), (z1, z2, z3), . . . may be built up successively, and further observa-tions may be easily incorporated as they arrive. Consider, for example, thehypothetical observations

(z1, z2, . . . , z7) = (0, 0, 1, 1, 0, 1, 1).

Table 1 shows

zn, N(0, n), . . . , N (r, n)

on line n, for n = 1, 2, . . . , 7, from which N(7) = 1680. The number of

permutations consistent with T

2 is found by applying the same iterativeprocess to the sequence (1,1,0,0,1,0,0) with zeros and ones interchanged.This yields Table 2 from which N(7) = 176. The number of permutationscommon to T1 and T

2 is 3! 4! = 144. Thus

n(T1 ) = 1680/7!, n(T2 ) =

176/7!, n(T12) = 144/7!, and n(S) = (1680 + 176 144)/7! = 1712/7!.

Consequently, the upper and lower probabilities ofT1, T2 and T12 conditionalon Sand (z1, z2, . . . , z7) = (0, 0, 1, 1, 0, 1, 1) are

P(T1) =1680

1712, p(T1) =

1536

1712, P(T2) =

176

1712, P(T2) =

32

1712,

P(T12) = 1441712

, p(T12) = 0.

Table 1.

n zn N(0, n) N(1, n) N(2, n) N(3, n)

1 0 12 0 23 1 2 2 2

4 1 4 8 125 0 12 16 126 1 36 76 84 407 1 144 416 640 480


22/31

94 A. P. Dempster

Table 2.

n zn N(0, n) N(1, n) N(2, n)

1 1 1

2 1 23 0 24 0 45 1 12 4 46 0 36 8 47 0 144 24 8

Since more than 10% of the measure could apply to a monotone non-increasingsequence, the evidence for an increasing sequence is not compelling.

For the extended sequence of observations 0,0,1,1,0,1,1,0,1,1,1,. . ., thelower and upper probabilities of a monotone downward sequence after nobser-vations are exhibited in Table 3.

Table 3.

n P(T2) P(T2)

1 0 12 0 1

3 0 0.3334 0 0.1675 0.167 0.4176 0.048 0.1907 0.019 0.1038 0.188 0.3199 0.065 0.14810 0.028 0.08011 0.014 0.047

4 Comments on the Method of Generating Upper

and Lower Probabilities

Although often notationally convenient, it is unnecessary to use models(X,S,, ) outside of the subclass where the inverse of is single-valued.For the model (X, S,, ) with

S =X S (47)

andx= {x} x (48)


23/31


does belong to the stated subclass, and yields

(P(T), P(T)) = (P(X T)P(X T)) (49)

for any T S, where the left side of (49) refers to any original model

(X,S,, ) and the right side refers to the corresponding model (X, S,, ).Moreover, the model (X, S,, ) provides upper and lower probabilities forall subsets ofX S, not just those of the form X T. On the other hand,it was assumed in applying the original form (X,S,, ) that the outcome xin X is conceptually unobservable, so that no operational loss is incurred bythe restriction to subsets of the form X T S.

Underlying the formalism of (X,S,, ) or its equivalent (X, S,, ) is theidea of a probability model which assigns a distribution only over a partitionof a complete sample space, specifically the distribution over the partition

ofS= X Sdefined by X. Thus the global probability law of an ordinaryprobability measure space is replaced by a marginal distribution or what mightbe called a partialprobability law. The aim therefore is to establish a usefulprobability calculus on marginal or partial assumptions.

I believe that the most serious challenges to applications of the new cal-culus will come not from criticism of the logic but from the strong form ofignorance which is necessarily built into less-than-global probability laws. Toillustrate, consider a simple example where w1 denotes a measured weight,w2 denotes a true weight, and x = w1 w2 denotes a measurement error.

Assume that ample relevant experience is available to justify assigning a spe-cific error distributionover the spaceXof possible values ofx. The situationmay be represented by the model (X,W,, ) with X and as defined, withW ={(w1, w2); w1 0, w2 0}, and defined by the relation x= w1 w2.Conditioning the model on an observed w1 leaves one with the same measureapplied tow1w2, except for renormalization which restricts the measure tow1 0. The result is very much in the spirit of the fiducial argument (althoughthere is some doubt about Fishers attitude to renormalization). I am unableto fault the logic of this fiducial-like argument. Rather, some discomfort isproduced by distrust of the initial model, in particular by its implication thatevery uncertain event governed by the true weight w2 has initial upper andlower probabilities one and zero. It would be hard to escape a feeling in mostreal situations that a good bit of information about a parameter is available,even if difficult to formalize objectively, and that such information shouldclearly alter the fiducial-like inference if it could be incorporated. One way totreat this weakness is openly to eschew the use of prior information, while notnecessarily denying its existence, that is, to assert that the statistician shouldsummarize only that information which relies on the observation w2 and theobjectively based error distribution. Because of the conservatism implicit in

the definition of upper and lower probabilities, the approach of rejecting softinformation seems likely to provide conservative inferences on an average, butI have not proved theorems to this effect. The difficulty is that the rejectionof all soft information, including even information about parametric forms,


24/31

96 A. P. Dempster

may lead to unrealistically weak inferences. The alternative approach is topromote vague information into as precise a model as one dares and combineit in the usual way with sample information.

Some comments on the mathematics of upper and lower probabilities are

appropriate. A very general scheme for assigning upper and lower probabilitiesto the subsets of a sample space Sis to define a family Cof measuresP overSand to set

P(T) = supC

P(T), P(T) = infC

P(T). (50)

Within the class of systems of upper and lower probabilities achieved in thisway for differentC, there is a hierarchical scheme of shrinking subclasses end-ing with the class of systems defined by models like (X,S,, ). (See Demp-ster, 1967a.). The family C corresponding to a given (X,S,, ) consists of

all measures P which for each x distribute the probability element d(x) insome way over x. Some readers may feel that all systems should be allowed,not just the subclass of this paper. In doing so, however, one loses the con-ception of a source of information as being a single probability measure. For,in the unrestricted formulation of (50), the class C consists of conceptuallydistinct measures such as might be adopted by a corresponding class of per-sonalist statisticians, and the conservatism in the bounds of (50) amounts toan attempt to please both extremes in the class of personalist statisticians. Ibelieve that the symmetry arguments underlying probability assignments donot often suggest hypothetical families C demanding simultaneous satisfac-tion. Also, the rules of conditioning and, more generally, of combination ofindependent sources of information do not extend to the unrestricted system(50), and without these rules the spirit of the present approach is lost.

The aim of this short section has been to suggest that upper and lowerprobabilities generated by multivalued mappings provide a flexible means ofcharacterizing limited amounts of information. They do not solve the difficultproblems of what information should be used, and of what model appropri-ately represents that information. They do not provide the only way to discussmeaningful upper and lower probabilities. But they do provide an approach

with a well-rounded logical structure which applies naturally in the statisticalcontext of drawing inferences from samples to populations.

5 Comments on the Models Used for Inference

The models used here for the representation of sampling from a populationtake as their point of departure a space whose elements correspond to themembers of the population. In addition to the complex of observable charac-

teristics usually postulated in mathematical statistics, each population mem-ber is given an individual identity. In conventional mathematical statisticsthe term hypothesis is often used for an unknown population distribution ofobservable characteristics, but the presence of the population space in the


25/31


model leads directly to the more fundamental question of how each hypothe-sized population distribution applies to the elements of the population space,that is, under a given hypothesis what are the observable characteristics ofeach population member? In the trinomial sampling model of Sect. 2, the

question is answered by the multivalued mapping B defined in (7). As illus-trated in Fig. 1, B asserts that for each hypothesis the population spaceU partitions into three regions U1, U2, U3 corresponding to the observablecharacteristicsc1, c2, c3. More generally, the observable characteristics may bemultinomial with k categories c1, c2, . . . , ck and the population space U maybe any space with an associated random sampling measure . For a givenhypothesis = (1, 2, . . . , k) the question is answered by determiningsubsets U1, U2, . . . , U k ofUwhich specify that a population member in Ui ispermitted to have characteristicciunder , fori = 1, 2, . . . , k. Having reached

this point in building the model, it seems reasonable to pose the restrictionwhich generalizes (9), namely,

(Ui) =i and (Ui Uj) = 0 (51)

for i, j = 1, 2, . . . , k and i = j. The reason for (51) as with (9) is simply tohaveirepresent both upper and lower probabilities ofci for a single drawingwith a given .

Now it is evident that the above information by no means uniquelydetermines a model for multinomial sampling. Indeed, one may start from

any continuous space U with measure , and for each

specify a partitionU1, U2, . . . , U k satisfying (51) but otherwise completely arbitrary. In otherwords, there is a huge bundle of available models. In Dempster (1966), twochoices were offered which I called models of the first kindand models of thesecond kind. The former assumes that the multinomial categoriesc1, c2, . . . , ckhave a meaningful order, and is uniquely determined by the assumption thatthe population members have an order consistent with the order of theirobservable characteristics under any hypothesis . (See Dempster, 1967b.)The restriction to ordered categories implies essentially a univariate charac-teristic, and because that restriction is so severe the following discussion ismostly aimed at a general multinomial situation with no mathematical struc-ture assumed on the space ofk categories. The general model of the secondkind is defined by extending (5), (6) and (7) in the obvious way from k = 3to generalk. This model treats the k categories with complete symmetry, butit is not the only model to do so, for one can define B1 arbitrarily for such that 1 2 . . . k, and define B

1 for other by symmetry. Butthe general model of the second kind is strikingly simple, and I recommendit because I can find no competitor with comparable aesthetic appeal.

The status of generalized Bayesian inference resembles that of Bayesian

inference in the time of Bayes, by which I mean that Bayes must have adopteda uniform prior distribution because no aesthetically acceptable competitorcame to mind. The analogy should be carried further, for even the principlesby which competitors should be judged were not formulated by Bayes, nor


26/31

98 A. P. Dempster

have the required principles been well formulated for the models discussedhere. I believe that the principles required by the two situations are not atall analogous, for the nature and meaning of a prior distribution has becomequite clear over the last two centuries and the concept may be carried more

or less whole over to generalized Bayesian inference. The choice of a modelsatisfying (51), on the other hand, has no obvious connection with prior infor-mation as the term is commonly applied relative to information about pos-tulated unknowns. In the case of generalized Bayesian inference, I believe theprinciples for choosing a model to be closely involved with an uncertaintyprinciple which can be stated loosely as: The more information which oneextracts from each sample individual in the form of observable characteris-

tics, the less information about any given aspect of the population distribution

may be obtained from a random sample of fixed size. For example, a random

sample of size n= 1000 from a binomial population yields quite precise andnearly objective inferences about the single binomial parameter p involved.On the other hand, if a questionnaire given to a sample ofn = 1000 has beensufficient to identify each individual with one of 1,000,000 categories, then itmay be foolhardy to put much stock in the sample information about a bino-mial p chosen arbitrarily from among the 21,000,000 2 non-trivial availablepossibilities. Conceptually, at least, most real binomial situations are of thelatter kind, for a single binomial categorization can be achieved only at theexpense of suppressing a large amount of observable information about eachsample individual. The uncertainty principle is therefore a specific instance ofthe general scientific truism that an investigator must carefully delimit andspecify his area of investigation if he is to learn anything precise.

Generalized Bayesian inference makes possible precise formulations of theuncertainty principle. For example, the model of the second kind with k = 2and n = 1000 yields inferences which most statisticians would find nearlyacceptable for binomial sampling. On the other hand, it is a plausible con-jecture that the model of the second kind with k = 1, 000, 000 and n= 1000would yield widely separated upper and lower probabilities for most events.The high degree of uncertainty in each inference compensates for the pres-

ence of a large number of nuisance parameters, and protects the user againstselection effects which would produce many spurious inferences. Use of themodel of the first kind with k = 1, 000, 000 and n = 1000 would very likelylead to closer bounds than the model of the second kind for binomial infer-ences relating to population splits in accord with the given order of populationmembers. And it is heuristically clear that models could be constructed whichfor each would place each point ofU in each ofU1, U2, . . . , U k as variesover an arbitrarily small neighbourhood about. Such a model would presentan extreme of uncertainty, for all upper and lower probability inferences would

turn out to be one and zero, respectively. It is suggested here that the choice ofa model can only be made with some understanding of the specific reflectionsof the uncertainty principle which it provides. For the time being, I judgethat the important task is to learn more about the inferences yielded by the


27/31


aesthetically pleasing models of the second kind. Eventually, intuition andexperience may suggest a broader range of plausible models.

Models of the second kind were introduced above for sampling from a gen-eral multinomial population with k categories and unknown 1 k parameter

vector. But the range of application of these models is much wider. First, onemay restrict to parametric hypotheses of the general form = (,, . . .).More important, the multinomial may be allowed to have an infinite num-ber of categories, as explained in Dempster (1966), so that general spaces ofdiscrete and continuous observable characteristics are permissible. It is possi-ble therefore to handle the standard parametric hypotheses of mathematicalstatistics. Very few of these have as yet proved analytically tractable.

At present, mainly qualitative insights are available into the overview ofstatistical inference which the sampling models of generalized Bayesian infer-

ence make possible. Some of these insights have been mentioned above, such asthe symmetric handling of prior and sample information, and the uncertaintyprinciple by which upper and lower probabilities reflect the degree of confusionproduced by small samples from complex situations. It is interesting to notealso that parametric hypotheses and prior distributions, which are viewed asquite different in conventional statistical theory, play indistinguishable rolesin the logical machinery of generalized Bayesian inference. For a parametrichypothesis such as = (,, . . .) may be represented by a model of thegeneral type (X,S,, ), which assigns all of its probability ambiguously overthe subset of allowed by (,, . . .) as , , . . . range over their permittedvalues, and this model combines naturally with sample information using therule of combination defined in Sect. 2 and suggested there to be appropriatefor the introduction of prior information.

Concepts which appear in standard theories of inference may reappearwith altered roles in generalized Bayesian inference. Likelihood is a primeexample. The ordinary likelihood function L() based on a sample from ageneral multinomial population is proportional to the upper probability ofthe hypothesis . This may be verified in the trinomial example of Sect. 2 bychecking that the random region illustrated in Fig. 4 covers the point with

probability n11 n22 n33 . The general result is hardly more difficult to prove.Now the upper probability offor all does not contain all the sample infor-mation under generalized Bayesian inference. Thus the likelihood principlefails in general, and the usual sets of sufficient statistics under exponentialfamilies of parametric hypotheses no longer contain all of the sample informa-tion. The exception occurs in the special case of ordinary Bayesian inferencewith an ordinary prior distribution, as illustrated in (31). Thus the failure ofthe likelihood principle is associated with the uncertainty which enters whenupper and lower probabilities differ. In passing, note that marginal likelihoods

are defined in the general system, that is, the upper probabilities of specificvalues of from a set of parameters , , . . . are well defined and yield afunction L() which may be called the marginal likelihood of alone. If theprior information consists of an ordinary prior distribution of alone, with


28/31

100 A. P. Dempster

no prior information about the nuisance parameters, then L() contains all ofthe sample information about.

Unlike frequency methods, which relate to sequences of trials rather thanto specific questions, the generalized Bayesian inference framework permits

direct answers to specific questions in the form of probability inferences. I findthat significance tests are inherently awkward and unsatisfying for questionslike that posed in the example of Sect. 4, and the main reason that Bayesianinference has not replaced most frequency procedures has been the stringentrequirement of a precise prior distribution. I hope that I have helped to reducethe stringency of that requirement.

References

Boole, G. (1854). An Investigation of the Laws of Thought. New York: Reprintedby Dover (1958).

Dempster, A. P. (1966). New methods for reasoning towards posterior distributionsbased on sample data. Ann. Math. Statist., 37, 35574.

(1967a). Upper and lower probabilities induced by a multivalued mapping. Ann.Math. Statist., 38, 32539.

(1967b). Upper and lower probability inferences based on a sample from a finiteunivariate population. Biometrika, 54, 515528.

Good, I. J. (1962). The measure of a non-measurable set. Logic, Methodology andPhilosophy of Science (edited by Ernest Nagel, Patrick Suppes and Alfred

Tarski), pp. 319329. Stanford University Press.Lindley, D. V. (1958). Fiducial distributions and Bayes theorem. J. R. Statist. Soc.

B, 20, 102107.Smith, C. A. B. (1961). Consistency in statistical inference and decision (with dis-

cussion). J. R. Statist. Soc. B, 23, 125. (1965). Personal probability and statistical analysis (with discussion). J. R.

Statist. Soc. A, 128, 469499.

Appendix A

A derivation is sketched here for the distributions (23) relating to specificvertices of the random region R defined by (20). R is the intersection ofnregionsB(i)u(i), fori = 1, 2, . . . , n, as illustrated in Fig. 4. The region B(i)u(i)

corresponding tou(i), which gives rise to an observationc1, consists of points

u such that u3/u1 u(i)3 /u

(i)1 and u2/u1 u

(i)2 /u

(i)1 . The intersection of the

n1 regions corresponding to the n1 observations c1 is a region R1 consistingof pointsu such that

u3/u1c13 and u2/u1 c12, (A.1)

where c13 = min(u(i)3 /u

(i)1 ) and c12 = min(u

(i)2 /u

(i)1 ), the minimization being

over the subset ofi corresponding to observations c1. Note that R1 together


29/31


with the n1 regions which define it are all of the type pictured on level 1of Fig. 2. By permuting subscripts, define the analogous regions R2 withcoordinates c23, c21 andR3 with coordinatesc31, c32, whereR2 andR3 are ofthe types pictured on levels 2 and 3 of Fig. 2, respectively. One is led thus to

the representationR= R1 R2 R3. (A.2)

Any particular instance of the region R which contains at least one pointis a closed polygon whose sides are characterized by fixed ratios of pairs ofcoordinates ui, uj. Thus R may be described by a set of six coordinates

bij = maxuR

(uj/ui) (A.3)

for i=j . From (A.1), (A.2), and (A.3) it follows that

bij cij (A.4)

for i = j. Moreover, equality holds if the corresponding side ofRi is also aside of R, while the inequality is strict if the side of Ri misses R entirely.The reader may wish to satisfy himself that R may have 3, 4, 5 or 6 sides inwhich case the strict inequality in (A.4) holds for 3, 4, 5 or 6 pairs i, j (withprobability one).

IfR is considered a random region, while R0 is a fixed region of the sametype with coordinates b0ij , then

P(R R0) = P(bij b0ij) for all i=j

= (1 + b012+b013)

n1(1 +b021+ b023)

n2(1 + b031+ b032)

n3 . (A.5)

To prove (A.5) note first that the three events

b12b

012, b13b

013

,

b21 b

021, b23 b

023

,

b31b

031, b32b

032

are equivalent respectively to the three events

c12b

012, c13b

013

,

c21 b

021, c23 b

023

,

c31b

031, c32b

032

.

In the latter form, the three events are clearly independent, for they dependon disjoint sets of independent u(i), and their three probabilities are thethree factors in (A.5). For example, the first event says that the n1 pointsu(i) corresponding to observationsc1 fall in the subtriangle u2/u1 b012 andu3/u1 b

013 whose area is the fraction (1 + b

012+b

013)

1 of the area of thewhole triangleU.

It will be convenient to denote the right side of (A.5) byF(b012, b013, b

021, b

023,

b031, b032) which defines, as the b0ij vary, a form of the joint cumulative distri-bution function of the bij. This c.d.f. should be handled with care. First, it isdefined only over the subset of the positive orthant in six dimensions such thatthe b0ij define a non-empty R

0. Many points in the orthant are ruled out by


30/31


31/31


which, after multiplying by

(u1, u2)/

b021, b031

= u1u

22 u

23

gives the density contribution

n2n3un1+11 u

n212 u

n313 (A.10)

expressed in terms ofu1, u2 and of courseu3 = 1u1u2. The contributionsfrom cases (ii) and (iii) may be found similarly to be

n2n3un11 u

n212 u

n33 and n2n3u

n11 u

n22 u

n313 . (A.11)

Since

u1+ u2+ u3 = 1,

the sum of the three parts is

n2n3un11 u

n212 u

n313 ,

orn1!n2!n3!

n!

n!

n1! (n2 1)!(n3 1)un11 u

n212 u

n213

, (A.12)

where the first term is the probability that u is anywhere, i.e. that R is not

empty, while the second is the Dirichlet density given in (23).The density of the point with minimum first coordinate may be found by

a similar argument. The analogue of (A.6) is

(i)(ii)(iii)

u2/u1 =c21 andu

3/u

1 =c13,

u2/u3 =c32 andu

3/u

1 =c13,

u2/u1 =c12 andu

3/u

2 =c23,

or

(A.13)

and the corresponding three components of density turn out to be

n1(n1+ 1)un111 un22 un23 , n1(n3)un111 un22 un23 , and n1n2un111 un22 un33 (A.14)

which sum to

n1!n2!n3!

n!

(n+ 1)!

(n1 1)n2!n3! 1un111 u

n22 u

n23

, (A.15)

which, like (A.12) is the product of the probability that R is not empty andthe Dirichlet density specified in (23).

The remaining four lines of (23) follow by symmetry. The probability that

R is not empty may be obtained directly by an argument whose gist is that,for any set ofn points in U, there is exactly one way to assign them to threecells of sizes n1, n2, n3 corresponding to observations c1, c2, c3 in such a waythat R is not empty. This latter assertion will not be proved here.

dempster68 generalization bayesianinference

Documents