d:larryoutput.prntlusty/courses/infoinbio/papers/jaynesstandon... · to irreversible processes....

105
April 1978 To be presented at the Maximum Entropy Formalism Conference, - Massachusetts lnsti tute of Technology, May 2-4, 1:978. WEERE DO WE STAJID ON MAXIMUM ENTROPY? E. T. Jaynes * A. HISTORICAL BACKGROUND The Line: Bernoulli to Cox The Second Line: Maxwell to Shannon Some Personal Recoililections B. PRESENT FEATURES The Formalism Brandeis Dice Problem Probability and Frequency Relation to Bayes I Theorem The Rawlinson Criticism Answer to the Rhetorical Question Wolf"s Dice Date. The Constraint Rule Forney's Question C• SPECULATIONS FOR THE FUTURE Fluctuations Biology Basic Statistical Theory Marginalization D. AN APPLICATION: IRREVERSIBLE STATISTICAL MIrHANIcS Background The Gibbs Algorihtbm Discussion Perturbation Theory Ensembles Relation to Wiener Prediction Theory Space-Time Variations * 'Wayman Crow Professor of' Physics, Washington Universit.y, st. Louis, Mo. 63130 2 12 27 30 33 35 38 41 45 48 58 61 64 70 71 74 78 80 86 88 90 93 95

Upload: ngokien

Post on 27-Mar-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

April 1978

To be presented at the Maximum Entropy Formalism Conference,- Massachusetts lnstitute of Technology, May 2-4, 1:978.

WEERE DO WE STAJID ON MAXIMUM ENTROPY?•

E. T. Jaynes *

A. HISTORICAL BACKGROUND

The Fi~st Line: Bernoulli to CoxThe Second Line: Maxwell to ShannonSome Personal Recoililections

B. PRESENT FEATURES

The FormalismBrandeis Dice ProblemProbability and FrequencyRelation to Bayes I TheoremThe Rawlinson Criticism

Answer to the Rhetorical QuestionWolf"s Dice Date.

The Constraint RuleForney's Question

C • SPECULATIONS FOR THE FUTURE

FluctuationsBiologyBasic Statistical TheoryMarginalization

D. AN APPLICATION: IRREVERSIBLE STATISTICAL MIrHANIcS

BackgroundThe Gibbs AlgorihtbmDiscussionPerturbation TheoryNear-E~uilibriumEnsemblesRelation to Wiener Prediction TheorySpace-Time Variations

* 'Wayman Crow Professor of' Physics, Washington Universit.y,st. Louis, Mo. 63130

21227

303335384145485861

64707174

78808688909395

Page 2: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

WHERE DO WE STAND ON MAXIMUM ENTROPY?

Edwin T. Jaynes

A. Historic.al Fa.ckground

B. Presen~ F~atures and Applications

- c. Specula~:icTis for the Fut:ure

D. An Applic~tion: Irreversible Statistical Mechanics

Summary. In Pa,rt A we place the principle of Maximum Entropyin its histori~sl perspective as a natural extension and unifi­cation of two :'h"!I;,:;rate lines of development, both of which hadlong.. used sped.d {'.ases of it. The first line is identifiedwith the nam~s Bernoulli, Laplace, Jeffreys, Cox; the secondwi~h Maxwell. T0lt2:D:lanU t Gibbs, Shannon.

Part B consid8rs some general properties of -the presentmaximum entrop.:,' fC':rmalism, stressing its consistency and inter­derivability witt the other principles of probability theory.In this connec;::i.c::l we answer some published criticisms of theprinciple.

In part C w~ try to view the principle in the wider contextof Statistical ilecision Theory in general, and speculate onpossible future &pplications and further theoretical develop­ments. The Princ.iple of Maximum Entropy, together with theseemingly disparate principles of Group Invariance and Margina­lization, may in t~e be seen as special cases of a still moregeneral princ.iple for translating information into a probabilityassignment.

Part D, wh1cI. should logically precede C, is relegated tothe end because it is of a more technical nature, requiringalso the full fo~m<1.1.:i.sm of quantum mechanics. Readers notfamiliar with i:J.li;.; will find the first three Sections a self­contained exposit.i.e..:.

In Part D we present some of the details and results of whatis at present the most highly developed application of theP~inciple of ~illximum Entropy; the extension of Gibbs' formalismto irreversible processes. Here we consider the.most generalapplication of the principle, without taking advantage of anyspecial features (such as interest in only a subspace of states,.or a subset of opo1:ators) that might be found in particular.p~9blems. An alternative formulation, which does t.ake such. . .'..

1

Page 3: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

advantage--and is thus closer to the spirit of previous "kineticequation" approaches at the cost of some generality, appears inthe presentation of Dr. Baldwin Robertson.

A. Historical BackgroundThe ideas to be discussed at this Symposium are found clearlyexpressed already in ancient sources, particularly the Old·Testament, Herodotus, and Ovennus. All note the virtue ofmaking wise decisions by taking into account all possibilities,i.e., by not presuming more information than we possess. Butprobability theory, in the form which goes beyond these moralexhortations and considers actual numerical values of probabil­ities and expectations, begins with the Ludo aleae of GerolamoCardano, some time in the mid-sixteenth century. Wilks (1961)places this "around 1520,11 although Cardano ' s Section "On Luckin Play" contains internal evidence that shows the date of itswriting to be 1564, still 90 years before the Pascal-Fermatcorrespondence.

Already in.these earliest works, special cases of the Prin­ciple of Maximum Entropy are recognized intuitively and, ofnecessity, used. For there is no application of probabilitytheory in which one· can evade that all-important first step:assigning so~e initial numerical values of probabilities sothat the calculation can get started. Even in the most elemen­tary homework problems, such as "Find the probability ofgetting at least two heads in four tosses of a coin, l! we haveno basis for the calculation until we make some initialjudgment, usually that "heads" shall have the probability 1/2independently at each toss. But by what reasoning does onearrive at this initial assignment? If it is questioned, howshall we defend it?

The basis underlying such initial assignments was stated asan explicit formal principle in the Ars Conjectandi of James(= Jacob) Bernoulli (1713). Unfortunately, it was given thecurious name: Principle of Insufficient Reason which has had,ever since, a psychologically repellant quality that preventsmany from seeing the positive merit of the idea itself. Keynes(1921) helped somewhat by renaming it the Principle of Indif­ference; but by then the damage had been done. Had Bernoullicalled his principle, more appropriately, the Desideratum EiConsistency, nobody would have vlntured to deprecate it. andtoday statistical theory would be in conside~ably better shapethan it is.

The essence of the principle is just: (1) we recognizethat a probability assignment is a means of describing acertain state of knowledge. (2) if the available evidencegives us no reason to consider proposition Al either more or

2

Page 4: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

(Al)

Where do we Stand on Maximum Entropy?

less likely than A2' then the only honest way we can describethat state of knowledge is to assign them equal probabilities:PI:: P2' Any other procedure would be inconsistent in the sensethat, by a mere .interchange of the labels 0., 2) we could thengenerate a new problem in which our state of knowledge is thesame but in which we are assigning different probabilities •.(3) Extending this reasoning, one arrives at the rule

(A) = M = (Number of cases favorable to A)p N (Total number of equally possible cases)

which served as the basic definition of probability for thenext 150 years.

The only valid criticism of this principle, it seems to me,is that in the original form (enumeration of the "equally pos­sible" cases) it cannot be applied to all problems. Indeed,nobody could have emphasized this more strongly than Bernoullihimself. After noting its use where applicable, he adds, HButhere, finally, we seem to have met our problem, since this maybe done only in a very few cases and almost nowhere other thanin games of chance the inventors of which, in order to provideequal chances for the players, took pains to set up so that thenumbers of cases would be known and --- so that all these casescould happen with equal ease." After citing some examples,Bernoulli continues in the next paragraph, "But Y1hat mortalwill ever determine, for example, the number of diseases --­these and other such things depend upon causes completelyhidden from us ---."

It Y1as for the explicitly stated purpose of finding proba­bilities when the number of "equally possible" cases is infiniteor beyond our powers to determine, that Bernoulli turns next tohis celebrated theorem, today called the weak law of large numbers.His idea was that, if a probability p cannot be calculated inthe manner p = H/N by direct application of the Principle ofInsufficient Reason, then in some cases Y1e may still reasonbackwards and estimate the ratio MIN approximately by observingfrequencies in many trials.

That there ought to be some kind of connection betY1een atheoretical probability and an observable frequency Y1as avaguely seen intuition in the earlier Y1orks; but Bernoulli,seeing clearly the distinction betY1een the concept~ recognizedthat the existence of a connection betY1een them cannot bemerely postulated; it requires mathematical demonstration. Ifin a binary experiment we assign a constant probability ofsuccess p, independently at each trial, then we find for theprobability of seeing m successes in n trials the binomialdistribution

3

Page 5: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

! (n) m n-mP(m n,p): m P (l-p)

Bernoulli then shows that asf = mIn of successes tends tothat for all £ > 0,

P(p-e<f<p+e!p,n) +1

n ~~, the observed frequencythe probability p in the sense

(A2)

(A3)

4

and thus (in a sense made precise only in the later work ofBayes and Laplace) for sufficiently large n, the observedfrequency is practically certain to be close to the number psought.

But Bernoulli's result does not tell us how large n must befor a given accuracy. For this, one needs the mare detailedlimit theorem; as n increases, f may be considered a continuousvariable, and the probability that (f < mIn < f + df) goes into a

::::~:~:; :r[no:al, l~S::b[:t:~;~P)2 ] df (A4)2~p(l-p)J 2p(l-p)

.in the sense of the leading term of an asycptotic eA~ansian.

For example, if p = 2/3, then from (A4), in n'" 1000 trials,there is a 99% probability that the observed f will lie ~n theinterval 0.667 ±O.038, and an even chance that it will fall in0.667 ± 0.010. The result (A4) was first given in this generalityby Laplace; it had been found earlier by de Maivre for the casep = t. And in turn, the de Moivre-Laplace theorem (A4) be­came the ancestor of our present Central Limit Theorem.

Since these limit theorems are sometimes held to be the mostimportant and sophisticated fruits of probability theory, wenate that they depend crucially on the assumption of indepen­dence of different trials. The slightest positive correlationbetween trials i and j, if it persists for arbitrarily largeIi - j I, will render these theorems qualitatively incorrect.

Laplace's contributions to probability theory go rather farbeyond mere analytical refinements of other peoples' results.Most important for statistical theory today, he saw the generalprinciple needed to solve problems of the type formulated byBernoulli, but left unfinished by the Bernoulli and de Moivre­Laplace limit theorems. These results concern only the so­called "sampling distribution." That is, given p '" HIll, whatis the probability that we shall see particular sample numbers(m,n)? The results (Al)-(A4) describe a state of knowledge inwhich the "population numbers ll (M,N) are known, the samplenumber unknown. But in the problem Bernoulli tried to solve,the sample is known and the population is not only unknown--

Page 6: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

(A5)

"Where do we Stand on Maximum Entropy?

its very existence is only a tentative hypothesis (what mortalwill ever determine the number of diseases, etc.).

We have, therefore, an inversion problem. The above theoremsshow that, given (M,N) and the correctness of the whole con­ceptual !'!!.oc!el, then it is likely that in many trials the-­observed frequency f will be close to the probability p.Presumably, then, given the observed f in many trials, it islikely that p is close to f. But can this be made into a I

precise theorem like (A4)? The binomial law (A2) gives theprobability of m, given (H,N,n). Can we turn this aroundand find a formula for the probability of M, given (m,N,n)?This is the problem of inverse probabilities.

A particular inversion of the binomial distribution wasoffered by a British clergyman and amateur mathematician.Thomas Bayes (1763) in what has become perhaps the most famousand controversial work in probability theory. His reasoningwas obscure and hard to describe; but his actual result iseasy to state. Given the data (m.n). he finds for the proba­bility that H/N lies in the interval p < (MIN) < P + dp,

P(dplm,n) = (n+l)! m (l_p)n-m dpm!(n-m)! P

today called a Beta distribution. It is not a binomial distri­bution because the variable is p rather than m and the numericalcoefficient is different, but it is a trivial mathematicalexercise -[expand the logarithm of (AS) in a power series aboutits peak] to show that, for large n, (A5) goes asymptoticallyinto just (A4) with f and p everywhere interchanged. Thus, ifin n= 1000 trials we observe m= 667 successes, then on thisevidence there would be.a 99% probability that p lies in(0.667±0.038), etc.

In the gaussian approximation, according to Bayes' solution,there is complete mathematical symmetry between the probabilityof f given p, and of p given f. This would certainly seem tobe the neatest and simplest imaginable solution to Bernoulli'sinversion problem.

Laplace, in his fa,"nous memoir of 1774 on the "probabilitiesof causes, If perceived the principle Qnderlying inverse proba­bilities in far greater generality. Let E stand for someobservable event and {C 1 ••• CN} the set of its conceivablecauses. Suppose that we have found. according to some con­ceptual model. the "sampling distribution lt or "d,irect" proba­bilities of E for each cause: p(Elei), i=1,2, ••.• N. Then,says Laplace. if initially the causes Ci are considered equallylikely, then having seen the event E, the different causes areindicated with probability proportional to p(EICi ). That is,

5

Page 7: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

(A6)

(A7)

- Jaynes

with uniform prior probabilities, the posterior probabilities

::C:~:)c~ [1 PlEICj)r PlElci )

This ~a tremendous generalization of the Bernoulli-Bayesresults (A2), (AS). If the event E consists in finding ill suc­cesses in n trials, and the causes Ci corresfond to the pos~ible

values of M in the Bernoulli model, then peE Ci)is the binomialdistribution (A2); and in the limit N ~ 00 (A6) goes into Bayes'result (AS).

Later. Laplace generalized (A6) further by noting that, ifinitially the Ci are not considered equally likely. but haveprior probabilities p(CiII), where I stands for the prior in­formation, then the terms in ~) should be weighted accordingtoPlC II):

i PlEICi)p(CiII)p(ciIE,I) ~ ". PlElc.)p(c.1iT

LJ J J

but, following long-established custom, it is Laplace's result(A7) that is always called, in the modern literature, "Bayes'theorem."

Laplace proceeded to apply (A6) to a variety of problemsthat arise in astronomy. meteorology, geodesy, populationstatistics, etc. He would use it typically as follows. Com­paring experimental observations with some existing theory, orcalculation. one will never find perfect agreement. Are thediscrepancies so small that they might reasonably be attributedto measurement errors, or are they so large that they indicate,with high probability, the existence of some new systematiccause? If so, Laplace would undertake to find that cause.Such uses of inverse probability--what would be called today"significance tests" by statisticians. and "detection of 'signalsin noise" by electrical engineers--led him to some of the mostimportant discoveries in celestial mechanics.

Yet there were difficulties that prevented others from fol­lowing Laplace's path. in spite of its demonstrated usefulness.In the first place, Laplace simply stated the results (A6) , (Al)as intuitive, ad hoc recipes without any derivation from compel­ling desiderata; and this left room for much agonizing overtheir logical justification and uniquenesa. For an account. ,of this, see Keynes (1921). However. we now know that Laplace sresult (A7) is. in fact, the entirely correct and uniquesolution to the inversion problem.

6

Page 8: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?'

More importantly, it became apparent that, in spite of firstappearances, the results of Bayes and Laplace did not, afterall, solve the problem that Bernoulli had set out to deal with.Recall, Bernoulli's original motivation was that the Principleof Insufficient Reason is inapplicable in so many real problems,because we are unable to break things down into an enumerationof "equally possible" cases. His hope--left unrealized at 'hisdeath in 1705--had been that, by inversion of his theorem onecould avoid having to use Insufficient Reason. Yet when theinversion problem was finally solved by Bayes and Laplace, theprior probabilities p(CiII) that Bernoulli had sought to avoid,intruded themselves inevitably right back into the picture!

The only useful results Laplace got came from (A6), based onthe uniform prior probabilities P(Ci II) "" liN from the Principleof Insufficient Reason. That is, of course, not because Laplacefailed to understand the generalization (A7) as some havecharged--it was Laplace who, in his Essai Philosophique, pointedout the need for that generalization. Rather, Laplace did nothave any principle for finding prior probabilities in caseswhere the prior information fails to render the possibilities"equally likely.1I

At this point, the history of statistical theory takes asharp 90° turn away from the original goal, and we are onlyslowly straightening out again today. One might have thought,particularly in view of the great pragmatic success achievedby Laplace with (A6), that the next workers would try to buildconstructively on the foundations laid down by him. The nextorder of business should have been seeking new and more generalprinciples for determining prior probabilities, thus extendingthe range of problems where probability theory is useful to(A7). Instead, only fifteen years after Laplace's death, therestarted a series of increasingly violent attacks on his work.Totally ignoring the successful results they had yielded.Laplace's methods based on (A6) were rejected and ridiculed,along with the whole conception of probability theory expoundedby Bernoulli and Laplace. The main early references to thiscounter-stream of thought are Ellis (1842), Boole (1854), Venn(1866). and von Mises (1928).

As already emphasized, Bernoulli's definition of probability(Al) was developed for the purpose of representing mathemat­ically a particular state of knowledge; and the equations ofprobability theory then represent the process. of plausible. orinductive. reasoning in cases where there is not enough informa­tion at hand to permit deductive reasoning. In particular,Laplace's result (A7) represents the process of "learning byexperience,11 the prior probability P(CjI) changing to the

7

Page 9: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

posterior probability P(C IE~I) as a result -of obtaining newevidence E.

This counter-stream of thought, however, rejected the notionof probability as describing a state of knowledg~ and insistedthat by "probability" one must mean only llfrequency in a randomexperiment." For a time this viewpoint dominated the field socompletely that those who were students in the period 1930-1960were hardly aware that any other conception had ever existed.

If anyone wishes to study the properties of frequencies inrandom experiments he is, of course, perfectly free to do so;and we wish him every success. But if he wants to talk aboutfrequencies, why can't he just use the word "frequency?ll Whydoes he insist on appropriating the word Ilprobability,lI whichhad already a long-established and very different technicalmeaning?

Most of the debate that has been in progress for over acentury on f1frequency vs. non-frequency definitions of proba­bilityll seems to me not concerned with any substantive issueat all; but merely arguing over who has the right to use aword. Now the historical priority belongs clearly to Bernoulliand Laplace. Therefore~ in the interests not only of respon­sible scholarship~ but also of clear exposition and to avo~d

becoming entangled in semantic irrelevancies~ we ought to usethe word "probability" in the original sense of Bernoulli andLaplace; and if we mean something else, call it something else.

With the usage just recommended, the term IIfrequency theoryof probability" is a pure incongruity; just as much so as"theory of square circles." One might speak properly of aIlfrequency theory of inference, II or the better term "samplingtheory," now in general use among statisticians (because theonly distributions admitted are the ones we have calledsampling distributions). This stands in contrast to the"Bayesian theory" developed by Laplace, which admits thenotion of probability of an hypothesis.

Having two opposed schools of thought about how to handleproblems of inference, the stage is set for an interestingcontest. The sampling theorists, forbidden by their ideologyto use Bayes' theorem as Laplace did in the form (A6)~ mustseek other methods for dealing with Laplace1s problems. Whatmethods, then~ did they invent? How do their procedures andresults compare with Laplace's?

The sampling theory developed slowly over the first half ofthis Century by the labors of many, prominent names beingFisher, "Student,1I Pearson~ Neyman, Kendall, Cramer, WaldoThey proceeded through a variety of ad hoc intuitive principles,each appearing reasonable at first glance, but for whichdefects or limitations on generality always appeared. For

8

Page 10: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy? .

example, the Chi-squared test, maximum likelihood, unbiasedand/or efficient estimators, confidence intervals, fiducialdistributions, conditioning on ancillary statistics, powerfunctiDus and sequential methods for hypothesis testing.Certain technical difficulties ("nuisance ll parameters, non­existence of sufficient or ancillary statistics, inabilityto take prior information into account) remained behind asisolated pockets of resistance which sampling theory has neverbeen able to overcome. Nevertheless, there was discernibleprogress over the years, accompanied by an unending stream ofattacks on Laplace's ideas and methods, sometimes degeneratinginto personal attacks on Laplace himself [see, for example,the biographical sketch by E. T. Bell (1937), entitled "FromPeasant to Snob ll

].

Enter Jeffreys. After 1939, the sampling theorists had anothertarget for their scorn. Sir Harold Jeffreys, finding in geo­physics some problems of "extracting signals from noise" verymuch like those treated by Laplace, found himself unconvincedby Fisher's areuments, and produced a book in which the methodsof Laplace were reinstated and applied, in the precise, compactmodern notation that did not exist in the time of Laplace, toa mass of current scientific problems. The result was a vastlymore comprehensive treatment of inference than Laplace's, butwith two points in common: (A) the applications worked outbeautifully, encountering no such technical difficulties asthe "nuisance parameters rr noted above; and yielding the sameor demonstrably better results than those found by samplingtheory methods. For many specific examples, see Jaynes (1976).(B) Unfortunately, like Laplace, Jeffreys did not derive hisprinciples as necessary consequences of any compellingdesid­erataj and thus left room to continue the same old argumentsover their justification.

The sampling theorists, seizin8 eagerly upon point (B) whileagain totalling ignoring point (A), proceeded to give Jeffreysthe same treatment as Laplace, ~hich he had to endure for somethirty years before the tide began to turn.

As a student in the mid-1940's, I discovered the book ofJeffreys (1939) and was enormously impressed by the smooth,effortless way he ~as able to derive the useful results ofthe theory, as well as the sensible philosophy he expressed.But I too felt that something was missing in the expositionof fundamentals in the first ChJpter and, lea~ning about theattacks on Jeffreys' methods by virtually every other writeron statistics, felt some mental reservations.

But just at the right moment there appeared a work thatremoved all doubts and set the direction of my own life's

9

Page 11: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

work. An unpretentious little article by Professor R. T. Cox(1946) turned the problem under debate around and 7 for thefirst time, looked at it in a constructive way. Instead ofmaking dogmatic assertions that it is or is not legitimateto use probability in the sense of degree of plausibilityrather than frequency, he had the good sense to ask a question:Is it possible to construct a consistent set of mathematicalrules for carrying out plausible, rather than deductive,reasoning? He found that, if we try to represent degrees ofplausibility by real numbers, then the conditions of consistencycan be stated in the form of functional equations, whose generalsolutions can be found. The results were: out of all possiblemonotonic functions which might in principle serve our purpose,there exists a particular scale on which to measure degrees ofplausibility which we henceforth call probability, with par­ticularly simple properties. Denoting various propositions byA, E, etc., and using the notation, AB:: "Both Aand B are true, II

A:: ITA is false," p(AIB) ::probability of A given B, the con­sistent rules of combination take the form of the familiarproduct rule and sum rule:

10

P (All IC)

p(AIB)

p(A[BC) p(BIC)

+ p(A[B) • 1

(AS)

(A9)

By mathematical transformations we can, of course, alter theform of these rules; but what Cox proved was that any altera­tion of their COntent will enable us to exhibit inconsistencies(in the sense that two methods of calculation, each permittedby the rules, will yield different results). But (AS), (A9)are, in fact, the basic rules of probability theory; all otherequations needed for applications can be derived from them.Thus, Cox proved that any method of inference in which werepresent degrees of plausibility by real numbers, is neces­sarily either equivalent to Laplacets, or inconsistent.

For me, this was exactly the argument needed to clinchmatters; for Cox's analysis makes no reference whatsoever tofrequencies or random experiments. From the day I first readCox's article I have never for a moment doubted 'the basicsoundness and inevitability of the Laplace-Jeffreys methods,while recognizing that the theory needs further developmentto extend its range of applicability. \

Indeed, such further development was started by Jeffreys.Recall, in our narrative we left Laplace (or rather, Laplaceleft us) at Eq. (A6), seeing the need but not the means tomake the transition to (A7), which would open up an enormouslywider range of applications for Bayesian inference. Since the

Page 12: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

function of the prior probabilities is to describe the priorinformation, we need to develop new or more general principlesfor determination of those priors by logical analysis of priorinformation when it does not consist of frequencies; just whatshould have been the next order of business after Laplace.

Recognizing this, Jeffreys resumed the constructive develop­ment of this theory at the point where Laplace had left off.If we need to convert prior information into a prior prob~­

bility assignment, perhaps we should start at the beginningand learn first how to express "complete ignorance" of acontinuously variable parameter, where Bernoulli's principlewill not apply.

Bayes and Laplace had used uniform prior densities, as themost obvious analog of the Bernoulli uniform discrete assign­ment. But it was clear, even in the time of Laplace, that thisrule is ambiguous because it is not invariant under a changeof parameters. A uniform density for e does not correspond toa uniform dens ity for Ct =e3; or S = log e; so for which choiceof parameters should the uniform density apply?

In the first (1939) Edition of his book, Jeffreys made atentative start on this problem, in which he found his nowfamous rule: to express ignorance of a scale parameter cr,whose possible domain is 0 < a < <XI, assign unifon:J. prior densityto its logarithm: p(daII) = da/a. The first arguments advancedin support of this rule were not particularly clear or con­vincing to others (including this writer). But other desid­erata were found; and we have now succeeded in proving viathe integral equations of marginalization theory (Jaynes, 1979)that Jeffreys' prior data is, in fact, uniquely determined asthe only prior for a scale parameter that is "completely un­informative" in the sense that it leads us to the same conclu­sions about other parameters e as if the parameter a had beenremoved from the model [see Eq. (C33) below].

In the second (1948) Edition, Jeffreys gave a much moregeneral rlInvarianc.e Theory" for determining ignorance priors,which showed amazing pTevision by coning within a hair'sbreadth of discovering both the principles of Maximum Entropyand Transformation Groups. He wrote down the actual entropyexpression (note the date!), but then used it only to generatea quadratic form by expansion about its peak. Jeffreys' in­variance theory is still of great importance today, and thequestion. of its relation to other methods tha~ have beenproposed is still under study.

In the meantime, what had been happening in the samplingtheory camp? The culmination of this approach came in thelate 1940's when for the first time, Abraham Wald succeededin removing all ad hockeries and presenting general rules of

11

Page 13: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

conduct for making decisions in the face of_uncertainty, thathe proved to be uniquely optimal by certain very simple andcompelling desiderata of reasonable behavior. But quickly anumber of peop1e--including I. J. Good (1950), L. J. Savage(1954), and the present writer--realized independently that,if we just ignore Wald's entirely different vocabulary anddiametrically opposed philosophy, and look only at the specificmathematical steps that were now to be used in solving specificproblems, they were identical with the rules given £y Lapla~e

in the eighteenth century, which generations of statisticianshad rejected as metaphysical nonsense!

It is one of those ironies that make the history of scienceso interesting, that the missing Bayes-optimality proofs, whichLaplace and Jeffreys had failed to supply, were at last foundinadvertently, while trying to prove the opposite, by an earlyardent disciple of the von Mises "collectiveH approach. It isalso a tribute to Wald's intellectual honesty that he was ableto recognize this, and in his final work (Wald, 1950) he calledthese optimal rules, "Bayes strategies. II

Thus came the "Bayesian Revolution" in statistics, which isnow all but over. This writer's recent polemics (Jaynes, 1976)will probably be one of the last battles waged. Today, mostactive research in statistics is Bayesian, a good deal of itdirected to the above problem of determining priors by logicalanalysiS; and the parts of sampling theory which do not lie inruins are just the ones (such as sufficient statistics andsequential analysis) that can be justified in Bayesian terms.

This history of basic statistical theory, showing how devel­opments over more than two centuries set the stage naturallyfor the Principle of Maximum Entropy, has been recounted atsome length because it is unfamiliar to most scientists andengineers. Although the second line converging on this prin­ciple is much better known to this audience, our account canbe no briefer because there is so much to be unlearned.

The Second Line: Maxwell, Boltzmann, Gibbs, Shannon. Over thepast 120 years another line of development was taking place,which had astonishingly little contact with the Ttstatisticalinference" line just described. In the 1850's James ClerkMaxwell started the first serious work on the application ofprobability analysis to the kinetic theory of gases. He wasconfronted immediately with the problem of as~igning initialprobabilities to various positions and velocities of molecules.To see how he dealt with it, we quote his first (1859) wordson the problem of finding the probability distribution forvelocity direction of a spherical molecules after an impact:"In order that a collision may take place, the line of motionof one of the balls must pass the center of the other at a

12

Page 14: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

distance less than the sum of their radii; that is, it mustpass through a circle whose centre is that of the other ball,and radius the sum of the radii of the balls. Within thiscircle every positio7', is equally probable, and therefore • t1

Rere again, as that necessary first step in a probabilityanalysis, Maxwell had to apply the Principle of Indifference;in this case to a two-dimensional continuous variable. Bu~

already at this point we see a new feature. As long as wetalk about some abstract quantity 8 without specifying itsphysical meaning, we see no reason why we could not as wellwork with a = 83, or 13 = log 8; and there is an unresolvedambiguity. But as soon as we learn that our quantity has thephysical meaning of position within the circular collisioncross-section, our intuition takes over with a compelling forceand tells us that the probability of impinging on any particularregion should be taken proportional to the area of that region;and not to the cube of the area, or the logarithm of the area.If we toss pennies onto a wooden floor, something inside usconvinces us that the probability of landing on anyone plankshould be taken proportional to the width of the plank; andnot to the cube of the width, or the logarithm of the width.

In other words, merely knowing the physical meaning of ourparameters, already constitutes highly relevant prior informa­tion which our intuition is able to use at once; in favorablecases its effect is to give us an inner conviction that thereis no ambiguity after all in applying the Principle of Indif­ference. Can we analyze how our intuition does this, extractthe essence, and express it as a formal mathematical principlethat might apply in cases where our intuition fails us? Thisproblem is not completely solved today, although I believe wehave made a good start on it in the principle of transforma­tion groups (Jaynes, 1968, 1973, 1979). Perhaps these remarkswill encourage others to try their hand at resolving thesepuzzles; this is an area where important new results mightturn up with comparatively little effort, given the right in­spiration on how to approach them.

Maxwell built a lengthy, highly non-trivial, and needlessto say, successful analysis on the foundation just quoted.He was able to predict such things as the equation of state,velocity distribution law, diffusion coefficient, viscosity,and thermal conductivity of the gas. The case of viscositywas particularly interesting because Haxwell's theory led tothe prediction that viscosity is Independent of density, whichseemed to contradict common sense. But when the experimentswere performed, they confirmed Maxwell's prediction; and whathad seemed a difficulty with his theory became its greatesttriumph.

13

Page 15: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

(AlD)

Jaynes

Enter Boltzmann. So far we have considered only the problemof expressing initial ignorance by a probability assignment.This is the first fundamental problem, since llcomplete initialignorance" is the natural and inevitable starting point fromwhich to measure our positive knowledge; just as zero is thenatural and inevitable starting point when we add a column ofnumbers. But in most real problems we do not have initialignorance about the questions to be answered. Indeed, unlesswe had some definite prior knowledge about the parameters tobe measured or the hypotheses to be tested, we would seldomhave either the means or the motivation to plan an experimentto get more knowledge: 'But to express positive initialknowledge by a probability assignment is just the problem ofgetting from (A6) to (A7), bequeathed to us by Laplace.

The first step toward finding an explicit solution to thisproblem was made by Boltzmann, although it was stated in verydifferent terms at the time. He wanted to find how moleculeswill distribute themselves in a conservative force field (say,a graVitational or centrifugal field; or an electric fieldacting on ions). The force acting on a molecule at positionx is then F::: -grad $, where $(x) is its potential energy. Amolecule with ~ss m, position x, velocity v thus has energyE ::: ; mv 2 + 4> (x). We neglect the interaction energy of moleculeswith·each other and suppose they are enclosed in a containerof volume V, whose walls are rigid and impermeable to bothmolecules and heat. But Boltzmann was not completely ignorantabout how the molecules are distributed, because he knew thathowever they move, the total number. N of molecules present can­not change, and the total energy

N [1 2 ]E "" i~l "'2 m vi + 4> (Xi)

must remain constant. Because of the energy constraint,evidently, all positions and velocities are not equally likely.

At this point, Boltzmann found it easier to think aboutdiscrete distributions than continuous ones (a kind of previ­sic~ of quantum theory); a~d so he divided the phase space(position-momentum space) available to the molecules intodiscrete cells. In principle, these could be defined in anyway; but let us think of the kIth cell as being a region Rkso small that the energy Ek of a molecule does not vary ap­preciably within it; but also so large that ~t can accommodatea large number, Nk» 1, of molecules. The cells {Rk, l::k::s}are to fill up the accessible phase space (which because of theenergy constraint has a finite volume) without overlapping.

The problem is then: given N, E, and ~(x), what is thebest prediction we can make of the number of Nk of molecules

14

Page 16: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

15

Where do we Stand on Maximum Entropy?

in Rk? In Boltzmann's reasoning at this point, we have thebeginning of the Principle of Maximum Entropy. He asked first:In how many ways could a given set of occupation numbers Nkbe realized? The answer is the multinomial coefficient

(Al2)

(All)N!

W(Nk ) ~ N ' N ' N 'l' 2' s'

This particular distribution will have total energys

E ~ k~l Nk Ek

and of course, the Nk are also constrained bys

N~k~lNk (Al3)

Now any set {Nk} of occupation numbers for which E, N agreewith the given information, represents a possible distribution,compatible with all that is specified. Out of the millions ofsuch possible distributions, which is most likely to be realized?Boltzmann's answer was that the "most probable" distributionis the one that can be realized in the greatest number of ways;i.e., the one that maximizes (All) subject to the co~straints

(AI2) , (Al3) , if the cells are equally large (phase volume).Since the Nk are large, we may use the Stirling approxima­

tion for the factorials, whereupon (All) can be written

log W = -N (Al4)

The mathematicalforward. and theis

solution byresult is:

Lagrange multipliers is straight­the "most probable ll value of N

k

(Al5)

where

(Al6)zm ~ t exp [-SEk)

and the parameter B is to be chosen so that the energy con­straint (Al2) is satisfied.

This simple result contains a great deal of physical in­formation. Let us choose a particular set of cells Rk asfollows. Divide up the coordinate space V and the velocityspace into cells Xa , Yb respectively. such that the potentialand kinetic energies rp(x), tID v2. do not vary appreciably within

Page 17: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

them, and take Rk=Xa ® Yb' Then, writing Nk=Nab'BoltzmaIUl's prediction of the number of molecules in Xirrespective of their velocity, is from (Al5) a

Na = LNab = A(S) exp (-Sq,.) (All)b

where the normalization constant A(B) is determined fromL,Na =N. This is the famous Boltzmann distribution law. +na gravitational field, ¢ (x) = mgz, it gives the usual "barometricformula" for decrease of the atmospheric density with height:

16

p(z) = p(O) exp(-Smgz) (AlS)

Now this can be deduced also from the macroscopic equation ofstate: for oce mole, PV=RT, or P(z) = (RT/mNb)p(z), where Nois Avogadro's number. But hydrostatic equilibrium requires-dP!dz = gp(z), which gives on integration, for uniformtemperature, p(z) = p(O) exp(-No mgz/RT). Comparing with (A18).we find the meaning of the parameter: B= (kT)-l. where Tisthe Kelvin temperature and k,~ RINo is Boltzmann's constant.

We can, equally well. sum (AI5) over the space cells Xa andfind the predicted number of molecules with velocity in thecell Yb. irrespective of their position in space; but a farmore interesting result is contained already in (Ai5) withoutthis summation. Let us ask. instead; What fraction of themolecules in the space cell Xa are predicted to have velocityin the cell Yb? This is, from (Al5) and (Al7),

(A20)

This is, of course, just the Maxwellian velocity distributionlaw; but with the new and at first sight astonishing featurethat it is independent of position in space. Even though theforce field is accelerating and decelerating molecules as theymove from one region to another, when they arrive at their newlocation they have exactly the same mean square velocity aswhen they started! If this result is correct (as indeed itproved to be) it means that a Maxwellian velocity distribution,once established, is maintained automatically, without anyhelp from collisions, as the molecules move about in any con­servative force field.

From Boltzmann's reasoning, then, we get a.very unexpectedand nontrivial dynamical prediction by an analysis that,seemingly, ignores the dynamics altogether! This is only thefirst of many such examples where it appears that we are"getting something for nothing," the answer coming too easilyto believe. Poincare, in his essays on IIScience and Hethod,"

Page 18: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

· Where do we Stand on Maximum Entropy?

felt this paradox very keenly, and wondered how by exploitingour ignorance we can make correct predictions in a few linesof calculation, that would be quite impossible to obtain if weattempted a detailed calculation of the 1023 individualtraj ectories.

It requires very deep thought to understand why we are not,in this argument and others to come, getting something fornothing. In fact, Boltzmann's argument does take the dynamicsinto account, but in a very efficient manner. Information aboutthe dynamics entered his equations at two places: (1) the con­servation of total energy; and (2) the fact that he defined hiscells in terms of phase volume, which is conserved in thedynamical motion (Liouville's theorem). The fact that this wasenough to predict the correct spatial and velocity distributionof the molecules shows that the millions of intricate dynamicaldetails that were not taken into account,~ actually irrele­vant to the predictions, and would have cancelled out any:Hay ifhe had taken the trouble to calculate them.--- Boltzmann'~easoningwas super efficient; far more so thanbe ever realized. Whether by luck or inspiration, he put intohis equations only the dynamical information that happened tobe relevant to the questions he was asking. Obviously, it wouldbe of so~e i~portance to discover the secret of how this co~e

ab?ut, and to understand it so well that we can exploit it inother problems.

If we can learn how to recognize and remove irrelevant in­formation at the beginning of a problem, we shall be sparedhaving to carry aut immense calculations, only to discover atthe end that practically everything we calculated was irrele­vant to the question we were asking. And that is exactly whatwe are after by applying Information Theory [actually, thesecret was revealed in my second paper (Jaynes, 1957b); but tothe best of my knowledge no other person has yet noticed itthere; so I will explain it again in Section D below. Thepoint is that Boltzmann was asking only questions about experi­mentally reproducible equilibrium properties].

In Boltzmann I s "method of the most probable distribution, t!

we have already the essential mathematical content of thePrinciple of Maximum Entropy. But in spite of the conventionalname, it did not really involve probability. Boltzmann was nottrying to calculate a probability distribution; he was estimatingsome physically real occupation numbers Nk,\by ? criterion(value of W) that counts the number of real physical possibili­ties; a definite number that has nothing to do with anybody'sstate of knowledge. The transition from this to our presentmore abstract Principle of Maximum Entropy, although mathemat­ically trivial, was so difficult conceptually that it required

17

Page 19: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

almost another Century to bring about. In fact, this requiredthree more steps and even today the development of irreversibleStatistical Mechanics is being held up as much by conceptualdifficulties as by mathematical ones.

Enter Gibbs. Curiously, the ideas that we associate today withthe name of Gibbs were stated briefly in an early work of Boltz­mann (1871); but were not pursued as Boltzmann became occupiedwith his more specialized H-theorem. Further development ofthe general theory was therefore left to Gibbs (1902).· TheBoltzmann argument just given will not work when the moleculeshave appreciable interactions, since then the total energy can­not be written in the additive form (A12). So we go to a muchmore abstract picture. Whereas the preceding argument wasapplied to an actually existing large collection of molecules,we now let the entire macroscopic system of interest become, ineffect, a rtmolecule,l1 and imagine a large collection of copiesof it.

This idea, and even the term "phase" to stand for the col­lection of all coordinates and momenta, appears also in a workof Maxwell (1876). Therefore, when Gibbs adopted this notion,",,-hich he called an "ensemble," it was not, as is apparentlythought by those who use the term tlGibbs ensemble," an innova­tion on his part. He used ensemble language rather as a con­cession to an aiready established custom. The idea became as­sociated later with the von Mises "Kollektiv ll but was actuallymuch older, dating back to Venn (1866); and Fechner's bookKo11ektivrnasslehre appeared in 1897.

It is important for our purposes to appreciate this littlehistorical fact and to note that, far from having invented thenotion of an ensemble, Gibbs himself (loc cit., p. 17) de­emphasized its importance. We can detect a hint of cynicismin his words when he states: "It is in fact customary in thediscussion of probabilities to describe anything which is im­perfectly knOwl1 as something taken at random from a great numberof things which are completely described. 11 He continues that,if ~e prefer to avoid any reference to an ensemble of systems,we may recognize that we are merely talking about "the proba­bility that the phase of a system falls within certain limitsat a certain time .11

In other words, even in 1902 it was customary to talk abouta probability as if it were a frequency; even if it is ai­frequency only in an imaginary ad hoc collection invented justfor that purpose. Of course, any probability whatsoever can bethought of in this way if one wishes to; but Gibbs recognizedthat in fact we are only describing our imperfect knowledgeabout a single system.

18

Page 20: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

The reason it is important to appreciate-this is that wethen understand Gibbs' later treatment of several topics, oneof which had been thought to be a serious omission on his part.If we are describing only a state of knowledge about a singlesystem, then clearly there can be nothing physically real aboutfrequencies in the ensemble; and it makes no sense to ask,"which ensemble is the correct one?" In other words: differentensembles are not in 1:1 correspondence with different phys~cal

situations; they correspond only to different states of knowledgeabout a single physical situation. Gibbs understood thisclearly; and that, I suggest, is the reason why he does notsay a word about ergodic theorems, or hypotheses, but insteadgives a totally different reason for his choice of the canonicalensembles.

Technical details of Gibbs' work will be deferred to Sec. Dbelow, where we generalize his algorithm. Suffice it to sayhere that Gibbs introduces his canonical ensemble, and worksout its properties, without explaining why he chooses thatparticular distribution. Only in Chap. XII, after its proper­ties--including its maximum entropy property--have been setforth, does he note that the distribution with the minimumexpectation of log p (i.e., maximum entropy) for a prescribeddistribution of the constants of the motion has certain desirableproperties. In fact, this criterion suffices to generate allthe ensembles--canonical, grand canonical, microcanonical, androtational--discussed by Gibbs.

This is) clearly, just a generalized form of the Principle ofIndifference. The possibility of a different justification inthe frequency sense, via ergodic theorems, had been discussedby Maxwell, Boltzmann, and others for some thirty years; asnoted in more detail before (Jaynes, 1967) if Gibbs thoughtthat any such further justification was needed, it is certainlycurious that he neglected to mention it.

After Gibbs' work, however, the frequ€ncy view of probabilitytook such absolute control over mens' minds that the ensemblebecame something physically real, to the extent that the fol­lowing phraseology appears. Thermal equilibrium is defined asthe situation where the system is "in a canonical distribution."Assignment of uniform prior probabilities was considered to benot a mere description of a state of knowledge, but a basicpostulate of physical fact, justified by the agreement of ourpredictions with experiment.

In my student days this was the kind of language always used,although it seemed to me absurd; the individual system is not"in a distribution;" it is in a state. The experiments, more­over, do not verify "equal a priori probabilities" or "random~ priori phases;" they verify only the predicted macroscopic

19

Page 21: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

equation of state, heat capacity, etc., and the predictions forthese would have been the same for many ensembles, uniform ornonuniform microscopically. Therefore, the reason for thesuccess of Statistical Mechanics must be altogether differentfrom our having found the tlcorrect" ensemble.

Intuitively, it must be true that use of the canonical ensem­ble, while sufficient to predict thermal equilibrium properties,is very far from necessary; in some sense, "almost everyll memberof a very wide class of ensembles would all lead to the samepredictions for the particular macroscopic quantities actuallyobserved. But I did not have any hint as to exactly what thatclass is; and needless to say, had not the faintest success inpersuading anyone else of such heretical views.

We stress that, on this matter of the exact status of ensem­bles, you have to read Gibbs' own words in order to know accu­rately what his position was. For example, Ter Haar (1954,p. 128) tells us that "Gibbs introduced ensembles in order touse them for statistical considerations rather than to illus­trate the behavior of physical systems __-." But Gibbs himself(loc. cit. p. 150) says, 11 our ensembles are chosen toillustrate the probabilities of events in the real world . t1

It might be thought that such questions are only matters ofpersonal taste. and a scientist ought to occupy himself withmore serious things. But one's personal taste determines whichresearch problems he believes to be the important ones in needof attention; and the total domination by the frequency viewcaused all attention to be directed instead to the aforemen­tioned "ergodic" problems; to justify the methods of StatisticalMechanics by proving from the dynamic equations of motion thatthe canonical ensemble correctly represents the frequencieswith which, over a long time, an individual system coupled toa heat bath, finds itself in various states.

This problem metamorphosed from the original conception ofBoltzmann and Maxwell that the phase point of an isolated(system + heat bath) ultimately passes through every statecompatible with the total energy, to the statement that thetime average of any phase function f(p,q) for a single systemis equal ,to the ensemble average of f; and this statement inturn was reduced (by von Neumann and Birkhoff in the 1930's)to the condition of metric transitivity (i.e., the fullphase space shall have no subspace of positive measure that isinvariant under the motion). But here things qecome extremelycomplicated, and there is little further progress. For exampl~

even if one proves that in a certain sense "almost everyl1 con­tinuous flow is metrically transitive, one would still have toprove that the particular flows generated by a Hamiltonian arenot exceptions.

20

Page 22: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

Such a proof certainly cannot be given in generality, sincecounter-examples are known. One such is worth noting: in thewriter's "Neoclassical Theory" of electrodynamics (Jaynes, 1973)we write a complete classical Hamiltonian system of equationsfor an atom (represented as a set of harmonic oscillators)interacting with light. But we find [lac. cit. Eq. (52)J thatnot only is the total energy a constant of the motion, thequantity LnWn/Vn is conserved, where Wn , vn are the energy andfrequency of the n'th normal mode of oscillation of the atom.

Setting this new constant of the motion equal to Planck'sconstant h, we have a classical derivation of the E = hv lawusually associated with quantum theory! Indeed, quantum theorysimply takes this as a basic empirically justified postulate;and never makes any attempt to explain why such a relationexists. In Neoclassical Theory it is explained as a consequenceof a new uniform integral of the motion, of a type never suspectedin classical Statistical Mechanics. Because of it, for example,there is no Liouville theorem in the "action shell 11 subspace ofstates actually accessible to the system, and statistical proper­ties of the motion are qualitatively different from those of theusual classical Statistical Mechanics. But all this emergesfrom a simple, innocent-looking classical Hamiltonian, invol\~

only harmonic oscillators with a particular coupling law (linearin the field oscillators, bilinear in the atom oscillators).Having seen this example, who can be sure that the same thingis not happening more generally?

This was recognized by Truesdell (1960) in a work that Irecommend as by far the clearest exposition, carried to themost far-reaching physical results, of any discussion of ergodictheory. He comes up against, n an old problem, one of theugliest which the student of statistical mechanics must face:What can be said about the integrals of a dynamical system?"The answer is, "Practically nothing. II In view of such simplecounter-examples as that provided by Neoclassical theory, con­fident statements to the effect that real systems are almostcertainly ergodic, seem like so much whistling in the dark.

Neve~theless, ergodic theory considered as a topic in itsown right, does contain some important results. Unlike someothers, Truesdell does not confuse the issue by trying 'to mixup probability notions and dynamical ones. Instead, he statesunequivocally that his purpose is to calculate time averages.This is a definite, well posed dynamical probl~having nothingto do with any probability considerations; and "Truesdell pro­ceeds to show, in greater depth than any other writer known tome, exactly what implications the Birkhoff theorem has for thisquestion. Since we cannot prove, and in view of counter­examples have no valid reason to expect, that the flow is

21

Page 23: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

metrically transitive over the entire phase space 5, the orieinalhopes of Boltzmann and Maxwell must remain unrealized; but inreturn for this we get something far more valuable, which justmisses being noticed.

The flow will be metrically transitive on some (unknown)sub-space 5' determined by the (unknown) uniform integrals ofthe motion; and the time average of any phase function f(p,g)will, by the Birkhoff theorem, be equal to its phase space,average over that subspace. Furthermore, the fraction of timethat the system spends in any particular region 5 in 5' is equalto the ratio of phase volumes: a(s)/0(5 1

).

These are just the properties that Boltzmann and Maxwellwanted; but they apply only to some subspace 5' which cannotbe known until we have determined all the uniform integralsof the motion. That is the purely dynamical theorem; and Ithink that if today we could resurrect Maxwell and tell it tohim, his reaction would be: "Of course, that is obviouslyright and it is just what I was trying to say. The troublewas that I was groping for words, because in my day we did nothave the mathematical vocabulary, arising out of measure theoryand the theory of transformation groups, that is needed to stateit precisely. 11

That more valuable result is tantalizingly close whenTruesdell considers "....._- the idea that however many integralsa system has, generally we shall not know the value of any butthe energy, so we should assign equal ~ priori probability tothe possible values of the rest, which amounts to disregard'ingthe rest of them. Now an idea of this sort, by itself, isjust unsound. ll It is indeed unsound, in the context of Truesdell'spurpose to calculate correct time averages from the dynamics;·for those time averages must in general depend on all theintegrals of the motion, whether or not we happen to know aboutthem.

The point that he just fails to see is that if, nevertheless,we only have the courage to go ahead and do the calculation herejects as unsound, we can then compare its results with ex­perimental time averages. If they disagree, then we haveobtained experimental evidence of the existence of new integralsof the motion, and the nature of the deviation gives a clue asto what they may be. So, if our calculation should indeed proveto be "unsound, 11 the result would be far more valuable to physicsthan a llsuccessfulll calculation! ~

To all this, however, one proviso must be added. Even if onecould prove transitivity for the entire phase space, this resultwould not explain the success of equilibrium statistical me­chanics, for reasons expounded in great detail before (Jaynes,1967). These theorems apply only to time averages over enormous

22

Page 24: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

(strictly, infinite) time; and an average over a finite time Twill approach its limiting value for T -+00 only if T is so longthat the phase point of the system has explored a "representa­tive sample" of the accessible phase volume. But the veryexistence of time-dependent irreversible processes shows thatthe "representative sampling time" must be very long comparedto the time in which our measurements are made. So the equalityof phase space averages with infinite time averages fails, ontwo counts, to explain the equality of canonical ensembleaverages and experimental values. We can conclude only thatthe "ergodic" attempts to justify Gibbs I statistical mechanicsfoundered not only on impossibly difficult technical problemsof integrals of the motion; but also on a basic logical defectarising from the impossibly long averaging times.

Enter Shannon. It was the work of Claude Shannon (1948) onInformation Theory which showed us the way out of this dilemma.Like all major advances, it had many precursors, whose fullsignificance could be seen only later. One finds them not onlyin the work of Boltzmann and Gibbs just noted, but also in thatof G. N. Lewis, L. Szilard, J. von Neumann, and W. Elsasser, tomention only the most obvious examples.

Sharman I s articles appeared just at the time when I was takinga course in Statistical Mechanics from Professor Eugene Wigner;and my mind was occupied with the difficulties, which he alwaystook care to stress, faced by the theory at that time; theshort sketch above notes only a few of them. Reading Shannonfilled me with the same admiration that all readers felt, forthe beauty and importance of the material; but also with agrowing uneasiness about its meaning. In a communicationprocess, the message Mi is assigned probability Pi' and theentropy H =: -EPi log Pi is a measure of "information." Butwhose information? It seems at first that if information isbeing "sent, n it must be possessed by the sender. But thesender knows perfectly well which message he wants to send;what could it possibly mean to speak of the probability thathe will send message Mi?

We take a step in the direction of making sense out of thisif we suppose that H measures, not the information of the sender,but the ignorance of the receiver, that is removed by receiptof the message. Indeed, many subsequent commentators appearto adopt this interpretation. Shannon, however; proceeds touse H to determine the channel capacity C required to transmitthe message at a given rate. But whether a channel can orcannot transmit message M in time T obviously depends only onproperties of the message and the channel--and not at all on

23

Page 25: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

the prior ignorance of the receiver! So this interpretationwill not work either.

Agonizing over this, I was driven to conclude that the dif­ferent messages considered must be the set of all those thatwill, or might be, sent over the channel during its usefullife; and therefore Shannon!s H measures the degree of ignoranceof the communication engineer when he designs the technicalequipment in the channel. Such a viewpoint would, to say theleast, seem natural to an engineer employed by the Bell Tele­phone Laboratories--yet it is curious that nowhere does Shannonsee fit to tell the reader explicitly whose state of knowledgehe is considering, although the whole content of the theorydepends crucially on this.

It is the obvious importance of Shannon's theorems thatfirst commands our attention and respect; but as I realizedonly later, it was just his vagueness on the~e conceptualquestions--allowing every reader to interpret the work in hisown way--that made Shannon's writings, like those of NielsBohr, so eminently suited to become the Scriptures of a newReligion, as they so quickly did in both cases.

Of course, we do not for a moment suggest that Shannon wasdeliberately vague; indeed, on other matters-few writers haveachieved such clarity and precision. Rather, I think, acertain amount of caution was forced on him by a growing para­dox that Information Theory generates within the milieu ofprobability theory as it was then conceived--a paradox onlyvaguely sensed by those who had been taught only the strictfrequency definition of probability, and clearly visible onlyto those familiar with the work of Jeffreys and Cox. What dothe probabilities Pi mean? Do they stand for the frequencieswith which the different messages are sent?

Think, for a moment, about the last telegram you sent orreceived. If the Western Union Company remains in businessfor another ten thousand years, how many times do you thinkit will be asked to transmit that identical message?

The situation here is not really different from that instatistical mechanics, where our first job is to assign proba­bilities to the various possible quantum states of a system.In both cases the number of possibilities is so great that atime millions of times the age of the universe would not suf­fice to realize all of them. But it seems to be much easierto thibk clearly about messages than quantum states. Here atlast, it seemed to me, was an example where the absurdity ofa frequency interpretation is so obvious that no one can failto see it; but the usefulness of the probability approach wasequally clear. The probabilities assigned to individual mes­sages are not measurable frequencies; they are only a means of

24

Page 26: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

describing a state ~ knowledge; just the original sense inwhich Laplace and Jeffreys interpreted a probability distribu­tion.

The reason for the vagueness is then apparent; t~ a personwho has been trained to think of probability only in the senseof frequency in a random experiment (as was surely the case foranyone educated at M.I.T. in the 1930 1 s!), the idea that aprobability distribution representS a mere state of knowledgeis strictly taboo. A probability distribution would not be.'l'objective" unless it represents a real physical situation.The· question: 11Whose information are we describing?1! doesn'tmake sense, because the notion of a probability ~~ personwith 2;. certain state of knowledge just doesn't exist. SoShannon is forced to do the most careful egg-walking, speakingof a probability as if it were a real, measurable frequency,while using it in a way that shows clearly that it is not.

For example, Shannon considers the entropies HI calculatedfrom single letter frequencies, HZ from digram frequencies,H3 from trigram frequencies, etc., as a sequence of successiveapproximations to the "true'l entropy_ of the source, which isH= lim Hn for n -+ 00. Application of his theorems presupposesthat all this is known. But suppose we try to determine the"true" ten-gram frequencies of English text. The number ofdifferent ten-grams is about 1. 4 x 10 1

"; to determine them allto something like five percent accuracy, we should need asample of English text containing ahout 10 17 ten-grams. Thatis thousands of times greater than all the English text in theLibrary of Congress, and indeed much greater than all theEnglish text recorded since the invention of printing.

If we had overcome that difficulty, and could measure thoseten-gram frequencies (by scanning the entire text) at the rateof 1000 per second, it would require about 4400 years to takethe data; and to record it on paper at a rate of 1000 entriesper sheet, would require a stack of paper about 7000 mileshigh. Evidently, then, we are destined never to know the"true ll entropy of the English language; and in the applicationof Shannon's theore~s to real communication systems we shallhave to accept Some compromise.

Now, our story reaches its climax. Shannon discusses theproblem of encoding a message, say English text, into binarydigits in the most efficient way. The essential step is toassign probabilities to each of the conceivable message~ ina way which incorporates the prior knowledge we"have about thestructure of English. 'Having this probability assignment, aconstruction found independently by Shannon and R. M. Fanoyields the encoding rules which minimize the expected trans­mission time of a message.

25

Page 27: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

But, as noted, 'We shall never know the "true" probabilitiesof English messages; and so Shannon suggests the principle bywhich 'We may construct the distribution Pi actually used forapplications: "--- we may choose to use some of our statis­tical knowledge of English in constructing a code, but not allof it. In such a case we consider the source with the maximumentropy subject .!Q. the statistical conditions we wish !£ retain.The entropy Df this source determines the channel capacity ,which is necessary and sufficient." [emphasis mine].

Shannon does not follow up this suggestion with the equations,but turns at this point to other matters. But if you start tosolve this problem of maximizing the entropy subject to certainconstraints, you will soon discover that you are writing downsome very familiar equations. The probability distributionover messages is just the Gibbs canonical distribution withcertain parameters. To find the values of the parameters, youmust evaluate a certain partition function, etc.

Here was a problem of statistical inference--or what is thesame thing, statistical decision theory--in which we are todecide on the best way of encoding a message, making use ofcertain partial information about the message. The solutionturns out to be mathematically -identical with the Gibbs forma­lism of statistical mechanics, which physicists had been trying,long and unsuccessfully, to justify in an entirely differentway.

The conclusion, it seemed to me, was inescapable. We canhave our justification for the rules of statistical mechanics,in a way that is incomparably simpler than anyone had thoughtpossible, if we are willing to pay the price. The price issimply that we must loosen the-connections between probabilityand frequency, by returning to the original viewpoint ofBernoulli and Laplace. The only new feature is that theirPrinciple of Insufficient Reason is now generalized to thePrinciple of Maximum Entropy. Once this is accepted, thegeneral formalism of statistical mechanics--partitionfunctions,grand canonical ensemble, laws of thermodynamics, fluctuationlaws--can be derived in a few lines without wasting a minuteon ergodic theory. The pedagogical implications are clear.

The price we have paid for this simplification is that wecannot interpret the canonical distribution as giving thefrequencies with which a system goes into the various states.But nobody had ever justified or needed that interpretationanyway. In recognizing that the canonical distribution repre­sents only our state of knowledge when we have certain partialinformation derived from macroscopic measurements, we are notlosing anything we had before, but only frankly admitting the

26

Page 28: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

situation that has always existed; and indeed, which Gibbs hadrecognized.

On the other hand, what we have gained by this change ininterpretation is far more than we bargained for. Even if onehad been completely successful in proving ergodic theorems, andhad continued to ignore the difficulty about length of timeover which the averages have to be taken, this still would havegiven a justification for the methods of Gibbs only in theequilibrium case. But the principle of maximum entropy. beingentirely independent of the equations of motion, contains nosuch restriction. If one grants that it represents a validmethod of reasoning at all, one must grant that it gives usalso the long-hoped-for general formalism for treatment ofirreversible processes!

The last statement above breaks into new ground, and claimsfor statistical mechanics based on Information Theory, a farwider range of validity and applicability than was ever claimedfor conventional statistical mechanics. Just for that reason,the issue is no longer one of mere philosophical preferencefor one viewpoint or another; the issue is now one of definitemathematical fact. For the assertion just made can be put tothe test by carrying out specific calculations, and will. proveto be either right or wrong.

Some Personal Recollections. All this was clear to me by 1951;nevertheless, no attempt at publication was made for anotherfive years. There were technical problems of extending theformalism to continuous distributions and the density matrix,that were not solved for many years; but the reason for theinitial delay was quite different.

In the Summer of 1951, Professor G. Uhlenbeck gave hisfamous course on Statistical Mechanics at Stanford, and fol­lowing the lectures I had many conversations with him, overlunch, about the foundations of the theory and current progresson it. I had expected, naively, that he would be enthusiasticabout Shannon's work, and as eager as I to exploit these ideasfor Statistical Mechanics. Instead, he seemed to think thatthe basic problems were, in principle, solved by the thenrecent work of Bogoliubov and van Hove (which seemed to mefilling in details, but not touching at all on the real basicproblems)--and adamantly rejected all suggestions that thereis tny connection between entropy and information.

nis initial reaction to my remarks was exacfly like myinitial reaction to Shannon's: "Whose information? 11 Hisposition, which I never succeeded in shaking one iota, was:IlEntropy cannot be a measure of 1 amount of ignorance,' becausedifferent people have different amounts of ignorance; entropy

27

Page 29: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

is a definite physical quantity that can be measured in thelaboratory with thermometers and calorimeters. l1 Although theanswer to this was clear in my own mind. I was unable, at thetime, to convey that answer to him. In trying to explain anew idea I was, like Maxwell, groping for words because theway of thinking and habits of language then current had to bebroken before I could express a different way of thinking.

Today, it seems trivially easy to answer Professor Uhlen­beck I s objection as follows: "Certainly. different people 'have different amounts of ignorance. The entropy of a thermo­dynamic system is a measure of the degree of ignorance of aperson whose sole knowledge about its microstate consists ofthe values of the macroscopic quantities Xi which define itsthermodynamic state. This is a completely 'objective' quantity,in the sense that it is a function only of the Xi, and does notdepend on anybody1s personality. There is then no reason whyit cannot be measured in the laboratory.u

It was my total inability to communicate this argument toProfessor Uhlenbeck that caused me to spend another five yearsthinking over these matters, trying to write down my thoughtsmore clearly and explicitly, and making sure in my owri mindthat I could ans~er all the objections that Uhlenbeck andothers had raised. Finally, in the Summer of 1956 I collectedthis into two papers, sending the first off to the PhysicalReview on August 29.

Now another irony takes place; it is left to the Reader toguess to whom the Editor (S. Goudsmit) sent it for refereeing.That Unknown Referee's comments (now framed on my office wallas an encouragement to young men who today have to fight fornew ideas against an Establishment that wants only new mathe­matics) opine that the work is clearly written, but since itexpounds only a certain philosophy of interpretation and hasno application whatsoever in Physics, it is out of place in aPhysics journal. But a second referee thought differently,and so the papers were accepted after all, appearing in 1957.Within a year there were over 2000 requests for reprints.

Needless to say, my own understanding of the technicalproblems continued to evolve for many years afterward. Aschoolboy, having just learned the rules of arithmetic, doesnot see immediately how to apply them to the extraction ofcube roots, although he has in his grasp all the principlesneeded for this. ~ Similarly, I did not see how to set down theexplicit equations for irreversible processes ~ecause I simplycould not believe that the solution to such a complicatedproblem could be as simple as the Maximum Entropy Principlewas giving; and spend six more years (1956-1962) trying tomutilate the principle by grafting new and more complicated

28

Page 30: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

rococo embellishments onto it. In my Brandeis lectures of 1962,tongue and pen somehow managed to state the right rule [Eq.(50)J; but the inner mind did not fully assent; it still seemedlike getting something for nothing.

The final breakthrough came in the Christmas vacation periodof 1962 when. after all else had failed. I finally had thecourage to sit down and work out all the details of the calcu­lations that result from using only the Maximum Entropy Prin­ciple; and nothing else. Witain three days the new formalismwas in hand, masses of the known correct results of Onsager,Wiener, Kirkwood, Callen, Kubo, Mori, MacLennan, were pouringout as special cases, just as fast as I could write them down;and it was clear that this was it. Two months later, mystudents were the first to have assigned homework problems topredict irreversible processes by solving Wiener-Hopf integralequations.

As it turned out, no more principles were needed beyondthose stated in my first paper; one has merely to take themabsolutely literally and apply them, putting into the equationsthe macroscopic information that one does, .in fact, have abouta nonequilibrium state; and all else follows inevitably.

From this the reader will uncier~tand why I have considerablesympathy for those who today have difficulty in accepting the'Principle of Maximum Entropy, because (1) the results seem tocome too easily to beiieve; and (2) it seems at first glance asif the dynamics has been ignored. In fact, I struggled foreleven years with exactly the same feeling, before seeingclearly not only why, but also in detail how the formalism isable to function so efficiently.

The point is that we are not ignoring the dynamics, and weare not getting something for nothing, because we are askingof the formalism only some extremely simple questions; we areasking only for predictions of experimentally reproduciblethings; and for these all circumstances that are not under theexperimenter's control must, of necessity, be irrelevant.

If certain macroscopically controlled conditions are found,in the laboratory, to be sufficient to determine a reproducibleoutcome, then it must follow that information about those macro­scopic conditions tells us everything about the microscopicstate that is relevant for theoretical prediction of that out­come. It may seem at first "unsound" to assign equal ~ prioriprobabilities to all other details, as the Haximwn Entropy \Principle does; but in fact ~e are assigning uniform probabili­ties only to details that are irrelevant for questions aboutreproducible phenomena.

To assume further information by putting some additionalfine-grained structure into our ensembles would, in all

29

Page 31: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

probability, not lead to incorrect predictions; it would onlyforce us to calculate intricate details that would, in the end,cancel out of our final predictions. Solution by the MaximumEntropy Principle is so unbelievably simple just because iteliminates those irrelevant details right at the beginning ofthe calculation by averaging over them.

To discover this argument requires only that one think. verycarefully. about why Boltzmann's method of the most probabl~

distribution was able to predict the correct spatial and veloc­ity distribution of the molecules; and this could have beendone at any time in the past 100 years. Whether or not onewishes to recognize it, this-~and not ergodic properties--isthe real reason why all Statistical Mechanics works. But oncethe argument is understood, it is clear that it applies equallywell whether the macroscopic state is equilibrium or non­equilibrium. and whether the observed phenomenon is reversibleor irreversible.

I hope that this historical account will also convey to thereader that the Principle of Maximum Entropy. although a power­ful tool. is hardly a radical innovation. Its philosophy wasclearly foreshadowed by Laplace and Jeffreys; its mathematicsby Boltzmann and Gibbs.

B. Present Features and Applications.Let us set dawn, for reference, a bit of the basic MaximumEntropy formalism for the finite discrete case. putting offgeneralizations until they are needed. There are n differentpossibilities. which would be distinguished adequately by asingle index (i=1.2 •.•• ,n). Nevertheless we find it helpful.both for notation and for the applications we have in mind, tointroduce in addition a real variable x. which can take on thediscrete values (xi' l::: i ~ n), defined in any way and not neces­sarily all distinct. If we have certain information I about x,the problem is to represent this by a probability distribution{Pi} which has maximum entropy while agreeing with I.

Clearly, such a problem cannot be well-posed for arbitraryinformation; I must be such that. given any proposed distribu­tion {Pi}. we can determine unambiguously whether I does ordoes not agree with {Pi}' Such information will be calledtestable. For example, consider:

II - lilt is certain that tanh x < 0.7. 11

1 2 - "There is at least a 90% probability that tanh x < 0.7."

13 - "The mean value of tanh x is 0.675."

30

Page 32: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

14

_ "The mean value of tanh x is probably less than O.7.!1

15

- "There is some reason to believe that tanh x "" 0.675. 11

Statements 11. 12~ 13 are testable, and may he used as con­straints in maximizing the entropy. 14 and IS. althoughclearly relevant to inference about x, are too vague to betestable, and we have at present no formal principle by whichsuch information can be used in a mathematical theory. How­ever, the fact that our intuitive common sense does make useof nontestable information suggests that new principles forthis, as yet undiscovered, must exist.

Since n is finite, the entropy has an absolute maximum valuelog n, and any constraint can only lower this. If we think ofthe {Pi} as cartesian coordinates of a point P in an n-dimensionalspace, P is constrained by Pi:: 0, L:Pi = 1 to lie on a domain Dwhich is a "triangularll segment of an (n-l)-dimensional hyper­plane. On D the entropy varies continuously, taking on allvalues in 0 < H < log n and reaching its absolute maximum- at thecenter. Any testable information will restrict P to some sub­region D' of D, and clearly the entropy has some least upperbound H ~ log n on D'. So the maximum entropy problem musthave a solution if D' is a closed set.

There may be more than one solution: for example, the in­formation 16 :: lIThe entropy of the distribution {Pi} is notgreater than log(n-l)" is clearly testable, and if n> Z ityields an infinite number of solutions. Furthermore, strictlyspeaking, if D' is an open set there may not be any solution,the upper bound being approached but not actually reached onD'. Such a case is generated by 17 := IIp i +P2 < n-: ll However,since we are concerned with physical problems where the dis­tinction between open and closed sets cannot matter. we wouldaccept a point on the closure of D' (in this example, on itsboundary) as a valid solution, although corresponding strictly

1 - II 2 2 < _2"on y to Is =- PI + Pz _ n .But these considerations are mathematical niceties that one

has to Qention only because he will be criticized if he doesnot. In the real applications that matter, we have not yetfound a case which does not have a unique solution.

In principle, every different kind of testable informationwill generate a different kind of mathematical problem. Butther~ is one important class of problems for w~ich the generalsolution was given once and for all, by Gibbs. If the con­straints consist of specifying mean values of certain functionsIf1 (x) • f 2 (x) ••••• f m(x) } :

31

Page 33: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

1 < k < m (B1)

32

(B2)

where {Fk}are numbers given in the statement of the problem,then if ill < n, entropy maximization is a standard variationalproblem solvable by stationarity using the Lagrange multipliertechnique. It has the formal solution:

p = 1 eXP[-A1

f 1 (x i ) - ... - Amfm(X.;>]i ZeAl·· . Am) •

where

(B3)

is the partition function and {Ak} are the Lagrange multipliers,which are chosen so as to satisfy the constraints (Bl). Thisis the case if

= - log Z 1 < k < m (B4)

a set of ill simultaneous equations for ill unknowns.of the entropy maximum then attained is, as notedreminiscences, a function only of the given data:

The valuein my

(BS)

and if this function were known, the explicit solution of (B4)would be

1 < k < ill (B6)

Given this distribution, the best prediction we can make (inthe sense of minimizing the expected square of the error) ofany quantity q(x). is then

n \

<q(x» = L: Pi q(xi

)i=l

Page 34: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

and numerous covariance and reciprocity rules are contained inthe identity

33

(B7)

[note the special cases q(x) = fj (x), and j = k]. The functionsfk (x) may contain also some parameters aj:

f k = fk(x;al •.. as)

(which in physical applications might have the meaning of volume.magnetic field intensity, angular velocity, etc.); and we havean important variational property; if we make an arbitrarysmall change in all the data of the problem {aFk,oar}. we maycompare two slightly different maximum-entropy solutions. Thedifference in their entropies is found. after some calculation.to be

(BS)

(B9)

The meaning of this identity has a familiar ring: there is nosuch function as Qk(FI ... Fm;al •.. as) because oQk is not anexact differential. However, the Lagrange multiplier Ak is anintegrating factor such that LAk oQk is the exact differentialof a "state function" S(Fl •.• Fm;al •..O:s).

I believe that Clausius would recognize here an interestingecho of his work, although we have only stated some generalrules for plausible reasoning. making no necessary reference tophysics. This is enough of the bare skeleton of the formalismto serve as the basis for some examples and discussion.

The Brandeis Dice Problem. First, we illustrate the formalismby working out the numerical solution to a problem which wasused in the Introduction to my 1962 Brandeis lectures merely asa qualitative illustration of the ideas, but has since become acause celebre as some papers have been written attacking thePrinciple of Maximum Entropy on the grounds of this veryexample._So a close look at it will take us straight to .the heart of someof the most common misconceptions and. I hope, give us someappreciation of what the Principle of Maximum Entropy does anddoes not (indeed, should not) accomplish for us.

Page 35: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

When a die is tossed, the number of spots up can have anyvalue i in 1 < i < 6. Suppose a die has been tossed N times andwe are told Duly that the average number of spots up was not3.5 as we might expect from an "honest H die but 4.S~ Giventhis information, and nothing else, what probability shouldwe assign to i spots on the next toss? The Brandeis lecturesstarted with a qualitative graphical discussion of this problem,which showed (or so I thought) ,how ordinary common sense forcesus to a result with the qualitative properties of the maximUm­entropy solution.

Let us see what solution the Principle of Maximum Entropygives for this problem, if we interpret the data as imposingthe mean value constraint

6L i Pi - 4.5 (lllO)i-IThe partition function is

34

Z(A) = L.-H=

i -Awhere x :: e

-1 6x(l-x) (I-x)

The constraint (BlO) then becomes

(Bll)

a- log Z =- aAor

6 71 - 7x + 6x

6(I-x) (l-x )

= 4.5

65x+9x-7=O (B12)

By computer, the desired root of this is x = 1.44925, whichyields A = -0.37105, Z = 26.66365, log Z = 3.28330. Themaximum-entropy probabilities are Pi = Z~xi, or

{P1

"'P6}={0.05435, 0.07877, 0.11416, 0.16545, 0:23977, 0.34749}

(B13)

FroD (B5), the entropy of this distribution is

S = 1.61358 natural units (Bl4)

as compared to the maximum of loge6 = 1.79176, correspondingto no constraints and a uniform distribution.

Now, what does this result mean? In the first place, it isa distribution {Pr' IS r S 6} on a space of only six points;the sample space S of a single trial. Therefore, our resultas it stands is only a mean of describing a state of knowledgeabout the outcome of a single trial. It represents a state of

Page 36: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

knowledge in which one has only (1) the enumeration of the sixpossibilities; and (2) the mean value constraint (BIO); and noother information. The distribution is "maximally noncommit tal'with respect to all other matters; it is as uniform (by thecriterion of the Shannon information measure) as it can getwithout violating the given constraint.

Any probability'distribution over some sample space Senables us to make statements about (i.e., assign probabilitiesto) propositions or events d~fined within that space. It doesnot--and by its very nature cannot--make statements about anyevent lying outside that space. Therefore, our maximum­entropy distribution does not, and cannot, make any statementabout frequencies.

Anything one says about a frequency in n tosses is a state­ment about an event in the n-fold extension space Sn = S ® s ®... ® S of n tosses, containing 6n points (and of course, inany higher space which has Sn as a subspace).

It may be common practice to jump to the conclusion that aprobability in one space is the same as a frequency in a dif­ferent space; and indeed, the level of many expositions issuch that the distinction is not recognized at all. But thefirst thing one has to learn about using the Principle ofMaximum Entropy in real problems is that the mathematicalrules of probability theory must be obeyed strictly; all con­ceptual sloppiness of this sort must be recognized and expunged.

There is, indeed, a connection between a probability Pi inspace S and a freguency gi in Sn; but we are justified in usingonly those connections which are deducible from the mathematicalrules of probability theory. As we shall see in connectionwith fluctuation theory, some common attempts to identifyprobability and frequency actually stand in conflict with therules of probability theory.

Probability and Frequency. To derive the simplest and mostgeneral connection, the sample space Sn of n trials may belabeled by {rl'rZ •••• ,rn}, where 1 ~ rk: 6, and- rk is thenumber of spots up on the k'th toss. The most general proba­bility assignment on Sn is a set of non-negative real numbersP(rl•.. r n ) such that

35

6

L:r =1

1

6L: p (, • •• r ) = 1

r =1 1 nn

(B15)

In any given sequence {rl •.. r n} of results, the frequency withwhich i spots occurs is

Page 37: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

36

<g > =i

(B16)

(Bl7)

(B18)

where Pk(i) is the probability of getting i spots on the kIthtoss, regardless of what happens in other tosses. The expectedfrequency of an event is always equal to its average proba­bility over the different trials.

Many experiments fall into the category of exchangeablesequences; Le., it is clear that the underlying IImechanisml1

of the experiment, although unknown, is not changing from onetrial to another. The probability of any particular sequenceof results {rIo .. Tn } should then depend only on how many timesa particular outcome r'" i happened; and not on which particulartrials. Then the probability distribution P{rk} is invariantunder permutations of the labels k. In this case, the proba­bility of. i spots is the same at each trial: pi (i) == P2 (i) ==••• = p (~) = p., and (BIl) becomes

n l

<gi> = Pi

In an exchangeable sequence, the probability of an event at onetrial is not the same as its frequency in many trials; but itis numerically equal to the expectation of that.frequency; andthis connection holds whatever correlations may exist betweendifferent trials.

The probability is therefore the "bestt! estimate of thefrequency, in the sense that it minimizes the expected squareof the error. But the result (BIS) tells us nothing whatsoeverabout whether this is a reliable estimate; and indeed nothingin the ~pace S of a single trial can tell us anything aboutthe reliability of (BI8).

To investigate this, note that by'a similar calculation, theexpected product of two frequencies is

n<gigj> ~ n-

2 L Pk(i) p(j,,,,li,k) (B19)k,m=l

where p(j,mli,k) is the conditional probability that the mlthtrial gives the result j, given that the kIth trial had theoutcome 1. Of course, if m== k we have simply p ukl ik) == 6

ij

Page 38: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

37

Where do we Stand on Maximum Entropy?

In an exchangeable sequence p(jm)ik) is independent of m,kfor m ~ k; and so Pk(i) p(jmlik) = Pij' the probability ofgetting the outcomes i,j respectively at any two differenttosses. The covariance of g.,g. then reduces to

, J

<gigj> - <g/<g/ • (Pij -PiPj ) +~(6ij Pi -Pi / (B20)

If the probabilities are not independent, Pij 1 PiPj- thisdoes not go to zero for large n.

Let us examine the case i = j more closely. lhiting PH =01 Pi' 0i is the conditional probability that, having obtainedthe result i on one toss, we shall get it at some other spec­ified toss. The variance of g1 is, from (B20), dropping theindex i,

<g2> _ <g>2 = pCa-p) + ~ pel-a) (B21)

Two extreme cases of inter-trial correlations are contained in(B21). For complete independence, a"" p, the variance reduce.sto n-1 p(l- p), just the result of the de Moivre-Laplace limittheorem (A4). But as cautioned before, in any other case thevariance does not tend to zero at all; there is no 11law oflarge numbers." For c.omplete dependence, u"" 1 (Le., havingseen the result of one toss, the die is· certain to give thesame result at all others), (B21) reduces to p(l-p) whichagain makes excellent sense; in this case our uncertaintyabout the frequency in any number of tosses must be just ouruncertainty about the first toss.

Note that the variance (B21) becomes zero for a slightnegative correlation:

(B22)• Pa 1..=..£- n-l

Due to the permutation invariance of P(rl ..• r n ) it is not pos­sible to have a negative correlation stronger than this; asn+ oo it is not possible to have any negative co!:"relation inan exchangeable sequence. This corresponds to the famous deFinetti (1937) representation theorem; in the literature ofpure mathematics it is called the Hausdorff moment problem.An almost unbelievably simple proof has just been found byHeath and Sudderth (1976).

To summarize: given any probability assignment P(rl •• ·rn )on the space Sn, we can determine the probability distributionWi (t) for the. frequency gi to take on any of its possiblevalues gi"" (tin), O.s t.s n. The (mean) ± (standard deviation)over this distribution then provide a reasonable statement ofour lIbest1' estimate of gi and its accuracy. In the case of

Page 39: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

an exchangeable sequence, this estimate is

(gi)est = Pi ±/PiCI-Pi)[Ri + l:RiJ~ (B23)

where Ri :: (Cil - Pi) I (1- Pi) is a measure of the inter-trialcorrelation, ranging from R:::: 0 for complete independence toR= 1 for complete dependence.

Evidently, then, to suppose that a probability assignmen~

at a single trial is also an assertion about a frequency inmany trials in the sense of the Bernoulli and de Moivre-Laplacelimit theorems, is in general unjustified unless (1) the suc­cessive trials form an exchangeable sequence, and (2) thecorrelation of different trials is strictly zero. However.there are other kinds of connections between probability andfrequency; and maximum-entropy distributions have an exactand close relation to frequencies after all, as we shall seepresently.

Relation to Bayes t Theorem. To prepare us to deal with someobjections to the maximum entropy solution (Bl3) we turn backto the basic product and sum rules of probability theory (AS),(A9) derived by Cox from requirements of consistency. Justas any argument of deductive logic can be resolved ultimatelyinto many syllogisms, so any calculation of inductive logic(i.e., probability theory) is reducible to many applicationsof these rules.

We stress that these rules make no reference to frequencies;or to any random experiment. The numbers p(AIE) are simply aconvenient numerical scale on which to represent degrees ofplausibility. As noted at the beginning of this work, it isthe problem of determining initial numerical values by logicalanalysis of the prior information in more general cases thansolved by Bernoulli and Laplace, that underlies our study.

Furthermore, in neither the statement nor the derivation ofthese rules is there any reference to the notion of a samplespace. In a formally qualitative sense, therefore, they maybe applied to any propositions A, B, C, ... with unambiguousmeanings. Their complete qualitative correspondence withordinary common sense was demonstrated in exhaustive detailby Po1ya (1954).

But in quantitative applications we find at once that merelydefining two propositions, A, B is not suffici~nt to determineany numerical value for p(AIB). This numerical value dependsnot only on A, B, but also on which alternative propositionsAI, A", etc. 'are to be considered if A should be false; andthe problem is mathematically indeterminate until those alter­natives are fully specified. In other words, we must define

38

Page 40: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

our "sample space" or "hypothesis space" before we have anymathematically well-posed problem.

In statistical applications (parameter estimatio~hypothesistesting), the most important constructive rule is just thestatement that the product rule is consistent; i.e., p(ABIC)is symmetric in A and B, so p(AIBC)p(BIC) ~p(BIAC)p(AIC). Ifp(BIC) i- 0, we thus obtain -

I I (B AC)peA BC) ~ peA C) pCB C)

in which we may call C the prior information, B the condi­tioning information. In typical applications, C representsthe general background knowledge or assumptions used toformulate the problem, B is the new data of some experiment,and A is some hypothesis being tested. For example, in theMillikan oil-drop experiment, we might take A as the hypothesis:"the electronic charge lies in the interval 4.802 < e < 4.803, "while C represents the general assumed known laws of electro­statics and viscous hydrodynamics and the results of previousmeasurements, while B stands for the new data being used tofind a revised "best" value of e. Equation (B24) then showshow the prior probability p(Ale) is changed to the posteriorprobability p(AIBC) as a result of acquiring the new informa­tion B.

In this kind of application, p(BIAC) is a lldirect" or"sampling" probability, since we reason in the direction ofthe causal influence, from an assumed cause A to a presumedobservable result B: and p(AI BC) is an lIinverse" probability,in which we reason from an observed result B to an assumedcause A. On comparing with (A7) we see that (B24) is'a moregeneral form of Laplace's rule, in which we need not have anexhaustive set of possible causes. Therefore, since (A7) isalways called "Bayes' theorem, 11 we may as well apply the samename to (824).

At the risk--or rather the certainty--of belaboring it, westress again that we are concerned here with inductive rea­soning of any kind, not necessarily related to random experi­ments or any repetitive process. On the other hand, nothingprevents us from applying the theory to a repetitive situation(i.e., n tosses of a die); and propositions about frequenciesgi are then just as legitimate pieces of data or objects ofinquiry as any other prppositions. Various ki~ds of connectionbetween probability and frequency then appear, as mathematicalconsequences of (AS), (A9). We have just seen one of them.

But now, could we have solved the Brandeis dice problem byapplying Bayes' theorem instead of maximum entropy? If so,how do the results compare? Friedman and Shimony (1971);

39

Page 41: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

(hereafter denoted FS) claimed to exhibit an inconsistency inthe Principle of Maximum Entropy (hereafter denoted PME) byan argument which introduced a proposition ds ' so ill-definedthat they tried to use it as (1) a constraint in PME, (2) aconditioning statement in Bayes' theorem; and (3) an hypothesiswhose posterior probability is calculated. -Therefore, let usnote the following.

If a statement d referring to a probability distribution inspace S is testable (for example, if it specifies a mean value<f> for some function f(i) defined on S), then it can be usedas a constraint in PME; but it cannot be used as a conditioningstatement in Bayes' theorem because it is not a statement aboutany event in S or any other space.

Conversely, a statement D about an event in the space Sn(for example, an observed frequency) can be used as a condi­tioning statement in applying Bayes' theorem, whereupon ityields a posterior distribution on Sn which may be contractedto a marginal distribution on S; but D cannot be used as aconstraint in applying PME in space S, because it is not astatement about any event in S, or about any probability dis­tribution over S; i.e., it is not testable information in S.

At this point, informed students of statistical mechanicswill be astonished at the suggestion that there is any incon­sistency between application of PME in space S and of Bayes'theorem in Sn, since the former yields a canonical distribution,while the latter is just the Darwin-Fowler method, originallyintroduced as a rigorous way of justifying the canonical dis­tribution! The mathematical fact shown by this well-knowncalculation (Schrodinger, 1948) is that, whether we usemaximum entropy in space'S with a constraint fixing an average<f> over a probability distribution, or apply Bayes' theoremin Sn with a conditioning statement fixing a numerically equalaverage f over sample values, we obtain for large n identicaldistributions in the space S. The result generalizes at onceto the case of several simultaneous mean-value constraints.

This not only illustrates--contrary to the claims of FS-­the consisLeney of PME with the other principles of probabilitytheory~ but it shows what a powerful tool P~ is; i.e., howmuch simpler and more convenient mathematically it is to usePME in statistical calculations if the distribution on S iswhat we are seeking. PME leads us directly to the same finalresult, without any need to go into a\higher ~pace Sn and carryout passage to the limit n-+o:J by saddle-point integration.

Of course, it is as true in probability-theory as in car­pentry that introduction of more powerful tools brings withit the obligation to exercise a higher level of understandingand judgment in using them. If you give a carpenter a fancy

40

Page 42: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

new power tool, he may use it to turn out more precise work ingreater quantity; or he may just cut off his thumb with it.It depends on the carpenter.

The FS article led to considerably more discussion (see thereferences collected with the FS one) in which severed thumbsproliferated like hydras; but the level of confusion about thepoints already noted is such that it would be futile to attemptany analysis of the FS arguments. ..

FS suggest that a possible way of resolving all this is todeny that the probability of d~ can be well-defined. Of courseit cannot be; however, to understand the situation we need no"deep and systematic analysis of the concept of reasonabledegree of belief. II We need only raise our standards of exposi­tion to the same level that is required in any other applica­tion of probability theory; i.e., we must define our proposi­tions and sample spaces with enough precision to make adeterminate mathematical problem.

There is a more serious difficulty in trying to reply tothese criticisms. If FS dislike the maximum-entropy solution(Bl3) to.this problem strongly enough to write three articlesattacking it, then it would seem to follow that they prefer adifferent solution. But what different solution? One cannotform any clear idea of what is really troubling them, becausein all these publications FS give no hint as to .how~ in theirview, a more acceptable solution oueht to differ from (BI3).

The Rawlinson Criticism. In sharp contrast to the FS criticismsis that of J. S. Rowlinson (1970), who considers the same diceproblem but does offer an alternative solution. For thisreason, it is easy to give a precise quantitative reply to hiscriticism.

He starts with the all too familiar line: "Most scientistswould say that the probability of an event is (or represents)the frequency with which it occurs in a given situation." Like­wise, a critic of Columbus could have WTitten (after he hadreturned from his first voyage): "Most geographers would saythat the earth is flat. 1I

Clarification of the centuries-old confusion about proba­bility and frequency will not be achieved by taking votes;much less by quoting the philosophical writings of Leslie Ellis(1842). Rather, we must examine the mathematical facts con­cerning the rules of probability theory and th~ different samplespaces in which probabilities and frequencies are defined. Wehave seen, in the discussion following (Bl4) above, that anyonewho glibly supposes that a probability in one space can beequated to a frequency in another, is assuming something whichis not only not generally deducible from the principles of

41

Page 43: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

42

Jaynes

(B25)1 < i < 6

probability theory; it may stand in conflict, with those prin­ciples.

There is no stranger experience than seeing printed crit­icisms which accuse one of saying the exact opposite of whathe has said, explicitly and repeatedly. Thus my bewildermentat Rawlinson I s statement that I rej ect "the methods used byGibbs to establish the rules of statistical mechanics." Ibelieve I can lay some claim to being the foremost currentadvocate and defender of Gibbs' methods! Anyone who takesthe trouble to read Gibbs will see that, far from rejectingGibbs' methods, I have adopted them enthusiastically and(thanks to the deeper understanding from Shannon) extendedtheir range of application.

One of the major unsolved riddles of probability theory is:how to explain to another person exactly what is the problembeing solved? It is well established that merely stating thisin words does not suffice; repeatedly, starting with Laplace,writers have given the correct solution to a problem, only tohave it attacked on the grounds that it is not the solution tosome entirely different problem. This is at least the tenthtime it has happened to me. As I tried to stress, the maximum­entropy solution (E13) describes the state of knowledge inwhich we are given the enumeration of the six possibilities,the mean value <i> = 4.5, and nothing else. - But Rawlinsonproceeds to introduce models with an urn containing seve~ whiteand three black balls (or a population of urns with varyingcontents) from which one makes various numbers of random drawswith replacement. One expects that different problems willhave different solutions.

In Rawlinson's Urn model, we perform Bernoulli trials fivetimes, with constant probability of success p = 0.7. Then thenumbers s of successes is in 0 $. s $. 5, and the expected numberis <s>=5xO.7=3.5. Setting <i>::s+l, we have 1.::;.i.::;.6,<i> "" 4.5, the conditions stated in my dice problem. Thus heoffers as a counter-proposal the binomial distribution

, _I 5 ) i-I (1 _ )6-iPi \ i-I P - P

These numbers are

{pi... pi)- {O. 00243, O. 02835, 0.1323, 03087, 0.36015, 0.16807J. (B26)

and they yield an entropy S' = 1. 413615, 0.2 unit lower thanthat of (B13). This lower entropy indicates that the urnmodel puts further constraints on the solution beyond thatused in (B13). We see that these consist in the extreme values(1"" 1,6) receiving less probability than before (only one of

Page 44: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

25 : 32 possible outcomes can lead to i = 1, while ten of themyield i ::: 3, etc.).

Now if we knew that the experiment consisted of drawingfive times from an urn with just the composition specified byRawlinson, the result (B25) would indeed be the correct solu­tion. But by what right does one assume this elaborate modelstructure when it is not given in the statement of the problem?One could, with equal right, assume anyone of a hundred otherspecific models, leading to a hundred other counter-proposals.But it is just the point of the maximum-entropy principle thatit achieves "objectivityll of our inferences, in the sense thatwe base our predictions only on the information that we do, infact, have; and carefully avoid introducing any such gratuitousassumptions not warranted by our data. Any such assumption isfar more likely to impose false constraints than to happen, byluck, onto an unknown correct one (which would be like guessingthe combination to a safe).

At this point, Rawlinson says, "Those who favour the automaticuse of the principle of maximum entropy would observe that theentropy of [our.Eq. (B25)), 1.4136, is smaller than that of[B13J, and so say that in proposing [B25J as a solution, 'in­formation' h~s been assumed for which there is no justification:'We do indeed say this. although Rawlinson simply rejects it outof hand without giving a reason. So to sustain our claim, letus calculate explicitly just how much Rawlinson's solutionassumes without justification.

To clarify what is meant by lIassuming information, II supposethat an economist, Mr. A, is trying to forecast future pricetrends for some commodity. The condition of next week's marketcannot be known with certainty, because it depends on intentionsto buy or sell hidden in the minds of many different individuals.Evidently, a rational method of forecasting must somehow takeaccount of all these unknown possibilities. Suppose that Mr.A's data are found to be equally compatible with 100 differentpossibilities. If he arbitrarily picked out 10 of these whichhappened to suit his fancy, and based his forecast only on them,ignoring the other 90, we should certainly consider that Mr. Awas guilty of an egregious case of assuming information withoutjustification. Our present problem is stmilar in concept, butquite different in numerical values.

We have stressed that, fundamentally, the maximum-entropysolution (B13) describes only a sta~e of knowledge about asingle trial, and is not an assertion about frequencies. ButRow1inson, as noted, also rejects this distinction and Wantsto judge the issue on the grounds of frequencies. Very well;let us now bring out the frequency connection that a maximum­entropy distribution does, after all, have (and which, incidentally,

43

Page 45: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

44

Jaynes

was pointed out in my Brandeis lectures, from which Rawlinsongot this dice problem).

In N tosses, a set of observed frequencies {gil C {Ni/N}(called g to avoid collision with previous notation) can berealized in

(B27)

Eq. (A14) ,an asymp-

N!W: (Ng

l)! (Ng

Z)! ••• (Ng

6)!

different ways. As we noted from Boltzmann 1 s work,the Stirling approximation to the factorials yieldstotic formula

log W'\t NS (B2B)

where

~S • - L- g. log g. (B29)i=l ~ ~

is the entropy of the observed frequency distribution. Giventwo different sets of frequencies {gil and {gi'}, the ratio:(number of ways gi can be realized)/(number of ways gi can berealized) is given by an asymptotic formula

W· {B -z IW' '" A exp[N(S-S')] 1 + N + O(N ) (B30)

where

(B3l)

(B32)

are independent of N, and represent corrections from the higherterms in the Stirling approximation. We write them down onlyto allay any doubts ahout the accuracy of the numbers to follow.In all cases considered here it is easily seen that they haveno effect on our conclusions, and only the exponential factormatters.

Rawlinson mentions an experiment involving 20,000 throws ofa die, to which we shall return later; but in fhe present com­parison this leads to numbers beyond human comprehension. Tokeep the results more modest, let us assume only N= 1000 throws.If we take {gil as the maximum-entropy distribution (B13) and{g!'} as Rawlinson's solution (B26), we find A c 0.2, B = SO,S - SI c 0.200; and thus, with N = 1000,

Page 46: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

(B34)

(B37)

Where do we Stand on Maximum Entropy?

JL = 1.52 x 1086II'

Both distributions agree with the dat:um <i>:::: 4.5; but for everyway in which Rawlinson's distribution can be realized, thereare over 1086 ways in which the maximum entropy distributioncan be realized (the age of the universe is less than 1018 .seconds). It appears that information was indeed "assumed forwhich there is no justification. II

This example should help to give us a proper respect forjust what we are accomplishing when we maximize entropy. Itshows the magnitude of the indiscret:ion we commit: if we accepta distribution whose entropy is 0.2 unit less than the maximumvalue compatible with our data. In this example, to accept anydistribution whose entropy is as much as 0.005 below the maxi­mum value, would be to ignore over 99 percent of all possibleways in which the average <i>:::: 4.5 could be realized.

For reasons unexplained, Rowlinson seizes upon the particularvalue Pl:::: 0.05435 from the maximum-entropy solut:ion (Bl3), andasks: 11But what basis is there for trusting in this lastnumber?" but fails to ask the same question about his oYm verydifferent 'result pi = 0.00243. Since it is so seldom that oneis able to give a quantitative reply to a rhetorical question.we should not pass up this opportunity.

Answer to the Rhetorical Question. Let us. as before, count upthe number of possibilities compatible with the given data. Inthe original problem we were to find {Pl" 'Pn} so as to maximizeH:::: -EPi log Pi subject to the constraints L:Pi = 1, <i> '" EiPi. aspecified numerical value. If now we impose the additionalconstraint that PI is specified, we can define conditionalprobabilities

Pip t :::: i = 2,3, ••• n (B35)

1 1- P1with entropy

nHI:c L p' log p , (B36)

1=2 1 1

These quantities are related by Shannon's basic functionalequation

H(P1

• .. Pn

) = H(P1,1-P1)+(1-P1)H'(PZ... p;)

and so, maximizing H with Pl held fixed is equivalent to maxi­mizing Ht

• We have the reduced maximum entropy problem:

45

Page 47: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

46

Jaynes

maximize H' subject tonL: p' • 1

1=2 1(B38)

n<1>' "" L

1=2ip .', = 1 + <1> - 1

1- PI(B39)

The solution proceeds as before, but now the maximum attain~bleentropy is a function Hmax = S(p!,<i» of the specified valueof PI- as well as <1>. , The maximum of S(Pl,4.5) is of coursethe previous value (B14) of 1.61358, attained at the maximum­entropy value PI = 0.05435. Evaluating this also for Rawlinson'sPI" I find S(Pl',4.5) ::: 1.55716, lower by 0.05642 units. By(B30) this means that, in lOaD-tosses, for every way in whichRawlinson's value could be realized, regardless of all otherfrequencies except for the constraint <1> - 4.5, there areover 1024 ways in which the maximum-entropy frequency could berealized.

We may give a more detailed answer: expanding S(pl,4.5)about its peak, we find that as we depart from 0.05435, thenumber of ways in which the frequency glcould be realized dropsoff like

(B40)

thein whichinterval

99% of all possible waysrealized, gl lies in the

and so, for example, foraverage <i> = 4.5 can be(0.05435 ± 0.0153).

This would seem to be an adequate answer to the question,"But what basis is there for trusting in this number?U Istress that the numerical results just given are theorems,involving only a straightforward counting of the possibilitiesallowed by the given data. Therefore they stand independentlyof anybody's personal opinions about either dice or probabilitytheory.

However, it is necessary that we understand very clearly themeaning of these frequency connections. They concern only thenumber of possible ways in which certain frequencies {gil couldbe realized, compatible with our constraints. They do notassert that the maximum-entropy frequencies will be observedin a real experimentj indeed, neither the Pri~ciple of MaximumEntropy nor any other principle of probability theory can pre­dict with certainty what will happen in a real experiment.The correct statement is rather: the frequency distribution{gil with maximum entropy calculated from certain constraintsis overwhelmingly the most likely one to be observed in a real

Page 48: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

experiment, provided that the physical constraints operative inthe experiment are the same as those assumed- in the calcula­tion.

In our mathematical formalism, a "constraine' is some pieceof information that leads us to modify a probability distribu­tion; in the case of a mean value constraint, by inserting anexponential factor exp[-Af(x)] with an adjustable Lagrangemultiplier A. It is perhaps not yet clear just what we meanby "constraints" in a physical experiment. Of course, by thesewe do not mean the gross constraining linkages by levers,

·cables, and gears of a mechanics textbook, but something moresubtle. In our applications, a "physical constraint" is anyphysical influence that exerts a systematic tendency--howeverslight--on the outcome of an experiment. We give some specificexamples of physical constraints in die tossing below.

From the above numbers we can understand the success of thework of J. C. Keck and R. D. Levine reported here. I am surethat their results must seem like pure magic to those who havenot understood the maximum-entropy formalism. To find a dis­tribution of populations over 20 molecular energy levels mightseem to require 19 independent pieces of data. But if oneknows, fro~ approximate rate coefficients or from past expe­rience. which constraints exist (in practice, even if only theone or two most important ones are taken into account). one canmake qu±te confident predictions of distributions over manylevels simply by maximizing the entropy.

In fact, most frequency distributions produced in real ex­periments are maximum-entropy distributions, simply becausethese can be realized in so many more ways than can any otheLAs N +~. the combinatorial factors become so sharply peakedat the maximum entropy point that to produce any appreciablydifferent distribution would require very effective physicalconstraints. Any statistically significant departure from amaximum-entropy prediction then constitutes strong--and if itpersists, conclusive--evidence of the existence of new con­straints that were not taken into account in the calculation.Thus the maximum-entropy formalism has the further "magical"property that it provides the most efficient procedure bywhich. if unknown constraints exist, they can be discovered.But this is only an updated version of the process noted inSection A by which Laplace discovered new systematic effects.

It is, perhaps, sufficiently clear from this how much aPhysical Chemist has to gain by understanding, ~Jther thanattacking, maximum entropy methods.

But we still have not dealt with the most fundamental mis­understandings in the Rowlinson article. He turns next to theshape of the maximum-entropy distribution (Bl3), with another

47

Page 49: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

rhetorical question: "__- is there anything in the mechanicsof throwing dice which suggests that if a die is not true theprobabilities of scores 1,2, ••. 6, should form the geometricalprogression [our Eq. (B13)]?ll He then cites some data of Wolfon 20,000 throws of a die which gave an average <i> "" 3.5983,plots the observed frequencies against the maximum-entropy dis­tribution based on that constraint, a nd concludes that l1depar_tures from the random value of 1/6 bear no resemblance to thosecalculated from the rule of maximum entropy. What is clearlywrong with the indiscriminate use of this rule, and of theolder rules from which it stems. is that they ignore the physicsof the problem."

We have here a total, absolute misconception about every pointI have been trying to explain above. If Wolf's data departsignificantly from the maximum-entropy distribution based onlyon the constraint <i> '" 3.5983. then the proper conclusion isnot that maximum entropy methods rtignore the physics tl butrather that the maximum entropy method brings out the physicsby showing us that another physical constraint exists beyondthat used in the calculation. Unable to see the new physicalinformation here revealed, he lashes out blindly against theprinciple that has revealed it.

Therefore, let us now give an analysis of Wolf's dice datashowing just what things maximum entropy can give us here. ifwe only open our eyes to them.

Wolf's Dice Data. In the period roughly 1850-1890. the Zurichastronomer R. Wolf conducted and reported a mass of " randomexperiments." An account is given by Czuber (1908). Ourpresent concern is with a particular die (identified as "WeiszerWiirfel" in Czuber's two-way table, lac. cit p. 149) that wastossed 20,000 times and yielded the aforementioned mean value<i> = 3.5983. We shall look at all details of the data pres­ently, but first let us note a few elementary things aboutthat "ignored" physics.

We all feel intuitively that a perfectly symmetrical die.fairly tossed, ought to show all faces equally often (butthat statement is really circular. since there is no other wayto define a "fair" method of tossing; so, suppose that by ex­perimenting on a die known to be true, we have found such afair method, and we continue to use it). The uniform frequencydistribution {gi = 1/6, 1 < i < 6} then represent~ the nominal \llunconstrained" situation of maximum possible entropy S = log 6.Any imperfection in the die may then give rise to a I1 physicalconstraint" as we have defined that term. A little physicalcommon sense can anticipate what these imperfections are likelyto be.

48

Page 50: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

where do we Stand on Maximum Entropy?

The most obvious imperfection is that different faces havedifferent numbers of spots. This affects the center of gravity,because the weight of ivory removed from a spot is obviously not(in any die I have seen) compensated by the paint then applied.Now the numbers of spots on opposite faces add up to seven.Thus the center of gravity is moved toward the "311 face, awayfrom "4", by a small distance e: corresponding to the one spotdiscrepancy. The effect of this must be a slight frequencydifference which is surely, for very small £, proportional toe::

49

g -g =a.e:4 3

(B41)

where the coefficient CL would be very difficult to calculate,but could be measured by experiments on dies with known E. Butthe (2-5) face direction has a discrepancy of three spots, and(1-6) of five. Therefore, we anticipate the ratios:

(B42)

But this says only that the spot frequencies vary linearly withi:

where

i < i < 6 (B43)

(B44)

The spot imperfections should then lead to a small linearskewing favoring the "6. 11 This is the most obvious "physica1constraint,1I and it changes the expected number of spots to

<i> :2) gi = 3.5 + 17.5a< (B45)

or, to state it more suggestively, the function f1

(i) acquiresa non-zero expectation

Now, what is the next most obvious imperfection to be ex­pected? Evidently, it will involve departure from a perfectcube, the specific kind depending on the manufacturing methods;but let us consider only the highest quality die that afactory would be likely to make. If you were assigned the jobof making a perfect cube of ivory, how would you do it withequipment likely to be available in a Physics Department shopor a small factory?

Page 51: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

I think you would head for the milling machine, and mountyour lump of ivory on a dividing head clamped to the work tablewith axis vertical. The first cut would be with an end mill,making the ntoplT face of the die. The construction of themachine guarantees that this will be accurately plane. Thenyou use side cutters to make the four side faces. For thefinish cuts you will move the work table only in the directionof the cut, rotating the dividing head 90° from one face to thenext. The accuracy of the equipment guarantees that you nowhave five of the faces of your cube, all very accurately planeand all angles accurately 90°, the top face accurately square.

But now the trouble begins; to make the final "bottom" faceyou have to remove the work from its mount, place it upsidedown on the table, and go over it with the end mill. Again,the construction of the machine guarantees that this finalface will be accurately plane and parallel to the "top;ll but.!!. will be practically impossible i2. adjust the work tableheight so accurately that the final dimension is exactly equalto the other two. Of course, a skilled artisan with a greatdeal more time and equipment could do better; but this wouldrun up the cost of manufacture for something that would neverbe detected in use. For factory production, there would be nomotivation to do better than we have described.

Thus, the most likely geometrical imperfection in a highquality die is not lack of parallelism or of true 90° angles,but rather that one dimension will be slightly different fromthe other two.

- Again, it is clear what kind of effect this will have onfrequencies. Suppose the die comes out slightly "oblate, II the(1-6) dimension being shorter than the-(2-S) and (3-4) by somesmall amount O. If the die were otherwise perfect, this wouldevidently increase the frequencies gl' g6 by some small amount~o, and decrease the other four to keep the sum equal to unity,where B is another coefficient hard to calculate but measurabl~

The result can be stated thus: the function

£3(i) " . {~i: ~: ~:~,4,5} (B47)

defined on the sample space, acquires a non-zero expectation

50

and the frequencies are

1 1gi = "6 + 2" So £3 (i)

= ~[l + 3So £3(i)]

(M8)

(B49)

Page 52: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

If now both imperfections are present, since the perturba­tions are so small we can in first approximation just superposetheir effects:

51

8 i • %(1 + 6(1£ £1 (i) 1[1 + 3So £3 (i) 1

But this is hardly different from

8 i • ~ exp[6(1£ £1 (i) + 3S6 £ 3(i) J

(B50)

(!l51)

and so a few elementary physical common-sense arguments haveled us to something which begins to look familiar.

If we had done maximum entropy using the constraints (B46),(B48), we would find a distribution proportional to exp[ -A1fl(1)- '\3£3(1)], so that (BSl) is a maximum-entropy distributionbased on those constraints. We see that the Lagrange multiplierby which any information constraint is coupled into our proba­bility distribution, is just a measure of the strength of thephysical constraint required to realize a numerically equalfrequency distribution:

(B52)

(B53)

and if our die has no other imperfections beyond the two noted~

then it is overwhelmingly more likely to produce the distribu­tion (B5l) than any other.

If the observed frequencies show any statistically signif­icant departure from (B5l), then we have extracted from thedata evidence of a third imperfection~ which probably wouldhave been totally invisible in the raw data; i.e., only whenwe have used the maximum entropy principle to " subtract off"the effect of the stronger influences, can we hope to detecta weaker one.

Our program for the maximum-entropy analysis of the die--or~~y other random eA~eriment--is now defined except for the finalstep; how we decide whether a discrepancy is "statisticallysignificant?"

The reader is cautioned that in all this discussion relatingto Rawlinson we are being careless about distinctions betweenprobability and frequency, because Rowlinson hi~sel~ makes nodistinction between them, and trying to correct this at everypoint quickly became tedious. The following analysis shouldbe restated much more carefully to bring out the fact that itis only a very special case~ although to the "frequentist" itappears to be the general case.

Page 53: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

We have some "null hypothesis" He about our die, that leadsus to assign the probabilities {Pl".P6}' We obtain data fromN tosses, in which the observed frequencies are {gi '" Ni/N,I.:$. i.:$. 6}. If the numbers {gl... g6} are sufficiently close to{Pl"'P6} we shall say the fit is satisfactory; the nullhypothesis is consistent with our data, and so there is noneed, as far as this experiment indicates, to seek a betterhypothesis. But how close is "close?!! How do we measure the"distance" between the two distributions; and how large maythat distance be before we begin to doubt the null hypothesis?

Early in this Century. Karl Pearson invented an intuitive,ad hoc procedure, called the Chi-squared test, to deal withthis problem, which has been since widely adopted. Here wecalculate the quantity

52

l- N ti-1

(B54)

and if it is greater than a certain "critical value" given inTables, we reject the null hypothesis. In the present case(six categories,_ five "degrees of freedom" after normalization),the critical value at the conventional 5% significance level is

x~ • 11.07 (B55)

which means that, if the null hypothesis is true there is onlya 5% chance of seeing a value greater than X~. The criticalvalue is independent of N, because for a frequentist who be­lieves that Pi is an assertion of a limiting frequency in thesense of the de Moivre-Laplace limit theorem (A4), if Ho is 1

true, then the deviations should fall off as Igi - Pi I = o (N4 ).A more careful approach shows that this holds only if our modelis an exchangeable sequence with zero correlations.; and even inthis case the X2 criterion of "closeness" has no theoreticaljustification (i.e., no uniqueness property) in the basic prin­ciples of probability theory.

In fact, for the case of independent exchangeable trials,there is a criterion with a direct information-theory justifica­tion (Ku1lback. 1959) in the "minimum discrimination informationstatistic"

6~ " N L gi log (g/P i ) (B56)

i=land the numerical value of ~, rather than Xl, will lead us toinferences directly justifiable by Bayes' theorem. If the de­viations (gi-P.) are large, these criteria can be very different.,

Page 54: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

However, by a lucky mathematical accident, if the devia­tions are small (as we already know them to be for our diceproblem) an expansion in powers of (&i - Pi) [in the logarithm,write g/p ·1+ (g-p)/g+ (g-p)'/gp] yields

53

(B57)

the neglected terms falling off as indicated, provided thatIg1 - Pi I '" 0 (N-~). The result is that in our problem, from apragmatic standpoint it doesn't matter whether we use X2 or ~.

So I shall apply the X2 test to Wolf's data, because it is somuch more familiar to most people.

Wolf's empirical frequencies {gil are given in the secondcolumn of Table 1. As a first orientation, let us test themagainst the null hypothesis {He :Pi :::: 1/6, 1 < i < 6} of a uniformdie. We find the result

2X • 271o

(B58)

over twenty times the critical value (B55). The hypothesisHo is decisively rejected.

Next, let us follow Rawlinson by considering a new hypothesisHI which prescribes the maximum-entropy solution based on 'W.Jlf' saverage <i> = 3.5983, or,

(B59)

This will give us a distribution Pi'" exp[- Af l (i)]. From thepartition function (Bll) with this new datum we find A= 0.03373and the probabilities given in the third 'column of Table l.The fourth column gives the differences ~ = gi - Pi' while inthe fifth we list the partial contributions to Chi-squared:

(gi -Pi)2c i - 20,000 p

iwhich add up to the value

2Xl • 199.4 (B60)

The fit is improved only slightly; and HI is also decisivelyrejected.

Page 55: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

Table 1. One Constraint

i 81 Pi Ei c

i

1 0.16230 0.15294 + 0.0094 11.462 0.17245 0.15818 + 0.0143 25.753 0.14485 0.16361 - 0.0188 43.024 0.14205 0.16922 - 0.0272 87.255 0.18175 0.17502 + 0.0067 5.186 0.19660 0.18103 + 0.0156 26.78

199.43

At this point, Rawlinson wants to reject not only HI' butalso the whole principle of maximum entropy. But now I stressstill another time what the principle is really telling us: ~

statistically significant deviation is evidence of ~ new physicalconstraint; and the nature of the deviation gives us a clue asto what that constraint is. After subtracting off, by maximumentropy, the deviation attributable to the first constraint,the nature of the most important remaining one is revealed.Indeed, from a glance at the deviations tl i = gi - Pi the answerleaps out at us; Wolf's die was slightly "prolate," t.he (3-4)dimension being greater than the (2-5) and (1-6) ones. So,instead of (B47), the new constraint is

f (i) - {+l, i' 1,2,5,6 (B61)2 • -2, i· 3,4

and Wolf's data yield the result

54

<f2> = 0.1393 (B62)

So now let us subtract off, by maximum entropy, the effect ofboth of these constraints; and thus discover whether Wolf'sdie had a third imperfection.

With the two constraints (B59), (B62) we have two Lagrangemultipliers and a partition function

6Z(A1 ,A2) = L ~[-Alfl (i) - A2f 2(1)]

i·l

-5/2 4 2-3• x y(l+x)(l+x +x Y ) (B63),wh~re x =exp(-A

l), Y =exp(-A

2). The maximum~entropy proba­

bilities are then

{2 -3 3 -3 4 5}Y l,x,x y ,x y ,x ,x (B64)

Page 56: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

55

5(6F1 - 4F2 - 11)x + (6F1

+ (6F1 + 4F2 + 11) • 0

Writingfrom them,

out thewe find

constraint equations (B4) and eliminating ythat x is determined by

4- 4F2 - 5)x + (6F1 + 4F2 + 5)x

(B65)

or, with Wolf's numerical values (~59), (B62),

5.4837x5 + 2.4837x4 - 3.0735x - 6.0735 = 0,(B66)

This has only one real.root, at x= 1.032233, from which we haveAl = -0.0317244, Y = 1.074415, 1.. 2 = -0.0717764.- Thenewrreximum­entropy probabilities are given in Table 2, which contains thesame information as Table 1, but for the new hypothesis H2.

Table 2. Two Constraints

i gi Pi ~i ci

1 0.16230 0.16433 - 0.0002 0.5022 0.17245 0.16963 + 0.0028 0.9383 0.14485 0.14117 + 0.0037 1.9194 0.14205 0.14573 - 0.0037 1.8595 0.18175 0.18656 - 0.0048 2.4806 0.19660 0.19258 + 0.0040 1.678

9.375

Wefit.

see that the second constraint hasChi-squared has been reduced to

greatly improved the

2X2 = 9.375 (B67)

This is less than the critical value 11.07, so there is now nostatistically significant evidence for any further imperfections;Le., if the given Pi were the "exact" values, it is reasonablylikely that the distribution gi would deviate from Pi by theobserved amount, by chance alone. Or, to put it in a wayperhaps more appropriate to this problem, if the die weretossed another 20,000 times, we would not expect the frequencies&i to be repeated exactly; the new frequencies gi~ might rea­sonably be expected to deviate from the first set gi by aboutas much as the distributions gi' Pi differ. '"

That this is reasonable can be seen directly without calcu­lating Chi-squared. For if the result i is obtained ni timesin N tosses, we might expect this to fluctuate in successiverepetitions of the whole experiment by about ±~. Thus the

Page 57: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

observed frequencies g~ = ui/N should fluctuate by about.6.gi"'±/iii/N; for gi "'1/6, N=20,OOO, this gives .6.&i::::0.0029.But this is just of the order of the observed deviations .6.i'Therefore, it ~ould be futile to search the {.6.i} of Table 2for a third imperfection. Not only does their distributionfail to suggest any simple hypothesis; if the die were tossedanother 20,000 times, in all probability the new ~i would beentirely different. With our two-parameter hypothesis HZ weare down lIin the noise" of random variations, and any furth"ersystematic influences are too small to be seen unless we go upto a million tosses, by which time the die will be changed any­way by wear.

A technical point might be raised by Statisticians: "Youhave estimated two parameters AI,AZ from the data; thereforeyou should use the test for three degrees of freedom ratherthan five." This reduction is appropriate if the parametersare chosen by the criterion of minimizing X2 • That is, if wechoose them for the express purpose of making X2 small andstill fail to do so, it does not speak well for the hypothesisand a penalty is in order. But our parameters were chosen bya criterion that took no note of X2

; and therefore the properquestion is only; "How well does the result fit the data?!!and not: "How did you find the parameters?1I Had we chosenour parameters to minimize X2

, we. would have found a stilllower value; but one that is not relevant to the point beingmade here, which is the performance of the maximum entropycriterion, as advocated long before this die problem was thoughtof.

The maximum entropy method with two Lagrange multipliersthus successfully determines a distribution with five indepen­dent quantities. The "ensemble" canonical with respect to theconstraints £l(i), £2(i) describing the two imperfections thatcommon sense leads us to expect in a die, agrees with Wolflsdata about as well as can be hoped for in a statistical problem.

It was stressed above that in this theory the connectionsbetween probability and frequency are loosened and we noted,in the discussion following (B40), that the connections re­maining are now theorems rather than conjectures. As we nowsee, they are not loosened enough to hamper uS in dealing withreal random experiments. If we had been given only the twoconstraints (B59), (B62) we could have reproduced, by maximumentropy, all of Wolf's frequency data.

This is an interesting caricature of the results of Keck andLevine, and shows again how much our critics would gain byunderstanding, rather than attacking, this principle. Far from"ignoring the physics, II it leads us to concentrate our attentionon the part of the physics that is relevant. Success in using

56

Page 58: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

it does not require that we take into account all dynamicaldetails; it is enough if we can recognize, whether by common­sense analysis or by i...'1spection of data, what~ the systematicinfluences at work, that represent the "physical constraints?"If by any means we can recognize these, maximum entropy thentakes over and supplies the rest of the solution, which doesnot depend on dynamical details but only on counting the pos­sibilities.

In effect, then, by subtracting off the systematic effectswe reduce the problem to Bernoulli's lIequally possible" cases;the deviations 6i from the canonical distribution that remainin Table 2 are the same as the deviations from Pi:: 1/6 that wewould expect if the die had no imperfections.

Success of our predictions is not guaranteed in advance. asRowlinson supposed it should be when he wanted to reject theentire principle at the stage of Table 1. But this suppositionmerely reflects his rejection. at the very outset, of the dis­tinction between probability and frequency that I keep stressing.If one is not moved by theoretical arguments for that distinc­tion, we now see a pragmatic reason for it. The probabilitiesPi in Table I are an entirely correct description of our stateof knowledge about a single toss, when we know about only theconstraint fl(i). It is a theorem that they are also numeri­cally equal to the frequencies which could happen in the greatestnumber of ways if no other physical constraint existed. Butour probabilities will agree with measured frequencies onlywhen we have recognized and put into our equations the con­straints representing all the systematic influences at work inthe real experiment.

This, ,1 submit, is exactly as it should be in a statisticaltheory; at no point are we ever justified in claiming that ourpredictions must be right; only that, in order to make anybetter ones we should need more information than was given.It is when a theory purports to do more than this (by failingto recognize the distinction between probability and frequency)that it may be charged with promising us something for nothing.

Since the fit is now satisfactory, the above values of AI'A2 give us the numerical values of the systematic influences inWolf's exper1IDent: from (B52), (B53) we have

57

as

=

0.031726

&.071783

= 0.0053

= 0.024

(B68)

(B69)

So, if today some enterprising person at Monte Carlo or LasVegas will undertake to measure for us the coefficients 0, S.then we can determine--lOO years after the fact--just how far

Page 59: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

(in terms of its nominal dimensions) the center of gravity ofWolf's die was displaced (presumably by excavation of thespots), and how much longer it was in the (3-4) direction thanin the (2-5) or (1-6). We can also certify that it had noother significant imperfections (at least, none that affectedits frequencies). Note, however, that Ct., S are not, strictlyspeaking, physical constants only of the die; a little furthercommon-sense reasoning makes it clear that they must dependalso on how the die was tossed; for example, tossing it witha large angular momentum about a (3-4) axis will decrease theeffect of the fl(i) constraint, while if it spins about the(1-6) axis the effect of f 2 (i) will be less; and with a (2-5)spin axis both constraints will be weakened.

Indeed, as Soon as the die is unsymmetrical, all sorts ofphysical conditions that were irrelevant for a perfectly sym­metrical one, become relevant. The frequencies will surelydepend not only on its center of gravity but also on all thesecond moments of its mass distribution, the sharpness of itsedges, the smoothness, elasticity, and coefficient of friction

-of the table, etc.However, we conjecture that a, S depend very little ori these

factors within the small range of conditions usually employed(i.e., small angular momentum in tossing, etc.); and suspectthat in that range the coefficient a is already well known tothose who deal with loaded dice.

I really must thank Rowlinson for giving us (albeit ininten­tionally) such a magnificent test case by which the nature andpower of ,the Principle of Maximum Entropy can be demonstrated,in a context entirely removed from the conceptual problems ofquantum theory. And indeed, all the criticisms he made wererichly deserved; for he was not, after all, criticizing thePrinciple of Maximum Entropy; only a gross misunderstanding ofit. Rawlinson's criticisms were, however, taken up and extendedby Lindhard (1974); in view of the long commentary above we mayleave it as an exercise for the reader to deal with his arguments..

The Constraint Rule. There is a further point of logic aboutour use of maximum entropy that has troubled some who are ableto see the distinction between probability and frequency. Inimposing the mean-value constraint (E1) we are simply appro­priating a sample average obtained from N measurements thatyielded f j on the.) I th observation:

1 NF : f : - L f (B70)

N j=l j

and equating it to a probability average

58

Page 60: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

59

Where do we Stand on Maximum Entropy?

n<f> = L: Pi f(x

i) (B7l)

i=lIs there not an element of arbitrariness about this? A cynicmight say that after all these exhortations about the distinc­tion between probability and frequency, we proceed to confusethem after all, by using the word "average" in two quite dif­ferent senses.

Our rule can be justified in more than one way; in SectionD below we argue in terms of what it means to say that certaininformation is "contained" in a probability distribution. Letus ask now whether the constraint rule (Bl) is consistent with,or derivable from, the usual principles of Bayesian inference.

If we decide to use maximum entropy based on expectations ofcertain specified functions {fl(x) ••. fm(x)}, then we know inadvance that our final distribution will have the mathematicalform

(B72)

(B73)A~ f~(x)J J

p(xiIH) = Z(\~ •• A) exp[-A1 f 1 (x i ) ... - Am fm(Xi )]m .

and nothing prevents us from thinking of this as defining aclass of sampling distributions parameterized by the Lagrangemultipliers Ak. the parameter space consisting of all valuesof Oll' •• Am}. which lead to norrnalizable distributions (B72).Choosing a specific distribution from this class is thenequivalent to making an estimate of the parameters Ak. Butparameter estimation is a standard problem of statistical in­ference.

The class C of hypothesis being considered is thus specifie~

any particular choice of the {Al ••• Am} may be regarded asdefining a particular hypothesis H€C. However. the class Cdoes not determine any particular choice of the functions{fl (x), ••• , f m(x)}. For, if A is any nonsingular (m x m) matrix,we can carry out a linear transformation

m mL: Ak f k (x) = L:k=l j=l

where

Aj "~Ak ~j* "\,-1fj

(x) ,,~(A ) Ok fk

(x)k J

and the class of distributionswell as

(B74a)

(B74b)

(B72) can be WTitten equally

Page 61: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

p(xiIH) = 1 eXP{-A1* f1*(X

i) _ ... - A* f*(X.)}. (B75)

z(A~ ... A:J m m ~

As the {A~ •.. A*} vary over their range, we generate exactlythe same familymof probability distributions as (B72). Theclass C is therefore characteristic, not of any particularchoice of the {fl(x) .•. fm(x)}, but of the linear manifold M(C)spanned by them.

If the fk(x) are linearly independent, the manifold M(C) hasdimensionality m. Otherwise, M(C) is of some lower dimension­ality m' <m; the set of functions {f1 (x) .•. f m(x)} is thenredundant, in the sense that at least one of them could beremoved without changing the class C. While the presence ofredundant functions fk(x) proves to be harmless in that it doesnot affect the actual results of entropy maximization (Jaynes,1968), it is a nuisance for present purposes [Eq. (BBl) be19w].In the following we assume that any redundant functions havebeen removed, so that m! = m.

Suppose now that Xi is the result of some random experimentthat has been repeated r times., and we have obtained the data

60

D " {Xl true rl

times, Xz true r2

times, •.. , x true r times}.n n(B76)

Of course, Eri = r. Out of all hypotheses RsC, which is moststrongly supported by the data D according to the Bayesian, orlikelihood, criterion? To answer this, choose any particularhypothesis Ho :: {AI (0) ••• Am(O)} as the 1Inull hypothesis" andtest it against any other hypothesis H:: {AI' •• Am} in C byBayes' theorem. The log-likelihood ratio in favor of HoverHo is

n

= Li=1

+ fk=l

(B77)

(B78)

is the measured average of fk(x), as found in the experiment.Out of all hypotheses in class C the one most strongly supportedby the data D is the one for which the first variation vanishes:

Page 62: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on }~imum Entropy?

m [a - JoL - -r L: aI""" log Z + fk

OAk = 0k-l k

But from (B4), this yields just our constraint rule (Bl):

(B79)

61

1 < k < m} (BSO)

To show that this yields a true maximum, form the second varia­tion and note that the covariance matrix

(BSl)

is positive definite almost everywhere on the parameter spaceif the fk(x) are linearly independent.

Evidently, this result is invariant under the aforementionedlinear transformations (B74); i.e., we shall be led to the samefinal distribution satisfying (B80) however the fk(x) aredefined. Therefore, we can state our conclusion as follows:

Out of all hypotheses in class C, the data D supportmost strongly that one for which the exp~ctation

<f(x) > is equal to the measured average f(x) for everyfunction f(x) in the linear manifold M(C).

(BS2)

This appears to the writer as a rather complete answer to someobjections that have been raised to the constraint rule. Weare not, after all, confusing two averages; it is a derivableconsequence of probability theory that we should set them equal.Maximizing the entropy subject to the constraints (B80), isequivalent to (i.e., it leads to the same result as) maximizingthe likelihood over the manifold of sampling distributionspicked out by maximum entropy.

Forney's Question. An interesting question related to thiswas put to me by G. David Forney in 1963. The procedure (Bl)uses only the numerical value of F, and it seems to make nodifference whether this was a measured average over 20 observa­tions, or 20,000. Yet there is surely a difference in our stateof knowledge--our degree of confidence in the accuracy of 'F-­that depends on N. The maximum-entropy method seems to ignorethis. Shouldhtt our final distribution depend on N as well asF?

It is better to answer a question 15 years late than not atall. We can do this on both the philosophical and the technicallevel. Philosophically, we are back to the question: "What isthe specific problem being solved?"

Page 63: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

In the problem I am considering F is simply a number givento us in the statement of ,the problem. Within the context ofthat problem, F is exact by definition and it makes no dif­ference how it was obtained. It might, for example" be onlythe guess of an idiot, and not obtained from any measurementat all. Nevertheless, that is the number given to us, and ourjob is not to question it, but to do the best we can with it.

This may seem like an inflexible, cavalier attitude; I ~convinced that nothing short of it can ever remove the ambiguityof "What is the problem?1I that has plagued probability theoryfor two centuries.

Just as Rawlinson was impelled to invent an Urn Model thatwas not specified in the statement of the problem, you and Imight, in some cases, feel the urge to put more structure intothis problem than I have used. Indeed, we demand the right todo this. But then, let us recognize that we are considering adifferent problem than pure llclassical" maximum entropy j and itbecomes a technical question, not a philosophical one, whetherwith some new model structure we shall get different results.Clearly, the answer must be sometimes yes, sometimes no, depen­ding on the specific model structure assumed. But ,it turns outthat the answer is "no ll far more often than one might haveexpected.

Perhaps the first thought that comes to one's mind is thatany uncertainty as to the value of F ought to be allowed forby averaging the maximum-entropy distribution Pi (F) over thepossible values of F. But the maximum-entropy distribution is,by construction, -already as "uncertain" as it can get for thestated mean value. Any averaging can only result in a distri­bution with still higher entropy, which will therefore neces­sarily violate the mean value number given to us. This hardlyseems to take us in the direction wanted; i.e., we are alreadyup against the wall from having maximized the entropy in thefirst place.

But such averaging was only an ad hoc suggestion; and infact the Principle of Maximum Entropy already provides theproper means by which any testable information can be builtinto our probability assignments.· If we wish only to incorporateinformation about the accuracy with which f is known, no newmodel structure is needed; the way to do this is to impose­another constraint. In addition to <f> we may specify <£2>;or indeed, any number of m~ments <fn> or more general funct~ons<h(f». Each such constraint will be accompanied by itsLagrange multiplier A, and the general maximum-entropy formalismalready allows for this.

Of course. whenever information of this kind is available itshould in principle be taken into account in this way. I would

62

Page 64: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

I1hold it to be self-evident" that for any problem of inference,the ideal toward which we should aim is that all the relevantinformation we have ought to be incorporated explicitly intoour equations; while at the same time, "objectivity'~ requiresthat we carefully avoid assuming any information that we do notpossess. The Principle of Maximum Entropy, like Ockham, tellsus to refrain from inventing Urn Models when we have no Urn.

But in practice, some kinds of information prove to be farmore relevant than others, and this extra information aboutthe accuracy of F usually affects our actual conclusions solittle that it is hardly worth the effort. This is particu­larly true in statistical mechanics, due to the enormouslyhigh dimensionality of the phase space. Here the effect ofspecifying any reasonable accuracy in F is usually completelynegligible. However, there are occasional exceptions; andwhenever this extra information does make an appreciable dif­ference it would, of course, be wrong to ignore it.

C. Speculations for the FutureThe field of statistical Inference--in or out of Physics-~is sowide that there is no hope of guessing every area in which newadvances might be made. But we can indicate a few areas whereprogress may be predicted rather safely because it is alreadyunderway, with useful results being found at a rate proportionalto the amount of effort invested.

Current progress is taking place at several different levels:I Application of existing techniques to existing problems

II Extension of present theory to new problems.III More powerful mathematical methods.

IV Further development of the basic theory of inference.However, I shall concentrate on I and I~ because II is soenveloped in fog that nothing can be seen clearly, and III seemsto be rather stagnant except for development of new specializedcomputer techniques, which I am not competent even to describe,much less predict.

There are important current areas that seem rather desperatelyin need of the same kind of house-cleaning that statisticalmechanics has received. What they all have in common is:(a) long-standing problems, still unsolved after decades ofmathematical efforts, (b) domination by a mental outlook thatleads one to concentrate all attention on the analogs of ergodictheory. That is, in the belief that a probabi~ity is not re­spectable unless it is also a frequency, one attempts a directcalculation of frequencies, or tries to guess the right "sta­tistical assumption ll about frequencies, even though the avail­able information does not consist of frequencies, but consistsrather of partial knowledge of certain "macroscopic" parameters

63

Page 65: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

{Cli}; and the predictions desired are not frequencies, butestimates of certain other parameters {ei}' It is not yetrealized that, by looking at the problems this way one is notmaking efficient use of probability theory; by restricting itsmeaning one is denying himself nearly all its real power.

The real problem is not to determine frequencies, but todescribe one's state of knowledge by a probability distribu­tion. If one does this correctly, he will find that whateverfrequency connections are relevant will appear automatically,not as "statistical assumptions" but as mathematical conse­quences of prDbability theDry.

Examples are the theory Df hydrodynamic turbulence, opticalcoherence, quantum field theory, and surprisingly, communicatiDntheory which after thirty years has hardly progressed beyondthe stage of theorems which presuppose all the ten-gram fre­quencies known in advance.

In early 1978 I attended a Seminar talk by one of the currentexperts on turbulence theory. He noted that the basic theoryis in a quandary because "Nobody knows what statistical assump­tions to make. 1I Yet the objectives of turbulence theory aresuch things as: given the density, compressibility, and vis­cosity of a fluid, predict the conditions for onset of turbulence,the pressure difference required t~ maintain turbulent flow, therate of heat transfer in a turbulent fluid, the distortion andscattering of sound Waves in a turbulent medium, the forcesexerted on a body in the fluid, etc. Even if onels objectivewere only to predict some frequencies g. related to turbulence,statements about the best estimate of gt and the reliability ofthat estimate, can only be derived from probabilities that arenot themselves frequencies.

We indicated a little of this above [Equations (B15)-(B23)];now let us see in a more realistic case why the frequencieswith which various things happen in a time-dependent processare not the same as their probabilities; but that, nevertheles~

there are always definite connections between probability andfrequency, derivable as consequences of probability theory.

Fluctuations. Consider some physical quantity f(t). Whatfollows will generalize at once to field quantities f(x,t);but to make the present point it is sufficient to consider onlytime variations. Therefore, we may think of f(t) as the netforce exerted on an area A by some pressure P(x,t):

f(t) = JP(x,t) dA (GI)A

or the net force in the x-direction exerted by an electric fieldon a charge distributed with density p(x): f(t) = fEx (X,t)p(X)d 3x

64

Page 66: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

(CZ)

Where do we Stand on Maximum Entropy?

or as the total magnetic flux passing through an area A, or thenumber of molecules in an observed volume V; or the differencein magnetic and electrostatic energy stored in V:

If z· Z 3f(tl : 8~ [H (x,t) - E (x,t)]d xV

and so on! For any such physical meaning, the following con­siderations will apply.

Given any probability distribution (which we henceforth call,for brevity an ensemble) for f(t), the best prediction of f(t)that we can make from it--"best" in the sense of minimizing theexpected square of the error--is the ensemble average

65

<f(t» = <f> (C3)

which is independent of t if it is an equilibrium ensemble~ aswe henceforth assume. But this mayor may not be a reliableprediction of f(t) at any particular time. The mean squareexpected deviation from the prediction (C3) is the variance

(C4)

again independent of t by our assumption. Only if llif/<f>I «lis the ensemble making a sharp prediction of the measurablevalue of f.

Basically, the quantity lif just defined represents only theuncertainty of the prediction; i.e., the degree of ignoranceabout f expressed by the ensemble. Yet 6f is held, almostuniversally in the literature of fluctuation theory, to repre­sent also the measurable RMS fluctuations in f. Clearly,_ thisis an additional assumption, which might or might not be true;for, obviously, the mere fact that I know f only to ±l% accuracy,is not enough to make it fluctuate by ±l%! Therefore, we notethere is logically no room .for any postulate that 6f is themeasurable RMS fluctuation; whether this is or is not true ismathematically determined by the probability distribution. Tounderstand this we need a more careful analysis of the relationbetween <f>, ~f. and experimentally measurable quantities.

More generally, we can consider a large class of functionalsof f(t) in some time interval (0 < t < T); for example,

K[f(t)] "T-n J~tl .•. JTdt~ G[f(t

1) ••• f(t

n)] (C5)

o 0with G(fl ••• f n) a real function. For any such functional, theensemble will determine some probability distribution P(K)dK,and the best prediction we can make by the mean-square-errorcriterion is its expectation <K>. What is the necessary and

Page 67: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

66

Jaynes

(C6)o+

sufficient condition that, as T +~, the ensemble predicts asharp value for K? It is, as always, that

(6K)2 _ <K2> _ <K>2

For any such functional, this condition may be written outexplicitly; let us give two examples that will surprise somereaders.

One of the sources of confusion in this field is that th~

word "average" is used in several different senses. We try toavoid this by using different notations for different kinds ofaverage. For the single system that exists in the laboratory,the observable average is not the ensemble average <f>, but atime average, which we denote by a bar (reserving the angularbrackets to mean only ensemble averages):

f" iJo

Tf(t)dt (C7)

which corresponds to (C5) with G = f(tl)' The averaging timeT is left arbitrary for the time being because the results (C8),(ell), (CIS), to be derived next, being exact for any T, then

. provide a great deal of insight that would be lost if we passto the limit too.soon.

In the state of knowledge represented by the ensemble, thebest prediction of f by the mean square error criterion, is

<f> = <i tf(t)dt> = i f<f>dto 0

or, for an equilibrium. ensemble,

<f> = <f> (ca)

an example of a very general rule of probability theory; anensemble average <f> i~ not the same as a measured value f(t)or a measured average f; but it is equal to the expectationsof both of those quantities.

But (C8), like (C3), tells us nothing about whether the pre­diction is a reliable one; to answer this we must again con­sider the variance

(C9)

Page 68: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

'Only if l~f/<f>I« 1 is the ensemble making a sharp predictionof the measured average f. Now, however, the time averagingcan help us; for ~f may become very small compared to 6f, ifwe average over a long enough time.

Now in an equilibrium ensemble the integrand of (C9) is afunction of (tz - tl) only, and defines the covariance function

67

HT) _ <f(t)f(t+T» - <f(t»<f(t+T»

= <f(O)f(T» - <f>2 (CI0)

from which (C9) reduces to a single integral:

(6£)2 = ~ t (T - T)HT)dT (Cll)T 0

A sufficient (stronger than necessary) condition for ~f to tendto zero is that the integralsrHT)TdT rHT)dT (C12)o 0converge; aud then the characteristic correlation time

To =[[" ¢(T)dT]-l[["T ¢(T)dT] (C13)

is finite, and we have asymptotically,

(6£)2 '" ~ r¢(T)dT (C14)

~f then tends to zero like lIlT, and the situation is very muchas if successive samples of the function over non-overlappingintervals of length Tc were independent. However, the slightestpositive correlation, if_it persists indefinitely, will preventany sharp prediction of f. For, if $(T) + $(00) > 0, then from(Cll) we have

(C15)

and the ensemble can never make a sharp prediction of themeasured average; 1. e., any postulate that the ensemble averageequals the time average, violates the mathematical rules ofprobability theory. These results correspond to (B23).

Now everything we have said about measurable values of f canbe repeated mutatis mutandis for the measurable fluctuationsof (t); we need only take a step up the hierarchy of successivelyhigher order correlations. For, over the observation time T,the measured mean-square fluctuation in f(t)--i.e., deviationfrom the measured mean--is

Page 69: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

'(6£)2 _ i r [itt) - fl 2 dto

2 -2=£ - £

(C16)

(C1l)

68

2which corresponds to the choice G = f (tl) -f(tl)f(tZ) in (C5).The "best" prediction we can make of this from the ensemble,is its expectation, which reduces to

(C18)

as a short calculation using (C4), (CII) will verify. This isin itself a very interesting (and I am sure to many surprising)result. The predicted measurable fluctuation 6f is not thesame as the ensemble fluctuation f1f unless the ensemble is such,and the averaging time so long, that f1f is negligible comparedto 6.f.

But (CI8) tells us nothing about whether the prediction«6f)2> is a reliable one; to answer this we must, once more,examine the variance

(C19)

(C20)

Unless (CI9) is small compared to the square of (CI8) , theensemble is not making any definite prediction of (Of)2, Aftersome computation we find that (CI9) can be written in the form

v = T14 JT dtJT dt 2fT dt3!T dt4 ~(t1.t2.t3't4)a 0 0 0

where $ is a four-point correlation function:

~(t1.t2.t3.t4) = <£(t1)£(t2)£(t

3)£(t

4» - 2<£(t

1)£(t

2)£2(t

+<£2(t1

)£2(t2

» - [(~£)2 + (~f)2]2 (C2l)

which we have written in reduced form, taking advantage of thesymmetry of the domain of integration in (C20).

As we see, the person who supposes that the RMS fluctuationtf in the ensemble is also the experimentally measurable RMSfluctuation of, is inadvertently supposing some rather non­trivial mathematical properties of that ensemble, which wouldseem to require some nontrivial justification! 'Yet to the bestof my knowledge, no existing treatment of fluctuation theoryev~n recognizes the distinction between 6f and 6f.

Page 70: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

In almost all discussions of random ·functions in the existingliterature concerned with physical applications, it is takenfor granted that (C6) holds for all functionals. One canhardly avoid this if one postulates, with Rawlinson, that"the probability of an event is the frequency with which itoccurs in a given situation." But if it requires the computa­tion (C20) to justify this for the mean-square fluctuation,what would it take to justify it in general? That is justthe " ergodic" problem for this model.

Future progress in a number of areas will, I think, req~re

that the relation between ensembles and physical systems bemore carefully defined. ,The issue is not merely one of II phi_losphy of interpretationll that practical people may ignore;for not only the quantitative details, but even the qualitativekinds of physical predictions that a theory can make, depend onhow these conceptual problems are resolved. For example, aswas pointed out in my 1962 Brandeis Lectures, (lac. cit. Eqs.(83)-(93)], one cannot even state, in terms of the underlyingensemble, the criterion for a phase transition, or distinguishbetween laminar and turbulent flow, until the meaning of thatense8~le is recognized.

A striking example of the need for clarifications in fluc­tuation theory is provided by quantum electrodynamics.' Hereone may calculate the expectation of an electric field at apoint: <E(x,t» '" 0, but the expectation of its square diverges:<E 2 (x, t) > '" co; Thus D.E'" co; in present quantum theory one inter­prets this as indicating that empty space is filled with "vacuumfluctuations," yielding an infinite "zero-point" energy density.But when we see the distinction between ~E and oE, a differentinterpretation suggests itself. If LJ.E = co that does not have tomean that any physical quantity is infinite; it means only thatthe present theory is totally unable to predict the field at apoint, i.e., the only thing which is infinite is the uncertaintyof the prediction.

It had been thought for 30 years that these vacuum fluctua­tions had to be real, because they were the physical cause ofthe Lamb shift; however it has been shown (Jaynes, 1978) thata classical calculation leads to just the same formula for thisfrequency shift without invoking any field fluctuations. There­fore, it appears that a reinterpretation of the "fluctuationlaws IT of quantum theory along these lines might clear up atleast some of the paradoxes of present quantum theory.

The situation just noted is only one of a wide class ofconnections that might be called "generalized fluctuation­dissipation theorems. IT or II fluctuation-response theorems."These include all of the Kubo-type theorems relating transportcoefficients to various "thermal fluctuations." I believe that

69

Page 71: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

· Jaynes

relations of this type will become more general and more use­ful with a better understanding of fluctuation theory.

Biology. Perhaps the largest and most obvious beckoning newfield for application of statistical thermodynamics is biology.At present, we do not have the input information needed for auseful theory, we do not know what simplifying assumptions areappropriate; and indeed we do not know what questions to ask~

Nevertheless, molecular biology has advanced to the pointwhere some preliminary useful results do not seem any furtherbeyond us now than the achievement of an integrated circuitcomputer chip did thirty years ago.

In the case of the simplest organism for which a great dealof biochemical information exists, the bacterium E. coli,Watson (1965) estimated that "one-fifth to one-third of thechemical reactions in E. coli are known," and noted that addi­tions to the list were-coming at such a rate that by perhaps1985 it might be possible to describe "essentially all themetabolic reactions involved in the life of an E. coli cell. 1I

As a pure speculation, then, let us try to anticipate aproblem that might just possibly be amenable to the biochemicalknowledge and computer technology of the year 2000: Given thestructure and chemical composition of E. coli, predict itsexperimentally reproducible properties, i.e., the range ofenvironmental conditions (temperature, pH, concentrations offood and other chemicals) under which a cell can stay alive;the rate of growth as a function of these factors. Given aspecific mutation (change in the DNA code), predict whetherit can survive and what the reproducible properties of thenew form will be.

Such a program would be a useful first step. It seems, inmy very cloudy crystal ball, that (1) its realization might bea matter of decades rather than centuries, (2) success in oneinstance would bring about a rapid increase in our ability todeal with more complicated problems, because it would revealwhat simplifying assumptions are permissible.

At present one could think of several thousand factors thatmight, as far as we know, be highly relevant for these predic­tions. If a single cell contains 20,000 ribosomes whereprotein synthesis is taking place, are they performing 20,000different functions, each one essential to the life of thecell? This just seems unlikely. I would conjecture that ofall the complicated detail that can be se~n in a cell, theoverwhelmingly greatest part is--like every det~il of the hairon our heads, or our fingerprints--accidental to the historyof that particular individual; and not at all essential forits biological function.

70

Page 72: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

The problem seems terribly complicated at present, becausein all this detail we do not know what is relevant, what isirrelevant. But success in one instance would show us how tojudge this. It might turn out that prediction of biologicalactivity requires information about only a dozen separatefactors, instead of a million. If so, then one would haveboth the courage and the insight needed to attack more com­plicated problems.

This has been stated so as to bring out the close analogywith what has happened in the theory of irreversible processes.In the early 1950's the development of a general formalism forirreversible processes appeared to be a hopelessly complicatedprogram, not to be thought of in the next thousand years, ifever. Thus, van Hove (1956) stated: " ••. in view of theunlimited diversity of possible nonequilibrium situations,the existence of such a set of equations seems rather doubtful!1Yet, as noted in Section A above, the principle which hassolved this problem already existed, unrecognized, at thattime. And today it seems that our major problem is not thecomplications of detail, but the conceptual difficulty inunderstanding how such a complicated problem could have sucha (formally) simple solution. The answer is that, while fulldynamical information is extremely complicated, the relevantinformation is not.

Perhaps there is a general principle--which we _are concep­tually unprepared to recognize today because it is too simple-­that would tell us which features of an organism are itsessential, relevant biological features; and which are not.

Of course, applications of statistical mechanics to biologymay be imagined at many different levels, so widely separatedthat they have nothing to do with each other. Thus, while Ihave been speculating about complexities within a single cell,the contribution of E. H. Kerner to this Symposium goes afterthe opposite extreme, interaction of many organisms. At thatlevel the relevant information is now so much simpler and moreeasily obtained that many interesting results are alreadyavailable.

Basic Statistical Theory. From the standpoint of statisticaltheory in general, the principle of maximum entropy is onlyone detail, which arose in connection with the problem ofgeneralizing Laplace's statistical practice from (A6), and wehave examined it above only in the finite discrete case. Asn ~ 00 a new feature is that for some kinds of testable informa­tion there is no upper bound to the entropy. For mean-valueconstraints, ::'he partition function may diverge for all real A,or the constraint equations (B4) may not have a solution. In

71

Page 73: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

this ~ay, the theory signals back to us that we have not putenough information into the problem to determine any definiteinferences. In the finite case, the mere enumeration of thepossibilities {i"" 1,2, ••• n} specifies enough to ensure that asolution exists. If n+o::>, we have specified far less in theenumeration, and it is hardly surprising that this must becompensated by specifying more information in our constraints.

Rowlinson quotes Leslie Ellis (1842) to the effect that"Mere ignorance is no ground for any inference whatever. Exnihilo nihil. 1! I am bewildered as to how Rowlinson can con-;"strue this as an argument against maximum entropy, since aswe see the maximum entropy principle immediately tells us thesame thing. Indeed, it is the principle of maximum entropy-­and not Leslie Ellis--that tells us precisely how much informa­tion must be,specified before we have a normalizable distribu­tion so that "rational inferences are possible. Once this isrecognized, I believe that the case n+::c presents no difficultiesof mathematics or of principle.

It is very different when we generalize to continuous dis­tributions. We noted that Boltzmann was obliged to dividehis phase space into discrete cells in order to get theentropy expression (Al4) from the combinatorial factor (All).Likewise, Shannons uniqueness proof establishing -Ipi log Pias a consistent information measure, goes through only for adiscrete distribution. We therefore approach the continuouscase as the limit of a discrete one. This leads (Jaynes, 1963b,1968) to the continuous entropy expression

S • - fp (x) log~ dx (C22)m(x)

~here the "measure function" m(x) is proportional to thelimiting density of discrete points (all this theory is readilyrestated in the notation of measure theory and Stieltjesintegrals; but we have never yet found a problem that needs it).So, it is the entropy relative to some "measure" m(x) that isto be maximized. Under a change of variables, the functionsp(x), m(x) transform in the same way, so that the entropy sodefined is invariant; and in consequence it turns out that theLagrange multipliers and all our conclusions from entropymaximization are independent of our choice of variables. Themaximum-entropy probability density for prescribed averages

72

(C23)

Page 74: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

is

73

p(X) (C24)

with the partition function

Z(Al''''m) "fdX m(x) eXP[-~'k fk(X)]' (C2S)

An interesting fact, which may have some deep significance notyet seen, is that the class of maximum-entropy functions (C24)is, by the Pitman-Koopman theorem, identical with the class offunctions admitting sufficient statistics; that is, if as in(B72)-(B81) we think of (C24) as defining a class of samplingdistributions from which, given data D ::: liN measurements of xyielded the results {xl",xN}'" we are to estimate the{)'f •• Am} by applying Bayes I theorem, we f iud that the posteriord1stribution of the A's depends on the data only through theobserved averages:

N

f k " ~ r~l f k (xr ) (C26)

all other aspects of the data being irrelevant. This seems tostrengthen the point of view noted before in (B72)-(B81). For

.many more details, see Huzurbazar (1976).But now let us return to our usual viewpoint, that (C24) is

not a sampling distribution but a prior distribution fromwhich we are to make inferences about x, which incorporates anytestable prior information. If the space Sx in which the con­tinuous variable x is defined, is not the result of any obviouslimiting process, there seems to be an ambiguity; for what nowdetermines m(x)?

This problem was discussed in some detail before (Jaynes.1968). If there are no constraints, maximization of (C22)leads to p(x) '=Am(x) where A is a normalization constant; thusm(x) has the intuitive meaning that it is the distributionrepresenting "complete ignorance l1 and we are back, essentially,to Bernoulli's problem from where it all started. In the con­tinuous case, then, before we can even apply maximum entropywe must deal with the problem of complete ignorance.

Suppose a man is lost in a rowboat in the middle of theocean. l-.11at does he mean by saying that he is tlcompletelyignorant" of his position? He means that, if he were to rowa mile in any direction he would still be lost; he would bejust as ignorant as before. In other words, ignorance of one'slocation is a state of knowledge which is not changed by a

Page 75: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

small change in that location. Mathematically, "completeignorance" is an invariance property.

The set of all possible changes of location forms a groupof translations. More generally, in a space Sx of any struc­ture, one can define precisely what he means by "completeignorance" by specifying some group of transformations of Sxonto itself under which an element of probability m(x)dx isto be invariant. If the group is transitive on Sx (1. e.,from any point x, any other point Xl can be reached by someelement of the group), this determines m(x), to within anirrelevant multiplicative constant, on Sx'

This criterion follows naturally from the basic desideratumof consistency: In two problems where we have the same stateof knowledge, we should assign the same probabilities. Anytransformation of the group defines a new problem in which a"completely ignorant" person would. have the same state ofprior knowledge. If we can recognize a group of transforma­tions that clearly has this property of transforming theproblem into one that is equivalent in this sense, then theambiguity in m(x) has been removed. Quite a few useful"ignorance priors!! have been found in this way; and in factfor most real problems that arise· the solutions are now well-­if not widely--known.

But while the notion of transformation groups greatlyreduces the ambiguity in m(x), it does not entirely removeit in all cases. In some problems no appropriate group maybe apparent; or there may be more than one, and we do not seehow to choose between them. Therefore, still more basicprinciples are needed; and active research is now underwayand is yielding promising results."

One of these new approaches, and the one on which there ismost to report, is the method of marginalization (Jaynes, 1979).The basic facts pointing to it were given already by Jeffreys(1939; §3.8), but it was not realized until 1976 that thisprovides a new, constructive method for defining what is meantby "ignorance, II with the advantage that everything followsfrom the basic rules (A8), (A9) of probability theory, withno need for any such desiderata as entropy or group invariance.We indicate the basic idea briefly, using a bare skeletonnotation to convey only the structure of the argument.

Marginalization. We have a sampling distribution p(xle) forsome observable quantity x, depending on a par?meter~, bothmultidimensional. From an observed value x we can make in­ferences about e; with prior information 11, prior probabilitydistribution p(eIII) Bayes' theorem (B24) yields the posterior

74

Page 76: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

distribution

75

(C27)

In the following, A always stands for a normalization constan~

not necessarily the same in all equations.But now we learn that the parameter e can be separated into

two components: e = (~,D) and we are to make inferences onlyabout~. Then we discover that the data x may also be sepa­rated: X"" (y, z) in such a way that the sampling distributionof z depends only on ~:

p(zl,n) ~ !p(YZln,)dY ~ p(zl,) (C28)

Then, writing the prior distribution as pee III) = TT(~)TT(n).(C27) gives for the desired marginal posterior distribution

p(,IY,z II) ~ A TI(,) !P(y,zl,n)TI(n)dn (C29)

which must in general depend on our prior information about n.This is the solution given by a lIconscientious Bayesian" Bl.

At this point there arrives on the scene an "ignorantBayesian" B2' whose knowledge of the experiment consists onlyof the sampling distribution (C28); i.e., he is unaware of theexistence of the components (y,n). wnen told to make inferencesabout ~. he confidently applies Bayes' theorem to (C28). gettingthe result

(C30)

This is what was called a "pseudoposterior distribution" byGeisser and Cornfield (1963).

Bl and B2 will in general come to different conclusions be­cause Bl is taking into account extra information about (y.n).But now suppose that for some particular prior TT(n) , Bl and B2happen to agree after all; what does that mean? Clearly, BZ isnot incorporating any information about n; he doesn't even knowit exists. If, nevertheless, they come to the same conclusions,then it must be that Bl was not incorporating any informationabout n either. In other words, a prior TT(n) that leaves B1and BZ in agreement must be, within the context of this model.a completely uninformative prior; it contains no inforwationrelevant to questions about ~.

Now the condition for equality of (CZ9) , (C~O) is just aFredholm integral equation:

Jp(y,zl"n)rr(n)dn' A(y,z)p(zIO (C31)

Page 77: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

'where A(y,z) is a function to be determined from (C31). There­fore, the rules of probability theory already contain thecriterion for defining what is meant by "completely uninforma­tive."

Mathematical analysis of (C31) proves to be quite involved;and we do not yet know a necessary and sufficient condition onp(y,zl~n) for it to have solutions, or unique solutions, al­though a number of isolated results are now in (Jaynes, 1979).We indicate one of them.

Suppose y and n are positive real. and n is a scale par~~

meter for y; i.e., we have the functional form

76

I -1p(y,z t,n) - n f(z,t;y/nl (C32)

for the sampling density function. Then, (C31) reduces to

y-1 Jf(Z,t;Ul[; ~(;)]da - A(y,zlJf(Z,t;uldU (C3»

where we have used (C28). It is apparent from this that theJeffreys prior

-1~(n) - n (C34l

is always a solution, leading to A(y,Z) =y-l. Thus (C34) is­"completely uninformative" for all models in which 1') appearsas a scale parameter; and it is easily shown (Jaynes, 1979)'that one can invent specific models for which it is unique.

We have therefore, the result that the Jeffreys prior isuniquely determined as the only prior for a scale parameterthat is "completely uninformative ll without qualifications. Wecan hardly avoid the inference that it represents, uniquely,the condition of trcomplete ignorancell fora scale parameter.

This example shows how marginalization is able to giveresults consistent with those found before, but in a way thatsprings directly out of the principles of probability theorywithout any additional appeal to intuition (as is involved inchoosing a transformation group). At the moment, this approachseems very promising as a means of rigorizing and extending thebasic theory. However, there are enough complicated technicaldetails, not noted here, so that it will require quite a bitmore research before we can assess its full scope and power.In fact, the chase is at present quite exciting, because itis still mathematically an open question whether the integral~quations may in some cases become overdetermined, so that nouninformative prior exists. If so, this would call for somedeep clarification, and perhaps revision, of present basicstatistical theory.

Page 78: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

D. An Application: Irreversible Statistical Mechanics.The calculation of an irreversible process usually involvesthree distinct stages; (1) Setting up an lIensemble," i.e.,choosing a density matrix p(O), or an N-particle distributionfunction, which is to describe our initial knowledge about thesystem of interest; (2) Solving the dynamical problem; i.e.,applying the microscopic equations of motion to obtain thetime-evolution of the system p(t); (3) Extracting the final·physical predictions from the time-developed ensemble pet).

Stage (3) has never presented any procedural difficulty~ topredict the quantity F from the ensemble p, one follows thepractice of equilibrium theory, and computes the expectationvalue <F> =Tr(pF). While the ultimate justification of thisrule has been much discussed (ergodic theory), no alternativeprocedure has been widely used.

In this connection, we note the following. Suppose we areto choose a number f, representing our estimate of the physicalquantity F, based on the ensemble p. A reasonable criterionfor the "best" estimate is that the expected square of theerror, «F_f)2> shall be made a minimum. The solution of thissimple variational problem is: f = <F>. Thus, if we regardstatistical mechanics, not in the "classical" sense of a meansfor calculating time averages in terms of ensemble averages,but rather as an example of statistical estimation theorybased on the mean square error criterion, the usual procedureis uniquely determined as the optimal one, independently ofergodic theory. A justification not depending on ergodictheory is in any event necessary as soon as we try to predictthe time variation of SOme quantity F(t); for the physicalphenomenon of interest then consists just of the fact thatthe ensemble average <F(t» is not equal to a time average.

The dynamical problem of stage (2) is the most difficult tocarry out, but it is also the one in which most recent progresshas been booked (Green's function methods). While the presentwork is not primarily concerned with these techniques, they areavailable, and needed, in carrying out the calculations in­dicated here for all but the simplest problems.

It is curious that stage (1), which must logically precedeall the others, has received such scant attention since thepioneering work of Gibbs, in which the problem of ensembleconstruction was first recogniZed. Most recent discussionsof irreversible processes concentrate all attention on stage(2); many fail to note even the existence of stage (1). Oneconsequence of this is that the resulting theories apply un­ambiguously only to the case of "response functions, If inwhich the nonequilibrium state is one resulting from adynamical perturbation (i.e., an explicitly given term in the

77

Page 79: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

Hamiltonian), starting from thermal equilibrium at some timein the past; the initial density matrix is, then given by con­ventional equilibrium theory, and so the problem of ensembleconstruction is. evaded.

If, however, the nonequilibrium State is defined (as itusually is from ,the experimental standpoint) in terms oftemperature or concentration gradients, rate of heat flow,shearing stress, sound wave amplitudes, etc., such a proceduredoes not apply, and one has resorted to various ad hoc devices.An extreme example is provided by some problems in astrophysic~

in which it is clear that the system of interest has never, inthe age of the universe, been in a state approximating thermalequilibrium. Such cases have been well recognized as presentingspecial difficulties of principle.

We show here that recognition of the existence of the stage(1) problem, and that its general solution is available, canremove such ambiguities and reduce the labor of stage (2). Inthe case of the nonequilibrium steady state, stage (2) can bedispensed with entirely if stage (1) has been properly treated.

Background. To achieve a certain unity within the presentvolume, we shall take the review article of Hori, Oppenheim,and Ross (1962)--hereafter denoted MOR--as indicating thelevel to which nonequilibrium theory had been brought beforethe introduction of Info~ation Theory notions. This work isvirtually unique in that the Stage 1 problem, and even theterm l1 ensemble construction" appear explicitly. The earlierwork of Kirkwood, Green, Callen, Kubo and others, directlyrelated to ours, is noted in MOR, Sec. 6.

To fix ideas, consider the calculation of transport proper­ties in systems close to equilibrium (although our finalresults will be far more general). In the treatments discussedby MOR, dissipative-irreversible effects did not appear in theensemble initially set up. For example, a system of Nparticlesof mass m, distributed with macroscopic density p(x), localtemperature T(x), is often described in classical theory byan N-particle distribution function. or Liouville function.of the form: 2

N p(xi ) 3/2 { Pi }WN(x1P1 "" "''l'pN) = i~ Nm [2mnkT(xi ) ] exp - 2mkT(x

i) (Di)

where Xi' Pi denote the (vector) position and momentum of theith particle. But, although this distribution represents non­vanishing density and temperature gradients Vp, VT. the dif­fusion current or heat flow computed from (Dl) is zero.

Likewise. in quantum theory MOR described such a physical13ituation by the !llocal equilibrium,lI or IIfrozen-state"

78

Page 80: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?"

density matrix:

Pt· ~ exp{-Jd3X S(x) [H(x) - ~(X)n(X)J} (D2)

where H(x)~ n(x) are the Hamiltonian density and number densityoperators. Again, although (D2) describes gradients of temper­ature, concentration, and chemical potential, the fluxescomputed from (D2) are zeLo.

}~thematically, it was found that dissipative effects appearin the equations only after one has carried out the followingoperations: (a) approximate forward integration of the equa­tions of motion for a short llinduction time," and (b) timesmoothing or other coarse-graining of distribution functionsor Heisenberg operators.

Physically, it has always been somewhat of a mystery whyeither of these operations is needed; for one can argue that,in most experimentally realizable cases, irreversible flows(A) are already "in progress ll at the time the experiment isstarted, and (B) take place slowly, so that the low-orderdistribution functions and expectation values of measurablequantities must be already slowly-varying functions of ~ime

and position; and thus not affected by coarse-graining. Incases where this is not -true, coarse-graining would result inloss of the physical effects of interest.

The real nature of the forward integration and coarse­graining operations is therefore obscure; in a correctlyformulated theory neither should be required. We are led tosuspect the choice of initial ensemble; i.e., that ensemblessuch as (Dl) and (D2) do not fully describe the conditionsunder which irreversible phenomena are observed, and thereforedo not represent the correct solution of the stage (1)problem. [We note that (Dl) and (D2) were not "derived!! fromanything more fundamental; they were written down intuitively,by analogy with the grand canonical ensemble of equilibriumtheory.] The forward integration and coarse-graining opera­tions would, on this view, be regarded as corrective measureswhich in some way compensate for the error in the initialensemble.

This conclusion 1s in agreement with that of MOR. Theseauthors never claimed that Pt in (D2) was the correct densitymatrix, but supposed that it differed by only a small amountfrom another matrix p(t), which they designate as the "actualdistribution. II They further supposed that after a short in­duction time, Pt relaxes into pet), which would explain theneed for forward integration.

Such relaxation undoubtedly takes place in the low-orderdistribution functions derived from p, as was first suggested

79

Page 81: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

by Bogoliubov for the analogous classical problem. However,this is not possible for the full "global" density matrix; ifPt and p (t) differ at t;:: 0 and undergo the same unitary trans­formation in their time development, they cannot be equal atany other time. Furthermore, pet) was never uniquely defined;given two different candidates PI(t), P2(t) for this role, MORgive no criterion by which one could decide which is indeedthe "actual" distribution.

For reasons already explained in earlier Sections and inJaynes (1967), we believe that such criteria do not exist;,i.e., that the notion of an lIactual distribution" is illusory,since different density matrices connote only different statesof knowledge. In the following Section we approach the problemin a different way, which yields a definite procedure for con­structing a density matrix which is to replace Pt' and willplay approximately the same role in our theory as the pet)of MOR.

The Gibbs Algorithm. If the above reasoning is correct, a re­examination of the procedures by which ensembles are set up instatistical mechanics is indicated. If we can find an algorithmfor constructing density matrices which fully describe non­equilibrium conditions, we should find that transport and otherdissipative effects are obtainable by direct quadratures overthe initial ensemble.

This algorithm, we suggest, was given already by Gibbs (1902).The great power and scope of the methods he introduced have notbeen generally appreciated to this day; until recently it wasscarcely possible to understand the rationale of his method forconstructing ensembles. This was (loc. cit., p. 143) to assignthat probability distribution which, while agreeing with whatis known, t'gives the least value of the average index of proba­bility of phase," or as we would describe it today, maximizesthe entropy. This process led Gibbs to his canonical ensemblefor describing closed systems in thermal equilibrium, the grandcanonical ensemble for open systems, and (loc. cit., p. 38) anensemble to represent a system rotating at angular velocity win which the probability density is proportional to

80

+ +exp [-~(H - w-M) 1 (D3)

where H, M are the phase functions representing Hamiltonian andtotal angul,r momentum. _

Ten years later, the Ehrenfests (1912) dismissed theseensembles as mere tlanalytical tricks, 11 devoid of any real sig­ificance, and asserted the physical superiority of Boltzmann'smethods, thereby initiating a school of thought which dominated

Page 82: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

statistical mechanics for decades. It is one of the majortragedies of science that Gibbs did not liye long enough toanswer these objections. as he could have so easily.

The mathematical superiority of the canonical and grandcanonical ensembles for calculating equilibrium properties hassince becDme firmly established. FurthermDre. although Gibbsgave no applications Df the rotatiDnal ensemble (D3), it wasshown by Heims and Jaynes (1962) that this ensemble prDvidesa straightfDrward method Df calculating the gyrDmagneticeffects Df Barnett and Einstein-de Haas. At the present time.therefore. like Gibbs methDds--like the Laplace methDds andthe Jeffreys methods--stand in a positiDn Df proven successin applications. independently of all the conceptual prDblemsregarding their justification. which are still being debated.

The development of Information Theory made it possible tosee the method of Gibbs as a general procedure for inductivereasoning. independent of ergodic theory or any other physicalhypotheses. and whose range of validity is therefore not re­stricted to equilibrium problems; or indeed to physics. Inthe following we show that the Principle of Maximum Entropyis sufficient to construct ensembles representing a widevariety of nonequilibrium conditions, and that these newensembles yield transport coefficients by direct quadratures.Indeed. we shall claim--for reasons already explained inJaynes (l957b), th§lt this is the only principle needed toconstruct ensembles which predict any experimentally repro­ducible effect, reversible or irreversible.

The general rule for constructing ensembles is as fDllows.The available information about the state of a system consistsof results of various macroscopic measurements. Let thequantities measured be represented by the operators Fl,Fl, ..• Fro .The results of the measurements are, of course, simply a setof numbers: {f1, .••• f m}. These numbers make no reference toany probability distribution. The ensemble is then a mentalconstruct which we invent in order to describe the range ofpossible microscopic states compatible with those numbers. inthe following sense.

If we say that a density matrix p "contains" or "agreeswith" certain information. we mean by this that, if we com­municate the density matrix to another person he must be able,~y applying the usual procedure of stage (3) above. to recoverthis information from it. In this sense, evidently, thedensity matrix agrees with the given information if and onlyif it is adjusted to yield expectation values equal to themeasured numbers:

81

k "" 1•••••m (D4)

Page 83: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

and in order to ensure that the density matrix describes thefull range of possible microscopic states compatible with (D4),and not just some arbitrary subset of them (in other words.that it describes only the information given. and contains nohidden arbitrary assumptions about the microscopic state), wedemand that, while satisfying the constraints (D4), it shallmaximize the quantity

82

Sr = -Tr(p log p) (DS)

A great deal of confusion has resulted from the fact that.for decades. the single word "entropy" has been used inter­changeably to stand for either the quantity (DS) or the quantitymeasured experimentally (in the case of closed systems) by theintegral of dQ/T over a reversible path. We shall try tomaintain a clear distinction here by following the usage in­troduced in my 1962 Brandeis lectures (Jaynes. 1963b); refer­ring to Sr as the llinformation entropy'! and denoting theexperimentally measured entropy by SE' These quantities aredifferent in general; in the equilibrium case (the only onefor which SE is defined in conventional thermodynamics) therelation between them was shown (lac., cit.) to be: for alldensity matrices p which agree with the macroscopic informa­tion that defines the thermodynamic state; i.e •• which satisfy(D4) ,

Boltzmann'sis computed

where k isonly if 51

1

constant, with equality infrom the canonical density

(D6) ifmatrix

(D6)

and

(DB)

which quantity will be called the partition function. Itremains only to choose the Ak [which appear as Lagrange multi­pliers in the derivation of (D7) from a variational principle]so that (D4) 1& satisfied. This is the case of

ofk-<F>=-logZ k=1,2, ••••m (D9)k OAk

Page 84: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

If enough constraints are specified to determine a normalizabledensity matrix, it will be found that these relations are justsufficient to determine the unknowns Ak in terms of the givendata {fl ••• f m}; indeed, we can then solve (09) explicitly forthe Ak as follows. The maximum attainable value of SI is, from(D7). (DB).

(Sr)max • log Z - f Ak<Fk> (DIO)k-l

If this quantity is expressed as a function of the given data,S(fl ••. fm), it is easily shown from the above relations that

Ak = - aa:k

(DU)

It has been shown (Jaynes, 1963b, 1965) that the second lawof thermodynamics, and a generalization thereof that tellswhich nonequilibrium States are accessible reproducibly fromothers, follow as simple consequences of the inequality (D6)and the dynamical invariance of SI'

We note an important property of the maximum ent ropy ensemble,which is helpful in gaining an intuitive understanding of thistheory. Given any density matrix p and any £ in 0< £ < l, onecan define a "high-probability linear manifold l1 (RPM) Df finitedimensionality WeE), spanned by all eigenvectors Df p whichhave probability greater than a certain amount 6(£), and suchthat the eigenvectors of p spanning the complementary manifoldhave tDtal probability less than~. Viewed in another way, theRPM consists of all state vectors IjJ to which p assigns an llarrayprobability" as defined in Jaynes (1957b), Sec. 7, greater than6(~). Specifying the density matrix p thus amounts to assertingthat, with probability (1- e:), the state vector of the systemlies somewhere in this RPM. As E varies, any density matrixp thus defines a nested sequence of RPM's.

For a macroscopic system, the information entropy SI may berelated to the dimensionality WeE) of the RPM in the followingsense: if N is the number of particles in the system, then asN-+o;> with the intensive parameters held constant, N-1S I andN-llog WeE) approach the same limit independently of E. This isa form of the asymptotic equipartition theorem of InformationTheory, and generalizes Boltzmann's S=k log W. The process ofentropy maximization therefore amount~ forall practical purposes,to the Same thing as finding the density matrix Which, whileagreeing with the available information, defines the largestpossib]e RPM; this is the basis of the remark.following (D4).An analogous result holds in classical theory (Jaynes, 1965),in which W(€) becomes the phase volume of the IIhigh-probability

B3

Page 85: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

region" of phase space, as defined by N-partic1e distributionfunction.

The above procedure is sufficient to construct the densitymatrix representing equilibrium conditions, provided thequantities Fk are chosen to be constants of the motion. Theextension to nonequilibrium cases, and to equilibrium problemsin which we wish to incorporate information about quantitieswhich are not intrinsic constants of the motion (such as stressor magnetization) requires mathematical generalization whichwe give in two steps.

It is a common experience that the course of a physicalprocess does not in general depend only on the present valuesof the observed macroscopic quantities; it depends also on thepast history of the system. The phenomena of magnetic hysteresisand spin echoes are particularly striking examples of this.Correspondingly, we roust expect that, if the Fk are not con­stants of the motion, an ensemble constructed as above usingonly the present values of the <Fk> will not in general sufficeto predict either equilibrium or nonequilibrium behavior. Aswe will see presently, it is just this fact which causes theerror in the "local equilibrium" densit.y matrix (D2).

In order to describe time variations, we extend the Fk tothe Heisenberg operators

in which the time-development matrix U(t) is the solution ofthe Schrodinger equation.

iuU(t) = H(t)U(t) (D13)

with U(O) = I, and H(t) is the Hamiltonian. If we are givendata fLxing the <Fk(ti» at various times ti, then each of thesemust be considered a separate piece of information, to be givenits Lagrange multiplier Aki and included in the sum of CD7).In the limit where we imagine information given over a continuoustime interval, -"[ < t < 0, the summation over the time index ibecomes an inte~ration and the canonical density matrix CD7)becomes

p = ~ exp{f r'k(t)Fk(t) dtl (D14)k=l -"[

whe!'e the partition function has been generalized to a partitionfunctional \

Z[',(t) .. ·'m(t)) =Tr exP{Jl f: 'k(t)Fk(t) dtr (DlS)

84

Page 86: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

and the unknown Lagrange multiplier functions Ak(t) are deter­mined from the condition that the density matrix agree with thegiven data <Fk(t» over the "information-gathering" time in­terval:

85

-1" < t < 0- - (Dl6)

By the perturbation methods developed below, we find that (D16)reduces to the natural generalization of (D9):

(D18)

and the values of these quantities at each point of space andtime now constitute the independent pieces of information,which are coupled into the density matrix via the Lagrangemultiplier function Ak(x.t). If we are given macroscopic in­formation about Fk(x.t) throughout a space-time region Rk(which can be a different region for different quantities Fk),the ensemble. which incorporates all this information, whilelocating the largest possible "RPM of microscopic states. is

p = t exp{I: J dt d3x Ak(X,tJFk(X,tJ}· (D19)

k Rk

z •

with the partition functional

dt d3

x Ak(X,tJFk(X,tJ}

. and the Ak(x,t) determined from

o

<J(z,t» = Tr[pJ(x,t)]

The form of equations (D19)-(D22) makes it appear thatstages (1) and (2), discussed in the Introduction. are now

(D20)

(D21)

(D22)

Page 87: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

fused into a single stage. However, this is only a consequenceof our using the Heisenberg representation. According to theusual conventions, the Schrodinger and Heisenberg representa­tions coincide at time t = 0; thus we may regard the steps(Dl9)-(D21) equally well as determining the density matrixp(O) in the Schrodinger representation; i.e., as solving thestage (1) problem. If, having found this initial ensemble, weswitch to the Schrodinger representation. Eq. (D22) is then·replaced by

86

<J(X»t • Tr[J(x)p(t)] (D23)

in which the problem of stage (2) now appears explicitly asthat of finding the time-evolution of pet). The form (D23)will be more convenient if several different quantitiesJ

1'J2•••• are to be 'predicted.

Discussion. In equations (D19)-(D23) we have the generalizedGibbs algorithm for calculating irreversible processes. Theyrepresent the three stages: (1) finding the ensemble whichhas maximum entropy subject to the given information aboutcertain quantities {Fk(x.t)}; (2) Utilizing the full dynamicsby working out the time evolution from the microscopic equa­tions of motion; (3)' making those predictions of certain otherquantities of interest {Ji(x,t)} which take all the above intoaccount. and minimize the expected square of' the error. We donot claim that the resulting predictions must be correct; onlythat they are the best (by the mean-squar~ror criterion)that could have been made from the information given; to do anybetter we would require more initial information.

Of course this algorithm will break down, as it should, andrefuse to give us any solution if we ask a foolish, unanswerablequestion; for example. if we fail to specify enough informationto determine any normalizable density matrix. if we specifylogically contradictory constraints, or if we specify space­time variations incompatible with the Hamiltonian of the system.

The reader may find it instructive to work this out indetail for a very simple system involving only a (2 x 2) matrix;a single particle of spin 1/2, zyromagnetic ratio Y, placed ina constant magnetic field B in the z-direction. HamiltonianH = -(l/2)~ y(O·B). Then the only dynamically possible be­havior is uniform precession about B at the Larmor frequencyWo :c yB. If we specify any time variation for <Ox> other thansinusoidal at this frequency. the above equations will breakdown; while 1£ we specify <Ox(t» = a cos wot+b sin wot, wefind a unique solution whenever (a 2 +b 2

) ::: 1.

Page 88: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

Mathematically, whether the ensemble p is or is not makinga sharp prediction of some quantity J is determined by whetherthe variance <J2> - <J>2 is sufficiently small. In general.information about a quantity F would not suffice to predictsome other quantity J with deductive certainty (unless J isa function of F). But in inductive reasoning, InformationTheory tells us the precise extent to which information aboutF is relevant to predictions of J. In practice. due to theenormously high dimensionality of the spaces involved, thevariance <32> - <3>2 usually turns out to be very small comp.aredto any reasonable mean-square experimental error; and there­fore the predictions are, for all practical purposes. deter­ministic.

Experimentally, we impose various constraints (volume,pressure, magnetic field, gravitational or centrifugal forces,sound level, light intensity, chemical environment, etc.) ona system and observe how it behaves. But only when we reachthe degree of control where reproducible response is observed,do we record our data and send it off for publication. Be­cause of this sociological convention, it is not the businessof statistical mechanics to predict everything that can beobserved in nature; only what can be observed reproducibly.But the experimentally imposed macroscopic constraints surelydo not determine any unique microscopic state; they ensureonly that the state vector is somewhere in the HPM. If effectA is. nevertheless, reproducible, then it must be that A ischaracteristic of each of the overwhelming majority of possiblestates in the HPM; and so averaging over those states will notchange the prediction.

To put it another way, the macroscopic experimental condi­tions still leave billions of microscopic details undetermined.If, nevertheless, some result is reproducible, then thosedetails must have been irrelevant to the phenomenon; and sowith proper understanding we ought to be able to eliminatethem mathematically. This is just what Information Theorydoes for us; it removes irrelevant details by averaging overthem, while retaining what is relevant to the particularquestion being asked [i.e .• the particular quantity J(x,t)that we want to predict].

It is clear, then,- why the maximum entropy prescriptionworks in such generality. If the constraints used in thecalculation are the same as those actually operative in theexperiment, then the maximum-entropy density matrix willlocate the sam~ HPM as did the experimental conditions; andwill therefore make sharp predictions of any reproducibleeffect. provided that our assumed microscopic physics (enum­eration of possible states, equations of motion) is correct.

87

Page 89: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

88

Jaynes

For these reasons--as stressed in Jaynes (l957b),--if theclass of phenomena predictable from the maximum entropy prin­ciple is found to differ in any way from the class of repro­ducible phenomena, that would constitute evidence for newmicroscopic laws of physics, not presently known. Indeed(Jaynes,l968) this is just what did happen early in thisCentury; the failure of Gibbs' classical statistical mechanicsto predict the correct heat capacities and vapor pressuresprovided the first clues pointing to the quantum theory. Anysuccesses make this theory useful in an "engineering" sens~;

but for a research physicist its failures would be far morevaluable than its successes.

We emphasize that the basic physical and conceptual formula­tion of the theory is complete at this point; what followsrepresents only the working out of various mathematical con­sequences of this algorithm.

(D24)

(D25)

thermal equilibrium,We denote an l1un_

close touseful.

Z=p

Perturbation Theory. "For systemsthe following general theorems areperturbedll density matrix Po' by

A AP =.!C..., Z _ Tr(e )

a Zo a

a "perturbed one byeA+£B

where A, B are Hermitian. The expectation values of anyoperator C over these ensembles are respectively

<C> = Tr(p C)o 0, <C> = Tr(pC) (D26)

The cumulant expansion of <C> to all orders in £ is derivedin Heims and Jaynes (1962), Appendix B. The n I th order termmay be written as a covariance in the unperturbed ensemble:

00

<C> - <C> • '\' En [<Q C> - <Q > <C> ] (D27)o 0';:1 non 0 0

Here Qn is defined by Q1 - SI' andn-l

o "S - L: <Qk> S k n > 1 (D28)

'n n. k=l 0 n-

in which S are the operators appearing in the well-~.own

expansion n

eMEB _ eA[l + f En Sn) (D29)n-l

Page 90: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

89

Where do we Stand on Maximum Entropy?

'More explicitly.

Sn = J1dXl ~l dXZo 0

(D30)

where

B(x) " e-xA B exA(D31)

(D32)- <B> <C> ]o 0

<C>

The first-order term is thus

- <C>o ~ £fl dx[<e-xABexAc>o

o

and it will appear below that all relations of linear transporttheory are special cases of (D32).

For a more condensed notation, define the average of anyoperator B over the sequence of similarity transformations as

1

B"J dxe-xABexA

(D33)o

which we will call the Kubo transform of B. Then (D32) becomes

<c> - <C>o = £ KCB

in which, for various choices of 'C, B; the quantities

(D34)

KCB

:= <BC> - <B> <C>o 0 0

(D35)

are the basic covariance functions of the linear theory.We list a few useful properties of these quantities; in all

cases, the result is proved easily by writing out the expres­sions in the representation where A is diagonal. Let F. C beany two operators; then

<F> • <F>0 0

~G • KGF

If F, C are Hermitian. then

~G is real • ~F ~ 0

(D36)

(D3?)

(D38)

If Po is a projection operator representing a pure state. thenKFC ::: O. If P is not a pure state density matrix. then with

oHermitian F. G,

Page 91: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

90

(D39)

with equality if and only if F'" qG, where q is a real number.If G is of the form.

-uA uAG(u) = e G(O) e

then

(D40)

(D41)d

du K,G = <[F,G]>o

This identity, with u interpreted as a time, provides a generalconnection between statistical and dynamical problems.

51 for prescribed <H>, and is a very specialThe thermal equilibrium prediction for any

as usual,

Near-Equilibrium Ensembles. A closed system in thermallibrium is described, as usual, by the density matrix

-SHePo = Z (S)

owhich maximizescase of (D19).quantity F is,

<F> = Tr(p F)o 0

equi-

(D42)

(D43)

But suppose we are now given the value of <F(t» throughoutthe "information-gathering" interval -T ~ t < O. The ensemblewhich includes this new information is of the form (DIg), whichmaximizes 51 for prescribed <H> and <F(t». It corresponds tothe partition functional

Z[S,A(t)] = Tr exP[-SH +~: A(t) F(t) dt] (D44)

If, during the information-gathering interval, this new in­formation was simply <F(t» '" <F> , it is easily shown from(DI7) that we have identically 0

Jo

A(t) F(t) dt = 0 (D45)-TIn words: if the new information is redundant (in the sensethat it is only what we would have predicted from the oldinformation), then it will drop out of the equations and theensemble is unchanged. This is a general property of theformalism here presented. In applications it means that thereis never any need, when setting up an ensemble, to ascertain

Page 92: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

whether the different pieces of information used are independent;any redundant parts will drop out automatically.

If, therefore, we treat the integral in- (D44) as a smallperturbation, we are expanding in powers of the departure fromequilibrium. For validity of the perturbation scheme it isnot necessary that A(t)F(t) be everywhere small; it is suf­ficient if the integral is small. First-order effects in thedeparture from equilibrium, such as linear diffusion or heatflow, are then predicted using the general formula (D32), withthe choices A'" -I3H, and

91

constant H, the Heisenberg operator F(t) reduces to

A(t) F(t) dt

F(t) = exp(iHt/li) F(O) exp(-iHt/fi)

(046)

(D47)

(D49)

(D48)

and its-Kubo transform (D33) becomes

- 1fSF(t) = S du F(t-iii u)o

the ch-aracteristic quantity of the Kubo (1957, 1958) theory.In the notation of (D34), the first-order expectation value

of any quantity C(t) will then be given by

<C(t» - <C> = JOK F(t,t') ACt') dt'a _'( C

where KCF is now indicated as a function of the parameters t,t'contained in the operators:

<F(t')C(t» - <F> <C>o 0 0

(D50)

Remembering that the parameters t,t' are part of the operatorsC, F, the general reciprocity law (D37) now becomes

KCF(t,t') = K,.C(t',t) (D51)

When H is constant, it follows also from (D47) that

<[C(t),F\t')]>o

KCF(t,t') = KCF(t-t')

and (D41) becomes

it :t KCF(t,t') •

(D52)

(D53)

Page 93: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

. Integral Equations for the Lagrange Multipliers. We wish tofind the Ak(x,t) to first order in the given departures fromequilibrium, <Fk(x,t»-<Fk(X,t»o' This could be done bydirect application of the formalism; by finding the perturba­tion expansion of log Z to second order in the A's and takingthe functional derivative explicitly according to (D2l). Itwill be sufficient to do this for the simpler case describedby Equations (D44)-(D53); but on carrying through this calcula­tion we discover that the result is already contained in our­perturbation-theory formula (D49). This is valid for anyoperator C(t); and therefore in particular for the choiceC(t) ;;: F(t). Then (D49) becomes

JO ~F(t,t')A(t')dt'. <F(t» - <F>o (D54)-T

If t is in the "information-gathering intervart' (-t ~ t ~ 0)this is identical with what we get on writing out (D17) ex­plicitly with log Z expanded to second order. In lowest order,then, taking the functional derivative of log Z has the effectof constructing a linear Fredholm integral equation for A(t),in which the "driving fercel! is the given departure fromequilibrium.

However, from that direct manner of derivation it wouldappear that (D54) applies only when (-t ~ t ~ 0); while thederivation of (D49) makes it clear that (D54) has a definitemeaning for all t. When t is in -t ~ t ~ 0, it represents theintegral equation from which A(t) is to be determined; whent > 0, it represents the predicted future of F(t); and whent < -t, it represents the retrodicted past of F(t).

If the information about <F(t» is given in the entire past,T ;;: 0:>, (D54) becomes a Wiener-Hopf equation. Techniques for thesolution, involving matching functions on strips in the complexfourier transform space are well known; we remark only that thesolution A(t) will in general contain a a-function singularityat t;;: OJ it is essential to retain this in order to get correctphysical predictions. In other words, the upper limit of in­tegration in (D54) must be taken as (0+). The case of finite1, where we generally have a-functions at both end-points, isdiscussed by Middleton (1960).

For example, with a Lorentzian correlation .function

92

~(t-t') ~ ; exp[-.It-t' IJ

and t:::z 00, the solution is found to be ~

A(t) • (1- ~ d2

2)f(t) + -~ f(0)[.6(t) - 6'(t)Ja dt a

(D55)

(D56)

Page 94: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

93

, where

{ <F(t» 0- <F>0 }t < 0f (t) - t > 0

Then we find

r {f(t) • t < 0 }K,.F(t-t' )A(t' Jdt'

= f(O)e-att > 0~

(D57)

(D58)

in which it is necessary to note the 6-functions in f" (t) atthe upper limit. The nature and need for these o-functionsbe­comes clear if we approach the solution as the limit of thesolutions for a sequence {fn} of "good" driving functions eachof which satisfies the same boundary conditions as KFF at theupper limit:

[f'(t')ILF(t-t') - f (t') ,', !L-(t-t'J] ·0 t<O .(D59)

n -T ,n ot --F,r' tl=O

The result (D58) thus predicts the usual exponential ap­proach back to equilibrium, with a relaxation time T = a- 1

The particular correlation function (D55) is "Markoffian ll inthat the predicted future decay depends only on the specifieddeparture from equilibrium at t = 0; and not on informationabout its past history. With other forms of correlationfunction ~e get a more complicated prediction, with in generalmore than one relaxation time.

Relation to the Wiener Prediction Theory. This problem is sosimilar conceptually to Wiener's (1949) problem of optimalprediction of the future of a random function ~hose past isgiven that one ~ould guess them to be mathematically related.However, this is not obvious from the above, because the Wienertheory ~as stated in entirely different terms. In particular,it contained no quantity such as A(t) ~hich enables us toexpress both the given past and predicted future of F(t) in asingle equation (D54). To establish the connection betweenthese two theories, and to exhibit an alternative form of ourtheory, we may eliminate A(t) by the following purely formalmanipulations.

If the resolvent KFF-l(t,t') of the integral equation (D54)can be found so that

Jo

~F(t,t1t)F).F-\t!l,t')dt" = o(t-t'),

-T

fO lCyF-l(t,er)~F(t",t')dt!l = o(t-t') t

-T

-TSt,t'<O

-T::t,t'::O

(D60)

(D61)

Page 95: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

then

A(t) =L: ~F-l(t.t·)[<F(t·» - <F>oJdt'. -,<t<O (D62)

and the predicted value (D49) of any quantity C(t) can be ex­pressed directly as a linear combination of the given departuresof F from equilibrium:

<C(t» - <C> =fO R (t)t')[<F(t'» - <F> ]dt ' (D63)o CF 0 I-,in which

RCF(t,t') :: fa KCF(t)tll)~F-\tll)t')dt" (D64)-,will be called the relevance function.

In cons~quence of (D6l), the relevance function is itself"the solution of an integral equation:

KCF(t) = t RCF(t.t')~F(t·)dt· • ~<t<~ (D65)-,so that) in some cases) the final prediction formula (D63) canbe obtained directly from (D65) without the intermediate stepof calculating A(t).

In the Wiener theory we have a random function f(t) whosepast is known. For any "lead time" h > 0, we 'are to try topredict the value of f(t+h) by a linear operation on the pastof fCt), i.e., the prediction is

f(t+h) = !"f(t-t')W(t')dt' (D66)

oand the problem is to find that WCt) which minimizes the meansquare error of the prediction:

I[Wj = lim iT fT If(t+h) - f(t+h)12

dt (D67)T-t«> -T

We readily find that the optimal W satisfies the Wiener-Hopfintegral equation

~(t+h) = !"~(t-t')W(t')dt' t > 0 (D68)

owhere

~(t) "lim 2~I\ (t+t') f (t' )dt' (D69)T-t«> -T

is the autocorrelation function of fCt), assumed known.Evidently, the response function Wet) corresponds to our

relevance function ~(t,t'); and to establish the formal

94

Page 96: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

(D71)

Where do we Stand on Maximum Entropy?

identity of the two theories, we need only show that R alsosatisfies the integral equation (D68). But, with the choiceC(t) :::: F(t), this is included _in (D65) making the appropriat.echanges in notat.ion; our "quantum covariance function ll KFF(t)corresponding to Wiener's autocorrelation function $(t). Inthe early stages of this work, the discovery of the formalidentity between Wiener's prediction theory and this specialcase of the maximum-entropy prediction theory was an import8:ntreassurance.

The relevance function RcF(t,t') summarizes the precise;extent to which information about F at time t' is relevant toprediction of C at timet. It is entirely different from thephysical impulse-response function $CF(t-t') discussed, forexample, by Kubo (1958), Eq. (2.18). The latter representsthe dynamical response <e(t» - <C>o at a time t > t I, to animpulsive force term. in the Hamiltonian applied at t::: t' :H(t)::::Ho+F o(t-t'), while in (D63) the "input ll <F(t'»-<F>oconsists only of information concerning what the system, witha fixed Hamiltonian but in a nonequilibrium state, was doingin the interval -T 5. t' 5 O. This distinction is perhaps broughtout most clearly by emphasizing again that (D63) is valid foran arbitrary time·t, which may be before, within, or after thisinformation-gathering interval. Thus, while our conception ofcausality is based on the postulate that a force applied attime t' can exert physical influences only at later times,there is no such limitation in (D63). It therefore representsan explicit statement of the fact that, while physical in­fluences propagate only forward. in time, logical inferencespropagate equally well in either direction; i.e., new informa­tion about the present affects our knowledge of the past aswell as the future. Although relations such as (D63) havebeen rather rare in physics, the situation is, of course,commonplace in other fields; sciences such as geology dependon logical connections of this type.

Space-Time Variations. Suppose the particle density n(x,t)departs slightly from its equilibrium value in a space-timeregion R. Defining on(x, t) :: n(x, t) - <n(x»o' the ensemblecontaining this information-corresponds to the partitionfunctional

Z[S,A(X,t)] = Tr e[-SH+J A(X,t)6n(X,t)d3Xdt] (D70)

Rand (D34) becomes \

<on(x,t» :::: rKnn(X-xt;t-t')A(X',t')d3x'dt'JR .

95

Page 97: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

,When (x,t) are in R this represents the integral equationdetermining A(X',t'); when (x,t) are outside R it gives thepredicted nonequilibrium behavior of <n(x,t» based on thisinformation, and the predicted departure from equilibrium ofany other quantity J(x,t) is

<J(x,t» - <J> = rK (x-x';t-t')A(x',t')d3x'dt' (D72)o JR In

To emphasize the generality of (D72), note that it cont~ins

no limitation on time scale or space scale. Thus it encomp~s5es

both diffusion and ultrasonic propagation.In (D7l) we see the deviation <on> expressed as a linear super­

position of basic relaxation functions Knn(x,t) = <oii(O,O)ODQc,t»withA(x' ,t') as the llsourcell function. The class of differentO

nonequilibrium ensembles based on information about <on> is in1:1 correspondence with different functions A(X,t). In view ofthe linearity, we may superpose elementary solutions in any way,and while to solve a problem with specific given informationwould require that we solve an integral equation, we can extractthe general laws of nonequilibrium behavior from (071), (072)without this, by considering A.(x, t) as the independent variable.

For example, let J be the a-component of particle current,and for brevity write the covariance function in (D72) as

96

<on(x',t 1 )JU(x,t»0 = Ku(x-x';t-t')

Now choose R as all space and all time t < 0, and take

(D73)

t < 0 (D74)

where ll(x), q(t) are "arbitrarytl functions (but of course,sufficiently well-behaved so that what we do with them makessense mathematically). In this ensemble, the current is

<Ja(x,t» = Jd3X

t ll(Xt)f dt' q(t')Ka(x-xt,t-t') (D75)-Integrate by parts on t and use the identity n+'VoJ = 0: theRHS of (075) becomes

Jd3

X' ~(X')[q(O)Ka(X-X',O)+a:'aI:dt' q(t)Kaa(X-X',t-t')] (D76)

where KaB is the current-current covariance:

K ,,(x-x',t-t') :.: <JQ(x',t')J (x,t» (077)u... I-' a 0

But from symmetry K (x-x' ,0) :: O. Another integration by partsthen yields a

Page 98: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

'<Ja(X,t». -l~t'q(t')Jd3X' KaS(x-x',t-t') a:~S (D78)

and thus far no approximations have been made.Now let us pass to the "long wavelength" limit by supposing

~(x) so slowly varying that a~/ax'B is essentially a constantover distances in which KaB is appreciable:

<Ja(x,t»; - a~Bf~t' Q(t l )fd3

X1 KaB(x-x',t-t') (D79)

- and in the same approximation (D71) becomes

<on(x,t»,g(O)U(X)Jd3X' K (x',O) (D80)nn

Therefore, the theory predicts the relation

97

<J- >a

(D81)

with

t:t' get') fd3x, KaB(x-x',t-t')

q(O)rd3x ' K (x-x' ,0)J nn

(D82)

(D83)

If the ensemble is also quasi-stationary, get) very slowlyvarying, only the value of q(t) near the upper limit matters·,and the choice get) = exp(et) is as good as any. This leadsto just the Kubo expression for the diffusion coefficient.

If instead of taking the long-wavelength limit we choose aplane wave: ~(x) = exp(ik'x), (D75) , (D7l) become

eik-X~dt' •<Ja(x,t» = J~ q(t')Ka(k;t-t ' )

<on(x,t» = eik-XL d~' q(t')Knn(k,t-t') (D84)

where Ka(k,t), K (k,t) are the space fourier transforms. Theserepresent the deg~y of sound waves as linear superpositions ofmany characteristic decays ~ Ka(k,t), ~(k,t) with various"starting times" t'. If we take time fourier transforms, (D84)becomes

ik·x Jdw -iwt<<In(x, t» = e -2 K (k,w)Q(w)eTI nn

(D85)

which shows how the exact details of the decay depend on themethod of preparation (past history) as summarized by Q(w).Now, however, we find that Kun(k,w) usually has a sharp peak at

Page 99: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

98

Jaynes

(D87)t > 0 •

'some frequency w= Wo (k). which arises mathematically from apole near the real axis in the complex w-plane. Thus ifwI "" Wo - ia and Knn (k.w) has the form

-iKK (k w) = __1 + K(k,w) (D86)nn' w-w

lwhere K(k,w) is analytic in a neighborhood of wI' this polewill give a contribution to the integral (D86) of

K Q( ) - iCkox-wot) -at1 wI e e

Terms which arise from parts of Kun(k.w)Q(w) that are notsharply peaked as a function of w, decay rapidly and representshort transient effects that depend on the exact method ofpreparation. If a is small. the contribution (D87) will quicklydominate them, leading to a long-term attenuation and propaga­tion velocity essentially independent of the method of prepara-.tion.

Thus, the laws of ultrasonic dispersion and attenuation arecontained in the location and width of the sharp peaks inK"n(k,w).

Other Forms of the Theory. Thus far we have considered theapplication of maximum entropy in its most general form: givensome arbitrary initial information, to answer an arbitraryquestion about reproducible effects. Of course, we may ask anyquestion we please; but maximum entropy can make sharp predic­tions only of reproducible things (that is in itself a usefulproperty; for maximum entropy can tell us which things are andare not reproducible, by the sharpness of its predictions).Maximum entropy separates out what is relevant for predictingreproducible phenomena, and discards what is irrelevant (we sawthis even in the example of Wolf's die where, surely, the onlyreproducible events in his sample sp~ce of 62.01000 points 'Werethe six face frequencies or functions of them; just the thingsthat maximum entropy predicted).

Likewise, in the stage 2 techniques of prediction from themaximum entropy distribution, if we are not interested in everyquestion about reproducible effects, but only some llrelevantsubset" of them, we may seek a further elimination of detailsthat are irrelevant to those particular questions. But thiskind of problem has come up before in mathematical physics; andDicke (1946) introduced an elegant projection operator techniquefor calculating desired external elements of a .scattering matrixwhile discarding irrelevant internal details. Our presentproblem. although entirely different in physical and mathematical

Page 100: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

,details, is practically identical formally; and so this sametechnique must be applicable.

Zwanzig (1962) introduced projection ope~ators for dealingwith two interacting systems, only one of which is of interest,the other serving only as its "heat bath." Robertson (1964)recognized that this will work equally well for any kind ofseparation, not necessarily spatial; i.e., if we want topredict only the behavior of a few physical quantities {FI•..Fm}we can introduce a projection operator P which throws awayeverything that is irrelevant for predicting those particularthings; allowing, in effect, everything else to serve as a kindof "heat bath" for them. .'

In the statistical theory this dichotomy may be viewed inanother way: instead of "relevane' and "irrelevant" read"systematic" and "random." Then, referring to Robertson'spresentation in this volume, it is seen that Eg. (9.3), whichcould be taken as the definition of his projection operatorpet), is formally just the same as the solution of the problemof "reduction of equations for condition" given by Laplace forthe optimal estimate of systematic effects. A modern versioncan be found in statistics textbooks, under the heading;"multiple regression." Likewise, his formulas (9.8), (9.9)for the "subtracted correlation functions" have a close formalcorrespondence to Dicke's final formulas.

Of course, this introduction of projection operators is notabsolutely required by basic principles; it is in the realm ofart, and any work of art may be executed in more than one way.All kinds of changes in detail may still be thought of; butneedless to say, most of them have been thought of already,investigated, and quietly dropped. -Seeing how far Robertsonhas now carried this approach, ~d how many nice results hehas uncovered, it is pretty clear that anyone who wants to doit differently has his work cut out for him.

Finally, I should prepare the reader for his contribution.When Baldwin was a student of mine in the early 1960's, Ilearned that he has the same trait that Lagrange and Fermishowed in their early works: he takes delight in inventingtricky variational arguments, which seem at first glance totallywrong. After long, deep thought it always developed that whathe did was correct after all. A beautiful example i p hisderivation of (4.3), where most readers would leave the trackwithout this hint from someone with experience in reading hisworks: you are not allowed to take the trace and thus provethat a'" 1, invalidating (4.4), because this i".a formal argumentin which the symbol <F> stands for Tr(Fcr)even when cr is notnormalized to Tr(cr) '" 1. For a similar reason, you are notallowed to protest that if Fa::: 1, then 8<Fo> :: O. One of my

99

Page 101: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

. Jaynes

mathematics professors once threw our class into an uproar bythe same trick; evaluating an integral, correctly, by dif­ferentiating with respect to n. For those With a taste forsubtle trickery, variational mathematics is the most fun ofall.

ReferencesT. Bayes (1763), Phil. Trans. Roy. Soc. 330-418. Reprint, withbiographical note by G. A. Barnard, in Biometrika 45, 293 (1958)and in Studies in the History of Statistics and Probability,:E. S. Pearson and M. G. Kendall, editors, C. R. Griffin and Co.Ltd., London (1970). Also reprinted in Two Papers ~ Bayeswith Commentaries. W. E. Deming, editor, Hafner Publishing Co.,New York (1963).

E. T. Bell (1937), Men of Mathematics, Simon and Schuster. N.Y.

L. Boltzmann (1871), Wien. Ber. 63 t 397, 679, 712.

G. Boole (1854), The Laws of Thought, Reprinted by DoverPublications, Inc., New York.

R. T. Cox (1946), Am. J. Phys. 17, 1. See also The Algebra ofProbable Inference, Johns Hopkins University Press (1961);

_reviewed by E. T. Jaynes in Am. J. Phys. 31 t 66 (1963).

E. Czuber (1908), Wahrscheinlichkeitsrechnung und Ihre Anwendungauf Feh1erausg1eichung, Teubner, Berlin. Two Volumes.

P. and T. Ehrenfest (1912), Encyk1. Math. Wiss. English transla­tion by M. J. Moravcsik, The Conceptual Founaations of theStatistical Approach in Mechanics. Cornell University Press.Ithaca t N. Y. (1959).

P. L. Ellis (1842). "On the Foundations of the Theory of Proba­bility", Camb. PhiL Soc. voL viii. The quotation given byRawlinson actually appears in a later work: Phil. Mag. vol.xxxvii (1850). Both are reprinted in his Mathematical andother Writings (1863).

K. Friedman and A. Shimony (1971), J. Stat. Phys. 1. 381. Seealso M. Tribus and H. Motron!. ibid. it 227; A. Hobson. ibid.i, 189; D. Gage and D. Hestenes, ibid. 1t 89; A. Shimony,~ibid.

It 187; K. Friedman, ibid. I, 265.

100

Page 102: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Where do we Stand on Maximum Entropy?

,J. W. Gibbs (1902), Elementary Principles in StatisticalMechanics, Reprinted in Collected works and commentary, YaleUniversity Press (1936), and by Dover Publications, Inc. (1960).

1. J. Good (1950), Probability and the Weighing of Evidence.C. Griffin and Co., London.

D. Heath and Wm. Sudderth (1976), "de Finetti I s Theorem onExchangeable Variables;' The American-Statistician, 12, 188.

d. ter Haar (1954), Elements of Statistical Mechanics, Rinehartand Co., New York.

S. P. Heims and E. T. Jaynes (1962), Rev. Mod. Phys. 34, 143.

V. S. Huzurbazar (1976), Sufficient Statistics (Volume 19 inStatistics Textbooks and Monographs Series) Marcel Dekker, Inc.,New York.

E. T. Jaynes (1957a,b), Phys. Rev. 106, 620; 108, 171.

E. T. Jaynes (1963a), "New Engineering Applications of Informa­tion Theory, I' in Proceedings of the First Symposium on EngineeringApplications of Random Function Theory and Probability, J. L.Bogdanoff and F. Kozin, editors; J. Wiley and Sons, New York,pp. 163-203.

E. T. Jaynes (l963b), "Information Theory and StatisticalMechanics," in Statistical Physics (1962 Brandeis Lectures)K. W. Ford, ed. W. A. Benjamin, Inc., New York, pp. 181-218.

E. T. Jaynes (1965), "Gibbs vs. Boltzmann Entropies," Am. J.Phys. ll, 391.

E. T. Jaynes (1967), "Foundations of Probability Theory andStatistical Mechanics," in Delaware Seminar in the Foundationsof Physics, M. Bunge, ed., Springer-Verlag, Berlin; pp. 77-101.

E. T. Jaynes, (1968), "Prior Probabilities, II IEEE Trans. onSystems Science and Cybernetics, SSC-4, pp. 227-241. Reprintedin Concepts and Applications of Hodem Decision Models, V. M.Rao Tummala and R. C. Henshaw, eds., (rlichigan State UniversityBusiness Studies Series, 1976).

E. T. Jaynes (1971), "Violation of Boltzmann IsH-theorem inReal Gases," Phys. Rev. A.i, 747.

101

Page 103: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

;E. T. Jaynes (1973), "The Well-Posed Problem, II Found. Phys. 1,477.

E. T. Jaynes (1976), IIConfidence Intervals vs. Bayesian IntervIntervals," in Foundations of Probability Theory, StatisticalInference, _"_n_d Statistical Theories 0_f_S7c~i~e~n~c~e, W. L. Harperand C. A. Hooker, eds., D. Reidel Publishing Co., Dordrecht­Holland, pp. 175-257.

E. T. Jaynes (1978), ",Electrodynamics Today," in Proceedingsof the Fourth Rochester Conference on Coherence and QuantumOptics, L. Mandel and E.Wolf, eds., Plenum Press. New York.

E. T. Jaynes (1979), "Marginalization and Prior Probabilities."in Studies in Bayesian Econometrics and Statistics, A. Zellner,ed., North Holland Publishing Co .• Amsterdam (in.press).

H. Jeffries (1939), Theory of Probability, Oxford UniversityPress.

J. M. Keynes (1921), A Treatise on Probability, MacMillanand Co •• London.

R. Kubo (1957). J. Phys. Soc. (Japan) 12. 570.

R. Kubo (1958). "Some Aspects of the Statistical-MechanicalTheory of Irreversible Processes, fl in Lectures in TheoreticalPhysics, Vol. 1, W. E. Brittin and L. G. Dunham. eds. Inter­science Publishers. Inc •• New York; pp. 120-203.

S. Kullback (1959). Information Theory and Statistics. J. Wileyand Sons. Inc.

J. Lindhard (1974). "On the Theory of Measurement and ItsConsequences in Statistical Dynamics," Kg!. Danske. Vid.Selskab Mat.-Fys. Medd. 29. 1.

J. C. Maxwell (1859). "Illustrations of the Dynamical Theory ofGases." Collected Works, W. D. Niven. ed., London (1890), Vol. I,pp. 377-409~

J. C. Maxwell (1876), "On Boltzmann's Theorem on the AverageDistribution of Energy in a System of Material Points, II ibid.Vol. ,II, pp. 713-741.

D. Middleton (1960), Introduction to Statistical CommunicationTheory. McGraw-Hill Book Company, Inc., New York, pp. 1082-1102.

102

Page 104: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

~ere do we Stand on Maximum Entropy?

H. Mori, I. Oppenheim, and J. Ross (1962), "Some Topics in, Quantum Statistics, n in Studies in Statistical Hechanics,Volume 1, J. de Boer and G. E. Uhlenbeck, eds., North-HollandPublishing Company, Amsterdam; pp. 213-298.

G. Polya (1954), Mathematics and Plausible Reasoning. PrincetonUniversity Press, two volumes-.--

B. Robertson (1964), Stanford University, Ph.D. Thesis.

J. S. Rawlinson (1970), "Probability, Information, and Entropy,"Nature, 225, 1196.

L. J. Savage (1954), The Foundations £f Statistics, John Wileyand Sons, Inc., Revised Edition by Dover Publications, Inc.,1972.

E. Schrodinger (1948), Statistical Thermodynamics, CambridgeUniversity Press.

C. Shannon (1998), Bell System Tech. J. 27, 379, 623. Reprintedin C. E. Shannon and W. Weaver, The-Mathematical Theory £fCommunication, University of Illinois Press, Urbana (1949).

C. Truesdell (1960), "Ergodic Theory in Classical StatisticalMechanics," in Ergodic Theories, Proceedings of the InternationalSchool of Physics "Enrico Fermi," P. Caldir.ola, ed., AcademicPress, New York, pp. 21-56.

J. Venn (1866), The Logic of Chance

R. von Mises (1928), Probability, Statistics and Truth, J.Springer. Revised English Edition, H. Geiringer, ed., GeorgeAllen and Unwin Ltd., London (1957).

A. Wald (1950). Statistical Decision Functions, John Wiley andSons, Inc., New York.

J. D. Watson (1965), Molecular Biology of the Gene, W. A.Benjamin, Inc., New York. See particularly pp. 73-100.

N. Wiener (1949), Extrapolation, Interpolation. and Smoothingof Stationary Time Series, Technology Press, M.I.T. ,s. S. Wilks (1961), Forward to Gera1omo Cardano, The Book .£!!.Games ~ Chance. translated by S. H. Gould. Holt, Rinehart,and Winston, New York.

103

Page 105: D:larryOutput.prntlusty/courses/InfoInBio/Papers/JaynesStandOn... · to irreversible processes. Here we consider the.most general application of the principle, without taking advantage

Jaynes

R. Zwanzig (1960), J. Chem. Phys. 33, 1338; Phys. Rev. 124,983 (1961).

104