inference infinite dimensional problems: discretization and · 2018. 11. 7. · inference in...

Inference in Infinite DimensionalInverse Problems:

Discretization and Duality

By

Philip B. StarkDepartment of StatisticsUniversity of California

Berkeley, California 94720

Technical Report No. 261July 1990

Department of StatisticsUniversity of California

Berkeley, California 94720

Inference in Infinite DimensionalInverse Problems:

Discretization and Duality

Philip B. StarkDepartment of StatisticsUniversity of CaliforniaBerkeley, CA 94720

July 3, 1990

Abstract

Many techniques for solving inverse problems involve approximatingthe unknown model, a function, by a finite-dimensional "discretization"or parametric representation. The uncertainty in the computed solutionis sometimes taken to be the uncertainty within the parametrization; thiscan result in unwarranted confidence. This paper presents a method forovercoming the limitations of discretization within the "strict bounds" for-malism, a technique for constructing confidence intervals for functionals ofthe unknown model which can incorporate certain types of prior informa-tion. The usual computational approach to strict bounds approximatesthe "primal" problem in a way that the the resulting confidence intervalsare at most long enough to have the nominal coverage probability. Thereis another approach based on "dual" optimization problems which givesconfidence intervals with at least the nominal coverage probability. Thepair of intervals derived by the two approaches bracket the correct con-fidence interval. Gravimetric, seismic, and geomagnetic illustrations aregiven, and a numerical example in seismology.

Keywords. Bounds on functionals, strict bounds, extremal bounds, confidenceset inference, inverse theory, inference, discretization error, conjugate duality.

Acknowledgements. I am grateful to George E. Backus, Peter J. Bickel,Mikhail A. Brodsky, David L. Donoho, Douglas 0. Gough, Niklaus Hengartner,Robert L. Parker, and Naum Z. Shor for helpful and encouraging discussions.Michael Stewart gave valuable computational assistance on the Berkeley CrayX-MP/14. This work was supported by National Science Foundation FellowshipDMS-8705843, NSF Presidential Young Investigator award DMS-8957573, NSFGrant DMS-8810192, and NASA Grant NAG 5-883.

1

0 PreambleFigure 1 presents two confidence regions for seismic velocity in Earth's core,based on travel-time data reduced by Johnson and Lee [30] and the assumptionthat velocity increases monotonically in the core (see [64]). Both regions havea nominal 99.9% confidence level. The inner (dashed) envelope was computedby Stark and Parker [61] using a technique that approximates the space ofvelocity models by a finite-dimensional subspace (discretization of the modelspace). The envelope is therefore optimistic: its coverage probability may beless than the nominal level, since many elements of the space are not availableto the algorithm. The wider (solid) envelope was computed by a techniquedescribed below in detail, especially in section 7.3.2 and appendix A. Thattechnique, based on a dual problem, does not involve discretizing an infinite-dimensional space. The second envelope is conservative: its coverage probabilityis at least the nominal value, provided the assumptions about data errors andmonotonicity constraints hold.

Stark [59] proves that as the number of functions used to approximate thespace of velocity models increases (under mild conditions) the results based ondiscretization converge to the correct bounds. Stark and Parker [61] tested forconvergence numerically by doubling the number of approximating functions:the computed bounds did not change visibly. Apparently, the bounds had con-verged. The set of solid bounds here limit rigorously the extent to which theapproach based on discretization might be misleading. The two envelopes to-gether bracket a true 99.9% confidence region for the inverse problem of inferringseismic velocity in the core from these data and prior information.

This paper gives a systematic treatment of this bracketing procedure formaking inferences about functionals of the unknown model in infinite-dimensionalconstrained linear inverse problems.

1 IntroductionA fundamental difficulty in inverse problems is that generally an infinite numberof models adequately account for our measurements (see, for example, Backus[2] ). Strict bounds are a way to make rigorous inferences in the face of thisnonuniqueness. Loosely, strict bounds determine shared properties of the set ofmodels that satisfy our assumptions and fit the observations adequately. Herea "property' is the value of a functional, and for a property to be "shared"means that the range of values the functional takes on the set satisfy an in-equality. More precisely, strict bounds are confidence intervals for functionalsof the unknown model. Backus [7] calls this idea "Confidence Set Inference."Sabatier [53, 50] and Cuer [16] call the approach "well-posed questions." WhileBayesian inversion, stochastic inversion, maximum entropy and other regulariz-ing methods all solve the problem of construction, finding a plausible model that

2

fits the data, Backus provides persuasive reasons for preferring strict boundsor confidence set inference for many problems of inference, characterizing thenonuniqueness in inverse problems. This is especially true for problems in whole-Earth geophysics, where there is no observational basis for establishing a priordistribution on- the (infinite-dimensional) model space.

We compute strict bounds by using our prior knowledge about the modeland the distribution of the data errors to define a confidence region for the truemodel in model space-a set of "acceptable" models. We then pose optimiza-tion problems: find the smallest and largest values some scientifically interestingfunctional takes on that set. The solutions to these optimization problems givea confidence interval for the functional of the true model. Many different func-tionals may be considered, corresponding to different properties of the model.If the range of a functional on the set of acceptable models is small, then thedata and our prior information allow us to determine that parameter well. If theconfidence interval is wide, the available information is insufficient to constrainthe property we have chosen. Strict bounds techniques are also useful for ex-perimental design. By computing strict bounds with different prior constraintsand hypothetical data, we can determine what measurements, signal-to-noiseratios and prior information would be sufficient to constrain properties we wantto know.

Not every functional is bounded above or below on the set of acceptablemodels, but often even one-sided bounds are informative. On the basis of linearobservations without special prior information, norms and seminorms can bebounded only from below. With suitable norm definitions, Shure et al. [56]use geomagnetic observations to find lower bounds on the energy stored in themagnetic field of the Earth's core, and on the Ohmic dissipation within the core.Parker and others [46, 43] use norm and seminorm minimization to constructconfidence regions for the paleopole position from measurements of seamountmagnetism.

The strict bounds approach has been used in many branches of geophysics,including:

* Gravimetry: Parker [42, 44]; Salon et al. [54]; Huestis and Ander [28]

* Core magnetism: Shure et al. [56, 57] Backus [7]

* Crustal magnetism. Parker and others [46, 43] Huestis and Parker [29];Bayer and Cuer [9]

* Magnetotellurics: Oldenburg [38]

* Seismic Tomography: Grasso et al. [24]

* Travel-time seismology: Bessonova and others [10, 11] Garmany et al. [22];Orcutt [39]; Stark and others [64, 63, 62] Wiggins et al. [65]

3

* Thermal History of the Crust: McNutt and Royden [37]

Strict bounds have also been used to estimate power spectral density (Langand Marzetta [32, 33, 36]). Lang [31], Oldenburg [38], Rust, Burris and others[49, 15, 47], and Sabatier [51] discuss formulations of the strict bounds problemfor bounds on linear functionals when the prior information about the unknownis a pointwise linear inequality. Backus [7] discusses confidence intervals for lin-ear functionals when the prior information is a bound on a quadratic form of themodel. His treatment covers systematic errors and the effect of approximatingthe model by a finite-dimensional discretization, by means of a clever reductionof the infinite-dimensional problem to the standard statistical problem of find-ing confidence intervals in finite-dimensional problems. Lang's paper [31] is aparticularly clear exposition of strict bounds, with a careful treatment of errorsintroduced in approximating infinite-dimensional models, and a discussion ofthe dual problem and approximations to it. Donoho [18] treats the problem ofestimating a linear functional of a model in 12 from linear data when the modelis constrained to lie in a convex subset of 12 and the data errors are Gaussian.His results include as a special case optimally short confidence intervals for theproblem addressed by Backus [7], in the case the data errors are Gaussian. Seesection 9.2.

An advantage of the Backus-Gilbert approach to inference [2, 3, 4] and ofminimum Hilbert-space norm problems is that the only models entering thecomputations are linear combinations of a finite number ofknown functions-theproblems are really finite-dimensional. This is also the case with the approachof Donoho [18], at least when the prior information is a ball in 12. When thereis prior information about the model in the form of linear inequalities, whenone wishes to look at other norms, or when the space of acceptable models isparticularly nasty, the Hilbert-space setting is often unnatural, leading one toformulate the problem in a more general Banach space (however, c.f. Donoho[18], who gives a theoretical treatment of arbitrary convex sets in a Hilbertspace setting). The analogue of the projection theorem, which so readily allowsthe Hilbert-space minimum-norm problem to be reduced to a finite numberof dimensions, can lead to nonlinear equations in these spaces. It can alsobe somewhat awkward to incorporate pointwise linear- inequality constraintson the model in the Hilbert space framework: the point-evaluator can not berepresented as an inner product except in reproducing kernel Hilbert spaces[1], which demand a tremendous amount of smoothness of the model. Hilbertspace is ideal when the information about the unknown is an upper bound ona quadratic form or a linear equality, but not a more general "ball" nor a cone(again, compare with [18]). The problems we must solve to make inferences if wewish to use, say, pointwise linear inequalities, have appeared to be relentlesslyinfinite-dimensional. In order to compute with real data, we seem to be facedwith the problem of representing an infinite-dimensional model on a computerusing only a finite number of parameters. It is clear that no matter what scheme

4

we choose to represent the model space with only a finite number of parameters,we can always find some element of the space that is not approximated well.This may result in optimistic bounds on the functionals: the computed boundsmay be narrower than the true bounds, giving a false impression of confidence.In some problems, it is possible to show that as the discretization improves, thebounds approach the correct value (e.g. Lang [31] ; Stark [59] ). There is alwaysa practical limit to the number of parameters we can store and compute. Howgood are our answers at that practical limit?

Lang [31] studies a finite-dimensional dual problem for inferences about lin-ear functionals of a model in the space of continuous functions, with a prioripointwise upper and lower bounds on the unknown and a 2-norm measure ofmisfit to the data. He shows that a discretized and smoothed version of the dualcan be used to bracket rigorously the strict bounds. (He also shows that thesolution to the discretized primal with a larger value of the acceptable misfitcan be used to bracket the true primal value. That approach does not seemhelpful in more general settings-see section 8.) This paper owes much to hiswork, and to the excellent book by Luenberger [35].

It turns out that there are finite-dimensional dual problems for strict boundsin linear inverse problems, even in quite general linear vector spaces. In the dualproblem, the objects we encounter are linear combinations of the data function-als, but the functionals are no longer necessarily elements of the model space, asthey are in the Hilbert-space setting, since the model space is not in general itsown dual. The dual problem has another significant advantage beyond the factthat it is finite-dimensional: its constraints, which may be infinite-dimensional,involve known functions, often much nicer than many elements of the set ofmodels we must consider in the primal problem. This can make it much easierto determine the effect of discretization on the dual problem-we shall see belowthat the discretized dual for many strict bounds problems can be implementedin a way that gives bounds guaranteed to be pessimistic (too wide). Since theusual approach to discretizing the primal problem gives bounds that are toonarrow (as the entire model space is not considered), the discretized primal anddual can be used in complementary ways to bracket the correct values of theoptimization problems.

Several of the papers mentioned above use a dual approach, for example,Lang [31], Parker [42, 44], and Bessonova et al. [10, 11]. The aim of this paperis to develop dual problems for strict bounds in linear inverse problems with avariety of prior constraints in a more inclusive formulation, and to show howto implement the dual so that it can be used in conjunction with the primal tobracket rigorously the correct values of the strict bounds optimization problems.The paper is organized as follows: Section 2 sets up the strict bounds primalproblem and touches very briefly on statistical and computational considerationsfor the primal approach. Section 3 introduces the dual in an intuitive fashion;section 4 gives an algebraic treatment. Section 5 develops the dual further forstrict bounds on linear functionals; section 6 for norms and seminorms. Section

5

7 gives examples of bounding discretization error in some problems in gravime-try, seismology and geomagnetism, including a numerical example (Figure 1)in seismology, via dual optimization problems. Section 8 discusses primal ap-proaches to bounding discretization error, including Lang's [31] and Backus' [7]techniques. Section 9 discusses confidence intervals for a single linear functional,including the methods of Backus [7] and Donoho [18]. Section 10 discusses theincorporation of systematic errors into the computations, with an illustrationfrom helioseismology. Section 11 briefly discusses duality gaps and Lagrangianduality. It shows that for strict bounds problems, Lagrangian duality providesa more satisfactory set of sufficient conditions for the absence of a duality gapthan Fenchel's duality Theorem, which is the usual tool for conjugate duality.Section 12 summarizes and indicates some unanswered questions and directionsfor further research. Appendix A develops the computations needed for thenumerical example in seismology.

6

2 Strict Bounds Primal Problems

2.1 NotationLowercase Greek letters, a,4, ... are elements of the real line R, as are the letterss and t, which will be used occasionally as independent variables. Boldfacelowercase Greek letters, 6, A, . . . are finite-dimensional vectors (elements of Rn).Lowercase Greek letters may also denote functions into the reals, or into Rn(boldface). Subsets ofR and R" are denoted by uppercase and bold uppercaseGreek letters (A, 3). Matrices will also be denoted by bold uppercase Greekletters (r); elements of a matrix are denoted by the same letter, doubly indexed(rij). The lowercase italic Roman letters i through q are reserved for integers(except in sections 7.2, 7.3.2, 10.2 and appendix A as noted). The unknownmodel is xo, which is assumed to be an element of the real linear vector space X.Lowercase italic Roman letters from the end of the alphabet, such as xl, x', y,are other elements ofX Subsets of X are denoted by sans-serif uppercase Romanletters (e.g. C). The dual space of X, the set of all linear functionals on X, isdenoted X*. Elements of X* are denoted by starred lowercase Roman letters suchas f*, x*. Functionals may also be recognized by the fact that their argumentsare enclosed by square brackets [ ]. Linearity of x* of course means that if{x},L1 are elements of X, and {ai,}P are any n real numbers, then

n n

x* E aix=] aix*[xi]* (1)

Subsets of X* are denoted by starred sans-serif uppercase Roman letters (C).Boldface starred lowercase Roman letters (f*) denote ordered n-tuples of linearfunctionals on X:

f* X--+ Rnx i:> f*[x] = (f*i[X], .. f n[XD-

General, possibly nonlinear functionals on X (or subsets of X) are denoted byitalic uppercase Roman letters (H[x]); and general, possibly nonlinear function-als on X* or subsets of it are denoted by starred italic uppercase Roman letters(C*[x*]). Primal and dual optimization problems (defined below) are denotedP and D. p{.} is the probability that the event in braces occurs. The nota-tion C[a, b] and CO[a, b] will be used to denote the usual spaces of continuousfunctions (C is normed with the "sup" norm; CO is not).

Our prior information may or may not endow X with a natural norm. WhenX is normed, it will be assumed that X is complete with respect to the normtopology, so X is a Banach space. In that case X* will be taken to be the normeddual of X with the induced norm

IIx*II- sup x*[x]. (2)IIXII|1

7

It is well known that X* with this norm is also a Banach space [35]. No notationaldistinction will be made between the norm on X and the norm on X*.

We possess n real data 5i:6i=fj*[xo]+ej, j=1,***,n, (3)

where the fg are linear functionals on X. If X is normed, the fZ will be assumedto lie in the normed dual X*. The ej are observational and modeling errorsassociated with the jth measurement. The data relations will usually be writtenin vector form:

6 = f*[xo] + C. (4)

We will frequently take the "dot" product of f*with an n-vector A of real num-bers:

n n

(>*f*)[] _(1 Aifj*)[x] _ AE fj* [x] ,x E X. (5)j=1 j=1

2.2 The Primal Problem PLet PE be the joint distribution of the data errors e. All we need assume formallyabout the errors is that we can find a 1- a confidence set 3 C Rn for the errorse; i.e. any set such that

p{ E S}> 1-a, (6)or equivalently

JdPe > 1 - a.

The set can be used to define a confidence region for zo: let

D - {x E X: 6-f*[x] E I}{x E X:f*[x]E 6 -}, (7)

where the set

6-S--{yER":' = 6-( for some E (8)

Let bo denote the noise-free predictions of the true model, i.e. bo = f*[xo]. By4 and the definition of 3, the probability is at least 1 - a that 6 - 6o E 3.Thus the probability is at least 1 - a that D, the preimage of 6-3, containsthe true model xo:

p{D3xo}> 1-ca. (9)

The set D is a 1 - a confidence region in model space for xo.Let C be a subset of X (possibly all of X) in which we are sure a priori

that xo lies. The set C represents our prior information; for example, we maybelieve that surely llxoll < 1 (see, for example, the geomagnetic problem in

8

Backus [7], where this constraint captures the physical requirement that theenergy stored in Earth's main magnetic field is less than Earth's mass). Sincethere is no chance that xo 0 C, the probability that C n D 3 xo is the same asthe probability that D 3 xo: at least 1 - a:

p{CnD 3 xo} = p{D 3 xO} - p{CnD 3 xo}= p{D 3 xo}> 1-a, (10)

where Cc is the complement of C in X.The set C n D is a 1- a confidence set for xo. It contains, in general, infinitely

many models that are acceptable on the basis of their fit to the data, and theirsatisfaction of the prior constraints available. Any element in C n D might bethe truth. The problem of construction in inverse problems is to find an elementx E C n D. The problem of inference in inverse problems is to study the entireset. The set C n D contains many models-are they similar? In what respect?With the strict bounds technique, we focus on one functional (property of theset) at a time, and measure the "width" of the set with respect to that propertyby the range of values that property may have over all C n D.

For example, suppose we are interested in the functional

H: XRx H[x].

H might be the value of x at a point, a local average of x over some region, anorm of x, or any other functional.

From 10 we findp{H[C n D] 3 H[xo]} > 1- a, (11)

whereH[C n D] {1E R: j3= H[x] for some x E C n D}.

Define -y- inf ECnED H[x] and y _ infECnED H[x]. Then 11 implies (theremay be slack in this inequality)

p{y < H[xo] < y+} > 1-a; (12)that is, the interval [y-, y+] is a 1 - a confidence interval for H[xo]. Figure 2sketches the relationships defined so far.

Depending on 3, C, f*, and H, it can happen that the infimum y-, thesupremum y+, or both fail to be finite. In these cases, only one-sided confidenceintervals, or trivial confidence intervals result. See sections 2.3 and 9 for shortdiscussions of the choice of 3. Note that confidence intervals for any number offunctionals may be found this way, with simultaneous 1- a coverage probability.The bounds are in general conservative, i.e.

p{v(P) < H[xo]} > 1 - a,

9

since there is some chance that xo could be anywhere in C, and H[x] may bebetween ac and a+ in much of C that is not included in D.

We are now faced with a pair of infinite-dimensional optimization problems.Note that - infCnD{-H} = SUPCnD H = y+, so it suffices to consider theminimization of a functional H over C n D. The strict bounds primal problemP is to find the value ofP,

v(P) inf H[x] (13)XECnD

where H is a functional defined on C n D. A feasible point of P is a pointx E CnD.

I know of only two approaches to finding v(P): discretization and duality,the subject of this paper. Discretization is the approximation of the primalproblem by a problem in a finite-dimensional space. Duality leads to certainfinite-dimensional problems in the dual space X*, with the same value as theprimal.

In this paper, the only functionals H considered in detail axe linear func-tionals, norms and seminorms. The sorts of prior information treated here are

1. (1) none

2. (2) bounds on the the distance of the true model xo from a preferred modelx', and

3. (3) the constraint that z lies in a cone.

I hope to treat other nonlinear functionals and different prior information in afuture paper.

The current set-up covers quite a range of problems. For example, for geo-magnetic problems it can incorporate a "hard quadratic bound" on the energystored in the Earth's magnetic field or the Ohmic dissipation in the core (seeBackus [6, 7]); it can incorporate pointwise upper and lower bounds (such as inLang [31]), upper and lower bounds on the derivative (as in Stark et al. [64]),and positivity or monotonicity constraints (as in Stark and Parker [62]). All ofthe problems listed in the introduction (within the approximations customarilymade for their solution) are instances of problems addressed here.

Deterministic Interpretation of Strict Bounds. As noted by Lang [31],strict bounds can be interpreted deterministically as well as stochastically. Lete be a deterministic rather than stochastic error. We believe that e E 3. Thecomputed strict bounds on H[xo] are valid provided that supposition is correct.If e V 3, the computed bounds might not be valid (although, depending on 3,H, C, and f*, they might still be valid). This relates to the connection betweenstatistical estimation and optimal recovery developed by Donoho [18].

10

2.3 Choice of the Data Confidence RegionSee section 10 for a discussion of systematic errors. The strict bounds we cal-culate depend on the way we choose the set 3. For a particular collection offunctionals H we may wish to make inferences about, we can try to choose 3optimally to minim-lize the maximum of some weighted sum of the lengths ofthe intervals over all models in C. This problem appears to be unsolved, evenfor bounds on a single linear functional (see section 9). In this paper we willconsider two types of data confidence regions 3: balls in R" in a weightedp-norm (discussed in this section); and "slabs," the region between two parallelhyperplanes (balls in a projection of the data space onto Rl-these degenerateballs are discussed in section 9).

The restriction to balls is not as severe as it first appears, because by definingnew data 6' E I' 6, where I' is an m by n matrix, we can get as confidenceregions families of hypercylinders with various orientations, depending on I',and various shapes, depending on the norm used. The slab discussed by Backus[7] and in section 9 is the case m = 1.

Define the p norms, p > 1, on Rn as follows:

| n l~/p

lilip - :IE AP , 1 < p < ,0 (14)\j=l

andI lei oo _ max lej l. (15)

Let E be a symmetric positive-definite n by n matrix. We define the E-weightedp-normLs to be

Ilell E _ IIE * eIIp (16)where E * e is the matrix-vector dot product

n

(E * C)j =-E Fjkfk- (17)k=1

The norms IIellE1 are defined analogously, with the matrix E-1 in place ofE.

We will need a weighted form of Holder's inequality. Let q = p/(p - 1),1 < p < o, or q = 1 if p = oo. Then for any two n-vectors ,3, y, and symmetric,positive definite E,

Ii0 1 = Ii3E E-l al

. lIE. /31qIIE-1 . 'IlP= llc3llq l'lII . (18)

11

We assume enough is known about the joint distribution of the ej's so thatfor fixed p, E and a E [0,1] we may find X such that

P{leIIE Xx}.1-xa. (19)

In general, finding X for parametric distributions is a nontrivial calculation. Ifthe joint distribution of the ej is Gaussian with nonsingular covariance matrix

2, then (i ) has the chi-squared distribution with n degrees of free-dom. Percentage points for the 1-norm of independent Gaussians are tabulatedin Parker and McNutt [45]. Percentage points for the infinity-norm of indepen-dent, identically distributed Gaussians are easy to calculate, but other than thetwo-norm, none of these is much fun even for Gaussian random variables if theej are correlated. (Of course, if one is not attached to a particular coordinatesystem, in the Gaussian case one may always rotate the data to form uncorre-lated linear combinations provided the covariance matrix is nonsingular.) Forother distributions, I do not know what results are extant. When there are suf-ficiently many repeated observations, it may be possible to find X by simulationfrom the empirical distribution of the data

For most of the paper we take 3 to be the E 1-weighted p-norm ball

3= Ee Rn IIz '< x}, (20)where X is chosen so that p{e E 3} = 1a-a. (Implicitly, we are assuming thatthe ej have mean zero; otherwise it would make sense to center the ball at themean. Backus [6, 7] discusses removing the effect of known nonzero means fromthe data)

Then D is the set of models that satisfy the data within X in the E'weighted p-norm:

D x{EEX:b-f*[x]E3}= {xEX:IJf*[x]- 6JJ < X} (21)

There are both computational and statistical considerations for the choiceof norms in defining 3. The 1,2 and infinity norms often lead to discretizedprimal problems that can be solved by linear or quadratic programming, ratherthan general nonlinear optimization-this is a clear advantage. The 2-norm isoften considered the most natural when the errors are thought to be Gaussian,but it has several difficulties. First, for small n, the percentage points of theresulting distribution are very sensitive to departures from Gaussianity: if thetrue distribution is long-tailed (e.g. if there are outliers), the 2-norm ball withdiameter chosen on the basis of the chi-squared distribution may have an ac-tual coverage probability much smaller than the nominal one. Second, for eachlinear functional we wish to bound, solving the discretized primal optimization

12

problem for the two-norm measure of misfit involves a sequence of quadraticprograms, each of which may consist of many least-squares problems. Boththe one-norm and infinity-norm measures of misfit lead to a single linear pro-gramming problem for a bound on a linear functional, which can result in asubstantial computational savings. The one-norm measure of misfit leads to asmaller tableau than the infinity-norm, and is more robust in the presence ofoutliers than the two-norm or infinity-norm. Since we are often just makingan educated guess about the distribution of errors, it is desirable to decreasethe sensitivity of our results to the accuracy of our guess as much as possible.Finally, there are some cases where the error is thought to be an absolute tol-erance rather than a stochastic disturbance (for example instrumental limitsor rounding errors in digitization-see also section 10). In such problems, theinfinity-norm measure of misfit may be natural.

The geometry of the unit balls {'y E Rn: II.1IE-1 < 1} and the radii of thesmallest balls that contain 1- a of the mass of the distribution of e are differentin these various norms, so the choice of norms will affect the bounds we obtain(although to a smaller extent than the choice of a different sort of region, suchas a slab). The details depend on the data functionals, the measured data, theprior information, and the functionals of interest.

2.4 Discretizing the Primal ProblemThe most common approach to solving the optimization problem P is to ap-proximate X by some finite-dimensional subspace S C X, i.e. to pick a set of mlinearly independent vectors xi E X, and solve the discretized primal problemPf: Find

v(1') = inf H [ Zaix] * (22){aERmn: ai.x,ECnD}

There are several goals in choosing S:

1. the possibility of computing f*[x] and H[x] analytically for x E S

2. simplicity of the description of C n S in terms of the coefficients a

3. sufficient flexibility in S that v(P) is close to v(P).

Typical choices of S are piecewise constant functions, and the first m elementsof a basis. For example, in whole-earth tomography, most researchers eitherdivide Earth into a set of spherical "boxes," and assume that seismic veloc-ity is constant in these boxes (the first sort of approximation); or they use atruncated spherical harmonic expansion in the angular variables, tensored withsome sort of polynomial in radius (the second sort of approximation). In geo-magnetism, it is customary to use the first m spherical harmonics (the secondsort of discretization). Both sorts of finite-dimensional approximation will becalled discretization.

13

For many choices of C, H and p, the problem P can be solved on a computerby standard optimization techniques. For example, if C is a cone or a set ofmodels satisfying an infinity-norm bound, H is linear, and p is 1 or infinity, thenthe discretized primal can be solved by linear programming (e.g. [54,29, 22, 64]).If p =. 2 and the rest remain the same, the problem can be solved by a seriesof quadratic programs [49, 63]. If H is the one or infinity norm on a boundeddomain, C is a cone or a ball in the infinity norm, and p is 1 or oo, v can alsobe solved by linear programming. If H is a quadratic norm or seminorm, C isa cone, and p is 1 or oo, P can be solved by quadratic programming (e.g. [62]).It is clear that

v(P3) > v(P) (23)

since the infimum in P is over S n C n D, rather than C n D. How much smallerthan the approximation could the true value be? We shall see that there isanother finite dimensional optimization problem, the dual problem D, with theproperty that

v(P) > v(P) > v(s).

The pair of finite-dimensional optimization problems P and D rigorously bracketthe number we wish to know, v(P).

14

3 Naive DualityThe idea of the dual approach is to get information about H[xo] directly fromthe data and prior constraints. Even the fact that a datum has a finite valuetells us something about the model-its component in the direction "aligned"(see [35] for a precise definition) with the data kernel is (almost surely) finite.

In this section we take X to be the space of continuous real-valued functionson the interval [a,,L], we take H to be a linear functional, and we assume thatH [x] and f*[x] can be written as integrals against "kernels:"

H[x] = J h(7)x(7)dr

f*[x] = f(r)x(7)dr,where f(ir) is an n-tuple of functions on [a, ,3].

3.1 No Prior InformationSuppose we can find a vector A E Rn such that

nh(r) = A * f(r) =E Ajfj(7-)

j=1

for all 7 E [a, 3]. Consider the minimization problem infXED H[x]. We have

H[x] = J. f(Tr)x(7T)dr= A * (6+ a).

(This observation is basic to Backus-Gilbert resolution theory [2, 3, 4, 5, 8].)Now for any x E D, I1-yI I < X. It follows from 18, the weighted form ofHolder's inequality, that JA * < xIlAlE for all x E D, so for all A E Rn suchthat h(r) = A * 7,

H[x] > A.6-xIIAIq * (24)If the functions fj(r) are linearly independent, then there is at most one Asatisfying A f(-r) = h(r), and we shall see below that for that A,

inf H[x] = A -* c- XIA11 qE.xED

In any case, we have the "dual" problem D:

v(D) = sup A.6-xi 1A\II (25)AER :A-f(r)=h(r)

and v(D) < v(P) = inf.ED H[x].

15

3.2 Positivity ConstraintsSuppose now we are given the prior information that xo lies in the positive coneC = {x(r) : X(r) > 0, r E [Ol 3]}. This information allows us to relax theconstraint that A * f(r) = h(r) to the constraint A * f(r) < h(r), because we canstill bound the integral of (h(r) - A* f(r))x():

H[x] = A.f*[x]+(H[x]-,A.f*[x])

= A.( +-Y)j+ (h-A f)x

>- A.- 6- x IIAIIqE +h(r)>A.f(r)} - A f)0

= A. 6-XII\IIqE,for x E C n D. Since this relation holds for all A such that A * f(r) < h(r), thisinequality leads to the dual problem

v(V)= sup A *6-xIIxIIq x (26)A:A.f(r)<h(r)

and we have v(D) < v(P) = infZECnD H[x]. This case and the one that followsare discussed in Lang [31]

3.3 Positivity Constraints and Upper BoundsSuppose we have in addition an upper bound on xo(r), i.e.

Xo E C _ {x E X:O < x(r) < u(Tr)}.

We may then completely relax the constraint that A *f(r) = h(r), since

H[x] = A.f*[x]+(H[x]-A.f*[x])

> A .6- xl\IIAII {+ h(r)>A.f(r)} (h-A. f) 0+

jh(r)<A.f(r)} (h - A . f)u,for x E C n D. We thus arrive at a third dual problem

v(D) = sup A.* 6;-xIlAlIq + (h-AA f)u. (27)AERn {h(r)<A.f(r)}

Of course as beforev(D) < v(P) = inf H[x],

:ECnfD

16

where C is now the set of positive, continuous functions on [a,,,3] that are ev-erywhere less than or equal to u(r).

The three dual problems 25,26,27 are representative of the dual problemswe will encounter for bounds on linear functionals over abstract vector spaceswith no prior information, prior cones, and prior balls. The dual problems allinvolve maximizing a concave functional of A E Rn over convex regions of Rn,and so are instances of general convex programmuing problems. Note that noparticular properties of the space of continuous functions on [a, ,3] were used inderiving the first dual problem (with no prior information); to derive the seconddual problem for a prior cone we needed only a lower bound on (H - A * f*)[x]for x E C; and for the prior infinity-norm ball 0 < X(r) < u(T), we used upperand lower bounds on (H - A * f*)[x] for x E C. Such bounds may be availableeven in much more abstract settings, allowing us, for example, to avoid theassumption that we are in a space of continuous functions, unless our priorknowledge supports it. We turn now to a more abstract framework.

17

4 Algebraic DualityLet us proceed in a little more generality to derive a lower bound on the valueof the primal problem P. Let X be a real linear vector space, f* an n-tuple oflinear functionals on X, H a functional on X, and let C C X. Let x* be anyelement of X*.

v(P) -- inf H[x]TECnD= inf {x*[x]+(H-x*)[x]}

XECnD> inf x*[x] + inf {(H-x*)[x]}TECnfD XECnD> inf x*[x] + inf{(H-x*)[x]}.

zED xEC

But x* was arbitrary, so

inf H[x] > sup inf x*[x] + inff{(H -x*)[x]}XECnD X*EX* {TED XEC

- sup {D*[x*]+C*[x*]}, (28)X*EX*

where the functionals C* and D* on X* are defined by

C*[x*]- inf (H -x*)[x]. (29)xEC

D*[x*]= inf x*[x]. (30)xED

Remarkably, equality is possible in 28, even though we seem to have thrownaway a great deal of information by breaking the optimization problem intopieces and removing constraints from each of the pieces. When equality doesnot hold, there is said to be a "duality gap." Fenchel's Duality Theorem andLagrangian Duality [35] give sufficient conditions for equality in 28; these arediscussed in section 11 below.

The only functionals x* it behooves us to consider are those for which C* [x*]and D* [x*] are greater than minus infinity. Define the conjugate sets

C* {x* E X* C*[x*] > -o} (31)

andD* {x* E X* D*[X*] > -oo0 . (32)

The functional x* provides a nontrivial bound via 28 if and only if x * E C* n D*.The set D* is easily characterized. By an argument implicit in Backus [2], the

18

only linear functionals bounded below on D are linear combinations of the datafunctionals fj*; i.e. functionals of the form

x*=A,f*, AERn.

We lose nothing by restricting the supremum in 28 to this (at most) n-dimensionalsubspace of X*.

We may now compute D* [x*]. For x E D, f*[x] = 6 +y, where II'I < X.Thus

6.f*[x] = A'+6.± Y

. A .'a_ IJII q.E 1-1iI

by Holder's inequality. Since 11HI < X,

6 * f*[x] > A 6-xXI JA E. (33)If the fj* are linearly independent, equality is possible: By an argument notreproduced here (see Backus [6]), if the .f are linearly independent, there exist{x,}L1 in X such that

0, otherwise.

Define the scalars aj as follows:

ai=ai-X ( ~lAjl )l 'Aj

and let x' =_ j ajxj. It is easily verified that

||f*[X,l -II =x,

so x' E D, and

H[x'] = A f* [x'] = A 6 - xIJAII. (34)Thus the bound in equation 33 is sharp at least when the f are linearly in-dependent. Substituting 33 into 28, we arrive at the fundamental inequality ofthe paper:

inf H[x] > v(D)=

sup {A.6-xIIAIIE+C*[A.f*]} (35)xECnlD AER-

19

Note that the infimum in the primal problem has been replaced by a supremum(over R') in the dual. However, there is still a hidden infinite-dimensionalinfimum on the right hand side, in C*.

It is easily verified that C* is a concave functional: suppose a, 13 > 0 anda+ = 1. We have

C*[ax* + ,3y*] = inf {(H - ax*-3y*)[x]}rEC

- inf {a(H -x*)[x] + 3(H - y*)[x]}rEC

> ca inf {(H -x*)[x]} + 3 inf {(H -y*)[x]}rEC rEC

= aC*[x*] +1C*[y*] 0 (36)

One may show similarly that D* is concave. Since A * f* is linear in A, it followsthat C*[A. f*] is concave in A. As A * 6 and -IJA11 are also concave in A, thedual functional (the right hand side of 35) is concave in A. The concavity of C*and D* yield immediately the convexity of C*, D* and C* n D*. Thus the dualproblem is always the maxiniization of a concave functional over a convex set.

The following sections explore some functionals H and sets C where C* andC* [A. f*] can be characterized usefully, as D* and D* were. In the next sectionwe take H to be a linear functional, and look at some sets C where C* and C*are simple.

20

5 Bounds on Linear FunctionalsWithin this section H[x] is always a linear functional. We will consider threekinds of prior information: C = X, C is a cone, and C is a ball.

5.1 No Prior InformationHere C = X. Since H is linear, by the standard argument,

C*[x*] inf (H - x*)[x] = -oo,xEX

unless H - x* = 0. Thus C* = H and for all x* E C*, C*[x*] = 0. If there isno A such that A * f* = H, the value of the dual problem is defined to be -oo.Otherwise, it is

v(ZD) = sup {A.*6-xIIAlE}.T{AER:A.f=H}

This is the nmaxinization of a convex functional subject to an infinite set oflinear equality constraints. For q = 1 or oo, the dual problem can be writtenas a semi-infinite equality constrained least-squares problem. If q = 2, it is asemi-infinite least-squares problem. If the f1 are linearly independent, there isat most one A satisfying the constraints, so there is no maximIiization to perform,and by the argument leading to equation 34, there is no duality gap. This isanalogous to section 3.1.

5.2 Prior ConesA set P C X is a cone with vertex at the origin if x E P implies that Yx E P forall -y > 0. A set C is a cone with vertex x' if

C = x'+ P =_{x'+ y : y E PI,for some point x' E X and some cone P with vertex at the origin. C is convexcone if C is a cone and C is convex. We now consider prior information of theform x e C, where C is a convex cone. Examples of information of this kind arepointwise upper or lower bounds on the model, monotonicity constraints, andso on. Such constraints arise in seismology, gravimetry, medical tomography,and magnetotellurics, diffraction-limited optics and radio astronomy; as seismicvelocity, mass density, electrical conductivity, and the number of photons comingfrom an object are all nonnegative.

It is easily verified that C* is a convex cone in X*. The convexity of C*follows from the more general inequality 36. That C* is a cone can be shownusing the following characterization of C*: recall that

C*--{x* E X*: inf (H-x*)[x] >-oo}XEC

21

We show that

C* = *{x* E X*: inf (H -x*)[x] = (H -x*)[x']} (37)xEC

It is obvious that C* C C*; we need to show that C* c C*. Suppose x* f C*, sofor some y E C, (H-x*)[y] < (H-x*)[x']. Since C is a cone, yas zx'+a(y-x') EC for all a > 0. Define x* H -x*. By the linearity of x*,

X*H[yc] = XH[X'] + a(x*Hy] X-Hx ]).

As a -0,X* [yar] = (H - x*)[y.] + -oo,

so x* f C*. Now if x* E C*, x*[x] > x*[x'], Vx E C*. Then ax*[x] > ax*[x']Vx E C* and all a > 0. Therefore by the argument supporting 37, ax* E C*,and thus C* is a cone with vertex at the origin. 0

Now if C is a cone with vertex x', we have

v(Z))- sup {A.6-xIIAII-+ inf(H-Af*)[x]}

= sup A *"6-XIlA\II + inf (H-A * f*)[x]}{AA * f*EC*}

q

EC

- sup {A-.-XjIAII|+(H-A.f*)[x]} (38){AA * f*EC*}

by 37. Equation 38 becomes useful if there is an easy way to characterize theset of A E RI such that A * f* E C*.

Example.Let X = C°[O, 1], the space of continuous functions on [0, 1] and letC be the positive cone:

C _={x E X: X(r) > 0 r E [0 1]}.

Then X* is the space of measures on [0, 1] with

X [XI = 1 X(T)dx* (T),

and C* is the set of positive measures on [0, 1]. If H and fa are measures thatare continuous with respect to Lebesgue measure, i.e. if

H[x] = 1 h(r)x(r)dr

22

and

f; [X] = 1 fj(r)x(r)drfor some real-valued functions h(r) and fj(r), then A * f* E C* if and only if

h(r) -Z Afj(r) > 0, E [01 1].

This infinite set of constraints on A can be discretized by imposing them at afinite set of points in [0, 1]. If something is known about the continuity of h(r)and fj(r), it may be possible (as we shall see below in section 7) to imposeconstraints in the discretized dual problem so that the set of admissible A is asubset of the set in the original infinitely constrained problem.

5.3 Prior BallsHere we take C to be the set {x E X: llx - x'll < v}. A quadratic norm mightbe induced by a "hard quadratic bound," such as a constraint on the amount ofenergy stored in Earth's magnetic field (Backus [6, 7]). An infinity norm boundcould be induced by pointwise upper and lower bounds on the unknown, suchas arise in gravimetry, travel-time tomography, magnetotellurics, spectroscopy,etc. In this section we assume that X is complete with respect to the norm (soit is a Banach space) and take X* to be its normed dual. We assume that Hand fj* are in X*; since X* is a linear space, A. f* E X* for all A E Rn. Fromthe definition of the norm on X* and the linearity of x*, it is immediate that

1x*[x]l < 11x*11ll 1z1. (39)Consider the conjugate set C* for this problem:

C*-{x* E X*: inf (H-A * f*)[x] >-oo}.zEC

Nowinf (HE- * f*)[x]= inf (HE-A f*)[x]-TEC 11~~~Ix-z'II<v= inf {(H-A.f*)[y]+(H->Af*)[xu]}

> -vi l(H-> f*) lJ + (H-A * f*)[x'] > -oo (40)by equation 39. Thus C* = X*, the entire normed dual space. This leads to thedual problem: find

v(D) sup {A * c I vl H - f* + (H - f [x']=HER- *

= H[x'] + sup {.(-f[x]-xIAI- IH-.fI},(41)AERn

23

which is the unconstrained maximization of a concave functional of A.

Example 1. Let X =l- be the Hilbert space of sequences x (xj)l1 thatare square-summable with the positive weight sequence w:

X = {X = (xj)j.1: 1 < }

Let and H denote both the functionals and the elements of l2 that "repre-sent" them. Suppose we can compute analytically the infinite sums fj*[fkr],fj*[H], fj[x'], H[H] and H[x']. Let F be the Gram matrix with elementsrjk = f[fk*]I j, k E {1, ,n}. Then equation 41 reduces to

v(D) = H[x'] + sup {A ( -f*[x'])-XIIAIIE-AERR

v (A. F A - 2A * (f*[H]) +H[H])1/2}

and there is no need to discretize the dual problem. This example relates to thegeomagnetic problem addressed by Backus [7], where the model x is a sequenceof coefficients in the spherical harmonic expansion of the scalar magnetic po-tential of Earth's main magnetic field.

Example 2. Suppose our prior information is that the unknown is a continuousfunction xo(r) and that

0 < xo(r) < u(r). (42)It is then natural to take X to be the Banach space of continuous functions withfinite weighted sup-norm

||X|| -sup | (rX()|ru (r)With the norm defined this way, 42 can be written

xoE C_ {xEX: llx-xx'll< 1},where x'(r) = u(Tr)/2. The normed dual of X is the set of finite signed measures;i.e. x* E X* can be written

X* [X J x (r)dx* (r),and the norm on X* is

iix*l- () dx (r) 1.The dual problem 41 becomes

v(D) = H[xq + sup A * ( -f*[x])- XII |E-| 2d(H-A*f*)jAERR

q2)

24

6 Bounds on Norms and SeminormsIn this section we construct dual problems for lower bounds on norms andseminorms of the difference between the set of acceptable models and some"preferred" mcdel x'. We shall not consider any special prior information: inthis section C = X. Two classes of problems that are thus not treated are lowerbounds on norms or seminorms when the unknown xo is known to lie in somecone, and bounds on norms or seminorms when xo is known to lie in a ball in adifferent norm.

6.1 NormsHere we take X to be a Banach space, X* to be its normed dual, and assumef; E X*, j = 1,.* *,m. The primal problem is to find v(P) = infxECnD IIx-x'II.The conjugate functional C*[x*] = infEX{IIx -x'-l - x*[x]}. The conjugate set

C*= {x* E X* : inf{ Ix - x'l -* [x]} > -oo}.

If we translate the origin to x',

C* = {x* E X*: inf{Ixll -xI*[x] + x*[xI]} > -oo}= {x* E X* :lx*lI< 1},

as we may easily see: Suppose lx*ii> 1, and let 0 < e < llx*U - 1. Pick anyyo E X such that

1x*[YO]I > IIx*II- > 1,

and letyo x [yo]

Y -- IIYOII Ix*[YO]I,Then for -y > 0, 1I-yII = y and

x*[yy] .> 7vIIx*1 - as = r(IIx*II - E).Thus

C*[x*] < inf{Il-yylj-x*[7y] + x*[xt]}

< infJ{-(1 -Jlx*i ±e)} +x*[x] =-oo.7>0

So if llx*ii > 1, x* 0 C*. Suppose llx*Ji < 1. Then for all x E X,

lxll-x*[x]+x*[x'] > llxll-llx*llllxll+x*[x']= llxll(1 - Ilx*ll) + x*[XI]> x* [x1, (43)

25

so X*E C*0By 43 we see immediately that C*[x*] = x*[x'] for all x* E C*. Thus the

dual problem is to find

v(7)) = sup .(6-f*[X ])-XI\AIE (44)XERm:1A * f*11<1

Example: One-sided confidence interval for Ohmic dissipation in thecore.

As shown by the work of Shure, Parker and Backus [56] (SPB in the follow-ing), the Ohmic dissipation in the core can be lower-bounded by a norm of themagnetic scalar potential at the core-mantle boundary; thus measurements ofcomponents of Earth's magnetic field can be used to infer a one-sided confidenceinterval for the dissipation. SPB focus on constructing models achieving theminimum; here we are concerned primarily with the confidence interval. Theyuse a two-norm measure of misfit to the data; for contrast let's use the infinitynorm. We assume that the observational errors are due to crustal magnetization,the magnitude of which can be bounded using knowledge of rock properties, sothat

j < X.(That is, we treat the errors as systematic rather than random-see also section10.) SPB represent the main field in lw by the sequence of coefficients in thespherical harmonic expansion (referred to Earth's core) of the magnetic scalarpotential. Here

12-- (XOi Xl, X2, X@ E W2j < X },

where the positive weight sequence (w)j°¾l is determined by the norm thatmeasures Ohmic dissipation. We have

v(D)= sup A.6-yIIXIIE'.A:11 * f*II<1

Now for the norms in SPB, one may evaluate the Gram matrix F:

rjk E< fj; fk >I

and the constraint I1A * f* < 1 becomes

A . F A< 1.

Thusv(D)= sup {A * 6-x IIAIll},

sA .f.A<dwhich can be solved as a sequence of quadratic programming problems.

26

6.2 SeminormsWe write IxI for the seminorm of x. The primal problem is to find

inf Ix- t'l.-ECnD

We shall consider only the case C = X, no prior information. Then C*[x*] =infxEX{lx- X'l - x*[x]}. Recall (e.g. [35]) that the elements of a real linearvector space that have zero seminorm form a linear subspace

M -{x E X: Ixi = O}.

We define the quotient space X/M, X modulo M, by identifying vectors x and yif there exists m E M such that y = x + m. On the space of equivalence classesset up this way, - is a norm-the entire subspace M is identified with the zeroelement. We now have a minimum-norm problem on the space X/M, and theresults of the previous section can be used.

LetM;-{x* E X*: x*[x] =O Vx E M}.

The set M;, the orthogonal complement of M in X*, is a linear subspace of X*.It can be thought of as the dual space to the space of equivalence classes wehave constructed: it consists of all linear functionals on the original space thatignore "components" in M. A linear functional x* E M* assigns the same valueto all vectors that are equivalent modulo M (assigning the value 0 to all elementsof M). It is clear that unless x*[x] vanishes for all x E M, C*[x*] =-oo: x*provides a nontrivial bound only if x* E M;. Define the functional V on M*as follows:

1x*1 =- sup IX*[X]Ij EX lxi

Then Ix*I is a norm on M;, dual to the norm Ilx on X/M. The same argumentas before gives C* = {x* E M* : Ix*I < 1}, and C*[x*] = x*[x']. Thus the dualproblem is

v(Z)) = sup A (6-f*[xI])-X11AIIq| (45){AERR:A * f*EMI and JA*f*1<1}

Example: Distance from an m-dimensional subspace.Here we define

lxl=- inf jjx-7 * xoll

where xo is a fixed m-tuple (X1, X2 * * Xm) of elements of X, and the dot productis defined in the obvious way. We then have the dual problem

v(P) _ inf IxxED

27

> v(s)- sup{A:A f*[i]=Oil, .,and JA -f*l<l}

28

x .6 x I,\ II qE'

7 Discretization Error and the Dual

Lang [31] gives a treatment of discretization error for the dual when X = C[a, b],the Banach space of continuous functions on [a, b] with the "sup" norm; C ={x: xl(r) < i(r) < X"(r)}, where xl and xu. are known piecewise constantfunctions; H is linear and has a piecewise constant kernel; and the kernels forthe data functionals fj(r) are differentiable with bounded derivative. Here wedevelop a more general treatment, illustrated with geophysical examples.

There are two places where discretization is an issue in the dual problem:in the dual functional, and in the dual constraints (A * f* E C*). It is fairlystraightforward to treat errors in the dual functional from discretization in con-crete cases where the primal functionals are integrals against known functions(or sums with known weights). The integrability or moduli of continuity ofthe functions involved (or expressions for the weights in the sums) let one con-trol the maximum possible error. If that is subtracted from the value of thediscretized dual, the adjusted value of the discretized dual can not exceed thevalue of the exact dual. Discretized constraints are just as easy to treat, butconceptually are quite different. The constraints in the dual, when there areany, are infinite-dimensional. Discretizing the constraint set generally admitsA's that are not in the continuous constraint set. In order to ensure that thevalue of the discretized dual is conservative, we must impose the discrete con-straints in such a way that the feasible set for the discretized dual is a subsetof the feasible set for the continuous problem (or at least that the optimal A inthe discretized problem is an element of the original feasible set).

Note that any feasible point of the properly discretized dual gives a rigor-ous bound on the value of the primal problem. Optimizing the dual functionalgives tighter and tighter bounds, but it may be that the computational expenseinvolved in optimizing the dual functional is not worthwhile, if a reasonable (use-ful) feasible point can be obtained inexpensively and a complete optimizationis onerous. This is particularly likely to be the case when the dual functional isnot differentiable, unless the dual problem can be written as a linear program-mning problem. The numerical example in section 7.3.2 below illustrates this.It is also the case that when there are many data, and hence the dimension ofthe dual problem is large, it can be computationally advantageous to use onlya subset of the data: the results remain conservative.

The sections that follow describe in greater detail the approach advocated inthis paper: "squeezing" the value of the primal between the discretized primal,and a discretized dual adjusted for the two sources of error just mentioned.The approach for minimum norm problems, bounds on linear functionals ofa model in a cone, and linear functionals of a model in a ball are illustratedwith examples from gravimetry, seismology, and geomagnetism, respectively. Anumerical example is given for linear functionals of a model in an infinity-normball in a seismic problem in section 7.3.2.

29

7.1 Ideal Bodies for Gravity Interpretation-a MinimumInfinity-Norm Problem

This section illustrates the approach with the "ideal bodies" technique for theinterpretation of gravity anomalies (Parker [42, 44]). The problem is as follows:we make n gravity anomaly measurements b, at points {Trj}7l1, -j E R3, at aheight p3 above the surface ( 0= 0) of the Earth E, a bounded subset of R3. LetC be the vertical unit vector. The measured gravity anomalies are assumed tobe due to a buried anomalous mass xo(r), plus observational noise. The datarelations are

fiJEHTi~Ti=-3xo(r)dr + Ej (46)

where 5 is the gravitational constant and IITII2 = (z31 2 1/2 is the Euclidean

2-norm on R3. We wish to use the data and the (assumed) known distributionof e to get a one-sided confidence interval for the maximum density contrastof the anomalous body xo. This problem is the basis for inferences about thedepth of burial and other interesting properties of the anomalous mass-seeParker [42, 44] for details.

In the discretized primal, we minimize the norm of the solution over a sub-space of the entire model space. In the dual problem, we wish to discretizethe constraint IIA - f*11 < 1; in order to do so, we need some control over thediscrepancy between the functionals f* and their discrete approximations.

Here X is the space of functions on the bounded domain E C R3 with finite"sup" norm; the normed dual space of X is the linear space of finitely additivesigned measures with finite (unsigned) mass:

X*= {I: JEjd < oo}.

The measures representing fj are absolutely continuous with respect to Lebesguemeasure dr, and f*[x] may be written

f [x] = f(T)x(T)dT,

where f is the n-vector of continuous real-valued functions

1 1 T, - rI12and dr is Lebesgue measure on R3.

30

7.1.1 Discretizing the Primal

We may discretize the primal problem by partitioning the region E into m"boxes" A,, and defining the indicator functions

Xi(T) = {o, otherwise;

and the matrix 0 with elements

= f, [xi].Let ca E Rm. The discrete primal problem is then to find

min{sup Iail} such that II# * Ca-611 < X.

This may be written as a linear programming problem:

min uf

such that

a>O (R)o <7r, L< af (Rln)

7r- v)-i)- 6+ n- C= O (R')i7,C> o (Rn)

i- (n+C)<X (R)where 1 = (1, 1,...,1)T and o = (O,O, ...,O)T.

See Huestis and Ander [28] and Huestis [27] for techniques when the fit tothe data is measured in the infinity-norm.

7.1.2 Discretizing the Dual

Recall that the dual problem is to find

sup A.6-xIIAll.{AXIIA.f*111<1}

The norm of the functional A * f* is

IIA * f*II1 = IE IA f(r)Idr.

We may discretize the dual problem by dividing E into m boxes Ai as in thediscretized primal. Let f" be the n-vector with components

f}-l= jT

31

let f(r) be the corresponding piecewise constant function

and let f* be the corresponding vector of functionals:

r [x]- f(H)x(T)dr.We wish to compare

IIA . f*tI JE IA* f('r)Iwith its discrete approximation

= EAiAfil

Denote the magnitude of their difference by E:

C - IIA * f*I- IPA * f*I< IP *(*\l

< A - (llfj* fi* 11)= (47)

Now the rightmost term is

e I(llf - fjJ (48)-(ZIAif j=) (48)

For some sorts of prismatic Ai, this can be computed analytically. (See section7.3.2 for a similar result.) For others, it will be necessary to bound the compo-nents of e, which can be achieved using the modulus of continuity: Define p'(v)to be the n-vector with components:

pl- sup Ifj(r) - fj(-).Tr,rEA

Clearlye, . ELP,p

32

where A, is the volume of 4i. The monotonicity of the functions fj(r) withdistance from Tij makes computation of pa straightforward.

Since the fj(r) are differentiable, there is another inequality we may use asa bound:

3

Pij < Ai E: sup I8r.fj(r)I.k=l TEAi

These two final techniques rely on the fact that the observations are made abovethe surface, at height at least ,3 > 0; otherwise, the fj have singularities in Eand the moduli of continuity are infinite.

By one of these methods or some other, we can find a vector e satisfying

S< A*e.

Thus we haveII.\ * f*11, < E il,\ * fil + I'I * La. (49)

Let i be the n by m matrix whose elements are given by

Write A as the difference between two positive n-vectors ir and v, and let i7 andC be two positive m-vectors. Equation 49 shows that the set

{A: IIA f*II < 1}

contains the set

{A = 7-v:7r,v> 0

and it *(7r- +v)+ C = 0

and i.(7±+C)+.-(7r+v) < 1}

where 1 is the vector (1, *1*,i)T. The conservatively discretized dual problemis thus the (large) linear programmuing problem

min(-6 + f*[x) *7(- v) + ax

such that

> 0 (R)0 < r,v< ol (Rn)

t17C> O (Rm).v(7r-v)+T7-C=O (Rm)1*(77+C)+LQ (7r+V) < 1.

33

The value of this discretized dual is less than the value of the exact dual

max {A * (6 - f*x'])- xIIAI.AXIIA.f*tIi<1

which is less than or equal to the value of the primal

inf {sup Ix(Q)I}XxECnD T

whereC =X= {functions on E with finite sup norm},

andD _ {x E X: 1If*[x]-6II1 < x}

7.2 Linear Functionals of a Model in a Cone: SeismicVelocity Bounds from r(p) and X(p)

This section illustrates bounding discretization error in the dual for boundson linear functionals of a model in a cone, with the seismic inverse problem ofinferring bounds on velocity as a function of depth from r(p) (vertical delay timeas a function of ray parameter) and X(p) (horizontal distance as a function ofray parameter) data from a halfspace in which velocity varies only with depth,and increases monotonically with depth. The section 7.3.2 covers an extension,derived from a problem in a spherical geometry, in which an upper bound on therate of increase of velocity with respect to depth is available. Section 10.2 coversan extension to the case where it is not known precisely which data functionalswere measured. The approach taken here originated with Garmany [21] andis discussed and developed in Garmany et al. [22], Orcutt [39], and Stark andothers [64, 63, 61, 59, 60].

In this section, section 7.3.2, and section 10.2, p stands for ray parameterand takes on real, rather than integer values. We model the Earth in terms ofdepth x as a function of velocity v, rather than velocity as a function of depth.The data mapping functionals for this problem are

b

f.t [XI = jb2Re{(v2 -p_ )1/2}dX(V), j = 1 . * *nr (50)rb

fi [X] = /b2pjRe{(v2 -p_ )-1/2}dX(v), j = nr + 1,***,n (51)

where a is the velocity at the surface of the Earth, b > maxj{p1'}, and x isa measure on [a, b]. The equations 50 are for the measurements of r(p); theequations 51 are for X(p) measurements. To ensure that x(v) assigns a singlevelocity to each depth (excluding a countable number of discontinuities) we

34

must ask that x be a positive measure on [a, b]. Perhaps the most interestinglinear functional to consider is

H.[4x] dx(v)

which gives the depth to the velocity v. It is the equivalent of the point evaluatorfor this problem: by finding strict bounds on H, for many values of v, we canconstruct a "corridor" containing the models consistent with the data (in thesense that IIf*[x] - 611 < X) and the prior constraint x(dv) > 0.

7.2.1 Discretizing the Primal

There are numerous references for discretizing the problem of finding bounds onlinear functionals in positivity-constrained problems using numerical program-ming, including Garmany, Orcutt, and Parker [22], Lang [31], Oldenburg [38],Rust and Burris [49], Sabatier [51, 52], Safon, Vasseur and Cuer [54], and Starkand others [59, 64, 63, 62]. Of these, Lang [31], Rust and Burris [49], Stark andParker [63], and Stark [59] address the two-norm measure of misfit to the data.

As usual, the primal problem may be discretized by parametrizing the depth-velocity model in terms of a linear combination of a known finite set of functions.In this problem, piecewise constant, piecewise linear, piecewise quadratic, andpiecewise reciprocal quadratic representations have been used. Once the ap-proximating functions have been chosen it is straightforward to discretize theprimal problem; the reader is referred to the references above.

7.2.2 Discretizing the Dual

It seems to be considerably more difficult to discretize the dual to this problemin a sharp way than it is for prior balls-even positivity constraints seem to leadto mixed-integer linear programming, a combinatorial optimization problem. Itis still possible to guarantee conservative answers to the discretized dual, but ingeneral the answers will be very conservative. The fundamental difficulty is thatpositivity constraints in the primal lead to the constraint in the discretized dualthat the absolute value of some quantities be greater than a constant. Linearprogramming can readily impose the constraint Ixil < c, but not Ixii > c. Theconstraint Ixil can be written xi < c and -x, < c, whereas Ixil > c involvesa disjunction: xi > c or -xi > c. Such disjunctions can, however, be treatedwith mixed-integer linear programming (see, for example, Papadimitriou andSteiglitz [41]). It is also possible to use a stronger constraint instead so thatlinear programming can solve the discretized dual at the cost of a less sharp,but still conservative result.

Finally, there are other methods for solving semi-infinite linear programmingproblems, such as imposing the positivity constraints at a finite grid without the"fudge factors" mentioned above, then a posteriori testing to see if the constraint

35

is violated between grid points. If so, more points are added to the grid, etc.To do this rigorously again requires the use of the modulus of continuity of thefunctions to determine how fine a grid one must use when performing the testfor violations of the positivity constraint. See Hettich [25] and Hu [26] for moredetails, and other algorithms.

Here is a sketch of what is involved in following through for positivity con-straints. Take X to be functions on [a, b], C the set of functions pointwisepositive on [a, b]. Assume as before that the functionals f* [x] can be writtenf* [x] = f f(t) x(t)dt for n continuous functions fj(t). We may readily see thatC* n D* is the set of linear functionals formed by linear combinations of thekernels fj(t) that are everywhere less than or equal to h(t):

b

C*nD*={x*EX* :x*[x]=j A f(t)x(t)dt and h(t)-A.f(t)O> , tE [a,b].

The difficulty in discretizing this problem is somehow to characterize C* n D*with only a finite number of constraints. Again, we resort to the modulus ofcontinuity of the functions f(t) and h(t), and express the modulus of continuityof h(t) - * f(t) in terms of them. We want to be sure that the discrete samplesof h - A f are sufficiently large that between samples the function can notbecome negative. There are two factors that "use up" the available modulusof continuity: the function can't go negative if its values at both endpointsare sufficiently large, nor if the difference between the values at the endpointsis sufficiently large (the function "consumes" its modulus climbing from oneendpoint to the other). It is straightforward but tedious to turn this idea intoan expression for how large the linear combination must be at its sample pointsto ensure it does not go negative between samples.

7.3 Linear Functionals of a Model in a BallIn this section we consider discretizing the problem of finding strict bounds ona linear functional of a model where the prior information is that the model liesin a ball in some norm around a known point. We have

v(P) = inf h* [x]{.::If*[x] 6jjE '1 <x and jjx-:1jj<v}

> v(D) = max (5f[]-i||V| f-h1+h[ }.

See Safon et al. [54], Parker [42, 44], Stark and others [64, 63, 59], Lang andMarzetta [31, 32, 33] for examples where the ball is in the infinity norrm SeeBackus [7, 6] for a two-norm example.

We shall assume that the vector f* [x'] and the scalar h* [x'] can be evaluatedanalytically, but that IA.f*-h* can not in general. We will discretize this term;we wish to bound the error committed. In examples, this amounts to bounding

36

the error in a quadrature scheme, or in a truncated infinite sum. Denote by f*and h* the discrete approximations to f* and h*. In analogy to equation 47, let

< IIA * f*-h*- (A * f* -h*)l= IlA (f*-f*)-(h*-h*)ll< IAI* (f* - f~*)j + llh* - h*lI< 1A1 * (hlf;*- f, II)71 + llh* - h*I

' III1||(hIfq -f.#hl)S=| + Ih* -h*II, (52)

where in the penultimate equation, the absolute value of A is taken componen-twise. If we can find numbers Lf and eh such that

ef > I( f;-f; Il)V= 1 (53)p

andQh > |Ih* -h*H, (54)

thenE . eullAlIq + eh,

and hence

v(D) > max{A (6- f*[x']) - (X + vei)11A|E-vlAvhllih + h*[xI] -Vh(55)

Note that this amounts to solving the dual with an increased value of X, plus aconstant factor-VQh.

We now set out to find numbers ef and Oh satisfying 53 and 54 using differentdiscrete approximations to f* and h*.

7.3.1 Discretizing the Dual in Lp Spaces

We specialize to X = Lp, la, b] (1 < p1 < oo), X* = Lql [a, b], f*[x] = fb f(t)x(t)dt,and h* [x] = fb h(t)x(t)dt. An example where X = Loo appears in section 7.3.2.Divide the interval [a, b] as before into m subintervals

f [ti7ti+0) 1 < i < m[tm tm+l],X i = m

where t1 = a and tm+1 = b, and let Ai = ti+l- ti.

37

We will discretize fj and h by functions constant in each Ai. Let these approx-imations have the values P and hi in Ai. Define

Zj(f) J| Ifj(t) _ fiIq dt (56)

Tj(h) - Ih(t) - h'l'dt. (57)

We can then write

m \ lqtIIf*- f,*lII -Tjj(f)

m ll1q'1Ih*-h*II = J(h)

These lead obviously to values of of and eh that work in equation 55. The factthat fj, h E Lql implies that with any sensible choice of fi* and h* these can bemade as small as desired by choosing Ai sufficiently short.

We will look at two different schemes for assigning fi and hi and finding thecorresponding Qf and Qh. The first is based directly on the integrability of fjand h; the second uses continuity properties of those functions that might holdin some problems. The first approach will be illustrated numerically in section7.3.2 for a seismic problem.

Prototype 1. Define

fi=-fj(t)dt (58)

h - _ J h(t)dt (59)

Since fj, h Lql and Lq[a, b] C L7[a, b] for r < q, the integrals exist. (If theoriginal Lp, space was weighted, weights can be introduced as needed.) Thisis in some sense the "correct" way to discretize problems in Lp spaces since itassumes only that the kernels fj(t) and h(t) are integrable.

Assuming we can bound 27(f) and 2(h), we are done: let pf be the n-vectorwith components

m ) l/q'(pf)i > li()

Then 9f = IIpf ilI and ,h = (Zi'm lj(h))l/q work. See section 7.3.2.

Prototype 2. This approach requires continuity of fj(t) and h(t). We discretize

38

f by sampling at arbitrary points within the intervals Ai. Let 4i E Ai, fiand hi _ h(t-). Define pi (A) to be the n-vector whose jth component is

(pf )M() - sup lfj(t) - MfO(illit-tIl<A

and defineph(A) _ sup Ih(t)-h(ti)I.

It-t1l<AThen clearly

IIf-J-fI11. (x |],Ifj(t) - fI,dt)If*f* I<. qi)dtq

< (EAi (pfif )j (AOi]"q)

and similarlyl/q'

IIh* -h* 11 < E hiP(Ai) q'

from which values of ef and Qh follow directly by inspection of equations 53 and54. As in the previous subsections, it may be easier to use different bounds onthe change of the f(t) than to use of, or to use Prototype 1.

7.3.2 Numerical Example: Seismic Velocity in Earth's Core

See section 7.2 for the problem set-up; the difference here is that we will incor-porate the constraint, developed in [64], that 0 < xo(dv) < Rdv, where R isthe radius of Earth's core. The addition of the upper bound on the measureforces xo to be continuous with respect to Lebesgue measure; thus dx/dv existsand is less than R. This means that our model xo is in the space of continuousfunctions possessing finite first derivatives. In a slight stretch of notation, wewill denote this space by Cl- [a, b]. This is to draw a distinction from the usualnotation, in which Cl[a, b] is the space of continuously differentiable functionson [a, b]. Thus xo lies in the ball

C = {x E C1[a,b]:O< d- <R

dv v

Now since we are interpreting x as a measure, the following functional is reallya norm:

2[ambu2v dx(v) (60)vE[a,b] R dv

39

Now endow C' [a, b] with that norm. Note that it is unimportant for thesubsequent analysis whether C1 [a, b] is complete with respect to the norm 60.We may write C = {x E C [a,b]: li - x'Il < 1}, where x' is defined by

_&. Now the norm on the dual space X* is by definition

IIx*lI- sup x*[x].11tz 1

If x* [x] can be written x* [X] = fab *(v)dx(v), then it is easy to see that

llX*ll = J| 2Ix*(v)ldv, (61)

which is a weighted 1-norm of the function x*(v). We are interested only innorms of A -f*-h*, which can be written in that form, as integrals of dx/dvagainst ordinary Lebesgue integrable functions. This is important, as the dualof Loo is rather nastier than L1, which is not the dual of any normed space.

This example is particularly interesting since some of the data functionals(those corresponding to X(p) data-see equation 63) have integrable singulari-ties. Recall from section 7.2 that the functionals h* of interest in this problemare of the form

bh* [x] = 1{V<Vh}dX(v),

where 1Q is the indicator function of the set Q. The integrals f*[x'] and h*[x']can be performed analytically; the results are in appendix A. Thus as indicatedabove in section 7.3.1, we are concerned only with the discrete approximationof IA f*- h*II. For our choice of h* in this problem, the second term in 55 canbe made to equal zero by including vh among the points in the discretization of[a, b]. Hence the only approximation we need to control is IIA * (f* -f*)II. Thekernels fj(v) are as follows (see equations 50 and 51):

fj(v) = 2Re{(v2 -_ p)1/2}, j = 1,- -nr (62)fj(v) = 2pjRe{(v2 - p)-1/2}, j = nr + 1, ***, n (63)

As the X(p) kernels 63 are singular at p 1, their moduli of continuity are infinite,and Prototype 2 in section 7.3.1 fails. We use the approach in Prototype 1, whichis requires only integrability of the r(p) and X(p) kernels.

Define fj as in equation 58. In analogy to equation 56, but with weights, let

J.4_| RIfj(v)-fjldv

Ri2-fj(v) - |,. fj(t)dt dv. (64)

40

Then

IIf;a-f;tI=Z"I (65)i=l

which can be made as small as desired by making A, sufficiently short. Let pfbe the n-vector whose jth component is

(Af )i lfJ* - fj*llI (66)If we

_Of 1- (67)

equation 55 will hold. Appendix A shows how to compute e numerically.Figure 3 shows the r(p) and X(p) kernels for the data we will analyze. The

data axe from the work of Johnson and Lee [30]; strict bounds in the discretizedprimal were found for this data set by Stark and others [64, 63].

Johnson and Lee analyze the data using the strict bounds technique ofBessonova and others [10, 11], which requires interpolation of the finite set ofdata available. Stark et al. [64] demonstrate that the numerical prograrnmingapproach followed here is superior; see also the argument in Orcutt et al. [40].

It is clear from Figure 3 that for of to be small, the discretization must bevery fine near the origin, where the r(p) kernels change rapidly, and in the range25-35 km s-1, near the singularities of the X(p) kernels.

Once we have computed ef, we need to solve the discretized dual optimiza-tion problem. We will take p = 2 and E diagonal following Stark and Parker[63, 61]. Let w be the m-vector of integrals of R over the intervals of discretiza-tion:

a), J| -dv2v

R-ln vi+l.2 v,

Let t be the matrix of the discretized values of f, weighted by w:

4Xji = fjiwi; (68)

let -y be the vector of the discretized values of h weighted by w:

'Yi = h'wi; (69)

and let

6' 6-f*[x] (70)X' -X+ Qf. (71)

41

Then, provided we include vh in the discretization of [a, b], the adjusted discretedual is

v(D) = max {* 6 -X IIlII _II- -ylll + h*[x']}= h*[x']--min {-*6'+ x'IIAIIE + IA. -'-IIi} (72)= h* [x']- rmin {X'IAIIE + 1 * (7+L)-A *6'}

{ A, 7r, a. E R7r,i> 0

A#*+ r-v.='y}

where i is a 2m-vector of ones. This final optimization is a quadratic pro-gramming problem in n + 2m variables, with positivity constraints on 2m ofthose variables, and m linear equality constraints. Unfortunately, the value ofm needed to get e down to 2.29, which is about 4% of X = 59.7 in this problem,is about 19,000. Since n is only 30, it is substantially more efficient to use un-constrained nonlinear optimization and equation 72. Although it is continuous,the objective functional is not differentiable everywhere. However, since thefunctional is concave, subgradients of the negative of the objective functionaldoes exist; appendix A gives such a subgradient. The objective functionalswere minimized using a two step procedure involving the Nelder-Mead simplexmethod (see [48, 20]) and Stanford Systems Optimization Laboratory algorithmNPSOL. See section A.4 for details.

Figure 1 compares the solutions to this problem for 89 values of vh (178optimization problems, solid lines) with the discretized primal results of Starkand Parker [61] (dashed lines), using the data from Johnson and Lee [30], withuncertainties assigned to the X(p) data by Stark and Parker [63], and 19,118points in the discretized dual. The primal results were obtained using 100 basisfunctions, equally spaced in 1/v, plus extra basis functions either side of Vh.The results were essentially unchanged when 200 basis functions were used.The computational algorithm used solves the problem as a sequence of linearlyconstrained, quadratic programming problems with upper and lower bounds onthe variables. The central algorithm of the method in [61] was modeled afterthe phenomenally robust NNLS algorithm of Lawson and Hanson [34].

The bounds in figure 1 from the dual are extra-conservative, as the algorith-mic procedure used to solve the dual may not find the global optimal values.Note, however, that any value of the discretized dual is a rigorous bound on thevalue of the primal: by spending more computer money to find better values ofthe discrete dual functional, we get tighter results, but no extra rigor.

Note that there is no duality gap for this problem, by conditions given insection 11: the feasibility of the primal problem guarantees that the dual isfinite, and solutions in C exist with misfit less than X. In cases where there isknown to be no duality gap, there is another way to bracket the true confidence

42

interval: the value of the discretized, corrected dual is less than the value of thetrue dual (which is equal to the true primal), and the value of the true dualis less than the value of the discrete dual without correction terms. Thus bysolving the dual problem with two different values of X' (X + ,f and X) wecan bracket a confidence interval with at least its nominal coverage probability.For this procedure to work, one must have a computational algorithm for thedual that is guaranteed to converge to the global extremum. In spite of theconvexity of the problem, the procedure used here may not achieve the globaloptimal value; thus using a smaller X' would not necessarily bracket the correctresults from the "inside."

7.3.3 Separable Hilbert Spaces: Core Magnetism

Backus [7] shows that the problem of inferring the value of a single coefficientin the spherical harmonic expansion of the core field, or of inferring the averagemagnetization of a disc on the core, using observations of Earth's magneticfield from satellites and ground observatories and the prior information that theequivalent mass of the energy stored in the field is less than the mass of Earth,can be posed in the infinite-dimensional space of coefficients in the sphericalharmonic expansion of the magnetic potential on the core. The constraint thatthe energy in the field is bounded by a particular constant constrains the modelto lie in a (weighted) two-norm ball. The functionals that evaluate a sphericalharmonic component or the average field in a disc are obviously linear.

All separable Hilbert spaces are isomorphic to 12, so without loss of general-ity, we assume some countable orthonormal basis has been chosen, and consideronly X = 12, the space of coefficients in that basis. We have assumed that h* andfj are bounded, so by the Riesz representation theorem there exist sequences(hk)w1 and (fk)* k' j = 1,2, .. *, n such that the linear functionals

00

h*[x] = EhkXk,k=l

f;[x] = Zf;kxk,k

and we have llxii = (Z42k)k. As Hilbert space is self-dual, f1 =( kNote that, as in example 1 of section 5.3, if the inner products f [fk*] and

fJ [h*] can be evaluated, there is no need to discretize the dual problerm. In fact,there is no need to discretize the primal problem, as we shall see: It is intuitivelyclear on geometric grounds (the projection theorem) that the relevant subset ofX for this problem is the subspace M spanned by the set {h*, f;, ..- fn}. Anycomponent not in M makes lxii larger, and does not affect IIf*[x] -6 nor h* [x].

Or consider this more rigorous justification: M is closed since it is a finite-dimensional subspace; therefore X can be written as the direct sum M + M',

43

where M' is the orthogonal complement of M (see for example [35]). That is,any x E X can be written uniquely as x = y + z, where y E M and z E M'.Consider any sequence (xj)^= of feasible points for the primal problem forwhich h* [xj] -- v(P). By direct substitution one may verify that the sequence(yi)j? , of components of the xj in M is also feasible, and that h*[xj] = h*[yj].Thus the value of the primal restricted to M is the same as for all X.)

Therefore, it suffices to consider x E M. Let's assume that the set {h, f;,.** fn }is linearly independent, i.e. that M is n + 1-dimensional. Rename fn+1 h*.We may write x E M uniquely as x = n+/l3fj*. We have

l1xll2 = 'S . r./where F is the "augmented" Gram matrix with elements

rik =fj*[fk*], j, k= 1,***,,n + 1.

Similarly, for x E M,

lIf*[x] _ 6112 = A3.A.., - 26. f.Fj + 116112,where A is the n + 1 by n + 1 matrix

A =fTfr,

and fris the n by n + 1 submatrix of r consisting of its first n rows (thosecorresponding to f*). (This can be generalized easily to E $ I by the introduc-tion of a weight matrix; the inner products needed are the same.) For x E Mwe also have h* [x] = y *., where -y is the n + lt row of r.

Thus the primal is equivalent to minimizing

minun /3:,/.r 3.<v and /3'A.,3-26"fr .,+±III2<X.

This can be achieved by means of two Lagrange multipliers, and implementedcomputationally by imposing yy ,3 = C as a constraint, and minimizing sIIxII 2 +Allf*[x] - 6112. The argument of Stark and Parker [63] can be adapted to showthat theminimum value of sIIxII2 + AIIf*[x] -6112 is quasiconvex in (. One mayshow then that there exist C, i > 0, and A > 0 such that

116112+ min A(/3 A.3-26If.r/)±p( .F./r3)= v(P)

i.e. the problem can be solved as a sequence of linearly constrained least-squaresproblems. See Lawson and Hanson [34] for details.

The Discrete Primal. Problems in separable Hilbert spaces are usually dis-cretized by truncation-representing x, fj, h* by their first m components

44

(x1, X2,X2.,* xm), etc.. Let Xm denote the m-dimensional subspace of X whereall components in the chosen basis other than the first m equal zero. Let Pmdenote the operator that projects an element of X onto Xm:

PmX _ (X1,X2,**XXmEOi,**)and let PRm map X onto Rm similaxly:

PRm x-(Xl, X2, *'* Xm)-

Let Pm and PRm act componentwise on n-tuples of elements of X; e.g. Pmf* =(Pmf* i,Pmf*2 * *-,.IPmf*n). Note that f; [Pmx] = (Pmf;)[Pmx]. Thus

DfnXm = {xEXm:f*[X]-6E3}= {x E Xm (Pmf*)[x] - 6 E Z} (73)

Let 0 = PR1 f* be the n by m matrix with elements

'jk =f;k' j=1,***,n;k=1,**,rm,and let

y= PRmh = (h*,, h*2, -*, h*m).

The discretized primal is to find

infa E Rm

11 0* (-'611 E-1 < X11aJ112 < V

which for p = 1, 2, or oo can be solved using quadratic programming iteratively.

The Discrete Dual. We discretize the dual by truncation as well, againassuming that f*[x'] and h*[x'] can be found analytically. If necessary, it isstraightforward to account for their approximate computation too. As in section7.3, the error E in the approximation of I A* f h*I- by IA - I is boundedby:

< IHAI1Z 11 (11f; - Pmf;11)>.I 1E + Jh* - Pmh*ll.Here f* - Pmf* and h* - Pmh* are known. Let of (11f| -f Pmf;11)7. 1IIpand -h _ IJh* -Pmh*11. Then 55 holds, and the value of the exact dual is atleast

max {A * (b - f*[x']) - (X + vegf)IlAIIq - VlH|| * y -Vh},A\ERn~

which is the unconstrained maximization of a concave functional and is a com-pletely finite-dimensional problem.

45

8 Primal Approaches to Discretization ErrorBackus [7] gives an intriguing treatment of discretization error for confidenceintervals on a single linear functional of a model in a two-norm Hilbert-spaceball, based upon adding a "systematic error" term that can be bounded usingthe Cauchy-Schwarz inequality and the prior constraint lIIoll < 1. Backusadvocates using as the discrete basis a subset of the singular functions of themapping f* restricted to the span of the data functionals ({x E X: x = A f*}).His technique for constructing confidence intervals can produce intervals that arelonger when more basis functions are used; he proposes a method of choosing thenumber of functions optimally using the singular values in that decomposition.

Lang [31] proposes a very appealing method of bounding discretization error:solve the discrete primal problem again with an increased value of X (analogousto the term vQf in section 7.3). One of the attractions of this approach isthat the same software can be used to bound v(P) above and below, with theaddition of a small amount of code to determine the necessary increase to X.Lang works out how much X must be increased for bounds on linear functionalsin a model problem where X is the space of continuous functions on an interval,C is an infinity-norm ball (0 < x(t) < u(t), where u is piecewise constant);and the linear functional is given by integration against a piecewise constantweight function. His basic idea can be extended to deal with prior balls in othernorms, and to linear functionals not given by piecewise constant weights (thisnecessitates the introduction of another correction term analogous to vgh, insection 7.3). However, it does not appear to extend to cases where there is lessprior information-e.g. if xo is not known to lie in a ball-and it does not workif the linear functionals f* and h* are not in the normed dual space.

The basic idea is to use the bound on the norm of xo to determine an increaseto X that guarantees the set Cn D is at least as "wide" in the direction normal tothe hyperplane that represents h* as is the set C n D. One wants to incorporatethe errors induced by the approximations to x and f* into the acceptable misfit.This can be done by considering the norm of the operator that is the differencefj , between J* and its discrete approximation fj*: the error is bounded by

IIf,AII IIxII = V11f,AII.By making the discretization sufficiently fine, one can ensure that Ilf,tJII issmall. Then the contribution of the discretization to the misfit is bounded byvIoleIII, where the n-vector ef= (IIfj,&jI)!1.

Similarly, one needs to allow for possible errors in h*[x], which can bebounded in the same way using the norm of the difference between h* andits discrete approximation. Denote the norm of this operator by eh*. Then thevalue of the primal is bounded below by

inf h*[]cx* x-veh*xECnD'

46

where I{a:II8f*[a.]IIE .X+VIIeII}.

In the one practical case the author tried to use this approach, the seismicproblem in section 7.3.2, vjLefIIE was greater than the misfit of the zeromodel; i.e., greater than the norm of the data vector 6. This is an indicationthat the discretization in the problem was too crude, but figure 1 shows thatthe discretization problem is not as severe as indicated by such a large valueof vllef IIpE . Thus this approach may not be sharp enough to give practicalresults in all problems where a bound on the norm of xo is available. A reasonfor this is that (for example, in Lp spaces with p # oo) a model that achievesits norm bound in each little interval of discretization in general does not fit thedata.

47

9 Bounds on a Single Linear Functional

So far, we have chosen the data space confidence region 3 without regard tothe functional H or the prior information C. Our procedure allows simultane-ous confidence intervals for any number of functionals to be computed, but inprinciple, 3 could be chosen optimally; for example, to give, the smallest ex-pected weighted sum of the lengths of the confidence intervals for a fixed finitecollection of functionals. Let S(a) denote the set of 1- a confidence regions for6:

S(a)- {3C R": P{eIE} > 1-a}

Using a ball in the data space as the confidence region leads to wider confidenceintervals than necessary to guarantee the nominal coverage probability if oneis interested in only a finite number of linear functionals. Since we can nevercompute bounds on more than a finite number of functionals, it may be ofprimary interest to get the bounds on those functionals as tight as possible whilemaintaining their (simultaneous) coverage probability. Backus [7] argues thatone should stick to a single functional at a time; I believe there axe cases whereit is natural to consider more than one. For example, one might be interestedin a confidence region for the paleopole position from paleomagnetic data (R.L.Parker, personal communication). Since two numbers (a three-dimensional unitvector) are needed to specify the paleopole position, we need a two-dimensionalconfidence set. In that case, it makes more sense to look at the two numbersas a vector and find, say, the region on the surface of the sphere with smallestarea, rather than to put separate confidence intervals on the components.

If one wished to construct a simultaneously valid envelope such as that inFigure 1, using one functional at a time and a set 3 tailored to each functional,the coverage probability decreases to something like 1 - Nao, where N is thenumber of functionals and ao is the nominal coverage probability for each func-tional. Clearly as N grows, ao must approach zero to guarantee simultaneouscoverage probability 1- a; it is easier and probably more efficient to use the ap-proach here, where a single 3 is used for all the functionals. To get the tightestbounds on any particular collection of functionals, one needs to tailor the dataconfidence set 3 to the functionals and the available prior information C.

As far as I know, the optimal choice of 3 is an unsolved problem, even fora single linear functional. For example, even the following simpler statisticalproblem appears to be unsolved: Suppose we observe 6 = p + E, where e -

N(O, 1), and p E [-r, 7]. For a fixed coverage probability 1- a, what procedurefor assigning a confidence interval to p from 6 produces, for the least favorablep E [-T, T], the shortest confidence interval in expectation? One must be ableto answer such a question before tackling the problem of optimal confidenceregions for bounds on functionals. The approach taken implicitly in this paperis to use the confidence interval [6-1.96, 6+ 1.96]fn[-, T]. It is easy to convinceoneself that this is suboptimal, but further analysis is needed to determine just

48

how bad it is.

9.1 Confidence SlabsIn this section we examine a particular choice of3 that can give a tighter confi-dence interval for a single linear functional than do balls. The choice mnade hereis the same as that used by Backus [7] and by Donoho [18], but the confidenceintervals we get differ due to the way the information xo E C is used. The setwe shall pick is a "slab," the region between two parallel hyperplanes in dataspace. See also Stark [58]. One way to think of this region is in terms of projec-tions of the data space Rn into R. This amounts to taking a particular linearcombination of the data functionals, say y * f*, and the corresponding datum- * 6 and solving the resulting single-datum inverse problem. Thus slabs are aspecial case (n = 1) of the previous results. For R, all the lp norms are thesame, 11.- el = 1- * el. Using the knowledge of the distribution of e, it is easyto compute the distribution of y - c. Let us sketch a motivation for this choiceof data confidence regions, and develop the dual problem directly.

Suppose for the moment that X* is normed. Define

'y=arg inaf IIH -3f*11I/3ERn

-y

g* _ j * f*, (75)and pick X so that

P{le Il < X}. 1-a.(If X* is not normed, pick any y E Rn.) Let

D- {xEX:1 ji- g*[x]l< x}.Then

P{C n D 9 xo} = P{D 3 xo} > 1 - a.The resulting primal optimization problem to get the lower endpoint of a 1- aconfidence interval for H[xo] is to find

v(p) = inf H[x].xECnD

The dual functionals C* and D* for this problem are

D*[x*] - inf x*[x] = f 36 - XI*31x = fg* for some i E RxED -oc, otherwise,

49

andC*[x*] _ inf (H[x] - x*[x]) = -vilH - x*1I,XEC

SO

v(D) = sup { x, -7vIH-g* I}.PER

Note that this can be derived directly from the n-dimensional case by takinglinear combinations as mentioned above, and setting n = 1. Section 11 discussesconditions for equality v(P) = c(s).

9.2 Hilbert-Spaces: Backus and DonohoSuppose that X is a Hilbert space so that X* = X. Then it is easy to see thaty = F- f* [H], where 1 is the Gram matrix with elements

rij =< f, , f* >

A bit of algebra yields

v(VD) = sup{36 F-1- f*[H]-XI,I3l-v (IIHII2 +f3(I-2)f*[H]*-F1lf*[H])}

The expression can be arranged so that evaluating the dual functional does notinvolve inverting the Gram matrix, just solving the system of equations

F7-.=f* [H]. (76)Now if the distribution of errors e is symmetric, the value of X needed for 1 - acoverage probability is independent of the direction j-any unit vector givesthe same value. Thus if we choose to solve 9.1 only approximately (say by asingular value truncation procedure, as advocated by Backus [7]) we still get avalid confidence interval for H[xo].

It is always possible to get a bound on the improvement in going from thebali in Rn to the slab by computing bounds with a Rn bali whose diameter ischosen on the basis of the x2 distribution rather than the x2 distribution. Theproof is obvious: the chi-squared 1 ball is a subset of the slab. Let Hll denotethe approximation to H in the linear span of the data functionals.

In our notation, the endpoints of Backus' confidence intervals are given by± inf ±HII [x] ± inf ±(H - HII) [x].xED xEC

Examine the lower endpoint (plus signs throughout):inf Hil [x]+ inf (H -Hll)[x] < inf Hll[x]+ inf (H-H11)[x]xED xEC xECnD xECnD

< inf (HII+ H-Hil)[x]xECnD

- inf H[x] (77)xECnD

50

which is the lower endpoint our procedure gives. Similarly, Backus' upper end-point is always larger, and so, at the same level of approximation used to find A,his intervals are always longer, to obtain the same nominal coverage probability.However, Backus performs a discrete optimization over linear combinations ofthe singular functions of the Gram matrix to determine the approximation toA that yields the shortest interval. With our approach, that optimization isimpermissible, since for each A, the width of the computed interval depends onthe measured data However, it is permissible to use Backus' or Donoho's (seebelow) a priori computation of the approximation that is optimal, then to usestrict bounds to find the confidence interval with that approximation to A. Thiswill yield shorter intervals than either Backus' or Donoho's technique used bythemselves.

By a different approach, based on the modulus of continuity of H, Donoho[18] shows how to construct optimally short (minimax) fixed-length confidenceintervals based on affine estimates for linear functionals of models in 12 con-strained to lie in arbitrary convex subsets C of X when the errors e are Gaussian.These intervals are shorter than those computed by Backus: they are optimalin a much larger class. Donoho also shows that these intervals based on affineestimates are nearly optimal among all measurable estimates, including nonlin-ear ones. For the present problem, where C = {x E 12 llxil < 1}, and H islinear, Donoho's approach leads to the computation:

w(v) - sup{IH[xm] - H[x11 : iIf*[x1] - f*[X-1]112 < v and xi E C}= sup{H[x]: llf*[X]112 < v and llxll < 2} (78)

as one may show without difficulty. The optimization problem 78 is in facta strict bounds problem with data misfit measured in the two-norm, a datavector identically equal to zero, and prior information that the unknown lies ina Hilbertian ball. Thus the development in section 7.3.3 and example 1 of section5.3 is applicable to the computation; only (n + 1)(n + 2)/2 infinite-dimensionalcomputations, fj [fk] and H[fj*], and iiH112, are required.

Donoho's minimax affine confidence interval has length

(w(v) \/sup -v i x,(v/2, a'),

where x<G.) is a universal function for such problems and must be evaluatednumerically, and a is the standard deviation of the (Gaussian) observationalnoise. Once the function Xa is tabulated once and for all, Donoho's approach iscomputationally less taxing than Backus', as it requires no secondary optimiza-tion over truncation levels. Donoho and Stark [19] give a numerical comparisonof Backus' and Donoho's confidence intervals for the geomagnetic problem, andshow that the optimal truncation-based confidence interval in a problem of thissort is no more than twice as long as the minimax affine-based fixed-lengthconfidence interval.

51

10 Systematic ErrorsIt is straightforward to incorporate the effects of known systematic errors intothe confidence intervals constructed by strict bounds. In order to end up withsolvable optimization problems, however, it is necessary that the systematicerror sets be convex; otherwise the optimization problems are over nonconvexsets, local optima may not be global optima, and there may be duality gaps.Of course, the convex hull of the systematic error set may be used, but thiscan enlarge the confidence intervals substantially. The following treatment ofsystematic errors in the data, or of modelling errors, is similar to the treatmentby Backus [7]. Backus' treatment of systematic errors in the evaluation of Hcarries over to our approach unchanged: one merely "tacks on" the mnaximumsystematic error in the obvious way after evaluating the confidence interval.

We model the data relations with systematic errors as follows:

6 = f*[xo] + e +w, (79)

where e is random error as before, and w E Ql is a systematic error drawn fromthe known set n?. Clearly, we may get a 1 - a confidence region for 8o = f*[xo]by translating any of our previous regions 3 by the set 1Q:

P{6- -7Q3 f*[xo]} >! 1 - a},where

6-3-S-Q {=yE R":'= 6 - w- for some t E S and some w E Q}.

We may thus proceed as before if we define D to be

D _ {x E X: f*[x] E 6- - f}.

If the set Q? is a ball in the norm we are using to measure data misfit, theresulting optimization problem can be solved by the same techniques as before.

10.1 Example: Ball-Type Constraints.Suppose

Q1= {w E R": ||&- wollpII <

for some wo E Rn and > 0. For p = p' and E = E' it is obvious that thisadditional uncertainty can be incorporated by setting

D = {x E X: lif*[x]-6-wIl .< X+ (

and proceeding as before. If E and 2' are both diagonal and p = p', resultsare easily obtained.

52

Suppose p $ pt. The following "quick and dirty" approach is straightforwardand conservative, but often not sharp. Let p" = max(p, p'). Since the unit ballin lp is contained in the unit ball in Ip,, for p" > p, one may conservativelyincorporate systematic errors by expanding 1n or _ to the pI'-norm, then usingthe same technique as before for p = pf.

Obviously, the situation is very simple if a slab is used as the confidenceregion, provided the systematic error set is an interval. When it is not, it issometimes possible to do something with mixed integer linear programmning,which can treat disjunctive constraints (see Papadimitriou and Steiglitz, [41]).

10.2 Uncertain Data Functionals: an example from He-lioseismology

The mathematical model of an inverse problem we adopted in equation 4, wherethe data functionals fi were assumed to be known, is too restrictive for someinverse problems. In many cases, including notably curved-ray travel-time to-mography, we do not know precisely what functionals we have measured: wemerely know that f* E X, with some probability, for a known class F of func-tionals. In curved-ray tomography this comes about because we do not knowthe raypaths. In other problems it can come about because we do not know pre-cisely the characteristics of our measuring instruments. In some circumstances,this uncertainty in what was measured can be cast as a systematic error.

The following is an illustration from the asymptotic inversion offree-oscillationdata; there is an analogous problem for the inversion of r(p) and X(p) data forEarth as well (see sections 7.2 and 7.3.2), although it has been ignored in theliterature so far. The asymptotic inversion of free-oscillation data via Abel'sequation was first done for Earth by Brodsky and Levshin (the first paper inEnglish is [12]). The technique has also been used for the inversion of solar free-oscillation data by Christensen-Dalsgaaxd and others [14], and by Brodsky andVorontsov [13]. The following development will implicitly be for Sun, becausethe additional constraint of monotonicity of soundspeed with depth does nothold for Earth, except arguably in the core [64].

In this section we abandon some of our previous notation for consistencywith current usage. Let wk, be the eigenfrequency with radial number k andangular number 1. We shall assume that we observe triples (b5k;, k, 1), where41 = wktl + eki, and as usual Ekl is observational error. It can be shown (see[12]) that

7r(k+a) R( tl +1/2 \'\ 1/2r'd+O(i). ()WkA;l - h C2(r) ( k))(

Here R is the solar radius, r is distance from the center of Sun, c(r) is thesoundspeed at radius r, a is a phase shift factor that captures the effect of the

53

boundary condition at r = R; a is assumed to be known (techniques for its esti-mation are the subject of work in progress with M. Brodsky, and uncertaintiesin a are easily incorporated into the subsequent analysis). Define

7r(k + a) (81)Wkl

Pk _ (I + 1/2)c(R) (82)WklR

RlIn( R)X lc(R) (83)

v(x) c(R exp{-xc(R)/R}) (84)

where the last two equations are the Earth-flattening transformations of Gerverand Marcushevitch [23]. If we substitute 81-84 into 80, we find to first order inw-1

dk = j (V-2-Pk dX. (85)

Under the assumed monotonicity of c(r)/r (according to D.O. Gough, personalcommunication, this assumption is not usually disputed for Sun, and all currentsolar models satisfy it), we have monotonicity of v(x) and hence can changevariables to x(v). (See Stark et al. [64] for a more rigorous justification.) Then

dkl = j" (v2 Pk- ) dx(v).

The reader might notice this is equation 50 with pj = Pkl. However, in this case(as, in fact, for the travel-time problem) the kernel in the integral equation andthe limit of integration are not known precisely-they depend upon Wkl, whilewe observe only 5kl = Wkl + Ekl.

We will address this complication by finding a systematic error term to addto the data uncertainties. The basic idea is to look at how different the integrals

{p-1

I(p, X) j (V2 p2)112dx(v) (86)

can be for p ranging throughout confidence intervals for each of the pAl derivedfrom the observed bkl, and x ranging over the set of a priori acceptable solutions.The possibility of bounding this error depends on the following constraint, whichis generally accepted by the helioseismological community, but by no meanscertain (D.O. Gough, personal communication): c(r) increases monotonicallywith depth in the sun, down to a radius of about 0.2R. As in the case for T(p)and X(p) data mentioned in section 7.3.2, and in detail in [64], monotonicityof c(r) translates into the upper bound x(dv) < Rdv. We may then take X =

54

C-[i, b], where the space C1- is defined in section 7.3.2, and b satisfies b >maX(k,l) Pkl k

Suppose we can find from the statistics of E a set of numbers Xkl such that

Pbkl- Xkl < Wkl < bkl + Xkl} > 1-/,

for fixed / E [0,1]. Let

__(k+ a)kl 5kl + Xkl

dk+, _ tr(k + a)kl 6-klXkl

and

- _ (I + 1/2)c(R)Pkl (6kl + Xkl)R

Pk+_ (I + 1/2)c(R)(6kl- Xkl)R

Thenp {d-1 < dkl < d+I} > 1 -

andP{Pkl.Pkl.Ptl}k 1- .

Let us fix k and I and drop them from the notation. Let

d+ +d-2

d+ +d-2

We wish to bound how bad an approximation I(j3 xo) is to 27(p; xo), where xois the true model and p = Pkl is the true value.

With probability 1 - 3, p E [p-, p+], and hence with probability 1 - 3,

II(P;XO)- T(P; XO)I < sup 2I(q;xo) - 1(P;xo)jqE[P- ,P+]

< sup IT(q; xo) - I(P;xo0)q,rE[p- ,p+]

< sup sup jI(q;x)-T(P;x)lxEC q,rE[p-,p+]

= R [p+ {cos 1 p - tan(cos p+)}-Pkl{cos 1 p- - tan(cos1Pkl)}]

_ k1k

55

provided p-1 and P-1 are velocities that occur shallower than 0.3R. Thus

p{dkz-ckl < I(Pkl; Xo) < dk, +Ofkl} >1 -.

Now

p{dk - 0k1 . Z(Pkl; xo) < d+ + 0"kl, V(k, 1) E K} > 1- IK1,3,where K is the set of all pairs (k, 1) in the data set, and IKI is the cardinalityof K. If the errors ej are independent, 1 - 1K1,3 can be replaced by (1 -,3)KI.Define /3' to be IK 3 or 1 - (1 - 3)IKI, as appropriate. Now let

D-{x E Cl-[1, b]: k-- <I(Pkl;x) < Vy+,V(k, ) E K}Then p{D 3 xo} > 1 -,3'. As in section 7.3.2,

C_ {xEC -[l,b]:O< dv Rdvv

As usual, p{C n D 3 xo} > 1 -/3', so we may use this set for strict boundswith coverage probability at least 1 - 3': we have succeeded in capturing theuncertainty in the data functionals.

56

11 Duality GapsIf there is a duality gap, that is if v(P) > v(D), then the bound we get by solvingthe dual problem is still valid, but not tight. We can not "squeeze in" on thevalue of P indefinitely using the discretized dual problem-the discretized dualalways gives pessimistic results. It is thus of interest to determine when theremight be a duality gap, to know if a large difference between the values of thediscretized primal and dual problems indicates that the true bound on H maybe much smaller than v(P). One heuristic check one may perform is to solve thediscretized dual problem without the correction terms for discretization, and seeif the discrete primal and dual results then overlap. Here we pursue a set ofrigorous sufficient conditions.

The fundamental result for the algebraic approach to conjugate duality wehave taken here is Fenchel's Duality Theorem (see Luenberger [35]), which givesthe following sufficient conditions for the absence of a duality gap:

1. C n D has points in the relative interior of C and D,

2. the epigraph ofH has nonempty interior, or D has nonempty interior, and

3. v(P) is finite

We obviously need a topology on X to apply the Theore.L We would like tobe able to treat cases where no topology is available, for instance when all weknow is that xo is in a cone. Fortunately, there is another approach throughLagrangian duality (see Luenberger [35] for the general theory, Stark [58] forapplication to this problem.)

The sufficient conditions for the absence of a duality gap in Lagrangianduality require no topology on X. What makes Lagrangian duality fly here isthat we determine membership in D via a mapping into R, with positive coneR+, such that x E D if and only if some convex functional G* of x is less thanor equal to 0. In most of the paper, the functional

_ 1G*[x]- Ilf*[x] -l - X.

Now Lagrangian duality ([35], Ch. 8.6, Th. 1) says, in our language, that if Cis convex, H is convex, v(P) is finite, and there exists x E C with G* [x] < 0,then

v(P)) = v(DL)- max inf {H[x] +-yG* [x]}. (87)f XEC

We shall see momentarily that the expression on the right hand side of 87 isequivalent to the conjugate dual derived algebraically in section 4. We remarkthat while Lagrangian duality is useful here to establish the absence of dualitygaps, the actual computation of the value of the dual is easier in the originalconjugate dual formulation, and the algebraic development of the conjugate dualin the first place is more transparent.

57

The conjugate dual is (from 35)

v(Z) = sup {A.6-XIIAIIE+ inf(H[x]-Af*[x])}AERn xEC

= sup -XIIAlE+ inf (H[x]- A.(f*[x]-1A))A :EC

> suxp t-X lIA l l+ in(fH[x]-IIlAl l[If* z-11 ) }E

= sup{-aX+inf (H[x]-aIIf*[x]-Z6I-)} (88)

which is the Lagrangian dual 87 defined above. Now since v(P) > v(s) >v(VL), where VDL is the Lagrangian dual, it follows from 88 that if there is noduality gap for the Lagrangian dual, i.e. if v(P) = v(VL), there can be no gapfor the conjugate dual D. 0

Thus the absence of duality gaps depends on

1. the primal P having finite value v(P), and

2. the existence of an xi E C with G*[x1] < 0.

The second condition is easier to discuss, so let's begin with it. If there is afeasible point for the discretized primal, which can be established computation-ally by numerical programming, then either the second item is satisfied, or aninfinitesimal increase to X will satisfy it. Since the value of X is estimated fromsuppositions about the distribution of e, one might well be disconcerted if theparticular value of X chosen happens to be the smallest for which the data canbe fit, and might not hesitate to increase X a smidgen. The second conditioncan be checked a priori, but it depends on the prior information C, and thenature of the functional H. We explore the various cases treated in the paper:

11.1 No Prior Information, H linearWe have seen in the case of bounds on linear functionals with C = X, no priorinformation, there is no duality gap provided the data functionals are linearlyindependent in X*. This result is purely algebraic.

11.2 Prior Convex Cone, H linearLet K* be the conjugate cone (also called the polar cone) of C:

K* = {x* EX* : Vx E C,x*[x] > O},

and let F* be the span of the fJ. Then, for EE a ball, v(P) is finite providedH E F* + K*, and so there will be no duality gap provided (2) is satisfied.

58

11.3 Prior Ball, H linearClearly, ifH is a bounded linear functional (the only case we consider when C isa ball), v(P) is finite if the problem is feasible at all, since H[x] > IIHIIv + H[x']for x E C.

11.4 No Prior Information, H a normFor H a norm, C = X, and linearly independent f; in X*, v(P) is finite: it isbounded below by zero, and above by a finite value depending on the norms ofthe individual f. 's and 6.

For H a seminorm, we may pose the problem in X/M, where ' is a normand then use the previous result. There is a (surmountable) difficulty if thereexists y E M with f,* [y] $ 0 for some j.

Thus under these reasonable regularity conditions, which often appear to ob-tain, duality gaps are not in practice a problem, and the discrete-primal/discrete-dual bracketing procedure should allow the true value v(P) to be "squeezed" asaccurately as desired.

59

12 Discussion

The discretized dual with correction terms provides a method for rigorous infer-ences about certain functionals in some constrained infinite-dimensional linearinverse problems. In conjunction with the discretized dual problem, or by itselfwithout correction terms when there is known to be no duality gap, it enables atrue 1- a confidence interval for the functional to be bracketed computationally.

The dual approach is probably attractive computationally as long as thenumber of data is relatively small (compared with the number of model ele-ments) and the data functionals are reasonably well behaved (in contrast tothe X(p) functionals in section 7.3.2). In whole Earth travel-time tomography,where the data number millions, current computer technology would require oneto chose a subset of the data kernels to work with to limit the dimensionality ofA to a reasonable figure. It is probably not too difficult to determine a priori agood subset to pick to bound a fixed linear functional: one wants kernels thatare most nearly aligned with H so that a good approximation to H is possiblewithin their span. One is probably not going to do too badly by picking thesubset so that the support of the functions that represent those components off* and H roughly match.

I expect that is possible to treat more general convex functionals via theHahn-Banach Theorem [35]. The ideas are similar to the development forminimum-norm problems, but the set C* has a more complicated descriptionthan I1x*1I < 1. Donoho [17] gives situations in which lower bounds on nonlin-ear functionals of a probability density can be computed. His work could likelybe generalized to inverse problems. It also should be possible to treat moregeneral convex constraints than Lp balls, and to get a bound on problems withnonconvex constraints by using the convex hull of the constraint set.

In the primal problem, it is often expensive to incorporate constraints; whenis additional knowledge likely to shorten the confidence intervals significantly?This question is not explored here, but it seems that the dual problem is thenatural vehicle for estimating a priori the power of a priori constraints.

Relation to Backus' work and Donoho's work. The approach to con-structing confidence intervals taken here is fundamentally different from that ofBackus [7], and Donoho [18]. Both Backus' and Donoho's procedures producefixed-length confidence intervals-the length of the confidence interval dependson the data functionals, the functional to be bounded, assumptions about thenoise, and so on, but not on the measured data values. The measured datamerely determine the center of the intervaL Backus' procedure does not ingeneral produce the shortest fixed-length confidence interval; among proceduresbased on affine transformations of the data, Donoho's procedure is optimal.Furthermore, he shows that the resulting intervals are within a small fractionof optimal among all procedures for producing fixed-length intervals. With, forexample, Gaussian errors, both procedures have positive probability of produc-

60

ing confidence intervals that do not intersect H[C], the set of feasible values ofH.

The approach taken here gives data-dependent confidence intervals-notonly the location, but also the length of the confidence intervals depend onthe measured data. With this approach, one can get shorter intervals than withthe data-dependent approach (although they can be longer), and the confidenceintervals include only values of H that could arise from elements of C. Again, asmentioned in section 9, one way to ensure that they are always shorter is to useas the data confidence region Z the slab geven by Backus' or Donoho's analysis,then use the techniques here to bound h*. The problem of finding the shortest(in expectation) data-dependent confidence interval for a restricted parameterappears to be unsolved.

Another fundamental difference is that both of these approaches are gearedto finding confidence intervals for a single linear functional. In contrast, themethod described here can produce simultaneously valid confidence intervals forany number of functionals, including some nonlinear functionals. Both Backus'and Donoho's techniques are developed only for models in 12, in contrast tothe present work, which does not even require that X be a topological space.However, Donoho's technique allows C to be an arbitrary convex subset of X,while this paper discusses only balls (in arbitrary norms) and convex cones.

61

A Numerical Computations for Seismic BoundsThis appendix covers the computations needed in section 7.3.2.

A.1 Computation of h*[x'] and f*[x']These can be done analytically:

h*[x'] =h R2dv = - ln vh. (89)LX 2~v- 2 \al

Let

-Tj (v) _ |fj (v) 2dv

{-R (pj sin-I(pjV) + (V-1 p,)1/2), < j < nr go= {RsiiF1(p,v), n, <i . fl (90

for v <p_ 1, and 0 otherwise; and where the fj(v) are defined in equations 62and 63. Then

f,* [x'] = Ij(p7-') - Ij(a)= R[(a-2 p2)1/2 +pj (sin-'(apj)-)], 1<j<nr

|R (P2-sin -(apj)) , nr < j < n

A.2 Computation of fi and hiThese computations will also be needed below to compute of. The functionsfj(v) are defined in equations 62 and 63.

j(v) _ |fjf(v)dv|2 [(1- v2pj2)1/2 - In ((vpj)-1 + ((vpj)-2 - 1)1/2)], 1 < j <l -2(p. 2 2)2)l/2nr < j

for v < p7', and 0 otherwise. Let

vi,j min{vi, p.-}Then

fJi _ |fj(v)dv = (Tj(vi+ ,,) -Ej(vij)):Jzl, ~Vi+1 - Vi

We have trivially

h'= 1<dv

{ 1, Vi < Vh

= otherwise

62

since we have included vh in the discretization.

A.3 Computation of ef

As figure 3 shows, the fj(v) are nonnegative and monotonic, which is helpful:the integrals 1j in equation 64 can each be broken into integrals on two intervals,one where fj < fji and one where f,> fi. The endpoint of these intervals in theinterior ofAi can be found explicitly since the monotonicity of the fj guaranteesthat they can be inverted. Let

f((2)pj2+ ,1/2,

2J(4 + ) sn < j < nr

Then f(tib,,j) = fji. Define wi,j = min(tbi,, p; 1). Recalling equation 90 we cannow find the integrals 2j of equation 64.

For 1 < j < nr, the r(p) kernels are monotonic decreasing, so

ii= JAtIfi-fhI'2dv1 I (fi- f)'IRdv+J( f)-dWfRli _ RRf R

- i fi dv-fi' -2 dv= ZJ( 2j)-(v' lJ) j +f 22)Wi' R-v Vgi,j, R

=dv-jfj j j 2-dv

Vi1R J ,+1,, R

_fJ -dv+ fj2-dv

2v ,~. v

- -2I(wi,j) + Ij(v,i+,j) +-Ij(v;,,j) -fj In (v2+iv ) (95)

Then we can computeX(p =kerl 2>, and

ef= II(ij).2=1lP

63

A.4 Numerical Solution of the DualDefine the functional

AY(A)--A.= ' + x'IIIIE-1 + IIA - V111,where #, y, 6' and X' and are defined in equations 68, 69, 70 and 71. To findthe minimum of h* [x] over C n D we compute

h*[x'] infiyy(A)IA

and to find the maximum of h * [x] we compute

h* [xt] + inf ii1(A).A

There is obviously positive scalar homogeneity for the optimization of linearfunctionals; to enhance numerical stability in the computations, h* was normal-ized by the ratio of its norm to the norm of the largest data functional.

The actual numerical solution of the optimization problem is nontrivial,as the objective functional, while convex and hence subdifferentiable, is notdifferentiable everywhere. Both the two-norm piece and the one-norm piece failto have derivatives on a set of zero measure. Here is a subgradient of p(A):

{_X + iXXE + E'-m C1sgn(X *~ -_ z)? IIXIIX $ o

- + -jl?kisgn(A. , --ti), otherwise,

where 4i is the ith column of #, and

1, a>0sgnae = l a, = Osn -1, a< 0

The lack of smoothness in this problem is sufficient to disrupt algorithms thatassume two continuous derivatives, for example Stanford Systems OptimizationLaboratories NPSOL. Shor [55] and Fletcher [20] give examples showing thatsteepest descent methods can converge to nonstationary points for nonsmoothconvex functions.

The 178 optimization problems for the 89 upper and lower bounds on ve-locities were solved by a two step procedure. In the first step, the Nelder-Meadsimplex algorithm (see, e.g. [20, 48]) was used to find an initial solution. Thismethod is inefficient but robust, requiring from 8,000 to 40,000 function evalu-ations to find each minimum (i.e. terminate); each function evaluation involvesa matrix multiply with dimension about 20,000 by 30, and 178 functionals mustbe optimized, so this is rather expensive. After the Nelder-Mead algorithm

64

found a reasonably good point, Stanford Systems Optimization Laboratory al-gorithm NPSOL was used to improve it, supplied with the subgradient in placeof the gradient. I found empirically that the algorithm could decrease the objec-tive value when a sufficiently good starting point was used. This may indicatethat the objective functional is differentiable in a neighborhood of the optimalsolution. Typically, NPSOL required 200-400 major iterations, and about 500function and gradient evaluations to "converge," i.e. terminate. In the eventthe resulting bounds appeared too pessimistic (for example, if the bounds werenon-monotonic), the Nelder-Mead algorithm was restarted from the best pointfound by NPSOL, and then NPSOL was run again. It is clear on reflectionthat the actual bounds themselves must be monotonic (since each point on theoptimal bounds is contained in some monotonic model fitting the data), so thisis a diagnostic for sharp solutions.

In addition, I tried the subgradient algorithm (see Shor [55]), with stepsize hk = constant * (.999)k. This choice of step length is illegitimate from atheoretical point of view as Zk hk < xc, but substantial numerical experiencesuggests it works well (N.Z. Shor, personal communication). (For that matter,unless the algorithm finds a point where the subgradient vanishes, no finitenumber of steps is sufficient.) To try to ensure that the method descends ateach step, a bisection was performed up to 31 times in each step to find apoint at which the value of the functional was less than the current value. Myexperience with the generalized gradient algorithm was that it was even slowerthan the two-stage procedure described above. I am hopeful that the ellipsoidalgorithm [41] or the method of space-dilation in the direction of the gradient(SDG) [55] would be more efficient.

The computations were coded in Fortran77 and performed on a set of 5 SunSPARCstation l's, and a Cray X-MP/14. The velocity interval was discretizedincreasingly finely until ef - 2.29, about 4% of the value X = 59.7. Thatrequired 19118 points in the interval [a,b]; 15000 equally spaced in slowness(reciprocal velocity) from a to the reciprocal of the smallest pj in the X(p)data, and 4000 equally spaced in slowness from there to b; plus samples at 1/pjfor each j, and samples at each velocity Vh. The value X = 59.7 is appropriatefor 99.9% confidence (a = 0.001) for the two-norm of 30 independent, zero-mean, unit-variance Gaussian random variables. Stark and Parker [63] showempirically that the primal results are not sensitive to the precise value of X towithin about 10-20% at that level, so a coarser discretization could probablyhave been used without appreciably changing the results.

65

References[1] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc.,

68:337-404, 1950.

[2] G. Backus. Inference from inadequate and inaccurate data, I. Proc. Natl.Acad. Sci., 65:1-7,1970.

[3] G. Backus. Inference from inadequate and inaccurate data, II. Proc. Natl.Acad. Sci., 65:281-287, 1970.

[4] G. Backus. Inference from inadequate and inaccurate data, III. Proc. Natl.Acad. Sci., 67:282-289, 1970.

[5] G. Backus and F. Gilbert. The resolving power of gross Earth data. Geo-phys. J. Roy. astr. Soc., 16:169-205, 1968.

[6] G.E. Backus. Comparing hard and soft prior bounds in geophysical inverseproblems. Geophys. J., 94:249-261, 1988.

[7] G.E. Backus. Confidence set inference with a prior quadratic bound. Geo-phys. J., 97:119-150,1989.

[8] G.E. Backus and F. Gilbert. Uniqueness in the inversion of inaccurate grossEarth data. Phil. Trans. Roy. Soc. Lon. A., 266:123-192, 1970.

[9] R. Bayer and M. Cuer. Inversion tri-dimensionnelle des donneesaeromagnetiques sur le massif volcanique du Mont-Dore: Implicationsstructurales et geothermiques. Ann. Geophys., 37:347-365, 1981.

[10] E.N. Bessonova, V.M. Fishman, V.Z. Ryaboyi, and G.A. Sitnikova. Thetau method for inversion of travel times, I. Deep seismic sounding data.Geophys. J. Roy. astr. Soc., 36:377-398, 1974.

[11] E.N. Bessonova, V.M. Fishman, M.G. Shnirman, G.A. Sitnikova, and L.R.Johnson. The tau method for inversion of travel times, II. Earthquake data.Geophys. J. Roy. astr. Soc., 46:87-108, 1976.

[12] Michael Brodskii and Anatoly Levshin. An asymptotic approach to theinversion of free oscillation data. Geophys. J. Roy. astr. Soc., 58:631-654,1979.

[13] M.A. Brodsky and S.V. Vorontsov. On the technique of the inversion of he-lioseismological data. In J. Christensen-Dalsgaard and S. Frandsen, editors,Advances in Helio- and Asteroseismology, pages 137-140, IAU, 1988.

[14] J. Christensen-Dalsgaard, T.L. Duvall, D.O. Gough, J.W. Harvey, andE.J. Rhodes Jr. Speed of sound in the solar interior. Nature, 378-382,1985.

66

[15] J.E. Cope and B.W. Rust. Bounds on solutions of linear systems withinaccurate data. SIAM J. Numer. Anal., 16:950-963, 1979.

[16] M. Cuer. Des Questions bien posees dana des Problemes Inverses deGravimetrie et Geomagnetisme: une Nouvelle Application de la Program-mation Lineare. PhD thesis, Academie de Montpellier, Universite des Sci-ences et Techniques du Languedoc, 1984.

[17] D.L. Dohoho. Nonparametric inference about functionals of a density, I.Technical Report, Department of Statistics, UC Berkeley, 1984.

[18] D.L. Donoho. Statistical estimation and optimal recovery. Technical ReportNo. 214, Department of Statistics, UC Berkeley, 1989.

[19] D.L. Donoho and P.B. Stark. Minimax confidence intervals in geomag-netism. Technical Report, Department of Statistics, UC Berkeley, 1990.

[20] R. Fletcher. Practical Methods of Optimization. John Wiley and Sons,1987.

[21] J.D. Garmany. On the inversion of travel times. Geophys. Res. Lett., 6:277-279,1979.

[22] J.D. Garmany, J.A. Orcutt, and R.L. Parker. Travel-time inversion: ageometrical approach. J. Geophys. Res., 84:3615-3622, 1979.

[23] M.L. Gerver and V.M. Markushevich. On the characteristic properties oftravel-time curves. Geophys. J. Roy. astr. Soc., 13:241-246, 1967.

[24] J.R. Grasso, M. Cuer, and G. Pascal. Use of two inverse techniques. Ap-plication to a local structure in the New Hebrides island arc. Geophys. J.Roy. astr. Soc., 75:437-472, 1983.

[25] R. Hettich, editor. Semi-Infinite Programming, Springer-Verlag, New York,1979.

[26] Hui Hu. Semi-Infinite Programming. PhD thesis, Stanford University, 1988.

[27] S.P. Huestis. Uniform norm minimization in three dimensions. Geophysics,51:1141-1145, 1986.

[28] S.P. Huestis and M.E. Ander. IDB2-a Fortran program for computingextremal bounds in gravity data interpretation. Geophysics, 48:999-1010,1983.

[29] S.P. Huestis and R.L. Parker. Bounding the thickness of the oceanic mag-netized layer. J. Geophys. Res., 82:5293-5303, 1977.

67

[30] L. R. Johnson and R. C. Lee. Extremal bounds on the p velocity in theearth's core. Bull. Seismol. Soc. Am., 75:115-130, 1985.

[31] S.W. Lang. Bounds from noisy linear measurements. IEEE Trans. Info.Th., IT-31:498-508, 1985.

[32] S.W. Lang and T.L. Marzetta. Bounds from noisy linear measurements.In H.W. Schuissler, editor, Signal Processing II: Theories and Applications,pages 491-494, Elsevier Science Publishers B.V. (North-Holland), 1983.

[33] S.W. Lang and T.L. Marzetta. A linear programring approach to boundingspectral power. In Proceedings ofIEEE International Conference on Acous-tics, Speech, and Signal Processing, pages 847-850, IEEE, IEEE, April1983.

[34] C.W. Lawson and R.J. Hanson. Solving Least Squares Problems. JohnWiley and Sons, Inc., New York, 1974.

[35] D.G. Luenberger. Optimization by Vector Space Methods. John Wiley andSons, Inc., New York, 1969.

[36] T.L. Marzetta and S.W. Lang. Power spectral density bounds. IEEE Trans.Info. Th., IT-30:117-122, 1983.

[37] M. McNutt and L. Royden. Extremal bounds on geotherms in erodingmountain belts from metamorphic pressure-temperature conditions. Geo-phys. J. Roy. astr. Soc., 88:81-95, 1987.

[38] D.W. Oldenburg. Funnel functions in linear and nonlinear appraisal. J.Geophys. Res., 88:7387-7398, 1983.

[39] J.A. Orcutt. Joint linear, extremal inversion of seismic kinematic data. J.Geophys. Res., 85:2649-2660, 1980.

[40] J.A. Orcutt, R.L. Parker, P.B. Stark, and J.D. Garmany. Comment con-cerning 'A method of obtaining a velocity-depth envelope from wide-angleseismic data' by R. Mithal and J.B. Diebold. Geophysical Journal, 95:209-212, 1988.

[41] C.H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algo-rithms and Complezity. Prentice-Hall, Inc., New Jersey, 1982.

[42] R.L. Parker. Best bounds on density and depth from gravity data. Geo-physics, 39:644-649, 1974.

[43] R.L. Parker. A statistical theory of seamount magnetism. J. Geophys. Res.,93:3105-3115, 1988.

68

[44] R.L. Parker. The theory of ideal bodies for gravity interpretation. Geophys.J. Roy. astr. Soc., 42:315-334, 1975.

[45] R.L. Parker and M.K. McNutt. Statistics for the one-norm misfit measure.J. Geophys. Res., 85:4429-2230, 1980.

[46] R.L. Parker, L. Shure, and J. Hildebrand. The application of inverse theoryto seamount magnetism. Reviews of Geophysics, 25:17-40, 1987.

[47] J.E. Pierce and B.W. Rust. Constrained least squares interval estimation.SIAM J. Sci. Stat. Comput., 6:670-683, 1985.

[48] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. NumericalRecipes in C. Cambridge University Press, 1988.

[49] B.W. Rust and W.R. Burrus. Mathematical Programming and the Numer-ical Solution of Linear Equations. American Elsevier Pub. Co., Inc., NewYork, 1972.

[50] P.C. Sabatier. Basic concepts and methods of inverse problems. In P.C.Sabatier, editor, Basic Methods of Tomography and Inverse Problems,pages 471-667, Adam Hilger, 1987.

[51] P.C. Sabatier. Positivity constraints in linear inverse problems-I. Generaltheory. Geophys. J. Roy. astr. Soc., 48:415-441, 1977.

[52] P.C. Sabatier. Positivity constraints in linear inverse problems-II. Appli-cations. Geophys. J. Roy. astr. Soc., 48:443-459, 1977.

[53] P.C. Sabatier. Well-posed questions and exploration of the space of param-eters in linear and non-linear inversion. In F. Santosa, Y.-H. Pao, W.W.Symes, and C. Holland, editors, Inverse Problemrs of Acoustic and ElasticWaves, pages 82-103, SIAM, 1984.

[54] C. Safon, G. Vasseur, and M. Cuer. Some applications of linear program-ming to the inverse gravity problem. Geophysics, 42:1215-1229, 1977.

[55] N.Z. Shor. Minimization Methods for Non-Differentiable Functions.Springer-Verlag, 1985.

[56] L. Shure, R.L. Parker, and G.E. Backus. Harmonic splines for geomagneticmodelling. Phys. Earth Planet. Int., 28:215-229, 1982.

[57] L. Shure, R.L. Parker, and R.A. Langel. A preliminary harmonic splinemodel from Magsat data. J. Geophys. Res., 90:11,505-11,512, 1985.

[58] P.B. Stark. Rigorous computer solutions for infinite-dimensional inverseproblems. In P.C. Sabatier, editor, Inverse Problems in Action, Springer-Verlag, 1990.

69

[59] P.B. Stark. Rigorous velocity bounds from soft r(p) and x(p) data. Geophys.J. Roy. astr. Soc., 89:987-996, 1987.

[60] P.B. Stark. Strict bounds and applications. In P.C. Sabatier, editor, SomeTopics on Inverse Problems, World Scientific, Singapore, 1988.

[61] P.B. Stark and R.L. Parker. Correction to 'Velocity bounds from statisticalestimates of r(p) and x(p)' by Philip B. Stark and Robert L. Parker. J.Geophys. Res., 93:13,821-13,822, 1988.

[62] P.B. Stark and R.L. Parker. Smooth profiles from r(p) and X(p) data.Geophys. J. Roy. astr. Soc., 89:997-1010, 1987.

[63] P.B. Stark and R.L. Parker. Velocity bounds from statistical estimates ofTr(p) and X(p). J. Geophys. Res., 92:2713-2719, 1987.

[64] P.B. Stark, R.L. Parker, G. Masters, and J.A. Orcutt. Strict bounds onseismic velocity in the spherical Earth. J. Geophys. Res., 91:13,892-13,902,1986.

[65] R.A. Wiggins, G.A. McMechan, and M.N. Toksoz. Range ofEarth structurenonuniqueness implied by body wave observations. Rev. Geophys., 11:87-114, 1973.

70

3Figure /: Kernels for 25 tau(p) and 5 X(p) data

1 121 1 .

0.9 -

0.8

E 0.7

0 0.6

X 0.5.'IO-, 0.4

0.3.

0.1

50 100 150 200 250

v (kmn/s)

H

R H

Figure 2: Geometrical Schematic of Strict Bounds;Illustration of the Notation in Section 2.

inference infinite dimensional problems: discretization and · 2018. 11. 7. · inference in...

Documents