optimum quantization for detector fusion: some proofs, examples, and pathology

*Corresponding author. Tel.: 860-486-2195; e-mail: [email protected] address: Citicorp, New York City.2Supported in part by ONR/BMDO-IST grant N00014-91-J-1950 and AFOSR grant F49620-95-1-

0299.

Journal of the Franklin Institute 336 (1999) 323—359

Optimum quantization for detector fusion:some proofs, examples, and pathology

Douglas Warren!,1, Peter Willett",*,2

!Department of Electrical Engineering and Computer Science, ºniversity of Illinois at Chicago,Chicago, IL 60657, USA

"Department of Electrical/Systems Engineering, University of Connecticut, U-157, Storrs CT 06269, ºSA

Abstract

We consider optimal decentralized (or equivalently, quantized) detection for the Neyman—Pearson, Bayes, Ali—Silvey distance, and mutual (Shannon) information criteria. In all cases, it isshown that the optimal sensor decision rules are quantizers that operate on the likelihood ratioof the observations. We further show that randomized fusion rules are suboptimal for themutual information criterion.

We also show that if the processes observed at the sensors are conditionally independent andidentically distributed, and the criterion for optimization is either a member of a subclass of theAli—Silvey distances or local mutual information, then the quantizers used at all of the sensorsare identical. We give an example to show that for the Neyman—Pearson and Bayes criteria thisis not generally true. We go into some detail with respect to this last, and derive necessaryconditions for an assumptions of identical sensor quantizer maps to be reasonable. ( 1998The Franklin Institute. Published by Elsevier Science Ltd.

1. Introduction

The initial paper on the subject of distributed detection, by Tenney and Sandell [1],showed that under a fixed fusion rule, for two sensors with one bit outputs, theoptimal Bayes sensor decision rule is a likelihood ratio test. In [2], it is shown that theoptimal fusion rule for N sensors is a likelihood ratio test on the data received fromthe sensors. Reibman and Nolte [3] and Hoballah and Varshney [4] have generalized

S0016-0032/98/$ — See front matter ( 1998 The Franklin Institute. Published by Elsevier Science Ltd.PII: S0016-0032(98)00024-6

the results in [1] to N sensors with optimal fusion, again with the restriction of one bitsensor outputs. Hoballah and Varshney [5] have also investigated system optimiza-tion under the aforementioned conditions for a mutual information criterion. Therestriction of the sensor outputs to single bits seems unduly harsh since it implieseither very rapid decision rates or extremely narrowband channels. In this paper, weremove this restriction and assume instead that the ith sensor produces a log

2M

ibit

output. The authors have derived some results concerning this more general case in[6]; using different techniques, Tsitsiklis has come to conclusions similar to thosepresented in this paper [7, 8], as have, for example Thomopoulos, Viswanathan et al.[9, 10].

As always, when speaking of optimality, it is necessary to impose a criterion forjudgment. For detectors, several such measures have been used. In this paper, westudy the structure of optimal decentralized detectors for the Neyman—Pearson,Bayes, Ali—Silvey, and mutual (Shannon) information criteria. We assume that,conditioned on the actual hypothesis (state of nature), the random processes observedat any two different sensors are independent. Given this assumption we show that foreach criterion, the optimal strategy is to quantize the local likelihood ratio at the sensors(to the maximum number of bits allowable), and transmit this result to the fusioncenter. The fusion center then performs a likelihood ratio test on this received data.

That to quantize the likelihood ratio is the optimal thing to do is scarcelya surprising result. Even prior to a proof of its optimality it was used by a number ofresearchers; and most certainly an excellent and rigorous proof is available in [7]. Inthat work, however, reference is made to the different and equally rigorous proof in[6], and since [6] has never been published archivally, it seems appropriate to offer itin the present paper. Certainly the field has not stood still since [6, 7], and somepapers we particularly admire are [11—13]; but to present a complete bibliography isnot the aim of this offering.

Following presentation of background material in Section 2, Sections 3 and 5,demonstrating optimality of likelihood ratio quantization under Neyman—Pearson,Bayes, and Ali—Silvey distance criteria, are more detailed versions of the material in[6]. We show in Section 4 that the detector that maximizes mutual information isidentical to the minimum Bayes cost detector for some set of prior probabilities. Wealso demonstrate that randomization at the fusion center reduces mutual information.Section 6 provides discussion and a number of examples and comparisons. One ofthese will deal with a case in which sensors observe (conditionally) iid data, yetoptimally should not use an identical quantization; in Section 7 we discuss thispathology in some depth. Finally, in Section 8 we summarize.

2. Statement of the problem

We consider the binary detection problem with N sensors. The purpose of thedetector is to discriminate optimally between two states of nature H

0and H

1. For

example, H0

may represent a noise only hypothesis whereas H1

represents signal plusadditive noise. For the Bayes and mutual information criteria, we consider the state of

324 Douglas Warren, Peter Willett / Journal of the Franklin Institute 336 (1999) 323—359

nature to be a random variable; for the Neyman—Pearson and Ali—Silvey criteria, it isassumed to be deterministic. The ith sensor input is a realization of the random vectorX

i3()

i,B

i) (of arbitrary dimension). The statistical behavior of this random vector is

defined by

H0

: Xi&P

0i

H1

: Xi&P

1i

where H0

and H1

are the null and alternative hypotheses, respectively, and the Pki,

i"1, 2,2,N, k"0, 1 are probability distributions. We assume throughout thatP0i

and P1i

are absolutely continuous with respect to each other; hence, the samep-algebra is used under both hypotheses.

If the channels between the sensors and the fusion center are not bandlimited, thedecentralized detection problem becomes, in effect, centralized. The sensors willthen transmit the received data directly. Let x"Mx

1, x

2,2, x

NN be a realization of the

complete (at all sensors) random data X"MX1, X

2,2,X

NN with distribution

Pk

under hypothesis Hk. The optimal centralized test for the Neyman—Pearson

and Bayes criteria is well known [14]. As we show in Section 4 (see also [15]),the same test is also optimal for the mutual information criterion [16]. This test isgiven by

/(x)"G1 if ¸ (x)'q

c if ¸ (x)"q

0 if ¸ (x)(q

(1)

where / (x) is the probability of deciding for H1when x is observed, ¸ is the likelihood

ratio dP1/dP

0, and c and q are chosen to conform to the specific performance goals of

the system designer. For Bayes detection, c is irrelevant and for mutual information itis optimally 0 or 1 (see Section 4). If we assume no bandwidth constraint and that thedata at different sensors is conditionally independent, then the sensors can transmitthe local likelihood ratio (since it is then a sufficient statistic for the detection problem[17, pp. 145—147]), rather than the data itself. We assume henceforth that the sensordata is, in fact, conditionally independent.

In general, for decentralized detection, the channels between the sensors and thefusion center will be bandlimited. Suppose that, for a given decision, the maximumnumber of bits that can be transmitted by sensor i is b

i"log

2M

i. Suppose also that

the number of possible values of ¸i(the local likelihood ratio at the ith sensor) is

greater than Mi. Under these circumstances, we must transmit a non-sufficient

statistic ºifrom the ith sensor to the fusion center. The major subject of this paper is

the optimal form of such a statistic. Since the channel bit rate is limited to bi, it is

apparent that ºimust belong to a set containing at most M

ielements. The ele-

ments themselves are irrelevant, for convenience we assume that they form the set

Douglas Warren, Peter Willett / Journal of the Franklin Institute 336 (1999) 323—359 325

M0, 1,2, Mi!1N and we define the probabilities a

ij"PrMº

i"jDH

0N and b

ij"

PrMºi"j DH

1N. Hence, the situation we are studying is the following: the ith sensor

receives the random data Xi; according to some yet to be determined decision rule, it

produces the statistic ºi, which takes on one of the values in the set M0, 1,2, M

i!1N

in accordance with the probabilities MaijN and Mb

ijN; the set of statistics Mº

iN,

i"1, 2,2,N, is transmitted to the fusion center where the final binary decision º0

ismade. The objective is to have a procedure (at both the sensors and the fusion center)that selects the value for º

0in an optimal way.

We assume throughout this paper that Xihas a density. This can be done without

loss of generality since we are concerned with the relative probabilities under the twohypotheses of sets of values for X

ibut never with the actual values of X

iitself. Hence,

for any Xithat does not have a density and any partition º

iof the set )

i, we can

substitute the random variable X@iand the partition º@

iso that X@

ihas a density and

º@ihas the same distribution as º

iunder both hypotheses. We do not, however,

assume that the likelihood ratio of Xihas a density since this excludes many situations

of interest (for example, additive signals in Laplace noise). One of the main points tobe demonstrated in this paper is that it is the distributions of the likelihood ratio withwhich we are most interested.

Lastly, as a notational convenience we let I`P

be the set of positive integers lessthan or equal to P and we let I

Qbe the set of non-negative integers less than or equal

to Q.

3. Neyman–Pearson and Bayes criteria

A binary detection system makes a decision that can take on one of two values. Theselection of this value amounts to deciding, based on the available data, that one of thetwo possible states of nature H

0and H

1is true. Since the data used in the decision

process is random, there is generally a non-zero probability of error associated withthe system decision. Throughout this paper, we will refer to the probability of decidingH

1when H

0is actually true as the false alarm probability, level, or size of the test. We

will refer to the probability of correctly deciding H1

as the probability of detection orthe power of the test. These two probabilities exhaustively define the statisticalproperties of the test.

3.1. Decentralized Neyman—Pearson detection

The objective for systems designed under the Neyman—Pearson criterion is toachieve maximum power for fixed test level. For an arbitrary fusion rule (with givensensor decision rules for the N sensors) and maximum overall level a

0, the ultimate

decision º0

is determined by a partition of the observation space D of the vectorMº

1,º

2,2,º

NN. The space D is the Cartesian product of the sets I

Mi~1for i3I`

N. In

the following, we denote a decision for Hk

by º0"k. Let u"Mu

1, u

2,2, u

NN be

a realization of U. If the fusion rule is chosen according to the Neyman—Pearsoncriterion (that is, the rule which maximizes the resulting power of the test), then we use


a likelihood ratio test given by

/(u)"G1 if ¸(u)'q

c if ¸(u)"q

0 if ¸ (u)(q

(2)

where / (u)"PrMº0"1DU"uN,

¸ (u)"N<n/1

bnunC

N<n/1

anunD

~1(3)

and c3[0,1]. If we define D1"Mu :¸ (u)'qN, Dc"Mu :¸ (u)"qN, and D

0"DWD

1XDc,

then for the fixed test size a0, c and q are chosen so that

a0"+

D1

N<n/1

anun

#c+Dc

N<n/1

anun

(4)

The power of this test is

b"+D1

N<n/1

bnun

#c+Dc

N<n/1

bnun

(5)

It is well known [18] that the optimal Bayes test, with c"0 and an appropriatelychosen value for q, can be written in the form of (2). Hence, for a given set of priors, theoptimal Bayes test can be found by choosing an appropriate Neyman—Pearson test.Thus, it is clear that the optimal Bayes decentralized test will be determined by the setof ordered pairs (a

0, b*) where a

03(0, 1) is the test size and b* is the maximum power

possible at that size. Which particular pair is used will depend on the prior probabili-ties of H

0and H

1.

Let /i,j

(xi)"Pr(º

i"j DX

i"x

i), for each j3I

Mi~1, and each i3I`

N. If, for each x

i,

+Mi~1j/0

/i,j

(xi)"1, then the set M/

i,jN, defines a mapping of )

iinto I

Mi~1and can be

used in a decentralized detector as a sensor decision rule. We call such a mapping anM

ilevel mapping. With b

ijand a

ijas previously defined we have the following lemma.

Lemma 1. If bij/a

ij"b

ik/a

ikfor jOk, then the performance of a decentralized detection

system using an Milevel mapping as the ith sensor decision rule is equivalent to that of

a system using an Mi!1 level mapping.

Proof. Let Qn

be the probability mass function of ºi

under the hypothesis Hn,

n"0, 1. We define ui"Mu

1,2, u

i!1,u

i#1,2, u

NN and use the notation u"(u

i, u

i).

Assume that Xp

and Xq

are conditionally independent (given the hypothesis) forp3I`

N, q3I`

N, and pOq. We then have

¸ (u)"¸ (ui, u

i)"¸ (u

i)¸ (u

i)


For q@"q/¸(ui), the fusion rule of Eq. (2) can be rewritten as

/(u)"G1 if ¸ (u

i)'q@

c if ¸ (ui)"q@

0 if ¸ (ui)(q@

Then (ui, u

i"k) and (u

i, u

i"j) are either both in D

0, both in D

1, or both in Dc. The

probabilities of these two points are added together in Eqs (4) and (5). Hence, withoutaffecting the level or power of the test, we can conjoin the decision regions for u

i"j

and ui"k into one region with probability Q

n( j )#Q

n(k), n"0, 1. h

Generally, there are an uncountable number of possible decision rules for thesensors. A decision rule of particular interest is given by the following definition.

Definition 1. A monotone partition of the set ) by the likelihood ratio ¸ is a collec-tion of disjoint, non-trivial sets MR

0,2, R

NN with ZN

i/0R

i") such that if x3R

j,

x@3Rj, y3R

k, y@3R

k, and ¸ (x)'¸ (y), then ¸ (x@)*¸ (y@) a.e. (P

0).

We henceforth refer to such a partition as a likelihood ratio partition. Let¸i"dP

1i/dP

0ibe the likelihood ratio of X

i. We show later that the optimal decision

rule for the ith sensor is a likelihood ratio partition of )iby ¸

i.

Lemma 2. For fixed test size, there exists an Milevel mapping such that the power of

a decentralized detection system using this Milevel mapping as the ith sensor decision

rule is no smaller than that of a system using any Mi!1 level mapping.

Proof. Subdivide the probability in the jth level by a likelihood ratio partitioninto levels j

1and j

2. Assume j

2includes the x

iwith larger likelihood ratios. We then

have

bij2

/aij2*b

ij/a

ij*b

ij1/a

ij1. (6)

and hence

¸ (ui, u

i"j

2)*¸(u

i, u

i"j)*¸(u

i, u

i"j

1)

for all ui. Assign (u

i, u

i"j

1) and (u

i, u

i"j

2) to the same D

k, k3M0, c, 1N as that to which

the original fusion rule assigned (ui, u

i"j). Then the power and the level of the test

remain unchanged. If inequality holds in Eq. (6), it may be necessary to reformulatethe fusion rule so that it performs a Neyman—Pearson test on U. Under thesecircumstances, the performance may improve. h

Lemma 3. Assume that for any finite set FLR` containing Mior fewer elements,

P0iM¸

i(X

i)3FN(1. ¹hen a decentralized detector using a likelihood ratio partition of


)iis at best equivalent to one that uses some sensor decision rule of the form

Pr(ºi"j DX

i"x

i)"/

i,j(x

i)"G

1 if ti,j(¸

i(x

i)(t

i,j`1ci,j

if ¸i(x

i)"t

i,j1!c

i,j`1if ¸

i(x

i)"t

i,j`10 otherwise

(7)

where j3IMi~1

, ti,0"0, t

i,Mi"R, t

i,j)t

i,j`1, and c

i,k3[0, 1], k3I

Mi. Furthermore,

there always exists a likelihood partition for which detector performance will be equiva-lent to that achieved using the decision rule of Eq. (7).

Proof. Note that Eq. (7) is not always equivalent to a likelihood ratio partition sinceximay be assigned to more than one level with non-zero probability. If, however,

¸i(X

i) has a density with respect to Lebesgue measure, the assertion of equivalence is

trivially true with ci,k"1 for all k. Assume that this is not the case. Without loss of

generality, we can reorder the levels at the ith sensor so that

bij

aij

*

bik

aik

for j'k (8)

Assume, also without loss of generality, that Xihas a density with respect to Lebesgue

measure under both hypotheses. Let Ri,j

be the set of xito which the reordered

likelihood ratio partition assigns the value j. If ¸i(x

i) has the same value a.e.(P

0i) in

Ri,j

and Ri,k

, jOk, then by Lemma 1, performance is not degraded by conjoining thesesets. Because ¸

i(X

i) cannot place all of its mass in F, there must be some R

i,m, mOj,

mOk in which ¸i(X

i) is not constant. By Lemma 2, we can then add another level,

again without degrading performance. Hence, no more than three Ri,j

will containpoints with the same value of ¸

i. Lastly, assume that ¸

i(x

i)"c for all x

i3R

i,jand that

there is a subset ¹ of Ri,m

, mOj on which ¸iis identically equal to c. By first applying

Lemma 2 (making ¹ a separate partition set) and then Lemma 1 (conjoining ¹

to Ri,j

) no degradation occurs. Hence, no more than two Ri,j

will contain pointswith the same value of ¸

i. This is accounted for in Eq. (7) by the randomization

at the boundaries. A further implication is that we may say that ti,j

is strictly less thanti,j`1

.Because of the assumption that X

ihas a density, it is possible to partition the set

S"Mxi:¸

i(x

i)"t

i,jN into S

jand S

j~1such that PrMS

jDH

0N"c

i,jPrMS DH

0N and

PrMSj~1

DH0N"(1!c

i,j)Pr MS DH

0N. We then redefine our sensor decision rule so that

/i,j

(xi)"1 for x

i3S

jand /

i,j~1(x

i)"1 for x

i3S

j~1. Since the likelihood ratio is

constant over S the H1

probabilities of ºiare also unaffected and, hence, so is the test

performance. With this partition, the sets Ri,j

and Ri,j~1

will be disjoint. Hence, theboundary randomization of Eq. (7) never improves performance and is used merely asa mathematical convenience. We conclude, then, that test performance using thedecision rule of Eq. (7) is always equivalent to that achieved when some likelihoodratio partition is used. h


We note in passing that if Xidoes not have a density, the definition of a likelihood

ratio partition may be modified by omitting the word disjoint. In that event, Lemma 3will hold. We also note that if F is any finite set of real numbers and if ¸

i(X

i)3F with

probability one, then the performance of a decentralized detector using likelihoodratio partitions cannot be improved by increasing the number of levels at the ithsensor beyond the number of elements in F.

Theorem 1. ¹he Neyman—Pearson optimal sensor decision rule is a likelihood ratiopartition.

Proof. We assume that bij/a

ij*b

ik/a

ik. Then ¸(u

i, j)*¸(u

i, k). Hence Mu

i, u

i"kN3D

1implies that (u

i, u

i"j)3D

1, and (u

i, u

i"k)3Dc implies that (u

i, u

i"j)3DcXD

1; that

is, if there is a non-zero probability that we would decide for H1

when ui"k, then the

probability of so deciding is at least as large if we change uifrom k to j. Assume that

the levels are ordered as in Eq. (8). Then for each N!1 dimensional vector uithere is

some integer b(ui) such that for j*b(u

i), (u

i, u

i"j )3D

1. Furthermore, there is some

integer a(ui)(b(u

i) such that for a(u

i)(j(b(u

i), (u

i, u

i"j)3Dc.

Suppose we are given an arbitrary decentralized detection system which satisfies (2)and (8). Let /

i,j(x

i)"Pr(º

i"j DX

i"x

i) define the sensor partition rules for this

arbitrary system. Here i3I`N

and xi3)

i. For this partition, let Q

ki( j)"

:)i/

i,j(x

i) dP

ki(x

i) be the probability that º

i"j under the hypothesis H

k, k"0,1. Also

construct a likelihood ratio partition for the ith sensor with the probability functionsQ*

ki, k"0,1. We choose the latter partition so that

Q0i

( j)"Q*0i

( j), j"0,1,2, Mi!1 (9)

and assume that it can be described by Eq. (7) with ti,0"0, t

i,Mi"R, t

i,j(t

i,j`1, and

the ci,j3[0,1], k3I

Mi. The Mt

i,jN, j3I`

Mi~1and Mc

i,jN are chosen to satisfy the equality

constraints of Eq. (9). Using the Neyman—Pearson fusion rule used for the arbitrarypartition, Q*

0iand Q*

1iwill also satisfy Eq. (8) and the two tests will have the same level.

For the likelihood ratio partition we have, for any n3IMi~1

,

Q*1iMº

i*nN"P

1iM¸(X

i)'t

i,nN#c

i,nP1i

M¸(Xi)"t

i,nN

For Neyman—Pearson fusion and by Eq. (8), the power of the test using the arbitrarypartition is

b"+ui

PrMuiDH

1N [Q

1iMu

i*b (u

i)N#cQ

1iMa (u

i)(u

i(b(u

i)N] (10)

where c3[0,1] is a randomization chosen so that the test level may be exactlyspecified. Suppose that v

iis some arbitrary value of u

iand that b(v

i)"b and a(v

i)"a.

Then for ui"v

ithe bracketed term in Eq. (10) can be written as

Q1iMu

i*bN#cQ

1iMa(u

i(bN"p#co


where

p"P)i

+jzb

/i,j

(xi)dP

1i(x

i)

and

o"P)i

+a:j:b

/i,j

(xi)dP

1i(x

i)

For the likelihood ratio partition with Neyman—Pearson fusion, the power is

b*"+ui

PrMuiDH

1N [Q*

1iMu

i*b(u

i)N#cQ*

1iMa(u

i)(u

i(b(u

i)N] (11)

For ui"v

ithe bracketed term in Eq. (11) can be written as

Q*1iMu

i*bN#cQ*

1iMa(u

i(bN"p*#co* (12)

where

p*"P1iM¸

i(x

i)'t

i,bN#c

i,bP1iM¸

i(x

i)"t

i,bN

and

o*"P1iMt

i,a`1)¸

i(x

i)(t

i,bN#(1!c

i,b)P

1iM¸

i(x

i)"t

i,bN

!(1!ci,a`1

)P1iM¸

i(x

i)"t

i,a`1N

Since viis arbitrary, to complete the proof it suffices to show that the quantity in

Eq. (12) is at least as large as p#co or

p*!p!c(o!o*)*0 (13)

By Eq. (9), the H0

probability corresponding to p is

P0iM¸

i(x

i)'t

i,bN#c

i,bP0iM¸

i(x

i)"t

i,bN

and the H0

probability corresponding to o is

P0iMt

i,a`1)¸

i(x

i)(t

i,bN#(1!c

i,b)P

0iM¸

i(x

i)"t

i,bN

!(1!ci,a`1

)P0iM¸

i(x

i)"t

i,a`1N

Hence, by the Neyman—Pearson lemma for randomized tests [19], we know thatp**p. Then a necessary condition for Eq. (13) not to hold is o*(o. In that case, theleft side of Eq. (13) is decreasing in c and

p*!p!c(o!o*)*p*#o*!(p#o) (14)

But

p*#o*"P1iM¸

i(x

i)'t

i,a`1N#c

i,a`1P

1iM¸

i(x

i)"t

i,a`1N


and, by Eq. (9),

P)i

+j;a

/i,j

dP0i(x

i)"P

0iM¸

i(x

i)'t

i,a`1N#c

i,a`1P

0iM¸

i(x

i)"t

i,a`1N

so by the Neyman—Pearson lemma, the right side of Eq. (14) is non-negative andEq. (13) holds. h

3.2. Decentralized Bayes detection

We now turn to the optimal decentralized Bayes test. Let n0

and n1

be the priorprobabilities of H

0and H

1, respectively. Also let C

mnbe the cost assigned to choosing

hypothesis Hm

when Hn

is the true hypothesis. The average or Bayes cost of thedetection system is then

C"

1+n/0

nnC

0n#

1+n/0

nn(C

1n!C

0n)P

n(D

1)

where D1

is the fusion center decision region for H1. The objective of Bayesian

detection is to minimize this cost. If, given a true hypothesis, the cost of making anerror is greater than the cost of making the correct decision (as it logically should be)then the optimal fusion rule for the Bayes test is given by the following:

/(u)"G1 if ¸(u)'q

0 if ¸ (u)(q(15)

where /(u) is the probability of choosing H1

given that U"u. The optimal thresholdis

q"n0(C

10!C

00)

n1(C

01!C

11)

(16)

Note the similarity between Eqs (2) and (15). Having specified the fusion rule, we nowturn to the sensor quantizers.

Theorem 2. ¹he sensor decision rules for an optimal Bayes detector are likelihood ratiopartitions of the sensor observation spaces.

Proof. The Bayes test uses, for each set of priors (n0,n

1), a particular level and the

maximum power that can be achieved for a test at that level. Eq. (2) gives themaximum power test for the level achieved by the threshold randomization pair cand q. For Bayes tests, we specify that c"0 and q is as given by Eq. (16). Hence,by application of Theorem 1, the theorem is proved. h

We have therefore shown that the optimal decentralized detector under the Ney-man—Pearson and Bayes criteria, uses a quantized version of the local likelihood ratioas the sensor decision rule. Hence, in the design of such a system, it is necessary only topartition the likelihood ratio rather than the raw data. This is intuitively satisfyingsince the likelihood ratio is a sufficient statistic for the detection problem.


4. Mutual information

Mutual information can be thought of as the amount by which observation ofa random variable, ½, reduces the uncertainty about the value of a related (dependent)random variable, X. The concept is relatively simple: the uncertainty about X before½ is observed is measured by its entropy H(X) whereas the uncertainty afterobservation is the conditional entropy H(XD½). Since mutual information is thedifference H(X)!H(X D½ ) it is a measure of the amount of uncertainty resolved byobserving ½. In the detection problem, a heuristic goal may be stated as the elimina-tion of as much entropy as is possible; that is, if X is the hypothesis and ½ is thedecision, we would like the value of ½ to limit the uncertainty about X to the greatestpossible extent. In the detection problem, this perspective is intuitively appealing(although only applicable when the a priori probabilities of H

0and H

1are known); the

detector designed to maximize mutual information will produce decisions that minim-ize the remaining uncertainty about the true hypothesis. Since the purpose of detec-tion is to do precisely this, mutual information is a reasonable criteria for evaluatingdetector performance.

In [15], the relationship between the test power maximization and the mutualinformation between hypothesis and decision was studied. It was found that for anyvalue of the false alarm rate, the mutual information increases as the power of the testincreases. We summarize this result in the following lemma.

Lemma 4. For a given pair of prior probabilities for hypotheses H0and H

1, the unbiased

test that maximizes the mutual information between the random state of nature and thedetector decision is a likelihood ratio test.

Proof. In [15], the lemma was proven by differentiating I (º0;H) (the mutual

information between the hypothesis H and the detector decision º0) with respect

to test power. It was shown that if the test level and the prior probabilities of thehypotheses were held constant, then this derivative was positive. Because of itslater usefulness, we use a different technique here to prove the lemma. Definea transition matrix Q with elements Q

i`1,j`1"Pr(º

0"j DH

i), i"0, 1, j"0, 1.

Suppose that we initially make decisions purely by chance so that the probabilitythat º

0"1 is a regardless of the actual hypothesis. The transition matrix for this

test is

Qr"C

1!a a

1!a aDLet nN "(n

0,n

1) be the vector of priors (n

i"Pr(H

i)) and, for notational convenience,

let I(nN ; Q)"I(º0; H). Then it is easily shown that I(nN ; Q

r)"0. This should be readily

apparent since the test itself has added nothing to our knowledge of H; that is, weknew in advance that the chance for deciding for H

1did not depend on whether

or not H1

was true. Now suppose that the maximum possible power for a testof level a is b*. A test with this power is by definition a likelihood ratio test and


its transition matrix is:

Q*"C1!a a

1!b* b*DFor an unbiased test with power b and level a we have b*a. Furthermore, we musthave b)b*. For this test the transition matrix can be written as

Q"jQr#(1!j)Q*

By the convexity of mutual information [16, Theorem 5.2.5]. we have

I(nN ; Q))jI(nN ; Qr)#(1!j) I (nN ; Q*)

or

I(nN ; Q))(1!j)I(nN ; Q*)

Hence, the test that maximizes mutual information is the maximum power test forsome false alarm rate. h

As we have seen, the mutual information for a test operating by pure chance is zero.It is well known that the mutual information between any two random variables is atleast zero and must be strictly positive if the variables are dependent. For thetransition matrix Q

rthe decision º

0is made without reference to any observable

characteristic of H; that is, the test is completely randomized. With respect to the testsspecified by Eqs (1) and (2), the question then arises as to whether partial randomiz-ation (at the threshold only) reduces the mutual information. The following lemmaaddresses this question.

Lemma 5. ¹he maximum mutual information test is a likelihood ratio test with norandomization at the threshold.

Proof. Suppose that º0

is randomized with probability qiunder hypothesis H

i. Then

qiis the probability that the likelihood ratio equals the threshold when H

iis the true

hypothesis. Further assume that º0

is chosen by a likelihood ratio test with random-ization parameter c and that the probabilities of exceeding the threshold are a andb under H

0and H

1, respectively. Since the test will choose H

1with probability c when

randomization occurs, the transition matrix Qc may be written as:

Qc"C1!a!cq

0a#cq

01!b!cq

1b#cq

1D

Then for c"0 we have

Q0"C

1!a a

1!b bD


and for c"1 we have

Q1"C

1!a!q0

a#q0

1!b!q1

b#q1D

We can then write

Qc"cQ1#(1!c)Q

0

As in the proof of Lemma 4, we can make use of the fact that mutual information isconvex in the transition matrix and conclude

I(nN ; Qc))cI(nN ; Q1)#(1!c)I (nN ; Q

0)

Hence, mutual information is maximized for either c"0 or c"1. For these values ofc no randomization occurs. h

From Lemma 4 it is apparent that the optimal fusion rule will be as specified byEq. (2); that is, the maximum mutual information test fusion rule will be a likelihoodratio test. Lemma 5 specifies that this test use no randomization in the fusion rule. Theintuitive interpretation of the latter result is that more information can be gleanedfrom the observed data if the decision is based solely on the data. The use ofrandomization in the fusion rule assigns a portion of the decision process to chance;hence, we find that mutual information is reduced. This result may be contrasted tothe use of randomization in optimal Neyman—Pearson tests. The Neyman—Pearsonlemma specifies a false alarm rate for the test and probability is added to the decisionregion for H

1by ranking points in the observation space by their likelihood ratio.

Randomization occurs when the addition of all of the points with the same likelihoodratio would increase the false alarm rate beyond the specified value. Our result onmutual information restricts the false alarm rate for the maximum mutual informationto those values that require no randomization; that is, if there are points assigned toH

0and H

1that have the same likelihood ratio, under the mutual information criteria

the test cannot be optimal. Since the statistic of interest is the likelihood ratio, it seemsreasonable that a test that makes different decisions for two points with the samelikelihood ratio will be suboptimal.

Theorem 3. ¹he maximum mutual information test uses likelihood ratio partitions forthe sensor decision rules.

Proof. From Lemma 4 the maximum mutual information test is the maximum powertest at some unspecified level. From Theorem 4, we know that the maximum powertest for a given false alarm rate uses likelihood ratio partitions at the sensors. Hence,the maximum mutual information test uses likelihood ratio partitioning for the sensordecision rules. h

Summarizing the foregoing results, we have: (1) the fusion rule for a maximummutual information test is a non-randomized likelihood ratio test on the sensor


output vector U, and (2) the sensor decision rules for a maximum mutual informationtest are likelihood ratio partitions. The false alarm rate for such a test will varyaccording to the H

0and H

1probability distributions of the ¸

i(X

i) and the prior

probabilities of the hypotheses. Finally, we note that since º0

is a deterministicfunction of the vector U, we can also consider maximization of I (U; H). A discussionof this quantity and its relation to I (º

0; H) may be found in section Section 6.

5. Ali–Silvey distances

It is often difficult to calculate the error probabilities for optimal systems; asa result, the literature includes numerous examples of metrics that have been used asdetector performance measures. These metrics include asymptotic relative efficiency,Chernoff bounds, rate distortion, mutual information, signal-to-noise ratio, anddistributional distance. As a result of either convexity or asymptotic normality,performance evaluation and system design may be significantly easier using thesesuboptimal measures than if error performance is used. Among the measures ofdistributional difference that are frequently cited are the members of the class ofAli—Silvey distances [20, 21]. This class includes the J-divergence, Bhattacharyyadistance (closely related to Chernoff bounds), and Matsushita distance among others.An Ali—Silvey distance that is frequently encountered in the detection literature(because of its relationship to signal-to-noise ratio) is the likelihood ratio variance[22]. This ‘‘distance’’ is the signal-to-noise ratio for a likelihood ratio detector.Applications of Ali—Silvey distances to signal selection are discussed in [23, 24].

In detection, Ali—Silvey distances are used in the following manner. The distancebetween the H

0and H

1probability distributions is maximized at the input to the

detector. This may involve the selection of signals to be used in the system (in digitalcommunications and active radar and sonar) or processing performed at the receiverprior to the application of the signal to the decision device. The detector then usesa likelihood ratio test to make the decision. It is assumed that the relationshipbetween error probability criteria and Ali—Silvey distance is monotonic. See [23, 24],for a more detailed discussion of this relationship.

In the context of decentralized detection, we are concerned with maximization ofthe distance between the respective distributions of U, the vector of inputs to thefusion center. Hence, in keeping with the previously mentioned procedure, the fusionrule is a likelihood ratio test on U. The distributional distance is maximized throughthe selection of appropriate sensor decision rules. In [25], Poor and Thomas discussthe quantization of data for centralized detectors. The quantizers they discuss are ofthe form

Q(x)"k, if x3(tk,tk`1

], k"0,1,2, M!1 (17)

where Q(x) and x are the output of and input to the quantizer, respectively. It is shownthat the quantizer that maximizes Ali—Silvey distance is found by choosing thebreakpoints for the level intervals (t

0, t

1,2, t

M) through use of the likelihood ratio.

A slight (and easily corrected) problem with this work is that more levels than


necessary must be used if the likelihood ratio is not monotonic in x. To avoid thisproblem in our discussion, we use a slightly different form for the quantizer. Innotation analogous to that of Eq. (17), we use quantizers of the form

Q(x)"k, if x3Sk, k3I

M~1(18)

This less constrained quantizer format allows for more efficient use of the limited bitsavailable.

The general form of an arbitrary Ali—Silvey distance between the distributionsP0

and P1

is given by

d0(P

0,P

1)"f APC[¸(x)]dP

0(x)B (19)

where f is an increasing function, C is convex, and ¸"dP1/dP

0is the likelihood ratio.

For a decentralized detection system with N sensors, Eq. (19) can be rewritten as

d0"f A

M1~1+

u1/0

M2~1+

u2/0

2

MN~1+

uN/0

C[¸(u)]P (u DH0)B (20)

where ¸ (u) is given by Eq. (3) and P(u DH0)"<N

n/1anun

is the H0

probability of thevector u. Using the notation of Sections 2 and Section 3, Eq. (20) becomes

d0"f A+

ui

Mi~1+j/0

C[¸(ui"j)¸ (u

i)]a

ijP (u

iDH

0)B (21)

In the sequel, we will omit f since it is irrelevant to the discussion; that is, maximizingd"E

0C[¸(U)] is equivalent to maximizing d

0. Furthermore, we confine the class to

those distances for which C is differentiable. We now state and prove the followingtheorem.

Theorem 4. ¹he sensor decision rules that maximize the Ali—Silvey distance between thedistributions of U under H

0and H

1are likelihood ratio partitions.

Proof. Without loss of generality, we assume that the levels of an arbitrary decisionrule at the ith sensor are ordered as in Eq. (8). Suppose that under this arbitrarydecision rule Pr(º

i"n DX

i"x

i)"/

i,n(x

i). If M/

i,n, n3I

Mi~1N is not equivalent to

a likelihood ratio partition, there must be sets Sjand S

k, k(j, of non-zero P

0imeasure

such that for each x13S

jand each x

23S

k, /

i,j(x

1)'0, /

i,k(x

2)'0, and ¸

i(x

2)'

¸i(x

1). Suppose further, again without loss of generality, that

PSk

/i,k

(xi)dP

0i(x

i)"P

Sj

/i,j

(xi)dP

0i(x

i)"e

and

PSk

/i,k

(xi)dP

1i(x

i)!P

Sj

/i,j

(xi)dP

1i(x

i)"d


From the construction thus far, both e and d will be greater than zero. We now changethe partition in the following manner. Let

/(1)i,j

(xi)"G

/i,j

(xi)#/

i,k(x

i) if x

i3S

k0 if x

i3S

j/

i,j(x

i) otherwise

/(1)i,k

(xi)"G

0 if xi3S

k/i,j

(xi)#/

i,k(x

i) if x

i3S

j/i,k

(xi) otherwise

and

/(1)i,n

(xi)"/

i,n(x

i) for nOj, nOk

For the new partition all ain

are unchanged, bij

increases by d, bik

decreases by d, andthe remaining b

inare unchanged. Then *d, the resulting change in the distance d is

*d"+ui

P0(u

i)CCA

bij# daij

¸(ui)Baij#CA

bik!daik

¸ (ui)B a

ikD!+

ui

P0(u

i) CCA

bij

aij

¸ (ui)Baij#CA

bik

aik

¸ (ui)Ba

ikD (22)

where we have adopted the notation P0(u

i)"Pr(u

iDH

0). By the mean value theorem

and the differentiability of C we have

CAbij#daij

¸(ui)B"CA

bij

aij

¸ (ui)B#dC@A

bij#d

jaij

¸ (ui)B

¸(ui)

aij

(23)

and

CAbik!daik

¸ (ui)B"CA

bik

aik

¸(ui)B!dC@A

bik!d

kaik

¸(ui)B

¸ (ui)

aik

(24)

where 0)dn(u

i))d, n"j, k. The difference in d due to the arbitrary vector u

i, after

substituting Eqs (23) and (24) into Eq. (22), is

*d(ui)"dP

1(u

i)CC@A

bij#d

j(u

i)

aij

¸(ui)B!C@A

bik!d

k(u

i)

aik

¸ (ui)BD (25)

Since C is convex, *d(ui) is non-negative if

bij#d

j(u

i)

aij

*

bik!d

k(u

i)

aik

. (26)

Recall that, by construction, bin/a

in*b

im/a

imfor n'm. Since j'k, d

j(u

i)*0, and

dj(u

i)*0, Eq. (26) holds. The partition can be further modified until, for any non-

trivial sets Rj

and Rk

with /i,j

(x1)'0 for x

13R

jand /

i,k(x

2)'0 for x

23R

k,


P(RjDH

1)/P(R

jDH

1)5P(R

kDH

1)/P(R

kDH

0). The levels can then be reordered to con-

form with Eq. (8) and the process repeated for some other pair j1

and k1. Since j and

k in the foregoing were chosen arbitrarily, the distance d will be maximized when thesensor decision rules are given by Eq. (7). h

For the J-divergence, C[¸(u)]"[¸(u)!1]log¸ (u). From the conditional indepen-dence of the sensor data, the J-divergence can be written as J"+N

i/1[E

1log¸ (u

i)!

E0log¸ (u

i)] or J"+N

i/1J(u

i) where J(u

i) is the J-divergence of the H

1and

H0

distributions at the output of the ith sensor. In this case (and in the case of the

Bhattacharyya distance B where B"!logE0J¸(u)), the distance at the output of

the sensors can be individually maximized in order to maximize the overall distance.In general, if the sensor inputs are conditionally independent and C[¸(u)] can bewritten as either +N

i/1C[¸(u

i)] or <N

i/1C[¸(u

i)], then the optimal system can be

found by maximizing the distances at the individual sensor outputs. Furthermore, it isapparent that if the sensor inputs are also identically distributed, then all of theoptimal sensor decision rules are the same. This factor can significantly reduce theamount of work required to determine the thresholds for the N quantizers. It is shownin [Eq. (16)] that if C has a continuous second derivative, then t

i,jwill depend on

¸(ui"j ), ¸(u

i"j!1), C, and C@. In our notation this threshold is given by

ti,j"

¸i,j

C@(¸i,j

)!C(¸i,j

)!¸i,j~1

C@ (¸i,j~1

)!C (¸i,j~1

)

C@ (¸i,j

)!C@(¸i,j~1

)

where ¸i,n"¸(u

i"n).

6. Discussion and examples

It is intuitively appealing that an optimal sensor mapping strategy should bea quantization of the minimal sufficient statistic for detection, the likelihood ratio.Although the result appears to have been widely assumed, up to now it has beenproven only for a number of special cases. We mention several here, the first two ofwhich have been cited previously:

f For the Bayes criterion with single-bit channels (Mi"2), Reibman and Nolte have

shown that sensors perform local likelihood ratio tests [3].f Poor and Thomas poth have derived Ali—Silvey distance optimal thresholds for the

case in which the local likelihood ratios are monotonically-increasing functions ofthe data.

f For the ‘‘weak-signal’’ case, Kassam [26] has demonstrated that the sensor map-ping function that maximizes the efficacy (or asymptotic signal-to-noise ratio) isa quantization of the differential log-likelihood ratio.

f For discrimination EH0

[log(P1(u)/P

0(u))] (an Ali—Silvey distance measure closely

related to Shannon information) and discrete-valued sensor data, it can be inferredfrom the refinement lemma [16, Theorem 4.2.3] that the optimal sensor mapping isa quantization of the local likelihood ratio.


f For the Chernoff function!log MEH0

[(P1(u)/P

0(u)) s]N, 0(s(1, it has been shown

in [27] that the optimal mapping is a quantization of the local likelihood ratio.Note that Bhattacharyya distance is a special case (s"1

2) of the Chernoff function.

Which measure is most appropriate is largely a function of the application. If priorprobabilities for the hypotheses are known, then Shannon information may be mostappealing; if, in addition, error costs can be assumed, then a Bayes criterion should beused. Without prior probabilities the Neyman—Pearson criterion is most natural;however, the resulting scheme is in general optimal only at a particular false-alarmrate and its derivation can require extensive computation. The Ali—Silvey distancecriteria, which utilize the sensor outputs U rather than the final decision º

0, generally

work well for a broad range of false-alarm rate; unfortunately, there is no guarantee ofany meaningful optimality. We present the following examples for clarification.

Example 1. Suppose we are to design a constant false-alarm rate (CFAR) system forSwerling II targets. The system is constrained to have three sensors (N"3); eachsensor can send one of three possible messages (M

i"3, i"1, 2, 3) to the fusion

center. A cell-averaging (CA) scheme on 20 homogeneous reference cells MM½ijN20j/1

N3i/1

is used to estimate the ambient noise level. For each i and j, ½ij

is exponentiallydistributed with parameter j; that is, the density function of the reference value is:

f (x)"Gje~jx x*0

0 else

The output of the test cell Ziis exponentially distributed with parameter (j/2) under

H1, and with parameter j under H

0. All random variables are assumed to be

independent. The sensors quantize the normalized data MXiN3i/1

, where

Xi"

Zi

(1/20)+20i/1

½ij

It can be shown that the Xiare independent and identically distributed (i.i.d.) with

density

fX(x)"

h(1#hx/20)21

where h"1 under H0

and h"0.5 under H1. Since the local likelihood ratios

¸i(X

i)"

1

2A1#X

i/20

1#Xi/40B

21

are monotone-increasing functions of Xi, quantizing the data directly is equivalent to

a likelihood-ratio partition.As we discuss later, it is not always true that the optimal sensor mappings are

identical likelihood ratio partitions. Obviously, this will not in general be the case ifthe channel capacity constraints differ between sensors. However, it still may not betrue even if all of the sensors observe i.i.d. random processes and face identicalcommunications constraints (i.e. M

i"M). However, in the case of Example 1, the


Fig. 1. Plot of Neyman—Pearson optimal thresholds versus a for Example 1 (CA-CFAR).

partitions are in fact identical. The Neyman—Pearson optimal thresholds are plottedagainst a in Fig. 1. Note the discontinuity at a+0.7%. This can be seen more clearlyin Fig. 2, where we have expanded the a axis for a(0.01. The discontinuity is dueto a transition in the optimal fusion rule: for a'0.007, D

1"M(2,3,3) (3,2,3)

(3,3,2) (3,3,3)N; for a(0.007 D1

is any permutation of M(1,3,3) (2,2,3) (2,3,3) (3,3,3)N.Here (u

1,u

2,u

3) represent the outputs of the three sensors, and each output is assumed

to be ordered as in Eq. (8). As is true for many quantized detection systems, there is nofusion rule that is optimal for all a.

Optimal thresholds under a number of other criteria are shown in Table 1. In allcases the thresholds are the same at each of the sensors. Receiver operating character-istics (ROCs) for detectors optimized for the Neyman—Pearson, efficacy and Bhat-tacharyya distance criteria are shown in Fig. 3. Because optimization for the lattertwo criteria results in only one set of thresholds for each sensor (i.e. the thresholds donot change as a function of false alarm rate), the resulting fusion rule will userandomization and the detector ROC curves are piecewise linear. In Fig. 4 we showthe ROC curve for a quantizer optimized for a fixed false alarm rate (10%). Note thathere again fixed thresholds result in randomization and a piecewise linear ROC.

Figure 5 shows the values of a and b, as a function of n0

("Pr(H0)), achieved with

the quantizer that maximizes the information measure I (º0; H). By comparing this to


Fig. 2. Plot of Neyman—Pearson optimal thresholds versus a for Example 1. This is an expanded version ofthe low-a behavior in Fig. 1.

Table 1Optimal thresholds (Example 1)

Criterion t1

t2

Bhattacharyya 1.33 3.65J-divergence 1.46 4.05Efficacy 0.58 1.60

Figs 3—5, we see that any information-optimal system (for 0(n0(1) is identical to

a Neyman—Pearson optimal scheme with 0.1(a(0.26 (approximately). This isconsistent with Theorem 3; however, we also note that maximum mutual informationis achieved for some test level in the interval (0.1, 0.26) for all values of n

0. Thus, the

mapping between n0and the information-maximizing false alarm rate is invertible but

not necessarily one-to-one.The thresholds used by the information-optimal system are plotted in Fig. 6. Here

‘‘global’’ refers to I(º0; H), the mutual information between the hypothesis and the


Fig. 3. Performance comparison of thresholds optimized under Neyman—Pearson, efficacy, and Bhat-tacharyya measures, in situation of Example 1. Note that the NP-optimal thresholds vary with a; but thoseunder the other two criteria cannot, hence the piecewise-linear ROCs. The vertical lines denote the region ofthis ROC obtainable under by maximizing mutual information; see Fig. 4.

ultimate decision. This measure was discussed in Section 4. A system optimized withrespect to

I(U; H)"N+i/1

I(ºi; H) (27)

is referred to as ‘‘local’’; this is the information between the vector of sensor outputsand the hypothesis random variable. In Fig. 7, we show the maximum local andglobal information that can be achieved. Since º

0is the result of combining the

elements of U, it is not surprising that the local information is greater than the globalinformation. This can also be deduced from the following equation:

I(º0; H)"I (U; H)!I(U; H Dº

0)

The last term can be interpreted as ‘‘additional information about H contained inU when º

0is already known’’, and can be considered as a measure of the efficiency

with which information in U is absorbed into º0; in a Neyman—Pearson framework

this information is useless. It should be noted that for local information, as in Eq. (27),


Fig. 4. Performance comparison of thresholds optimized under Neyman—Pearson criterion in situation ofExample 1. The ‘‘optimal’’ thresholds vary with a; but that optimized for a fixed a"10% naturally do not.

if the sensor observations are i.i.d and the communication channel constraints areidentical, the sensor mapping rules will all be the same. As we demonstrate in Example4 below, this is not always the case for global mutual information. We also note inpassing that we can write

J"Ad

dn0

[I(U;H)]n0/0B!Ad

dn0

[I(U; H)]n0/1Bwhere J is the J-divergence. Hence, the thresholds chosen to maximize J-divergencewill also maximize the difference between the slopes of the mutual information versusprior probability curve at its endpoints.

Example 2. Suppose we have ten sensors (N"10) that take one observation each.The observations are i.i.d. realizations of a unit Cauchy random variable with density

f (x)"1

n( 1#(x!h)2)

Under the noise-only hypothesis H0, h"0; under H

1, h"2. Each of the sensors uses

a five level quantizer.


Fig. 5. Information optimal a and b for Example 1.

In Fig. 8, we show the log-likelihood ratio, the quantizer that is optimal forJ-divergence, and the Neyman—Pearson optimal quantizer for a false-alarm rate of1%. The resulting probabilities of detection for (a"0.01) are b"0.907 andb"0.925, respectively. Note that in both cases the sensor mappings are likelihoodratio quantizations.

Example 3. Suppose we have ten sensors (N"10) that are capable of transmittingone of five possible messages (M

i"5) each. With ½

irepresenting the observation at

the ith sensor, the detection problem can be written as

H0:½

i"(1!Z)N

1i#ZN

2i

H1:½

i"S

i#(1!Z)N

1i#ZN

2i

Here MSiN10i/1

, MN1iN10i/1

, and MN2iN10i/1

are all independent, zero-mean and Gaussian,with E (S2

i)"E (N2

1i)"1 and E (N2

2i)"10. The binary random variable Z takes on

values 0 and 1 with respective probabilities 90% and 10% and has the same value atall of the sensors.

This example is used to demonstrate the effect of dependence among the sensors.Without the high power noise process, the model conforms to the unknown


Fig. 6. Information optimal thresholds for Example 1. This should be compared to the region of figure Fig. 1for the values of a available in Fig. 5. As indicated in the text, ‘‘local’’ refers to optimization of the informationbetween hypothesis and the outputs of the quantizers; while ‘‘global’’ is between hypothesis and decision.

(Gaussian) signal in Gaussian noise problem. When the N2i

process is included wehave added the possibility of jamming: there is a 10% probability that all of thesensors will be jammed, and a 90% probability of no jamming.

For each sensor the statistic Xi"½2

iis sufficient. Since ½

iis always Gaussian, X

i,

conditioned on the hypothesis and the presence or absence of jamming, will havea Rayleigh density given by:

fR(x;p2)"G

x

p2e~x2@2p2 x*0

0 else

Here, p2 will depend on both the hypothesis and the value of Z. Hence, the univariatedensity of X

iis

f (xi)"(1!e) f

R(x

i;1#h)#ef

R(x

i; 11#h) (28)

and the joint density of the observations is

f (x1,x

2,2,x

10)"(1!e)

10<i/1

fR(x

i; 1#h)#e

10<i/1

fR(x

i; 11#h)

Here e"0.1, while h"0 under H0

and h"1 under H1.


Fig. 7. Maximized information for Example 1, where optimized thresholds are in Fig. 6, and where ‘‘local’’and ‘‘global’’ are there defined.

We consider two cases of quantizers optimal under the Bhattacharyya criterion.First, we maximize the Bhattacharyya distance of º

i, the (quantized) output of

each sensor; this is equivalent to the assumption that the sensor observationsare independent with the density given by Eq. (28). Second, we maximize theBhattacharyya distance for U ; that is, for the ensemble input to the fusion center.In Fig. 9, we plot the logarithm of the local likelihood ratio, the quantizer thatresults from the first maximization, and we indicate the thresholds that are used toquantize the data for the latter maximization. As expected, the independence assump-tion results in a quantization of the local likelihood ratio as specified by Eq. (28).The fact that inclusion of the dependence results in a quantization of the data X

iis not surprising; a large value of X

iis indicative of jamming at the ith sensor which in

turn implies jamming at all sensors. Note that due to dependence, the quantizer levelsin the second case are meaningless and are not shown. The ROC curves under bothassumptions are shown in Fig. 10. Since the dependence assumption more accuratelymodels the problem, the resulting ROC dominates that found when independence isassumed.


Fig. 8. Quantizer optimized under Neyman—Pearson (a"0.01) and J-divergence, for additive signal inCauchy noise situation of Example 2.

Example 4. Suppose that three sensors (N"3) observe mutually independent dataand that the likelihood ratio of the data has density

fH0

(x)"Gc

1#[k (x!12)]2

#

c

1#[k (x!32)] 2

0)x)2

0 else

under H0and xf

H0(x) under H

1). The constant c is chosen to normalize the density and

k"100. Each sensor uses binary quantization (Mi"2).

This example is discussed in [28, 29]. It can be shown that if identical thresholds areused at all of the sensors, the optimal test (given the identical threshold constraint) willrequire a randomized fusion rule. Such tests were posited to be sub-optimal in [29],although an excellent recent article [30] indicates that this is true over a morerestricted class than was previously believed. In any case, when the constraint isremoved, randomization is no longer necessary. The optimal thresholds for theunconstrained case are shown in Fig. 11 as a function of a. In the figure, the partitionslabeled AND, OR, and MAJORITY refer to the optimal fusion rule logic for the testlevels within the partitions. Note that the three thresholds are not in general equal.


Fig. 9. Quantizers optimized under Bhattacharyya criterion, for dependent situation of Example 3. Underan (incorrect) assumption of independence the result is a likelihood ratio quantization; correctly assumingdependence results in a direct data quantization.

Intuition suggests that if the sensors are quantizing i.i.d. random variables, then thequantizers used should be identical; this, however, is not always true. It should also benoted that for any prior probabilities, the operating points for a system optimizedwith respect to either global mutual information or Bayes cost must lie on the ROCcurve for Neyman—Pearson optimal detectors. Hence, for i.i.d. sensor observationsand identical channel constraints, the optimal detectors for the Bayes and mutualinformation criteria may use different sensor quantizers. From this somewhat patho-logical example we draw some conclusions in the following Section 7.

7. Identical quantization for identical sensors

In the previous Example 4, independent and identically distributed sensor data withlocal likelihood ratio densities differentiable to arbitrary order gives rise, optimally, todifferent sensor mapping functions [29]. When does this happen, and when can wereasonably ignore the possibility? Certainly it is known to occur when sensor likeli-hood ratios contain point masses of probability; yet Example 4 does not contain pointmasses.


Fig. 10. Performance of quantizers for dependent situation of Example 3. The thresholding scheme and anexplanation of terms is in Fig. 9.

Here we provide some answer. It will turn out that there is a threshold in terms ofpoint-mass behavior of the likelihood ratio densities; once these densities are suffi-ciently ‘‘peaky’’, non-identical sensor quantization functions ought to be used.

7.1. The necessary conditions

With the local decision thresholds equal it is clear that the fusion rule must be of the‘‘k-out-of-n’’ variety; that is,

D"GH

1

n+i/1

nºi*k

H0

n+i/1

nºi(k

(29)

This is a familiar form, particularly for k"1 (the or rule), for k"n (the and rule), andfor k"(n#1)/2 (majority-logic). If an optimized system is to use identical localdecision thresholds, then certainly optimizing under the constraint of a k-out-of-nfusion rule must again result in identical thresholds. Let us check this.


Fig. 11. Thresholds for ‘‘pathological’’ case of example 4. Here the sensors are independent (conditioned onhypothesis) and identical, yet the three sensors employ different thresholds. The fusion rules which happento be optimal are also given.

Following standard optimization theory we shall minimize the function

C(t1,2, t

n),1!b#j(a!a

d) (30)

subject to the constraint a"ad. We shall assume for now that 1(k(n. Writing b in

terms of tiwe get

b"F1(ti)PrA

n+

j/1,jEi

ºj*kDH

1B#(1!F1(ti))PrA

n+

j/1,jEi

ºj*k!1DH

1B (31)

or

b"PrAn+

j/1,jEi

ºj*kDH

1B#(1!F1(ti) )Pr A

n+

j/1,jEi

ºj"k!1DH

1B (32)

with a similar expression for a, and in which we have defined

Fl(t),Pr(¸

i)tDH

l) (33)


for l"0,1. The first-order necessary conditions for a minimum are that the gradient ofC with respect to the thresholds is zero, or that

f1(ti)PrA

n+

j/1,jEi

ºj"k!1DH

1B"j f0(ti)PrA

n+

j/1,jEi

ºj"k!1DH

0B (34)

for all i3M1, nN, and where fl(t),(d/dt)F

l(t) are the likelihood ratio densities. It is clear

that equation Eq. (34) is satisfied for any set of identical thresholds, hence that valuet* which satisfies the false-alarm constraint is at least a critical point of the costfunction. Evaluating the probabilities explicitly we have

j*"f1(t*)

f0(t*) A

1!F1(t*)

1!F0(t*)B

k~1

AF1(t*)

F0(t*)B

n~k(35)

corresponding to t1"2"t

n"t*.

Now let us express b in terms of two thresholds tiand t

j. We have

b"(1!F1(ti)) (1!F

1(tj))

n~2+

l/k~2An!2

l B (1!F1(t*) )lF

1(t*)n~l~2

#(1!F1(ti))F

1(tj)

n~2+

l/k~1An!2

l B (1!F1(t*))lF

1(t*)n~l~2

#F1(ti) (1!F

1(tj))

n~2+

l/k~1An!2

l B (1!F1(t*) )lF

1(t*)n~l~2

#F1(ti)F

1(tj)

n~2+l/kAn!2

l B (1!F1(t*))lF

1(t*)n~l~2 (36)

or

b"F1(ti)F

1(tj) A

n!2

k!2B (1!F1(t*))k~2F

1(t*)n~k

!(F1(ti)#F

1(tj) ) A

n!2

k!2B (1!F1(t*))k~2F

1(t*)n~k

!F1(ti)F

1(tj) A

n!2

k!1B (1!F1(t*))k~1F

1(t*)n~k~1

#

n~2+

l/k~2An!2

l B (1!F1(t*))lF

1(t*)n~l~2 (37)

Taking the derivative with respect to tiwe get

LbLt

i

"!f1(ti) (1!F

1(tj)) A

n!2

k!2B (1!F1(t*))k~2F

1(t*)n~k

!f1(ti)F

1(tj) A

n!2

k!1B (1!F1(t*))k~1F

1(t*)n~k~1 (38)


from which we get the second derivatives

L2bLt2

i

"!fQ1(t*) (1!F

1(t*))k~1F

1(t*)n~k CA

n!2

k!2B#An!2

k!1BD (39)

and

L2bLt

iLt

j

"f1(t*)2A

n!2

k!2B (1!F1(t*))k~2F

1(t*)n~k

!f1(t*)2 A

n!2

k!1B (1!F1(t*) )k~1F

1(t*)n~k~1 (40)

The first and second derivatives of a are similar.A necessary (and sufficient) condition for a set of thresholds to be a local minimum

of the cost function C is that for any vector y such that (+a)Ty"0 we have yTLy'0,where L is the Hessian matrix of C [31]. Since with all thresholds equal we have+a"(1, 1,2, 1)T, and since all off-diagonal terms of L are the same, this is equivalentto requiring

0(!

L2 bLt2

i

#jL2aLt2

i

#

L2bLt

iLt

j

!jL2a

LtiLt

j

(41)

or

0(C(n!k)d

dt*ln A

f1(t*)/f

0(t*)

F1(t*)/F

0(t*)B#(k!1)

d

dt*lnA

f1(t*)/f

0(t*)

(1!F1(t*))/(1!F

0(t*))BD

](n!2)!

(n!k)!(k!1)!f1(t*) (1!F

1(t*))k~1F

1(t*)n~k (42)

Thus a necessary condition for t1"2"t

n"t* is that

(n!k)d

dt*ln A

f1(t*)/f

0(t*)

F1(t*)/F

0(t*)B#(k!1)

d

dt*ln A

f1(t*)/f

0(t*)

(1!F1(t*))/(1!F

0(t*))B'0

(43)

While this has been proven only for 1(k(n, it can be shown via a paralleldevelopment that Eq. (43) is also a necessary condition for the and (k"n) and or(k"1) fusion rules. It is interesting that similar ROC slope conditions have been usedas predictors of pathology [32], in which it is shown that under certain suchconditions the probability of error does not go to zero as the number of sensors growswithout bound.

7.2. Interpretation

Let us recapitulate the results of the previous section. We have noted that if anidentical threshold is to be used at each sensor, then a k-out-of-n fusion rule must be


employed. With such thresholds and fusion rule the first-order necessary conditionsfor optimality are indeed satisfied; in addition we have presented (second-order)conditions necessary for optimality, and sufficient for the cost function C to be at leasta local minimum.

Consider the functions

gk(x)"

n+l/k

nAn

lRBxl (1!x)n~l (44)

for which gk(0)"0, g

k(1)"1, and gR

k(x)'0 for 0)x)1. With b

0(a

0) representing

the receiver operating characteristic (ROC) of an individual sensor, a system optimalunder the assumption of identical thresholds uses

k*"arg maxkMg

k(b

0(g~1

k(a

d)) )N (45)

for its fusion rule. If the operating point on the individual-sensor ROC correspondingto the design false-alarm rate a

ddoes not satisfy Eq. (43), then an optimal design does

not use identical thresholds. If on the other hand Eq. (43) is satisfied, then the designproduces at least a local minimum of the cost function C; unfortunately there do notappear to be any general statements one can make about convexity, and there maystill be a better design using non-identical thresholds, or even with a different (notk-out-of-n) fusion rule.

The and and or fusion rules are exceptional. In these cases Eq. (34) may be writtenas

f1(ti)

1!F1(ti)

n<j/1

(1!F1(tj))"j

f0(ti)

1!F0(ti)

n<j/1

(1!F0(tj)) (46)

and

f1(ti)

F1(ti)

n<j/1

(F1(tj))"j

f0(ti)

F0(ti)

n<j/1

(F0(tj)) (47)

respectively, for which it is clear that Eq. (43) guarantees unique solutions. In fact, theand and or rules, since they represent inequality Eq. (43) in its purest forms, offer themost straightforward interpretation. The necessary conditions for the and rule may bewritten in terms of the individual-sensor ROC as

d

daln A

dbdaB(

d

dalnA

baB (48)

while for the or rule we have

d

dalnA

dbdaB(

d

dalnA

1!b1!aB (49)

Geometrically the left sides of Eq. (48) and Eq. (49) represent the relative rate ofchange of the slope of the individual-sensor ROC at a given operating point. The rightsides represent the relative rates of change of the slopes of secants drawn between theoperating point and (0, 0) and (1, 1) respectively; this is shown in Fig. 12. Here the


Fig. 12. Interpretation of second-order necessary conditions.

slope of a is db/da, that of b is b/a, and that of c is (1!b)/(1!a). Since all quantitiesare negative, we see that the second-order necessary conditions will be violated fora sufficiently flat (in the sense of constant slope) portion of the ROC—this is indicativeof a large (but not necessarily a point) mass of probability.

Example 5. For the known-signal in Gaussian noise hypothesis problem, inequalitiesEqs (48) and (49) are always satisfied.

Example 6. For the single-sensor ROC described by b0"as

0(0(s(1), which may

arise from an exponential-density hypothesis-testing problem, the left and right sidesof Eq. (48) are equal. In fact, the performance of such a test using an n-sensor and ruleis completely insensitive to the thresholds used, and actually is equal to the single-sensor performance. Not surprisingly the and rule is never optimal, and Eq. (43) issatisfied for any k'1.

Example 7. (Same as Example 4). The null-hypothesis density f0is shown in Fig. 13 for

k"10 and k"100. It was observed that for k"10 the thresholds optimally shouldbe identical, while they may differ for k"100, the latter case more closely resemblingpoint masses. The actual thresholds for k"100 and three sensors are shown in


Fig. 13. Likelihood ratio density functions for the ‘‘pathological’’ Example 4, different k parameters.

Fig. 11. It turns out that Eq. (43) begins to fail for some t when k'11. For example, ifan and rule is to be used, only the second term in equation Eq. (43) must be checkedsince k"n. This logarithm (that is, without the derivative) is plotted in Fig. 14. It isclear that with k"10 the slope is always positive, and there is no doubt that sensorquantizations should be identical; yet with k"100 this is not true, and the behavior ofFig. 11 results. The same conclusion can be drawn directly from the ROC of Fig. 15using Eq. (48) instead of Eq. (43).

In summary, there is, for the binary quantization case, an easy-to-check necessarycondition, given in terms of the local ROC, for the sensor mapping functions to beidentical. It turns out that the condition is violated when the local likelihood ratiodensities become ‘‘point-mass like’’ in that their probability is sharply concentrated;they need lose their continuity and differentiability, however. The condition wasderived in the context of Neyman—Pearson detection systems, but the proof is valid forBayesian schemes as well. The necessary condition is known to be sufficient only forand and or fusion rules. Excellent further results on the relative performance ofidentical versus non-identical quantization are available in [12]; generally the differ-ence is not great.


Fig. 14. The version of equation Eq. (43) appropriate for Example 9 (same as Example 4) and an and rule.Note that with k"10 this curve is monotone (the derivative is always positive, and hence Eq. (43) is alwayssatisfied); while for k"100 this is not so.

8. Summary

In this paper, we have shown that under many performance criteria the optimaldecentralized detection system uses a likelihood ratio partition of the sensor observa-tions. The criteria studied were Neyman—Pearson, Bayes, mutual information, and theAli—Silvey distances. Unlike past work on the subject, we did not assume that thebandlimited channels between the sensors and the fusion center restrict communica-tion to one bit per decision, nor did we assume that the sensor decision rules wereidentical. In all cases, we have shown that the sensors optimally compute and quantizethe local likelihood ratio and transmit the quantizer level to the fusion center. We alsoshow that the optimal decentralized detector for global mutual information uses norandomization at the fusion center.

Conditions were given for the use of identical sensor decision rules for certainAli—Silvey distances and for the local mutual information. It was also shown that evenwith i.i.d. sensor observations and equal communication channel constraints, thesensor decision rules for the Neyman—Pearson, Bayes, and global mutual informationcriteria may vary between sensors. Examples were given illustrating the effects ofjamming (and hence dependence), the relationship between local and global mutual


Fig. 15. The single-sensor ROC for example 9 (same as Example 4), for use with equation Eq. (48) todetermine whether the sensors use identical quantizations. The same conclusions can be drawn as in thealternative approach of figure Fig. 14: when k"100, this ROC is sufficiently flat to violate Eq. (48).

information, and situations in which identical sensor decision rules will and will not beoptimal.

As this last example is, we hope, intriguing, we devote some effort to its explication.Specifically, we are able to derive necessary (and sometimes sufficient) conditions foridentical sensors optimally to use identical quantizers. These conditions amount toa check on the nearness of a likelihood ratio density to exhibition of point-massbehavior.

References

[1] R.R. Tenney, N.R. Sandell Jr., Detection with distributed sensors, IEEE Trans. Aerospace Electron.Systems AES-17 (1981) 501—510.

[2] Z. Chair, P.K. Varshney, Optimal data fusion in multiple sensor detection systems, IEEE Trans.Aerospace Electron. Systems, AES-22 (1986) 98—101.

[3] A.R. Reibman, L.W. Nolte, Optimal detection and performance of distributed sensor systems, IEEETrans. Aerospace Electron. Systems AES-23 (1987) 24—30.

[4] I.Y. Hoballah, P.K. Varshney, Distributed Bayesian signal detection, IEEE Trans. Inform. TheoryIT-35 (1989) 995—1000.


[5] I.Y. Hoballah, P.K. Varshney, An information theoretic approach to the distributed detectionproblem, IEEE Trans. Inform. Theory IT-35 (1989) 988—994.

[6] D.J. Warren, P.K. Willett, Optimal decentralized detection for conditionally independent sensors,Proc. 1989 Amer. Control Conf., vol. 2, Pittsburgh, June 1989, pp. 1326—1329.

[7] J.N. Tsitsiklis, Decentralized detection, in: H.V. Poor, J.B. Thomas (Eds.), Advances in StatisticalSignal Processing, vol. 2—Signal Detection, JAI Press, Greenwich, CT, 1990.

[8] J.N. Tsitsiklis, Extremal properties of likelihood ratio quantizers, Tech. Report dLIDS-P-1923,Laboratory for Information and Decision Sciences, Massachusetts Institute of Technology, Cam-bridge, MA, November 1989.

[9] S. Thomopoulos, R. Viswanathan, D. Bougoulias, Optimal distributed decision fusion, IEEE Trans.Aerospace Electron. Systems 25 (5) (1989) 761—765.

[10] R. Viswanathan, A. Ansari, S. Thomopoulos, Optimal partitioning of observations in distributeddetection, Proc. Int. Symp. on Information Theory, 1988, p. 197.

[11] R. Blum, Quantization in multi-sensor random signal detection, IEEE Trans. Inform. Theory 41 (1)(1995) 204—215.

[12] P.-N. Chen, A. Papamarcou, New asymptotic results in parallel distributed detection, IEEE Trans.Inform. Theory 39 (6) (1993) 1847—1863.

[13] R. Blum, Necessary conditions for optimum distributed detectors under the Neyman—Pearsoncriterion, IEEE Trans. Inform. Theory 42 (3) (1996) 990—994.

[14] H.L. Van Trees, Detection, Estimation, and Modulation Theory, vol. 1, Wiley, New York, 1968.[15] A. Martinez, Detector optimization using Shannon’s information, Proc. 20th Annual Conf. Informa-

tion Sciences and Systems, Princeton University, Princeton, NJ, March 1986.[16] R. Blahut, Principles and Practice of Information Theory, Addison-Wesley, Reading, MA, 1987.[17] S.D. Silvey, Statistical Inference, Chapman & Hall, London, 1974.[18] H.V. Poor, An Introduction to Signal Detection and Estimation, Springer, New York, 1988.[19] E.L. Lehmann, Testing Statistical Hypotheses, Wiley, New York, 1986.[20] S.M. Ali, S.D. Silvey, A general class of coefficients of divergence of one distribution from another,

J. Roy. Stat. Soc., Ser. B 28 (1966) 131—142.[21] H. Kobayashi, J.B. Thomas, Distance measures and related criteria, Proc. 5th Annual Allerton Conf.

Circuit and System Theory, Monticello, IL, October 1967, pp. 491—500.[22] H.V. Poor, Robust decision design using a distance criteria, IEEE Trans. Inform. Theory IT-26 (1980)

575—587.[23] T.L. Grettenberg, Signal selection in communication and radar systems, IEEE Trans. Inform. Theory

IT-9 (1963) 265—275.[24] T. Kailath, The divergence and Bhattacharyya distance measures in signal selection, IEEE Trans.

Commun. Technol. COM-15 (1967) 52—60.[25] H.V. Poor, J.B. Thomas, Applications of Ali—Silvey distance measures in the design of generalized

quantizers for binary decision systems, IEEE Trans. Commun. COM-25 (1977) 893—900.[26] S. Kassam, Signal Detection in Non-Gaussian Noise, Springer, Berlin, 1987.[27] D. Kazakos, V. Vannicola, M. Wicks, On optimal quantization for detection, Proc. 1989 Conf. on

Information Sciences and Systems, March 1989.[28] P. Willett, D. Warren, Randomization and scheduling in decentralized detection, Proc. 27th Annual

Allerton Conf. on Communications, Control, and Computing, September 1989.[29] P. Willett, D. Warren, The suboptimality of randomized tests in quantized detection systems, IEEE

Trans. Inform. Theory 39 (2) (1992) 355—361.[30] Y. Han, T. Kim, Randomized fusion rules can be optimal in distributed Neyman—Pearson detectors,

IEEE Trans. Inform. Theory 43 (4) (1997) 1281—1288.[31] D. Luenberger, Linear and Nonlinear Programming, Addison-Wesley, Reading, MA, 1984.[32] R. Viswanathan, V. Aalo, On counting rules in distributed detection, IEEE Trans. Acoust. Speech

Signal Process 37 (5) (1989) 772—775.


optimum quantization for detector fusion: some proofs, examples, and pathology

Documents