3954 ieee transactions on signal processing, vol. 57, …vikramk/kd09.pdf · 3954 ieee transactions...

16
3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 Optimal Threshold Policies for Multivariate POMDPs in Radar Resource Management Vikram Krishnamurthy, Fellow, IEEE, and Dejan V. Djonin Abstract—This paper deals with the management of multimode sensors such as multifunction radars. We consider the problems of multitarget radar scheduling formulated as multivariate partially observed Markov decision process (POMDPs). The aim is to com- pute the scheduling policy to determine which target to choose and how long to continue with this choice so as to minimize a cost func- tion. We give sufficient conditions on the cost function, dynamics of the Markov chain target and observation probabilities so that the optimal scheduling policy has a threshold structure with respect to the multivariate TP2 ordering. This implies that the optimal pa- rameterized policy can be estimated efficiently. We then present stochastic approximation algorithms for estimating the best multi- linear threshold policy. Index Terms—Bayesian sensor scheduling, multitarget tracking, radar resource management, stochastic approximation algorithms, threshold policies, totally positive (TP2) ordering. I. INTRODUCTION T HIS paper deals with the management of multimode sensors such as multifunction radars. Consider dy- namical targets tracked by an agile beam multifunction radar. How should the radar manager decide which target to track with high priority during the time slot and for how long? Given Bayesian track estimates of the underlying targets, the goal is to devise a sensor management strategy that at each time dynamically decides how much radar resource to invest in each target. Several recent papers in statistical signal processing [1]–[6], study the problem as a multivariate partially observed Markov decision process (POMDP). Such sensor scheduling problems have recently received much attention in the statistical signal processing literature. They are used in the context of joint target tracking and sensor management. A major concern with the POMDP formulation is that in most realistic cases, POMDPs are numerically intractable. They are PSPACE hard [7] requiring exponential computational cost (in sample path length) and memory. The main aim of this paper is to show that by introducing structural assumptions on multivariate POMDPs, the optimal Manuscript received January 05, 2009; accepted March 30, 2009. First pub- lished May 12, 2009; current version published September 16, 2009. The asso- ciate editor coordinating the review of this manuscript and approving it for pub- lication was Prof. Antonio Napolitano. This work was supported in part by the NSERC and by the Defense Research Development Canada (DRDC) Ottawa. A drastically reduced version of this paper appears in the European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, July 2009. The authors are with the Department of Electrical and Computer Engi- neering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2009.2022915 scheduling policy can be characterized by a simple structure and computed efficiently. We formulate the dynamic sensor management problem as a two level optimization where the two levels are cross coupled. The inner level optimization problem termed sensor micromanagement deals with how long to main- tain a given priority allocation of targets. It is formulated as a multivariate POMDP. The outer level optimization termed sensor macromanagement deals with picking a new priority allocation. We introduce a novel target priority parameter optimization at the outer level that links the two levels. At each scheduling slot, the macromanager decides on which target should be given priority during the slot. Given this priority allocation, the micromanager coordinates the tracking of the targets and decides how long to maintain the current priority allocation. The radar devotes maximum priority (time) or the best sensors to track the high priority target and less time or poorer quality sensors to track the remaining lower priority targets. When the micromanager decides to terminate the current priority, the scheduling interval is completed and control returned to the macromanager. The micromanagement optimization is formulated as a multivariate POMDP in this paper. We refer to Fig. 1 for a schematic description of the setup. At the end of the scheduling interval , the outer level macromanager chooses the next high priority target based on the target priority allocation vector and cost accrued at the micromanager , and other extrinsic information. A. Main Results and Outline In the above context, there are three main results in this paper: 1. Structural Results for Sensor Micromanagement: The sensor micromanagement in both of the above problems can be formulated as a multivariate POMDP. The main goal of this paper is to prove that under reasonable condi- tions, the multivariate POMDP has a remarkable structure: the optimal scheduling policy is a simple threshold. Therefore, the optimal policy can be computed efficiently. Showing this result requires using the TP2 (totally positive of order 2) multivariate stochastic ordering. We extend this TP2 stochastic ordering to multilinear curves and lines within the state space of the POMDP to prove our threshold result. We also give a novel necessary and suf- ficient condition for the optimal threshold policy to be approximated by the best linear hyperplane or multilinear curve. We then present stochastic approximation algo- rithms to compute these parameterized thresholds. 2. By introducing a priority allocation vector in the sensor mi- cromanagement POMDP, we show that the macromanage- ment problem (outer level optimization) can be formulated as the optimization of a concave objective over a convex 1053-587X/$26.00 © 2009 IEEE Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Upload: others

Post on 12-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

Optimal Threshold Policies for Multivariate POMDPsin Radar Resource Management

Vikram Krishnamurthy, Fellow, IEEE, and Dejan V. Djonin

Abstract—This paper deals with the management of multimodesensors such as multifunction radars. We consider the problems ofmultitarget radar scheduling formulated as multivariate partiallyobserved Markov decision process (POMDPs). The aim is to com-pute the scheduling policy to determine which target to choose andhow long to continue with this choice so as to minimize a cost func-tion. We give sufficient conditions on the cost function, dynamics ofthe Markov chain target and observation probabilities so that theoptimal scheduling policy has a threshold structure with respect tothe multivariate TP2 ordering. This implies that the optimal pa-rameterized policy can be estimated efficiently. We then presentstochastic approximation algorithms for estimating the best multi-linear threshold policy.

Index Terms—Bayesian sensor scheduling, multitarget tracking,radar resource management, stochastic approximation algorithms,threshold policies, totally positive (TP2) ordering.

I. INTRODUCTION

T HIS paper deals with the management of multimodesensors such as multifunction radars. Consider dy-

namical targets tracked by an agile beam multifunction radar.How should the radar manager decide which target to trackwith high priority during the time slot and for how long? GivenBayesian track estimates of the underlying targets, the goalis to devise a sensor management strategy that at each timedynamically decides how much radar resource to invest in eachtarget. Several recent papers in statistical signal processing[1]–[6], study the problem as a multivariate partially observedMarkov decision process (POMDP). Such sensor schedulingproblems have recently received much attention in the statisticalsignal processing literature. They are used in the context ofjoint target tracking and sensor management. A major concernwith the POMDP formulation is that in most realistic cases,POMDPs are numerically intractable. They are PSPACE hard[7] requiring exponential computational cost (in sample pathlength) and memory.

The main aim of this paper is to show that by introducingstructural assumptions on multivariate POMDPs, the optimal

Manuscript received January 05, 2009; accepted March 30, 2009. First pub-lished May 12, 2009; current version published September 16, 2009. The asso-ciate editor coordinating the review of this manuscript and approving it for pub-lication was Prof. Antonio Napolitano. This work was supported in part by theNSERC and by the Defense Research Development Canada (DRDC) Ottawa.A drastically reduced version of this paper appears in the European Conferenceon Symbolic and Quantitative Approaches to Reasoning with Uncertainty, July2009.

The authors are with the Department of Electrical and Computer Engi-neering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada(e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2009.2022915

scheduling policy can be characterized by a simple structureand computed efficiently. We formulate the dynamic sensormanagement problem as a two level optimization where the twolevels are cross coupled. The inner level optimization problemtermed sensor micromanagement deals with how long to main-tain a given priority allocation of targets. It is formulated asa multivariate POMDP. The outer level optimization termedsensor macromanagement deals with picking a new priorityallocation. We introduce a novel target priority parameteroptimization at the outer level that links the two levels. At eachscheduling slot, the macromanager decides on which targetshould be given priority during the slot. Given this priorityallocation, the micromanager coordinates the tracking of thetargets and decides how long to maintain the current priorityallocation. The radar devotes maximum priority (time) or thebest sensors to track the high priority target and less timeor poorer quality sensors to track the remaining lowerpriority targets. When the micromanager decides to terminatethe current priority, the scheduling interval is completed andcontrol returned to the macromanager. The micromanagementoptimization is formulated as a multivariate POMDP in thispaper. We refer to Fig. 1 for a schematic description of thesetup. At the end of the scheduling interval , the outer levelmacromanager chooses the next high priority target basedon the target priority allocation vector and cost accrued at themicromanager , and other extrinsic information.

A. Main Results and Outline

In the above context, there are three main results in this paper:1. Structural Results for Sensor Micromanagement: The

sensor micromanagement in both of the above problemscan be formulated as a multivariate POMDP. The maingoal of this paper is to prove that under reasonable condi-tions, the multivariate POMDP has a remarkable structure:the optimal scheduling policy is a simple threshold.Therefore, the optimal policy can be computed efficiently.Showing this result requires using the TP2 (totally positiveof order 2) multivariate stochastic ordering. We extendthis TP2 stochastic ordering to multilinear curves andlines within the state space of the POMDP to prove ourthreshold result. We also give a novel necessary and suf-ficient condition for the optimal threshold policy to beapproximated by the best linear hyperplane or multilinearcurve. We then present stochastic approximation algo-rithms to compute these parameterized thresholds.

2. By introducing a priority allocation vector in the sensor mi-cromanagement POMDP, we show that the macromanage-ment problem (outer level optimization) can be formulatedas the optimization of a concave objective over a convex

1053-587X/$26.00 © 2009 IEEE

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 2: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

KRISHNAMURTHY AND DJONIN: MULTIVARIATE POMDPS IN RADAR RESOURCE MANAGEMENT 3955

polytope. Therefore the optimal target priority allocationvector is one of the extreme points of the convex polytope.

3. Finally we illustrate the main assumptions and results inthe examples of radar resource management. These prob-lems have been formulated in recent work as multivariatePOMDPs. Therefore, the structural results and algorithmspresented in this paper are directly applicable to these ap-plications.

B. Context and Other Works

The main difference between this paper and other recent pa-pers in sensor management is that our work focuses on analysisof the structure of multivariate POMDPs rather than brute forcenumerical solution or myopic solutions based on heuristics. Wegive sufficient conditions under which the optimal policy has aprovable threshold structure. Such structural results allow us todevise efficient numerical algorithms for multivariate POMDPswith several hundred states, which would otherwise be impos-sibly complex to solve. Also, the structural results we provideare “class” type results – that is for parameters belonging to aset, the results hold. Hence there is an inherent robustness inthese results since even if the underlying POMDP parametersare not exactly specified but still belong to the appropriate sets,the structural results still hold.

This paper substantially generalizes our previous paper [3]which dealt with structural results for scalar POMDPs with asingle target. In that paper we used the univariate monotone like-lihood ratio (MLR) ordering. For multivariate POMDPs consid-ered in the current paper, the problem is significantly harder:First, one needs to use the TP2 multivariate stochastic orderingof Bayesian estimate which is not necessarily reflexive (unlikethe univariate MLR ordering). We show in Section IV-A-5), thatit is not possible to use the univariate MLR ordering of [3] forthe multitarget problems considered in this paper. Moreover, de-pending on whether or not the multiple targets evolve indepen-dently, it is necessary to construct different types of thresholdpolicies. Second, unlike in [3], we present in this paper an outerlevel optimization for sensor macromanagement. We show thatthe outer level macromanagement problem is coupled with theinner level POMDP but can be solved efficiently. Third, the re-sults of the current paper naturally apply to multitarget sensormanagement. The results of this paper are also related to [8]where conditions are given for a POMDP to have a TP2 mono-tone increasing policy. However, to make those results usefulfrom a practical point of view, one needs to translate mono-tonicity of the policy to the existence of a threshold policy. Amajor contribution of the current paper is to develop the prop-erties of a specialized version of the TP2 stochastic orderingon multilinear curves and lines. We show that this specializedTP2 order requires far less restrictive conditions on the costscompared to [8]. Moreover, we present necessary and sufficientconditions for the best linear and multilinear approximation tothe optimal threshold policy. This allows us to estimate the op-timal linear and multilinear estimate to the threshold policy viastochastic approximation algorithms.

II. SENSOR SCHEDULING SIGNAL MODEL AND DYNAMIC

PROGRAMMING FORMULATION

In this section we describe a generic two-time scale sensormanagement scheme that will be used for radar resource man-agement. We use three time indices:

1. The targets evolve over the fast time scale denoted.

2. The slow time scale denoted by indexesrandom length intervals of time over the fast time scale.These random length intervals of length denoted arecalled scheduling intervals.

3. Finally, in the fast time scale we will use the indexto denote the time within a scheduling interval.

This denotes relative time while denotes absolute timeon the fast time scale.

In Section II-A, the dynamics of the targets is introduced. InSection II-B, the two-time scale sensor management involvingmacromanagement at the slow time scale and micromanage-ment at the fast time scale is described. In Section II-C, themicromanagement problem is formulated as a multivariatePOMDP. Section II-D deals with the special case of indepen-dent targets. In Section II-E, the optimization of the priorityallocation vector in the macromanagement problem is for-mulated as a convex optimization problem. Finally, the entireprotocol is described in pseudocode form in Section II-F.

A. Target Dynamics

Consider a radar with an agile beam, tracking moving tar-gets (e.g., aircraft). Each target is modeled as a finite staterandom process indexed by evolving overdiscrete time . To simplify notation, assume eachprocess has the same finite state spaceEach process models a specific attribute of target . Forexample in [2], [3], [9], it models the distance of the target tothe base-station. In other radar resource management examples[10], models the track covariance (uncertainty) of targetat time . In either case, the radar resource manager uses this in-formation to micromanage the radar by adapting the target dwelltime, waveform, aperture, etc.

The composite state random process comprising of allprocesses is denoted :

with composite state space

(1)

where denotes Cartesian product. We index the states ofby the vector index or , where

with generic element

Assume the composite process evolves according to astate Markov chain, with transition matrix

(2)

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 3: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

3956 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

The composite process can model dependent targets flyingin formation, see Section IV. The special case where the targetsare mutually independent is discussed in Section II-D.

B. Two Time Scale Sensor Management Architecture

1) Macromanagement: Target Selection : At each instanton the slow time scale ( th scheduling interval), the sensor

manager picks one target denoted , to track/estimate with high priority, while the other targets aretracked/estimated with lower priority. The choice is based onsome policy rule denoted which maps the current Bayesiantrack estimates of the targets (denoted below) to , i.e.,

. The policy is termed macrolevelsensor management. While macromanagement is not the mainfocus of the paper, we discuss its interaction with the microlevelmanagement (described below) in Section II-E.

2) Micromanagement: Scheduling Control : Once the ac-tion is chosen, the micromanager is initiated for the th sched-uling interval. The clock on the fast time scale is reset toand commences ticking. At this fast time scale, , the

targets are tracked/estimated by a Bayesian tracker. The targetis given highest priority and thus allocated the best quality

sensors (or more time within the scheduling interval) for mea-suring its state. The remaining targets are given lowerpriority and tracked using lower quality sensors (or given lesstime in the scheduling interval). Micromanagement is the mainfocus of this paper. The question we seek to answer is: How longshould the micromanager track target with high priority be-fore returning control to the macromanager to pick a new highpriority target? Instead of such target dwell time management,we can also formulate radar waveform selection or aperture se-lection as similar problems.

C. Formulation of Sensor Micromanagement as a MultivariatePOMDP

The aim below is to formulate the sensor micromanagementproblem for the targets as a multivariate POMDP. ThisPOMDP comprises of the following six ingredients:

1) Markov Chain: defined in Section II-A models theevolving targets.

2) Action Space: At the th time instant within the thscheduling interval, the micromanager picks action as afunction of the Bayesian estimates (defined in (8)) of all

targets:

(3)

• If , the micromanager continues withthe current target priority allocation . So increments to

and the targets are tracked with target given thehighest priority.

• If , the micromanager stops the currentscheduling interval , and returns control to the macroman-ager to determine a new target for the next schedulinginterval . The precise logical flow is given in Protocol1 below.

3) Multivariate Target Measurements: Given the compositestate of the targets, the measurement vector is recordedat time . Here

(4)

Assume each target’s observation , , is finitevalued, i.e.

(5)

For example in [2], [3], and [9], models the noisy ob-served distance of the target to the base-station. In other typesof radar resource management [10], models the estimatedtrack covariance of target . In radar resource management, since

denotes the high priority target, if , then is amore accurate measurement of for compared to theother targets. Note that since the targets are correlated, ob-serving one target yields information about other targets. Alsosince the elements of can be correlated, measurements of onetarget give information about another target.

4) Multitarget Bayesian Tracking and Information State: Inscheduling interval , with priority allocation , at any time ,denote the history of past observations and actions as

(6)

Here is the a posteriori distribution of the targets fromthe macromanager at the end of the th scheduling interval.Based on the history , the Bayesian tracker computes thea posteriori state distribution of the targets defined as

(7)The -dimensional column vector is computed recursivelyvia the hidden Markov Bayesian filter:

(8)

Here is the dimension vector of ones. is referred to asthe information state, since (see [7]) it is a sufficient statistic todescribe the history in (6). The composite Bayesian estimate

in (8) lives in an dimensional unit simplex

(9)The Bayesian a posteriori distribution of each individual

target , defined as , can be computedby marginalizing the joint distribution . For each target ,

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 4: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

KRISHNAMURTHY AND DJONIN: MULTIVARIATE POMDPS IN RADAR RESOURCE MANAGEMENT 3957

the estimate in (17), lives in an dimensional unitsimplex

(10)Obviously, ( times).

Let denote the unit dimensional vector with 1 in theth position. Note form the corners of the

simplex . These corner points represent the elements ofstate space . So,

(11)

5) Tracking Cost: At time , with given current compositestate of the targets [see (1)], if action

is taken, then the micromanager ac-crues an instantaneous cost . Hereand [see (12) at the bottom of the page]. In (12), the nonneg-ative dimensional vector denotestarget priority allocations and is set by the macromanager. Asdescribed in Section II-E, links the micro and macromanage-ment. The costs are chosen as decreasing functions(elementwise) of since higher priority targets should incurlower tracking costs. The cost can also be viewed as aswitching cost incurred by the micromanager when it decides toterminate the current target priority allocation and revert back tothe macromanager. In such a case, the higher the target priority,the more it should cost to switch back to the macromanager torequest a new high priority target.

If is chosen, control reverts back to the macromanagerto determine a new target priority allocation. Let denote thetime (in the th scheduling interval) at which action

is chosen. The random variable is a stopping time,i.e., the event for any positive integer is a functionof the history or equivalently information state (moreformally is measurable with respect to the sigma-algebra generated by or ). Let denote a userdefined economic discount factor. Then the sample path costincurred during this interval is

(13)

6) Discounted Cost Stopping Time Problem Formulation:The final ingredient in the multivariate POMDP formulationis the optimization objective. Recall that at the fast time scalein a scheduling interval , the aim is to determine whetherto continue or stop with the current target priority . So theobjective is to compute the optimal sensor schedule tominimize the expected discounted cost over theset of admissible control laws defined in (3). That is, with

defined in (12), and denoting mathematicalexpectation w.r.t.

(14)

7) Summary: The six ingredients (2), (3), (4), (8), (12), (14)constitute a multivariate POMDP for the sensor micromanage-ment problem for targets. The dependence of the costs (12)on the priority allocation vector results in the discounted op-timal cost of the POMDP (14) depending on . Indeed, linksthe micro and macromanagement and is optimized at the macro-manager optimization described below. In subsequent sections,we introduce additional structure to the multivariate POMDP sothat the optimal policy for micromanagement has a monotonestructure.

D. Special Case: Independent Targets With IndependentObservations

In the special case that the targets are mutually independent,then each target , evolves as a Markov chainwith transition matrix (for )

(15)

The initial distribution is then denoted aswhere .

If the measurements of the targets are also mutually indepen-dent, then we observe

(16)

Note that the individual distributions can bemultivariate; for example, position and velocity measurements

cost of terminating current priority allocation given statecost of continuing current priority allocation given state .

(12)

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 5: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

3958 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

of target . When (15) and (16) hold, then the a posteriori dis-tribution for each target iscomputed as the hidden Markov Bayesian filter

(17)

Moreover, the joint information state is the Kronecker productof the individual information states

(18)

where is the set of product information states.

E. Macromanagement: Target Priority Allocation

Having formulated the micromanagement problem as a mul-tivariate POMDP, we now return briefly to the formulation ofthe macromanagement policy . Given that the th interval hasbeen completed by the micromanagement (inner level) policy ,control is returned to the macromanager (operating on the slowtime scale ) which decides the next target to as-sign highest priority, see Fig. 1.

Macromanagement algorithms often involve meta-level rulesand are not the focus of this paper. However, what is impor-tant for our purposes is that macromanagement rules invariablyinvolve the performance (cost) of the microman-ager from the inner level optimization.

The high priority target is chosen according to a meta-levelrule based on several extrinsic factors. For example, themacromanager can pick the based on the track variances(uncertainty), threat levels and priority allocation vector

of the targets

(19)

It is the priority allocation vector that couples the micro andmacromanagement problems. For example, the macromanagercan optimize the above rule (19) with respect to as follows:

(20)

(21)

Here and denote user-defined macroman-ager defined parameters that constraint the priority allocation .In (20), is the optimal cost from the micromanager’s

multivariate POMDP, see (14). The function “ ” in(20) determines how the different targets are allocated priority.Naturally it should be an increasing function of . The followingresult is proved in Appendix B–1.

Lemma 1: If the instantaneous cost in (12) isconcave and decreasing (elementwise) in , thenis concave and decreasing in . Therefore if the function“ ” in (20) is concave, then (20), (21) is equivalentto optimizing a concave function over a convex set (21).

Therefore, optimizing boils down to optimizing a concavefunction (20) over a convex set (21). We can then apply the resultin [11, Theorem 3, p. 181] that a concave function defined ona bounded closed convex set achieves its global minimum at anextreme point of the convex set. (Recall from convex analysis[11, p. 470] that an extreme point in a convex set is onewhere no two distinct points and exist in such that

for any .) Thus it sufficesto the check the corner points of the convex polytope (21) tocompute the optimal target priority allocation vector . The term

in (20) will be computed using POMDP structuralresults as described in Section III.

1) Respite: Due to the priority allocation vector , themacromanagement and micromanagement policies interact.Note that the macromanagement rule (19) or “priority function”in (20) can be quite general – their precise form is unimportantfor our purposes. Also instead of linear constraints in (21) anyconstraint that restricts to a closed bounded convex regionwill work – although then one would need to compute theextremal points of the convex set. In the rest of the paper, wefocus on efficient structural solutions for the micromanagementPOMDP problem.

F. Two-Time Scale Integrated Radar Manager and TrackingProtocol

To conclude this section, the various steps of the macro andmicromanagement described above are summarized in Protocol1. These steps are depicted pictorially in Fig. 1.

Protocol 1: Two-Time Scale Radar Resource Management

Step 0. Initialization: Initialize the a posterioriprobabilities of the -target distribution and set

.Step 1. Macromanagement: Target Selection: Atscheduling interval , given previous track estimates ofall targets , the sensormanager:1) Selects priority allocation action

using policy , see (19)2) Initializes fast time scale . Set ,

at .Step 2. Micromanagement multivariate POMDP: (Fasttime scale ).Given the current Bayesian a posteriori estimatedefined in (7)1) Scheduling action: Choose scheduling action

, see (3).

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 6: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

KRISHNAMURTHY AND DJONIN: MULTIVARIATE POMDPS IN RADAR RESOURCE MANAGEMENT 3959

Fig. 1. Schematic of two-time scale sensor management scheme proposed in paper. At the end of the scheduling interval , the outer level macromanager choosesthe next high priority target based on the priority allocation vector and cost accrued at the micromanager , and other extrinsic information.The inner micromanager then coordinates the tracking of the targets and decides how long to maintain the current priority allocation. When the micromanagerdecides to terminate the current priority, the scheduling interval is completed and control returned to the macromanager. The micromanagement optimization isformulated as a multivariate POMDP. The goal of this paper is to show that (i) The optimal micromanagement policy for the multivariate POMDP has a simplemonotone (threshold) structure that can be efficiently computed. (ii) Optimizing with respect to the priority allocation vector is equivalentto optimizing a concave function over a convex polytope as described in Section II-E.

2) Instantaneous Cost: An instantaneous costis incurred, see (12).

3) Target Evolution: Each target’s state trajectory evolvesfrom to according to (2) or (15).

4) Target Observations and Tracking: Based on actionchosen in Step 2.1:

• If , then continue trackingwith current target priority allocation .— On fast time scale within the scheduling interval

, for the targets• Record multivariate observation vector

, see (4);or (16)

• Update Bayesian track estimate using (8)or (17).

— Set . Go to Step 2.1.• If , then end current scheduling

interval :— Set . So denotes the

Bayesian estimate at the end of the th interval.— The total cost is incurred for th scheduling

interval is defined in (13).— Set . Go back to sensor manager in Step

1 for new priority allocation.

III. MICROMANAGEMENT AS A MULTIVARIATE POMDP WITH

THRESHOLD POLICY

Consider the micromanagement problem represented by themultivariate POMDP with objective function (14). For fixed pri-ority allocation vector , the optimal stationary policy

and associated optimal cost are thesolution to “Bellman’s equation” for

(22)

Recall is defined in (13). Since the informationstate space of a POMDP is an uncountable set, thedynamic programming (22) do not translate into practicalsolution methodologies as needs to be evaluated at each

, an uncountable set, see [7] for details. In ourmultivariate POMDP, the state space dimension is (expo-nential in the number of targets) and so applying value iterationis numerically intractable. The rest of this section focuses onthe structure of the POMDP. Theorem 1 , shows that undersuitable conditions, the optimal scheduling policy is a simplethreshold policy. We then develop novel parameterizationsof this threshold curve and give stochastic optimization algo-rithms to compute them efficiently. For the reader interested inpractical implementation, Part (iii) Theorem 1 is important (itestablishes the existence of a threshold curve for the optimalpolicy) along with Theorems 2, 3, 4, and Algorithm 2.

A. Main Result: Existence of Threshold Policy for MultivariatePOMDP

First we list the assumptions for correlated targets with corre-lated measurements. Assume for any fixed , thefollowing conditions hold for the multivariate POMDP (14).

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 7: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

3960 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

A1) implies . Herethe componentwise partial order on is be de-noted by , i.e., for , and

, then holds iffor all .

A2) The composite transition probability matrix in (2) ofthe targets is MTP2; see Definition 2.

A3) For each and , the -variate observation probabilitiesis MTP2 in .

S) The costs are submodular. That is, for ,.

A special case of the above assumptions is independent tar-gets (15), independent target-wise observations (16), and sepa-rable costs. For convenience we list these assumptions.

A1') Assume is a sepa-rable cost function for the targets of the form

(23)

where denotes the cost for individualtarget . Assume

for

(24)

A2') The transition probability matrix of each targetdefined in (15) is MTP2 (see Definition 1).

A3') is MTP2 in .S') For separable costs in A1’),

for eachtarget .

Examples of the above conditions in radar management arediscussed in Section IV. The main result below shows that theoptimal micromanagement policy is monotonically increasingin the information state with respect to the TP2 ordering on linesin , see Appendix A for definitions.

Theorem 1 (Existence of Threshold Policy for Sensor Mi-cromanagement): Consider the multivariate POMDP (2), (3),(4), (8), (12), (14), and any fixed target priority allocation

. Then:i) Dependent Targets: Under A1), A2), A3), S), the optimal

policy is TP2 increasing on lines in (seeDefinition 5). That is, , implies .

ii) Independent Targets: Under A1’), A2)’, A3’), S’), the op-timal policy is TP2 increasing on curves in(see Definition 5). That is, , implies

.iii) As a consequence, in both cases there exists a curve

(which we call a “threshold curve”) that partitions infor-mation state space into two individually connectedregions , , such that:

ifif

(25)

Moreover, the region is convex and therefore thethreshold curve is continuous and differentiable almosteverywhere (the set of points where it is not differentiablehas measure zero).

Under the conditions of Theorem 1, the optimal schedulingpolicy for the multivariate POMDP is a threshold policy witha threshold curve that partitions the information state space

. Note that without these conditions, the optimal policyof the multivariate POMDP can be an arbitrarily complex par-tition of the simplex – and solving such a multivariatePOMDP is computationally intractable. The convexity of region

(statement (iii) of the theorem) follows from the clever butelementary observation in [12, Lemma 1]. It is well known [11]that a convex function is continuous and differentiable almosteverywhere.

B. Characterization of Best Linear and Multilinear Threshold

Due to the existence of a threshold curve , computing theoptimal policy reduces to estimating this threshold curve.In general, any user-defined basis function approximation canbe used to parameterize this curve. However, any such approxi-mation needs to capture the essential feature of Theorem 1: theparameterized optimal policy needs to be TP2 increasing.

In this section, we derive linear and multilinear approxima-tions to . Such linear/multilinear thresholds have two attractiveproperties: i) Estimating them is computationally efficient, andii) we give novel conditions on the threshold coefficients thatare necessary and sufficient for the resulting linear/multilinearthreshold policy to be TP2 increasing on lines. Due to the neces-sity and sufficiency of the condition, optimizing over the spaceof linear/multilinear thresholds yields the “best” linear/multi-linear approximation to the threshold curve .

1) Dependent Targets: We start with the following defini-tion of a linear threshold policy: For any fixed target priorityallocation , we define the linear threshold policy

as

ifif ,

(26)

Here ( dimension vector with nonnegative ele-ments) denotes the vector of coefficients of the linear thresholdpolicy.

Theorem 2 characterizes the best linear hyperplane approxi-mation to .

Theorem 2 (Best Linear Threshold Policy): Assume condi-tions A1), A2), A3), S) hold for the multivariate POMDP (14).Then for any fixed target priority allocation :

i) The linear threshold policy defined in (26) is TP2increasing on lines if and only if

(27)

ii) Therefore, the optimal linear threshold approximation tothreshold curve of Theorem 1 is the solution of the fol-lowing constrained optimization problem:

(28)

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 8: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

KRISHNAMURTHY AND DJONIN: MULTIVARIATE POMDPS IN RADAR RESOURCE MANAGEMENT 3961

where the cost is obtained as in (14) by ap-plying threshold policy in (26).

Remarks:i) Best linear threshold: Note that the constraints in (28) are

necessary and sufficient for the linear threshold policy(26) to be TP2 increasing on lines. That is, (28) definesthe set of all TP2 increasing linear threshold policies—itdoes not leave out any TP2 increasing polices; nor doesit include any non-TP2 increasing policies. Therefore op-timizing the POMDP over the space of TP2 increasinglinear threshold policies yields the best linear approxima-tion to the threshold curve .

ii) Elements of threshold vector: To make the thresholdvector parametrization unique, we have incorporatedthe following steps: The term ’1’ on the right-hand side(RHS) of (26) [and also in (30)], is without loss of gen-erality; otherwise one could scale both sides of theseequations resulting in nonuniqueness. The requirementthat is a nonnegative vector is without loss of generalitysince a positive vector with identical elements can alwaysbe added to each vector to make it nonnegative.

2) Independent Targets: The main point in Theorem 3 belowis that for independent but nonidentical targets, we can con-struct a dimension threshold as the best multilinear approx-imation of . Define the dimension vector as

(29)

The elements of each sub-vector are denoted andare associated with target . The dimension of here is incontrast to the dimension threshold for dependent targets inTheorem 2 above. Then for any fixed , definethe multilinear threshold policy

ifif ,

(30)

To make the threshold parameterization unique, we need todisallow scaling by some constant for one target anddividing by the same constant for another target (otherwise

would remain the same for different ). Sowe assume that

(31)

Theorem 3 (Best Multilinear Threshold Policy for Hetero-geneous Independent Targets): Assume conditions A1’), A2’),A3’), and S’) hold for the multivariate POMDP (14). Then forany fixed :

i) The multilinear threshold policy defined in (30) isTP2 increasing on lines if an only if

(32)

ii) Therefore, the optimal multilinear threshold approxima-tion to threshold curve of Theorem 1 is the dimen-sion threshold vector which is the solution of the con-strained optimization problem

(33)

In (33), the cost is obtained as in (14) by ap-plying multilinear threshold policy in (30).

Remark. Independent Homogeneous Targets: In contrast tothe above two theorems, Theorem 4 below deals with the ex-treme case of independent and identical targets, measurementsand costs. That is,

(34)

Assume all elements of the priority allocation vector areequal. Since all targets are identical, the multilinear thresholdpolicy coefficients in (30) are the same for differenttargets . Therefore, it suffices to pick as adimension threshold vector

(35)

Compare this with the threshold dimension for independentnonidentical targets.

Theorem 4 (Best Multilinear Threshold Policy for IdenticalIndependent Targets): Assume conditions A1’), A2’), A3’), andS’) hold for the multivariate POMDP (14) and that the target dy-namics, observations and costs are identical (34). Then Theorem3 holds. The optimal -dimensional threshold vector is thesolution of the constrained optimization problem:

(36)

C. Algorithm to Compute the Optimal Multilinear ThresholdPolicy

Having proved the existence of a threshold policy, Algorithm2 below, presents a simulation based optimization method to es-timate the threshold vector . The computational cost of thealgorithm at each iteration is linear in the dimension of and isindependent of the dimension of the observation alphabet size

in (5). Recall that in comparison, for an unstructured mul-tivariate POMDP (i.e., a POMDP that does not satisfy the con-ditions of Theorem 1) with states and possible observa-tions, the problem is completely intractable.

We focus on estimating the best multilinear threshold policyof Theorem 4 for the independent identical targets. Estimatingthe best linear threshold for the dependent case of Theorem 2,and independent heterogeneous case of Theorem 3 is similarand details are omitted. It is impossible to analytically computethe expected cost (36) of a POMDP. So we resort to sample-path based simulation optimization to estimate : For

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 9: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

3962 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

batches indexed by , evaluate the sample path costby simulating the multivariate POMDP in Step

2 of Protocol 1 (see (13)). The aim is:

(37)

The above constrained stochastic optimization problem (37) canbe converted into an equivalent unconstrained one using the fol-lowing parameterization: Consider the unconstrained dimen-sional vector denoted , with component vectors having di-mensions identical to in (29). Set as

(38)

Since the square of a number is nonnegative, . Also. Thus automatically satisfies constraint (36).

The equivalent unconstrained stochastic optimization problemis

(39)

where is computed using the multilinearthreshold policy with computed from (38). Algorithm 2below, presents a policy gradient algorithm which generates asequence of estimates , that converges to alocal minimum of the best multilinear threshold or equiva-lently (using transformation (38)), with policy .

Algorithm 2: Policy Gradient Algorithm for SensorMicromanagement and Independent Targets

Assume multivariate POMDP parameters satisfy A1’), A2’),A3’), S’), so that the optimal policy is multilinearthreshold via Theorem 3.Step 1: Choose initial threshold coefficients and linear

threshold policy .Step 2: For iterations

• Evaluate sample cost usingStep 2, Protocol 1. Compute gradient estimate

as: (we denoteas to simplify notation):

with probability 0.5with probability 0.5.

• Update threshold coefficients via (wherebelow denotes step size)

(40)

In Step 2, the initial value can be chosen arbi-trarily, since by definition a stationary policy does not depend on

the initial condition (but of course, the cost does). We use the Si-multaneous Perturbation Stochastic Approximation (SPSA) al-gorithm [13] due to its computational efficiency. The SPSA al-gorithm [13] picks a single random direction along which di-rection the derivative is evaluated at each batch . The main ad-vantage of SPSA is that to evaluate the gradient estimatein (40) requires only 2 POMDP simulations, i.e., the numberof evaluations is independent of dimension of parameter . Be-cause the stochastic gradient algorithm (40) converges to localoptima, it is necessary to try several initial conditions .

Theorem 5: The sequence generated by the thresholdbased policy gradient Algorithm 2 converges as to alocal optimum of [defined in (37)] with probabilityone (w.p.1).

Note for fixed , the samples in (37) are simulated in-dependently and have identical distribution. The proof of con-vergence for stochastic gradient algorithms in Theorem 5 isstraightforward, see [14].

IV. DISCUSSION OF ASSUMPTIONS

This section discusses the assumptions for Theorems 1 , 2,3, 4 to work in radar resource management. Next, we showthat univariate stochastic orderings (such as the MLR order in[3]) cannot deal with the multivariate POMDP. Finally, we givesome elementary conditions to avoid degeneracy.

1) Assumption A1), A1’): Suppose the states de-notes decreasing distance of the target to a base-station, sois the closest and 1 is the farthest distance. Then the closer thetarget, the higher the threat level, and there is more incentiveto track it. Then A1) means that the reward (negative of cost)for tracking the target is smallest when it is at maximum dis-tance. This is natural since the further away the target the lowerthe threat. Similarly, if states denote increasing co-variances of the target estimate from the tracker, then the largerthe covariance, the higher the incentive of the radar manager totrack the target.

2) Assumption A2) and A2’): For dependent targets, A2) canmodel convoy behavior. For simplicity consider targets,so , . Define the convoy transition prob-ability matrix

whereifif orifotherwise.

(41)

Here are probabilities such that to be a validtransition probability matrix. It is easily shown that issufficient for to be MTP2.

For independent targets, if the target is at state , ,at time then at time , it is reasonable to assume thatit is either still in state or with a lesser probability in theneighboring states or . Each target’s dynamics canthen be modeled as a state Markov chain with tridiagonaltransition probability matrix . As shown in [15, pp. 99–100],a necessary and sufficient condition for tridiagonal to be TP2

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 10: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

KRISHNAMURTHY AND DJONIN: MULTIVARIATE POMDPS IN RADAR RESOURCE MANAGEMENT 3963

is that . Such a diagonally dominanttridiagonal matrix satisfies Assumption A2’).

We can consider more complex target dynamics that includeindependent deviations about a convoy. For example, let

. Here is thecomposite transition probability of independent targets. So

models targets that move in a convoy but independently de-viate from this correlated trend according to probabilities .Then if and are MTP2, so is(since the product of MTP2 matrices is MTP2 [16, Proposition3.4]). Thus A2) holds.

3) Assumption A3): Several observation probabilities modelssatisfy the TP2 ordering A3), see [16]. For example, supposethe targets are independent. Each sensor measures the targetin quantized Gaussian noise. Then the observation probabilitiesare [see (17)]

(42)

Here denotes the noise variance of the sensor andthus reflects the quality of its measurements. Here

denotes for example, the quantized distance of thetarget to the base-station. It is easily verified that A3’) holds.The ordering is consistent with ourdiscussion in A1’) where state 1 (namely, distance ) was thefarthest distance and (namely, distance ) was the closestdistance.

As another example [9] consider independent targets andchoose . Suppose that the distance the sensors report isnever more than 1 discrete location away from the true distance.So with denoting the target mobility parameter

or

otherwise.(43)

Then it is easily shown that any choice of ,implies A3’) holds.

4) Assumption S), S’): The difference in rewards betweendeploying an accurate estimator versus a less accurate estimatorshould increase as the threat level goes up. This gives economicincentive for the radar manager scheduler to pick the more ac-curate action when the target is close or the threat is high. So if

denote decreasing distance of the target, this immedi-ately translates to S’). Similarly, if denote increasingtrack covariance of the target , then more incentive should begiven to track an uncertain target with an accurate sensor. As-sumption S) is less restrictive than that in [8] where it is requiredthat , .

5) Why Univariate Orderings Will Not Work: Considerthe following highly simplified version of our multivariatePOMDP formulation. Suppose we wish to schedule amongsttwo sensors for a single target with two states

. If the measurements were uni-variate, then as described in Section I, one can use the resultsof [3] and [17] to prove the existence of a threshold policyfor a univariate POMDP. In such a case, the multivariate TP2assumption A3) specializes to the univariate MLR ordering:

.Can we use the univariate MLR ordering (see

Appendix, Definition 1) in [3], [17] for multivariatePOMDPs? The answer even for the following highly simplifiedcase is “No.” Consider the simplified model above but supposenow that at each time, the measurement at each sensor is twodimensional: whereand are conditionally independent.Then even ifand , it isstraightforward to check that this does not imply

.(Here, due to the assumed conditional independence,

).Therefore, even in the highly simplified setting of a singletarget, and conditionally independent multivariate measure-ments, the univariate MLR ordering does not apply. Therefore,the results in [3], [17] cannot be applied. Hence for multipletargets with multivariate observations, we have no choice butto work with the TP2 multivariate ordering.

6) Conditions to Avoid Degeneracy: We wish to avoid the un-interesting case where the micromanager always picksindefinitely for a target . In such a case the scheduling intervalin Protocol 1 does not terminate and the macromanager is irrel-evant. Clearly, since by A1) , it followsthat

for all is sufficient to avoid this case (recalldenotes the discount factor).

Also we wish to avoid the trivial case where it is always op-timal to stop at the first time instant , see Step 4 of Pro-tocol 1. Then the micromanager problem would be degenerate.In light of A1), a sufficient condition for this is

for all . To summarize the followingconditions on the cost function avoid degeneracy

(44)

V. NUMERICAL RESULTS

We present two examples below, one with independent tar-gets, the other with targets in a convoy.

Example 1: Independent Targets: We consider inde-pendent Markovian targets, each comprising of states.The states correspond to quantized distance of each target. Sothe composite state space of is enormous and without struc-tural results, the resulting POMDP is completely intractable.Below we construct the POMDP to satisfy structural assump-tions A1’), A2’), A3’), and S’) of Theorems 1 and 3. Since the

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 11: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

3964 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

targets are independent and identical, we chose a simple macro-management policy of choosing the high priority target as thetarget with the largest covariance, see (19).

Target Dynamics: Assume all targets have the same tridiag-onal transition matrix with , ,

. This satisfies the simple sufficient condition wegave in Section IV for condition A2’) to hold.

Multivariate Target Measurements: Denote the targetobservation set as , see (5). (For con-venience we chose ). For each target we chose theobservation probabilities in (16) as a tridiagonal matrix with

when for and 0.6for . As described in Section II-C, this means that in eachscheduling interval, the high priority target is tracked withdetection probability 0.99 while the remaining two low prioritytargets are tracked with detection probability 0.6 each. It iseasily checked that A3’) holds.

Tracking Cost: Given the large number of possible choicesthat satisfy assumption A1’), to give a succinct representation ofthe cost and provide some intuition, we parameterize it in termsof two scale factors and . For fixed target priority allocation

, we chose the tracking cost for each target as

(45)

For the remaining targets, we chose, where . As motivated in

Section IV, we chose to decrease with andsubmodular so that A1’) and S’) hold. Finally we chose thePOMDP discount factor .

The scale factor determines how much more expensive it isto track a high priority target compared to a low priority target.

determines how much more expensive it is to stop the currentpriority allocation compared to continuing. The choice of in(45) leads to interesting interaction between the macro and mi-cromanagers. By choosing smaller , the macromanager allowsfor more frequent target priority re-assignment and increasedexploration of other target trajectories. We chose ,

.Computing Best Multilinear Threshold Policy: Given that the

POMDP satisfies A1’), A2’), A3’), S’), Theorem 1 implies theexistence of an optimal threshold policy. Also because the tar-gets are independent and identical, Theorem 4 applies implyingthat the best multilinear policy approximation of dimension

to the optimal threshold curve can be constructed. Next weuse the policy gradient Algorithm 2 to estimate this multilinearthreshold policy. The sample path cost of the POMDP was eval-uated according to (13). The SPSA algorithm parameters in Al-gorithm 2 were chosen as , , . Asshown in Fig. 2, the SPSA algorithm converges to the optimalmultilinear threshold within 1500 iterations.

How well does the optimal multilinear policy compare withsimple heuristic policies? We implemented Protocol 1 with theoptimal multilinear micromanagement policy and compared itwith the performance of myopic micromanagement policy anda periodic micromanagement policy. To estimate the expectedcost, the simulation was run for 500 scheduling intervals (slow

Fig. 2. Performance of the policy gradient Algorithm 2 for the multivariatePOMDP of Example 1 comprising of states. The POMDP satisfies the struc-tural assumptions of Theorem 1 and therefore the optimal policy is a threshold.The optimal multilinear policy (characterized by a dimension thresholdvector given by Theorem 3), was computed using the SPSA algorithm in Algo-rithm 2. This policy is compared with a heuristic myopic policy and periodicpolicy with period . The improvement is substantial.

time-scale). The myopic policy picks scheduling intervalsand the macromanager chooses the target that has minimum

expected cost as the high priority target . The periodic policywith period picks scheduling intervals and

reverts to the macromanager, which picks using (19). As canbe seen from Fig. 2, the performance of the optimal multilinearpolicy is significantly better than the myopic and periodic poli-cies (up to 3 times reduction in total cost).

Example 2: Correlated Targets in Convoy: In this example,we solve the micromanagement POMDP for dependent targetsin a convoy together with optimization of the target priority allo-cation in (20), (21) by the macromanager. We considercorrelated targets with states each corresponding to 5levels of quantized variance of the target tracking algorithm –such an approach is used in [10]. (This variance is straight-forwardly computed via a Riccati type equation if a Kalmanfilter-based tracker is used, otherwise it can be estimated viasimulation). The composite state space has 25 elements.We now verify conditions A1), A2), A3), S) so that Theorem 1and Theorem 2 apply.

The composite transition probability matrix(described in Section IV-A-2)) is chosen as

follows. The components of are chosen as tri-diag-onal with for target 1 and for target 2.The elements of in (41) are , ,

. It is easily checked that is MTP2 (seediscussion following (41)). Therefore is MTP2 and A2)holds.

We chose the observation probabilities for each target aswhen for and 0.6

for implying that A3) holds.The immediate costs of applying the

actions are chosen as follows. If the macroman-ager chooses target to track with high priority, then for

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 12: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

KRISHNAMURTHY AND DJONIN: MULTIVARIATE POMDPS IN RADAR RESOURCE MANAGEMENT 3965

state of target 1, and for target 2, thecosts are

Similarly, the costs incurred when target is given priorityis

We chose , (see discussion in Ex-ample 1). It is easily checked that A1) and S) hold.

Given that A1), A2), A3), and S) hold, Theorem 1 impliesthe optimal policy is threshold, and Theorem 2 states that theoptimal linear approximation to the threshold is of dimension

. We used the procedure outline in Protocol 1 to-gether with the policy gradient Algorithm 2 for solving the mi-cromanagement POMDP for the optimal linear threshold. Wethen optimized the priority allocation vector for the macro-manager. The constraints (21) on the target priority vector arechosen as

(46)and define a convex polytope. This target priority vector regioneffectively constraints that priority allocation of any given targetcannot be less then 0.2.

For simple illustrative purposes, we chose a macromanage-ment policy that picks the high priority target with probability

. Therefore, the higher the target priority for a cer-tain target the more likely is that it is going to be chosen by themacromanagement policy, but it is also more costly to apply theprecise sensor for that target. As explained in Lemma 1, the op-timal priority choice is straightforwardly computed by checkingthe corners of the convex polytope (46). Let denote theoptimal discounted cost of the combined macro and microman-agement policies, i.e., the RHS of (20). (The discounted costwas computed by averaging over a uniformly distributed ini-tial information state within the simplex and using the optimallinear threshold for the POMDP for each initial state.) Our sim-ulations show that for the corner point the opti-mized discounted cost is , for ,

and for , . There-fore, the best target priority vector among the three corner pointsis and according to Lemma 1, that is also the bestchoice within the convex target priority region (46). Our simula-tions also show that for any target priority vector, the optimizeddiscounted cost using the optimal linear threshold policy in The-orem 2 is improved more than over the discounted cost fora randomly chosen linear threshold.

In Fig. 3, we present a contour plot of the optimized dis-counted cost over the feasible polytope (46). Each point in theplot was obtained by simulating Protocol 1 (which involvedsolving the multivariate POMDP using Algorithm 2). As ex-pected from Lemma 1, the optimal target priority allocationvector which is one of the corner points of (46).

Fig. 3. Contour plots of the optimized discounted cost over the feasible region(46). The optimal target priority vector is .

VI. CONCLUSION

We presented a stochastic control framework for radar re-source management. More specifically, the micromanagementproblem deals with scheduling the optimal Bayesian filter whilethe macromanagement problem deals with allocating target pri-ority. The main result of the paper was to show that microman-agement can be formulated as a multivariate POMDP whose so-lution has a special threshold structure. As a result, linear andmultilinear approximations to this threshold curve can be com-puted efficiently via stochastic approximation algorithms. Wealso showed that the macromanagement optimization probleminvolves minimizing a concave objective over a convex polytopewhich can be done efficiently. In characterizing the thresholdpolicy for the multivariate POMDP, we developed novel vari-ations of the multivariate TP2 stochastic order. The structuralresults of this paper are class type results, that is, for parametersbelonging to a set, the results hold. Hence, there is an inherentrobustness of these results since even if the underlying POMDPparameters are not exactly specified but still belong to the ap-propriate sets, the structural results still hold. In recent work, wehave used similar results to this paper to show that the Gittinsindex in POMDP multiarmed bandits has a monotone structure.We refer to [18] for related results.

Due to their generality, the results in this paper can bestraightforwardly applied to other applications such as dynamicspectrum access in cognitive radio systems. In this setup,denotes the quality of the spectral gaps at time – thisquality evolves according to a finite state Markov chain andreflects the activity of primary and secondary users. Alsodenotes the measured quality as a result of spectrum sensing.Since we allow for correlated components in and , ourformulation allows for correlated spectrum sensing and qualityof spectral gaps. Then the structural results in the paper readilyapply to determine the optimal spectrum access policy for a

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 13: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

3966 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

cognitive radio; see also [22] where a partially observed globalgame-theoretic formulation is presented.

APPENDIX

A. Multivariate TP2 Stochastic Ordering and Submodularity

A crucial step in the proof of the main result Theorem 1, in-volves proving that the optimal policy is monotonicallyincreasing in information state . In order to compare multi-variate information states and , we will use the multivariatetotally positive of order 2 (TP2) stochastic ordering. The multi-variate TP2 stochastic order introduced below and its univariateversion, the MLR stochastic order, is ideal for Bayesian infor-mation states since these orders are preserved after conditioningon any information [17]; see also [16], [19], [20] for comprehen-sive treatments.

To introduce the definition of the TP2 order letand denote the indices of two

-variate probability mass functions Denote the element-wiseminimum and maximum vectors

(47)Definition 1 (TP2 Ordering and MLR Ordering): Let

and denote any two -variate probability mass functions.Then if . If and

are univariate, then this definition is equivalent to the MLR or-dering denoted as .

MLR is a partial order, and it is not always possible to orderany two information states . Also, unlike the MLRorder, in general the TP2 order is not reflexive, i.e.,

does not hold. This introduces additional complications whendealing with multivariate POMDPs.

Definition 2 (MTP2 Reflexive Distributions): A multi-variate distribution is said to be multivariate TP2 (MTP2)if holds, i.e., . If

are scalar indices, this is equivalent to sayingthat a matrix is MTP2 if all second-order minors arenonnegative.

We are interested in the following two subsets of the com-posite information state space : The product state space

[defined in (18)] and

(48)Note: Because every product state is TP2 reflexive

.The following lemma summarizes several properties of the

TP2 order that we will use in our proofs.Lemma 2:

a) For all , .

b) If , then under A2), A3), the information statetrajectory , computed via the Bayesianestimator (8), satisfies .

c) , implies that all information states on the lineconnecting to are reflexive and TP2 orderable. Thatis, for any , is reflexive and

satisfies . Therefore

if .d) Any product information state is

TP2.e) If , , then

.TP2 Ordering Over Lines and Multilnear Curves: Al-

though the TP2 ordering over is used in [8], it is astronger condition than we require and it does not yield aconstructive procedure to implement a threshold schedulingpolicy for a multivariate POMDP. Below we define two novelversions of the TP2 ordering; namely, TP2 ordering over linesand TP2 ordering over multilinear curves. These TP2 orderingslie at the heart of our assumptions and lead to threshold policiesfor the multivariate POMDP.

TP2 Ordering on Lines: For the dependent target case, de-fine the dimensional simplex comprising of

with last element . That is

(49)For each , construct the line that connects to

. Thus, comprises of information states of theform

(50)Definition 3 (TP2 Ordering on Lines): is greater

than with respect to the TP2 ordering on the line –denoted as , if for some ,

i.e., , are on the same line connected to , and.A nice property of is that if is TP2

reflexive, then all points in the line between andare TP2 orderable and TP2 reflexive; see Lemma 2.

TP2 Ordering on Curves: This will be used for our proofswhen dealing with independent targets. Let denotethe set of product information states where at least for one of theindividual information states, the th element is zero. That is

(51)Next, for each define the curve (parameter-

ized by the scalar variable ) as

(52)The nice property about the above representation is that we usea single parameter for each , i.e., we do not require using

different variables . This considerably simplifiesour analysis.

Definition 4 (TP2 Ordering on Curves): is greater

than with respect to the TP2 ordering on the curve– denoted as , if for some

, i.e., , are on the same curve connected to , and.

Lemma 3: By construction, each curve has twouseful properties:

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 14: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

KRISHNAMURTHY AND DJONIN: MULTIVARIATE POMDPS IN RADAR RESOURCE MANAGEMENT 3967

1. It comprises only of product information states, i.e.,. It is easily shown that the union of all

such curves covers every product information state in thesimplex, i.e., .

2. is a TP2 increasing trajectory, i.e.,is TP2 increasing

as goes from zero to one. (The proof follows directly fromproperty (v) of Lemma 2.)Definition 5 (TP2 Increasing Function on Lines or Curves):

Let denote two composite information states.Let , denote the TP2 order on lines (see Definition

3 and 4). A function is said to be TP2 increasing on lines(curves) in if (respectively, )

implies .Definition 6 (Submodular Function):

or is said to be sub-modular if , for ,

or .

In simple terms, a submodular function of two vari-ables , has decreasing differences [21]. The most impor-tant feature of a submodular function is that

increases in its argument . Such “comple-mentarity” is widely used in economics [21].

Result 1 ([21]): Ifor is submodular, then

is TP2 increasing on ,i.e., , or

Remark: To motivate the use of submodularity, if we showthat in (22) is submodular, then Result 1 impliesis TP2 increasing in on the line segments or curves

.

B. Proof of Theorems

1) Proof of Lemma 1: The proof is by mathematicalinduction on the value iteration algorithm (53). Assume

is concave and decreasing in . Then since the sumof concave decreasing functions is concave and decreasing

and are concave and decreasing. Sinceminimization preserves concavity and decreasing properties,

is concave and decreasing.Thus is concave and decreasing in . Finally, since (viaelementary calculus) the product of two nonnegative concavefunctions is concave providing one of them is increasing andthe other is decreasing, the lemma is proved.

2) Key Theorem: We start with the following key theoremproved on lines . The proof for curves is sim-ilar and omitted. For notational convenience we omit the targetindex subscript .

Theorem 6: The following properties hold for the multi-variate POMDP.

1) Under A1), is TP2 decreasing.2) Under A1), A2), A3), is TP2 decreasing.3) Under A1), A2), A3), S), is submodular wrt .

Thus the optimal policy is TP2 increasing on lines.

Proof of Part 2: The proof is by mathematical inductionon the value iteration recursion:

(53)

It can be shown [17] that the value iteration algorithm converges,i.e., uniformly in .

Choose as an arbitrary TP2 decreasing function ofin (53). Consider (53) at any stage . Assume that

is TP2 decreasing in . Consider . Denote the op-

timal actions for these states as and . From [8,Theorem 4.2] it follows that under A2) and A3), the term

is TP2 decreasing in .From Part 1, under A1), is TP2 decreasing. Since thesum of decreasing functions is decreasing, the result follows.

Proof of Part 3: From Definition 6, to show thatis submodular, requires showing that is TP2decreasing on lines. From Part is TP2 decreasing overlines if A1), A2), A3) hold. So to prove is submodular,we only need to show that is TP2 decreasingover lines. By a similar proof to Part 1, isdecreasing over lines if S) holds. Thus is submodularon lines, Result 1 implies that the optimal policy is TP2increasing on lines.

3) Proof of Theorem 1: With the above key theorem, wecan now prove Theorem 1.

Part 3 in the above proof establishes the first claim of The-orem 1 for dependent targets. The proof of the second claim forindependent targets is similar and omitted. We now prove thethird claim of Theorem 1. For each (49), construct theline segment connecting to as in (50). Part 2 ofTheorem 6 says that is monotone for . Thereare two possibilities:

i) There is at least one reflexive information state on lineinformation apart from . In this case, pickthe reflexive state with the smallest – call thisstate . Then by Lemma 2, on the line segment connecting

, all information states are TP2 order-able and reflexive. Moving along this line segment to-wards , pick the largest for which the . Theinformation state corresponding to this is the thresholdinformation state – denote it bywhere .

ii) There is no reflexive state on apart from . Inthis case, define the threshold information state arbitrarily.It is irrelevant since from Lemma 2, the trajectory of allinformation states is TP2 reflexive. The above construc-tion implies that on , there is a unique thresholdpoint . Note that the entire simplex can be coveredby considering all pairs of lines , for , i.e.,

. Combining all points for allpairs of lines , , yields a unique thresholdcurve in denoted .

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 15: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

3968 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

4) Proof of Theorem 2: Given with

, we need to prove that the linear threshold policy satisfiesiff , . Note

that means that ,

and .Necessity: We show that if for ,

then is TP2 increasing on lines . Note that from (26),, is of the same sign as

for all . Therefore, implies. That is, implies which

implies that is TP2 increasing on lines .Sufficiency: Suppose is TP2 increasing on lines. We

need to prove . From (26), for , since

is TP2 increasing, it follows that . Butthis is equivalent to for all .Since , the previous expression is positive iff

for all . This implies for .5) Proof of Theorem 3: Here we use the TP2 ordering over

curves of (52) and Lemma 3. In particular, for de-fined in (51) and , consider the product informa-tion states

Necessity: For information states of the above form, weshow that if for , then the multilinearthreshold policy in (30) is TP2 increasing on curves .Note that from (30)

where is a -order polynomial function of . Since, all the coefficients of the polynomial are nonneg-

ative. Therefore implies . But fromLemma 3, implies . That is,

implies which implies that is TP2 in-creasing on curves .

Sufficiency: Suppose is TP2 increasing on curves. We need to prove . From (30), for

, since is TP2 increasing, it follows that. This is equivalent to

(54)

We show that it is impossible to satisfy the above inequality forall if . Therefore, (54) implies that

.

REFERENCES

[1] J. P. Le Cadre and S. Laurent-Michel, “Optimizing the receiver ma-neuvers for bearings only tracking,” Automatica, vol. 35, no. 4, pp.591–606, Apr. 1999.

[2] V. Krishnamurthy, “Algorithms for optimal scheduling and manage-ment of hidden Markov model sensors,” IEEE Trans. Signal Process.,vol. 50, no. 6, pp. 1382–1397, Jun. 2002.

[3] V. Krishnamurthy and D. Djonin, “Structured threshold policies fordynamic sensor scheduling-A partially observed Markov decisionprocess approach,” IEEE Trans. Signal Process., vol. 55, no. 10, pp.4938–4957, Oct. 2007.

[4] R. Evans, V. Krishnamurthy, and G. Nair, “Networked sensor man-agement and data rate control for tracking maneuvering targets,” IEEETrans. Signal Process., vol. 53, no. 6, pp. 1979–1991, Jun. 2005.

[5] S. Ji, R. Parr, and L. Carin, “Non-myopic multiaspect sensing withpartially observed Markov decision processes,” IEEE Trans. SignalProcess., vol. 55, no. 6, pp. 2720–2730, Jun. 2007.

[6] W. Moran, S. Suvorova, and S. Howard, “Application of sensor sched-uling concepts to radar,” in Foundations and Applications for SensorManagement, A. Hero, D. Castanon, D. Cochran, and K. Kastella,Eds. New York: Springer-Verlag, 2006, pp. 221–256.

[7] W. S. Lovejoy, “A survey of algorithmic methods for partially observedMarkov decision processes,” Ann. Operat. Res., vol. 28, pp. 47–66,1991.

[8] U. Rieder, “Structural results for partially observed control models,”Methods Models of Operat. Res., vol. 35, pp. 473–490, 1991.

[9] A. R. Cassandra, “Exact and approximate algorithms for partially ob-served Markov decision process,” Ph.D., Brown Univ., Providence, RI,1998.

[10] S. Blackman and R. Popoli, Design and Analysis of Modern TrackingSystems. Reading, MA: Artech House, 1999.

[11] D. G. Luenberger, Linear and Nonlinear Programming, Second ed.Norwood, MA: Addison-Wesley, 1984.

[12] W. S. Lovejoy, “On the convexity of policy regions in partially ob-served systems,” Operat. Res., vol. 35, no. 4, pp. 619–621, Jul.-Aug.1987.

[13] J. Spall, Introduction to Stochastic Search and Optimization. NewYork: Wiley, 2003.

[14] G. Pflug, Optimization of Stochastic Models: The Interface betweenSimulation and Optimization. Boston, MA: Kluwer Academic, 1996.

[15] F. R. Gantmacher, Matrix Theory. New York: Chelsea, 1960, vol. 2.[16] S. Karlin and Y. Rinott, “Classes of orderings of measures and related

correlation inequalities. I. Multivariate totally positive distributions,” J.Multivar. Anal., vol. 10, pp. 467–498, 1980.

[17] W. S. Lovejoy, “Some monotonicity results for partially observedMarkov decision processes,” Operat. Res., vol. 35, no. 5, pp. 736–743,Sep.-Oct. 1987.

[18] V. Krishnamurthy and B. Wahlberg, “POMDP multiarmed bandits –Structural results,” Math. Operat. Res., May 2009.

[19] W. Whitt, “Multivariate monotone likelihood ratio and uniform condi-tional stochastic order,” J. Appl. Probabil., vol. 19, pp. 695–701, 1982.

[20] A. Muller and D. Stoyan, Comparison Methods for Stochastic Modelsand Risk. New York: Wiley, 2002.

[21] D. M. Topkis, Supermodularity and Complementarity. Princeton, NJ:Princeton Univ. Press, 1998.

[22] V. Krishnamurthy, “Decentralized spectrum access amongst cognitiveradios—An interacting multivariate global game-theoretic approach,”IEEE Trans. Signal Process., vol. 57, no. 10, Oct. 2009.

Vikram Krishnamurthy (S’90–M’91–SM’99–F’05) was born in 1966. He received the Bachelor’sdegree from the University of Auckland, NewZealand, in 1988, and the Ph.D. degree from theAustralian National University, Canberra, in 1992.

He currently is a professor and Canada ResearchChair with the Department of Electrical Engineering,University of British Columbia, Vancouver, Canada.Prior to 2002, he was a Chaired Professor with theDepartment of Electrical and Electronic Engineering,University of Melbourne, Australia, where he also

served as Deputy Head of that Department. His current research interests in-clude computational game theory and stochastic control in sensor networks,and stochastic dynamical systems for modeling of biological ion channels andbiosensors.

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.

Page 16: 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, …vikramk/KD09.pdf · 3954 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 OptimalThresholdPoliciesforMultivariatePOMDPs

KRISHNAMURTHY AND DJONIN: MULTIVARIATE POMDPS IN RADAR RESOURCE MANAGEMENT 3969

Dr. Krishnamurthy has served as an Associate Editor for several jour-nals including the IEEE TRANSACTIONS AUTOMATIC CONTROL, IEEETRANSACTIONS ON SIGNAL PROCESSING, IEEE TRANSACTIONS AEROSPACEAND ELECTRONIC SYSTEMS, IEEE TRANSACTIONS CIRCUITS AND SYSTEMSB, IEEE TRANSACTIONS NANOBIOSCIENCE, and Systems and Control Letters.During 2009–2010, he is serving as distinguished lecturer for the SignalProcessing Society. Starting in 2010, he will serve as Editor-in-Chief of theIEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING.

Dejan V. Djonin received the B.Sc. and M.Sc.degrees from the University of Belgrade, Serbia, in1996 and 1999, respectively, and the Ph.D. degreefrom the University of Victoria, Victoria, BC,Canada, in 2003.

He held a NSERC Postdoctoral Fellowship withthe Department of Electrical and Computer Engi-neering, University of British Columbia, Vancouver,in 2005 and 2006. Currently, he is with DypativeSystems, Vancouver, working on the testing equip-ment for Ev-Do base stations. He is also currently an

Adjunct Professor at the University of British Columbia. His research interestsspan applications of control theory in multimedia wireless communicationsystems, and information theory.

Authorized licensed use limited to: The University of British Columbia Library. Downloaded on November 12, 2009 at 20:35 from IEEE Xplore. Restrictions apply.