a convex analytic approach to risk-aware markov...

A CONVEX ANALYTIC APPROACH TO RISK-AWARE MARKOVDECISION PROCESSES ⇤

WILLIAM B. HASKELL AND RAHUL JAIN †

Abstract. In classical Markov decision process (MDP) theory, we search for a policy that say,minimizes the expected infinite horizon discounted cost. Expectation is of course, a risk neutralmeasure, which does not su�ce in many applications, particularly in finance. We replace the ex-pectation with a general risk functional, and call such models risk-aware MDP models. We considerminimization of such risk functionals in two cases, the expected utility framework, and ConditionalValue-at-Risk, a popular coherent risk measure. Later, we consider risk-aware MDPs wherein therisk is expressed in the constraints. This includes stochastic dominance constraints, and the classicalchance-constrained optimization problems. In each case, we develop a convex analytic approach tosolve such risk-aware MDPs. In most cases, we show that the problem can be formulated as aninfinite dimensional linear program in occupation measures when we augment the state space. Weprovide a discretization method and finite approximations for solving the resulting LPs. A strikingresult is that the chance-constrained MDP problem can be posed as a linear program via the convexanalytic method.

Keywords: Markov decision processes, Stochastic optimization, risk measures, Conditionalvalue-at-risk, Stochastic dominance constraints, Convex analytic approach.

AMS Subject Classification: 90C40, 91B30, 90C34, 49N15.

1. Introduction. Consider a Markov decision process on a state space S, actionspace A, cost function c (s, a), discount factor � 2 (0, 1) and a transition kernel Q.Typically, we want to find an optimal policy ⇡ that solves inf

⇡

E⇡ (P1

t=0 �t

c (st

, a

t

)).Of course, as we well know this problem can be solved by dynamic programming. Thekey is that one can show that there exists an optimal policy that is stationary andMarkovian.

However, dynamic programming is not the only method to solve the problem.Alternatives include posing the optimization problem as a linear program as in the‘convex analytic approach’. The convex analytic approach is developed for finitestate and action spaces in [6, 13, 25, 31], and for Borel state and action spaces in[22, 23, 34]. In this approach, we introduce an occupation measure µ

⇡

(s, a), whichfor a fixed stationary policy ⇡ can be interpreted as the discounted empirical frequencyof visiting state s and taking action a. Thus, we can pose a linear program

infµ

8

<

:

X

(s,a)2S⇥Aµ(s, a)c(s, a) s.t. linear constraints

9

=

;

whose solution gives the optimal occupation measure from which the optimal policycan be derived.

Often, however, in multi-stage decision-making, the risk-neutral objective is notenough. A decision maker may be risk-averse and may want to explicitly model hisrisk aversion. Thus, we are faced with a di↵erent optimization problem:

inf⇡

⇢

1X

t=0

�

t

c(st

, a

t

)

!

, (1.1)

⇤This research is supported by the O�ce of Naval Research Young Investigator Award#N000141210766 and the National Science Foundation CAREER Award #0954116.

†William Haskell is an assistant professor in the ISE department at National University of Singa-pore. Rahul Jain is an associate professor and the Kenneth C. Dahlberg Early Career Chair in theDepartments of EE, CS and ISE at the University of Southern California.

1

2 Haskell and Jain

where ⇢ is a coherent risk measure such as conditional value-at-risk (CVaR). Problem(1.1) presents significant challenges since it need not be convex (depending on the riskmeasure) and the optimal policy will in general depend on history (rather than justthe current state). It follows that the ‘principle of optimality’ used in writing downthe dynamic programming equations does not hold, and thus Problem (1.1) cannotbe solved via the dynamic programming method.

Problem (1.1) is not artificial or abstract. There is a tremendous interest in takingrisk into account in sequential decision-making in areas like finance and insurance,operations management, smart-grid power networks, and even robotics.

In finance, for example, the portfolio optimization is really a sequential decision-making problem. Furthermore, investors are rarely risk-neutral. The mean-varianceapproach of Markowitz [32, 42] has been popular for risk-averse investors. Yet, the fi-nancial crisis of 2008 has raised the need to consider other direct measures of risk. Theproblem of single-stage portfolio optimization with such risk measures (the conditionalvalue-at-risk) was first addressed in [37]. How to do dynamic portfolio optimizationwith such risk measures has been an open problem. Our convex analytic frameworkprovides a solution.

As mentioned above, Problem (1.1) does not admit an optimal stationary policy,nor a dynamic programming solution. On the other hand, introducing an occupationmeasure via the convex analytic method ([6, 7]), leads to static optimization problemsin a space of occupation measures. We develop this approach for Problem (1.1) andother related problems. This opens up a variety of new modeling options for decisionmakers seeking risk-aware policies.

We make the following main contributions in this paper. We treat four di↵erentrisk-aware sequential stochastic optimization formulations. Of these, the expectedutility formulation, the stochastic dominance and chance-constrained formulationslead to linear programming problems. We also give a treatment for optimizationof risk functionals such as conditional value-at-risk (CVaR), and briefly discuss meandeviation and mean semi-deviation risk measures. These lead to non-convex optimiza-tion problems, albeit over convex feasible regions. In this case, we give a sequenceof linear programming approximations that asymptotically yield an optimal solution.Thus, in each case the convex analytic methodology that we use yields a tractablesolution. This is based on a state space augmentation approach akin to that usedin [4, 5] and finite approximation methods for infinite-dimensional linear programsadapted from [21, 29, 30]. This, when combined with a discretization scheme, allowsus to provide a convergence rate.

A striking observation we make is that, unlike the single-stage chance-constrainedoptimization problem which is generally non-convex and very di�cult to solve [35],the sequential chance-constrained optimization problem can actually be reformulatedas a linear programming problem via the convex analytic approach.

Throughout this paper, we encounter a dichotomy between sequential and single-stage risk management. In the static setting, we optimize over a random variable whilein the sequential setting, we have to optimize over a measure. This leads to serioustechnical di�culties in the sequential setting as compared to the static setting.

Related Literature. There is a substantial body of work on risk measures in thestatic setting, [17, 28, 39]. Risk measures have also been considered in [26, 27],where the expected utility of countable state finite action MDPs is minimized. In[15], variance penalties for MDPs are considered. In [41], the mean-variance trade-o↵ in MDPs is further explored. [14] shows how to solve the Bellman equation for

Risk-aware MDPs 3

risk-sensitive control of MDPs. In [5], a finite horizon MDP with a conditional value-at-risk constraint on total cost is considered. Both an o✏ine and online stochasticapproximation algorithm are developed.

The most closely related works to this paper are the following. In [38], the classof Markov risk measures is proposed. This class of risk measures leads to tractabledynamic programming formulations. However, we note that most common risk mea-sures, e.g., CVaR are not Markov. [3] shows how to minimize the average value-at-riskof costs in MDPs. [18] applies stochastic dominance constraints to the long-run av-erage and infinite horizon discounted reward distributions in MDPs. [43] minimizesthe conditional value-at-risk of discounted total cost and provides a static nonlinearprogramming formulation. Existence of a solution to such a problem is establishedthough no tractable method is given for its solution. [4] considers minimization of thecertainty equivalent of MDP costs, where the classical risk-sensitive MDP is a specialcase. In [9], numerical methods are developed for risk-aware MDPs with one-stopconditional risk measures. Specifically, value and policy iteration algorithms are de-veloped and their convergence is established. Our work is distinct from the above inconsidering a range of risk-aware sequential optimization formulations, using a convexanalytic approach, and in most cases yielding a linear programming problem, or atleast approximation via a sequence of linear programs.

2. Preliminaries. This section reviews standard notation for MDPs and thendiscusses risk-aware MDPs.

2.1. Discounted MDPs. Consider a discrete time MDP given by the 5-tuple(S, A, {A (s) : s 2 S} , Q, c). The state space S and the action space A are Borelspaces, measurable subsets of complete and separable metric spaces, with correspond-ing Borel ��algebras B (S) and B (A). We use P (S) to denote the space of probabilitymeasures over S with respect to B (S), and we define P (A) analogously. For each states 2 S, the set A (s) ⇢ A is a measurable set in B (A) and it indicates the set of feasibleactions available in state s. We assume that the multifunction s ! A (s) permits ameasurable map � : S ! A such that � (s) 2 A (s).

The set of feasible state-action pairs is given by K = {(s, a) 2 S⇥ A : a 2 A (s)},with corresponding Borel ��algebra B (K). The transition law Q governs the systemevolution. For B 2 B (S), Q (B | s, a) is the probability of next visiting the set B giventhe current state-action pair (s, a). Finally, c : K ! R is a measurable cost functionthat depends on state-action pairs. We will emphasize cost throughout (rather thanreward) because risk functions are typically defined for losses.

We now describe two classes of policies for MDPs. Let Ht

be the set of histories attime t, H0 = S, H1 = K⇥S, and H

t

= Kt⇥S for all t � 2. A specific history h

t

2 Ht

records the state-action pairs visited at times 0, 1, . . . , t�1 as well as the current states

t

. Define ⇧ to be the set of all history-dependent randomized policies: collections ofmappings ⇡ = (⇡

t

)t�0 where ⇡

t

: Ht

! P (A) for all t � 0. Given a history h

t

2 Ht

and a set B 2 B (A), ⇡t

(B |ht

) is the probability of selecting an action in B. Define⇧0 to be the class of stationary randomized Markov policies : mappings ⇡ : S ! P (A)which only depend on history through the current state. For a policy ⇡ 2 ⇧0, a givenstate s 2 S, and a set B 2 B (A), ⇡ (B | s) is the probability of choosing an action inB. The class ⇧0 is a subset of ⇧ and we explicitly assume that ⇧ and ⇧0 only includefeasible policies that respect the constraints K.

The canonical measurable space of MDP trajectories is (⌦, B) = (K1, B (K)1),

and specific trajectories are written as ! 2 ⌦. The state and action at time t aredenoted s

t

and a

t

, respectively. Formally, for a trajectory ! 2 ⌦, st

(!) and a

t

(!)

4 Haskell and Jain

are the state and action at time t along this trajectory. With respect to an initialdistribution ⌫ 2 P (S), any policy ⇡ 2 ⇧ determines a probability measure P

⇡

⌫

on(⌦, B) and a corresponding stochastic process {(s

t

, a

t

)}t�0. The resulting probability

space is (⌦, B, P⇡

⌫

). The expectation operator with respect to P

⇡

⌫

is denoted E⇡

⌫

[·].For discount factor � 2 (0, 1), consider: C (⇡, ⌫) = E⇡

⌫

[P1

t=0 �t

c (st

, a

t

)]. Weassume that costs c are bounded above and below throughout this paper, which isone way of ensuring that C (⇡, ⌫) is well-defined.Assumption 2.1. There exist c and c such that 0 < c c (s, a) c < 1 for all

(s, a) 2 K.

This assumption streamlines our presentation and is reasonable in practice. Forexample, in a real newsvendor problem there are limits on how much we can orderin any one period, and thus the maximum possible order cost is bounded in eachperiod. Under Assumption 2.1, the inequalities 0 < c/ (1� �) P1

t=0 �t

c (st

, a

t

) c/ (1� �) < 1 hold for all trajectories ! 2 ⌦. We use Y := [0, c/ (1� �)] to denote

the interval in which the running costsP

T

t=0 �t

c (st

, a

t

) lie for all finite horizons T .This interval will appear quite often.

The classical infinite horizon discounted cost minimization problem is

inf⇡2⇧

C (⇡, ⌫) . (2.1)

It is well known that a stationary policy in ⇧0 is optimal for Problem (2.1) undersuitable conditions (for example, this result is found in [36] for finite and countablestate spaces, and in [22, 23] for general Borel state and action spaces).

2.2. Risk-aware MDPs. Problem (2.1) has a risk-neutral objective but some-times decision makers may be risk-averse or risk-seeking. We now introduce risk-awareextensions of Problem (2.1). It is convenient to introduce a fixed reference probabil-ity space, since the underlying probability space (⌦, B, P⇡

⌫

) for MDP trajectories ischanging as ⇡ varies. Consider the probability space ([0, 1] , B ([0, 1]) , P) with primi-tive uncertainties denoted by ⇠ 2 [0, 1], where B ([0, 1]) is the Borel ��algebra on [0, 1]and P is the uniform distribution on [0, 1]. We define L = L1 ([0, 1] , B ([0, 1]) , P)to be the space of essentially bounded random variables on ([0, 1] , B ([0, 1]) , P).Random variables defined on [0, 1] with support in Y, such as the infinite horizondiscounted cost

P1t=0 �

t

c (st

, a

t

), are included in L. Recall that two random variablesX and Y are equal in distribution, written X =

d

Y , if Pr {X ⌘} = Pr {Y ⌘} forall ⌘ 2 R. Let C⇡

⌫

2 L be defined by

Pr {C⇡

⌫

⌘} = P

⇡

⌫

( 1X

t=0

�

t

c (st

, a

t

) ⌘

)

, 8⌘ 2 R,

i.e., C

⇡

⌫

is a random variable that is equal in distribution toP1

t=0 �t

c (st

, a

t

) onR when the underlying probability distribution on trajectories is P

⇡

⌫

. All C⇡

⌫

havesupport contained in the interval Y.

Risk functions ⇢ will be mappings ⇢ : L ! R. Now, consider a fixed risk function,a mapping ⇢ : L ! R which associates a scalar with random variables in L. We willmake the following assumption about the risk measures.Assumption 2.2. Risk measures are law invariant, i.e., ⇢ (X) = ⇢ (Y ) , 8X =

d

Y .

Most common risk measures, and all of the risk measures we consider in thispaper, are law invariant. By assuming law invariance, we are simply saying that weonly care about the distribution of costs on R. We do not care about properties ofthe underlying probability space on which these random variables are defined.

Risk-aware MDPs 5

Under this assumption the random variables {C⇡

⌫

}⇡2⇧ can be used in place of the

measurable mappingP1

t=0 �t

c (st

, a

t

), since a law invariant risk function ⇢ will notdistinguish between them because they have the same distribution on R by definition.A natural risk-aware extension of Problem (2.1) is then

inf⇡2⇧

⇢ (C⇡

⌫

) . (2.2)

3. State space augmentation. This section describes our general approachfor solving Problem (2.2) based on state space augmentation and convex analyticmethods. Note that history dependence shows up in Problem (2.2) - it is to beexpected that some information about the history h

t

will have to be appended to thecurrent state s

t

at all times to get a near-optimal risk-aware policy. The idea behindstate space augmentation is to partially or totally capture the history (see [3, 4, 5]),so that this information is available for decision-making.

In this section, we first augment the state space to keep track of the running costover the entire time horizon. Then, we argue that the infinite horizon problem canbe approximated with arbitrary precision by a finite horizon problem. Furthermore,the finite horizon problem can be solved exactly with convex analytic methods. Inparticular, we will estimate C⇡

⌫

for a given ⇡ with another random variable that is dis-tributed according to an occupation measure. With this method we can, in principle,approximate any instance of Problem (2.2) with a static optimization problem.

In [4], S is augmented with two new state variables to keep track of the runningcost and the discounting. We borrow the augmented state space from [4] and denoteit as S := S⇥ Y ⇥ (0, 1], where the first component is the state in the original MDP,the second component keeps track of the running cost (recall Y = [0, c/ (1� �)] isthe support of the running cost y

t

=P

t

i=0 �i

c (si

, a

i

) for all t � 0), and the thirdcomponent adjusts for the discounting. A state in S looks like (s

t

, y

t

, z

t

) where wecontinue to use s

t

to denote the original state at time t, yt

is the running cost at thebeginning of time t, and z

t

is the discount factor for time t. The set of feasible actionsin state (s, y, z) 2 S only depends on s, so we define the augmented set of feasible

state-action pairs to be K :=n

(s, y, z, a) 2 S⇥ A : a 2 A (s)o

, along with its Borel

��algebra B⇣

K⌘

, and we assume K is closed in S⇥A. Occupation measures for the

solution of Problem (2.2) will be defined on K.The evolution of the augmented state {(s

t

, y

t

, z

t

)}t�0 is as follows: {s

t

}t�0 still

evolves as per the original transition kernel Q: s

t+1 ⇠ Q (· | st

, a

t

) , 8t � 0, and itsevolution does not depend on the running costs or the discounting. The running costs{y

t

}t�0 evolve deterministically according to y

t+1 = z

t

c (st

, a

t

)+y

t

, for all t � 0. Thediscount factors {z

t

}t�0 also evolve deterministically according to z

t+1 = � z

t

, for allt � 0. The state variable {z

t

}t�0 is just a geometric series in place to make sure the

running costs are updated correctly. We initialize y0 = 0 since no costs have beenassessed before time t = 0. Also, we initialize z0 = 1 so that the costs c (s0, a0) attime t = 0 are not discounted. Note then that y

t

=P

t�1i=0 �

i

c (si

, a

i

) for all t � 1. Weemphasize that the augmented state variables y

t

and z

t

are completely determinedby the history h

t

2 Ht

. We let Q be the transition kernel for the augmented state{(s

t

, y

t

, z

t

)}t�0, and defined by

Q (B, z c (s, a) + y, � z | s, y, z, a) = Q (B | s, a) , 8B 2 B (S) , 8 (s, y, z, a) 2 K.

We emphasize again that the augmented state variables (yt+1, zt+1) are deterministic

functions of (st

, y

t

, z

t

, a

t

) for all t � 0.

6 Haskell and Jain

Now we describe a new class of policies for use in tandem with the augmentedstate space. Let ⇧1 be the class of augmented stationary randomized Markov policies :mappings ⇡ : S ! P (A) which depend on history only through the current augmentedstate (s, y, z). Policies ⇡ 2 ⇧1 are allowed to use the running cost and the discountlevel to make decisions. Because the running cost and discount factor are functionsof the history h

t

, we consider ⇧1 to be a subset of the set of all policies ⇧ and wesee that ⇧0 ⇢ ⇧1 ⇢ ⇧, where we view the set of stationary Markov policies on theoriginal state space ⇧0 as a subset of ⇧1.

On a trajectory ! 2 ⌦, yt

(!) and z

t

(!) are the running cost and discount factorat time t. Our earlier initial state distribution ⌫ used to define Problem (2.1) can beextended to an initial state distribution on the augmented state space. We denotethis initial distribution on the augmented state space by ⌫ as well, since the initialconditions on y0 and z0 are deterministic, i.e. ⌫ (S⇥ {0}⇥ {1}) = 1, corresponding toy0 = 0 and z0 = 1. Along with the (augmented) initial distribution ⌫, a policy ⇡ 2 ⇧gives a probability measure P

⇡

⌫

on (⌦, B) that determines a corresponding stochasticprocess {(s

t

, y

t

, z

t

, a

t

)}t�0 on the augmented set of state-action pairs.

Now that we have the augmented state space in place, we will discuss a generalmethod for approximating Problem (2.2). Our approximation scheme is based onthe intuition, confirmed in the following lemma, that the running cost y

t

is a goodestimate of the infinite horizon cost

P1t=0 �

t

c (st

, a

t

) on every trajectory ! 2 ⌦ forlarge enough t. The following result uses boundedness of costs.Lemma 3.1. For any ✏ > 0, there is a T = T (✏) such that

kyt

(!)�1X

t=0

�

t

c (st

(!) , at

(!)) k < ✏, 8! 2 ⌦,

for all t � T .

Lemma 3.1 justifies our interest in the running cost yT

=P

T

t=0 �t

c (st

, a

t

) at timeT for large enough T . We now compare the risk of the finite horizon cost y

T

versusthe infinite horizon cost

P1t=0 �

t

c (st

, a

t

) in terms of ⇢. In the next assumption, wewill abuse notation and write ⇢ (y

T

) and ⇢ (P1

t=0 �t

c (st

, a

t

)) to denote the risk func-tion ⇢ from Problem (2.2) evaluated at y

T

andP1

t=0 �t

c (st

, a

t

) when the underlyingprobability space is (⌦, B, P⇡

⌫

), i.e., the underlying policy is implicit.Assumption 3.2. For any ✏ > 0, there is a T such that

|⇢ (yT

)� ⇢

1X

t=0

�

t

c (st

, a

t

)

!

| ✏, 8⇡ 2 ⇧.

Assumption 3.2 amounts to uniform continuity of ⇢ across policies in the runningcost. It means that when T is large and y

T

is close toP1

t=0 �t

c (st

, a

t

) almost surelyacross policies ⇡ 2 ⇧, then the risk ⇢ (y

T

) is close to ⇢ (P1

t=0 �t

c (st

, a

t

)) across policies⇡ 2 ⇧. The key in Assumption 3.2 is that the error guarantee does not depend onthe policy ⇡ 2 ⇧. We will show that it is easy to guarantee Assumption 3.2 holds forthe specific risk-aware MDPs that we consider in Section 5.

Under Assumption 3.2, we can justify working with a truncationP

T

t=0 �t

c (st

, a

t

)of the infinite horizon discounted cost. In analogy with C

⇡

⌫

, we define C

⇡

⌫, T

2 L to

satisfy Pr�

C

⇡

⌫, T

⌘

= P

⇡

⌫

n

P

T

t=0 �t

c (st

, a

t

) ⌘

o

, for all ⌘ 2 R, so that C

⇡

⌫, T

has

the same distribution on R as the finite horizon costP

T

t=0 �t

c (st

, a

t

) at time T withrespect to P

⇡

⌫

. The next lemma considers the quality of ⇢�

C

⇡

⌫, T

�

versus ⇢ (C⇡

⌫

), itfollows immediately from law invariance and Assumption 3.2.

Risk-aware MDPs 7

Lemma 3.3. Suppose Assumption 3.2 holds. Then, for any ✏ > 0, there is a T such

that |⇢ �C⇡

⌫, T

�� ⇢ (C⇡

⌫

) | ✏ for all ⇡ 2 ⇧.The preceding Lemma 3.3 shows how we are using Assumption 3.2 and why we

assumed law invariance of ⇢. We have defined C

⇡

⌫, T

to be equal in distribution to y

T

when the underlying probability distribution is P⇡

⌫

. Lemma 3.3 confirms that the riskof C⇡

⌫, T

, ⇢�

C

⇡

⌫, T

�

, is close to the risk of the infinite horizon discounted cost C⇡

⌫

, ⇢ (C⇡

⌫

),

uniformly across policies. We needed the error guarantee on |⇢ �C⇡

⌫, T

� � ⇢ (C⇡

⌫

) | tobe independent of ⇡ 2 ⇧.

Since C

⇡

⌫, T

approximates C⇡

⌫

well for large T , we can approximate Problem (2.2)with the truncated problem

inf⇡2⇧

⇢

�

C

⇡

⌫, T

�

. (3.1)

Since we have truncated the planning horizon, Problem (3.1) turns out to be exactlysolvable with convex analytic methods since we can compute the distribution of y

T

exactly for any finite T , as we will show. We can use a solution of Problem (3.1) toget a near-optimal solution for Problem (2.2), as confirmed in the following lemma.However, the best we can hope for is a near-optimal policy in ⇧1, but not in ⇧0

(which is defined over the unaugmentated state space), for general ⇢. Denote ⇢

⇤ :=inf

⇡2⇧ ⇢ (C⇡

⌫

).Lemma 3.4. Choose any ✏ > 0. Then, there is a T such that:

(i) inf⇡2⇧ ⇢

�

C

⇡

⌫, T

�

< ⇢

⇤ + ✏;

(ii) For ⇡ with ⇢

�

C

⇡

⌫, T

� inf⇡2⇧ ⇢

�

C

⇡

⌫, T

�

+ ✏, we have ⇢

�

C

⇡

⌫

�

< ⇢

⇤ + 3 ✏.

Proof. (i) Choose T such that |⇢ �C⇡

⌫, T

�� ⇢ (C⇡

⌫

) | ✏/2 for all ⇡ 2 ⇧, and then

choose ⇡

0 such that ⇢⇣

C

⇡

0

⌫

⌘

< ⇢

⇤ + ✏/2. We are guaranteed that such a ⇡

0 exists by

the definition of infimum. It follows that

⇢

⇣

C

⇡

0

⌫, T

⌘

⇢

⇣

C

⇡

0

⌫

⌘

+ |⇢⇣

C

⇡

0

⌫, T

⌘

� ⇢

⇣

C

⇡

0

⌫

⌘

| ⇢

⇤ + ✏.

(ii) Now, for ⇡ with ⇢

�

C

⇡

⌫, T

� inf⇡2⇧ ⇢

�

C

⇡

⌫, T

�

+ ✏, it follows

⇢

�

C

⇡

⌫

� ⇢

�

C

⇡

⌫, T

�

+ |⇢ �C ⇡

⌫

�� ⇢

�

C

⇡

⌫, T

� | inf⇡2⇧

⇢

�

C

⇡

⌫, T

�

+ 2 ✏ ⇢

⇤ + 3 ✏,

where the second inequality uses part (i).Next we show that the distribution of C⇡

⌫, T

can be computed exactly with convexanalytic methods for any T . The idea is to modify the transition kernel so that it nolonger updates the running cost after time T , i.e., y

t+1 = y

t

for all t � T and thusy

t

= y

T

for all t � T . Then, we use the conditional distribution of µ⇡

⌫

on {t � T} tocompute the distribution of C⇡

⌫, T

on R. Note that we can use the augmented state

variable {zt

}t�0 to know when we have reached time T , i.e., when z

t

= �

T , so thatwe do not have to explicitly keep track of time and augment the state even further.The modified transition kernel is

Q

T

(B, y

0, z

0 | s, y, z, a) =

8

>

<

>

:

Q (B | s, a) , y

0 = z c (s, a) + y, z

0 = � z, z > �

T

,

Q (B | s, a) , y

0 = y, z

0 = z, z = �

T

,

0, otherwise,

,

for any B 2 B (S). The transition kernel QT

is very close to Q, however it stopsupdating the running cost and the discount factor once z

t

= �

T is reached at time T .

8 Haskell and Jain

We now introduce the machinery for occupation measures in order to compute

the distribution of C⇡

⌫, T

, starting with the necessary functional spaces. Let M⇣

K⌘

be the space of finite signed measures on⇣

K, B⇣

K⌘⌘

in the total variation norm

kµkM(K) =´K |µ| (d (s, y, z, a)), and let M+

⇣

K⌘

be the set of positive measures.

The upcoming occupation measures for our convex analytic approach will belong to

M⇣

K⌘

. Additionally, we will need the space dual to M⇣

K⌘

: let F⇣

K⌘

be the

space of bounded measurable functions f : K ! R in the supremum norm kfkF(K) =sup(s,y,z,a)2K |f (s, y, z, a) |. For a measure µ 2 M

⇣

K⌘

and a function f 2 F⇣

K⌘

,

we define the duality pairing hµ, fi =´K f (s, y, z, a)µ (d (s, y, z, a)), the integral of

f with respect to the measure µ. When µ 2 M⇣

K⌘

is a probability measure and

f 2 F⇣

K⌘

, then hµ, fi can be interpreted as an expectation.

Now, let I

B

be the indicator function of a measurable set B 2 B⇣

K⌘

. We can

define the augmented infinite horizon discounted occupation measure µ

⇡

⌫

2 M+

⇣

K⌘

of a policy ⇡ on K as

µ

⇡

⌫

(B) =1X

t=0

�

tE⇡

⌫

[IB

(st

, y

t

, z

t

, a

t

)] =1X

t=0

�

t

P

⇡

⌫

{(st

, y

t

, z

t

, a

t

) 2 B} , 8B 2 B⇣

K⌘

.

We can interpret µ⇡

⌫

(B) as the expected discounted number of visits to state-actionpairs in the set B ⇢ K when following the policy ⇡.

The next theorem expresses the distribution of C⇡

⌫, T

in terms of µ⇡

⌫

. This theoremis the foundation of the convex analytic approach for Problem (3.1).Theorem 3.5. For any ⌘ 2 R,

Pr

�

C

⇡

⌫, T

⌘

=1� �

�

T

ˆSI

�

y ⌘, z = �

T

µ

⇡

⌫

(d (s, y, z, a)) .

Proof. Compute

Pr�

C

⇡

⌫, T

⌘

=1� �

�

T

E⇡

⌫

" 1X

t=T

�

t

I {yt

⌘}#

,

using the fact that I {yT

⌘} = I {yt

⌘} for all t � T . Finally

E⇡

⌫

" 1X

t=T

�

t

I {yt

⌘}#

=

ˆKI

�

y

t

⌘, z

t

= �

T

µ

⇡

⌫

(d (s, y, z, a)) .

Based on the preceding theorem, we construct a random variable in L that is

determined by µ

⇡

⌫

and that is equal in distribution to C⇡

⌫, T

. For a measure µ 2 M⇣

K⌘

,

define the random variable X (µ) in L to have the distribution

Pr {X (µ) ⌘} =1� �

�

T

ˆKI

�

y ⌘, z = �

T

µ (d (s, y, z, a)) , 8⌘ 2 R.

Risk-aware MDPs 9

By Theorem 3.5, X (µ⇡

⌫

) is equal in distribution to C

⇡

⌫, T

.Occupation measures µ

⇡

⌫

for policies ⇡ 2 ⇧ have special properties that can be

conveniently expressed as a linear mapping. Define M⇣

S⌘

to be the space of finite

signed measures on⇣

S, B⇣

S⌘⌘

in the total variation norm. Introduce the linear

mapping L0, T : M⇣

K⌘

! M⇣

S⌘

defined by

[L0, Tµ] (B) := µ (B)� �

ˆKQ

T

(B | s, y, z, a)µ (d (s, y, z, a)) , 8B 2 B⇣

S⌘

, (3.2)

where µ (B) =´B⇥A µ (d (s, y, z, a)), for all B 2 B

⇣

S⌘

is the marginal distribution of

the measure µ on S. With this notation in place, we can write the convex analyticform of Problem (3.1),

infµ2M(K)

{⇢ (X (µ)) : L0, Tµ = ⌫} . (3.3)

It is worth mentioning that when ⇢ (X (µ)) is concave in µ (in particular, when itis linear in µ), then Problem (3.3) has an optimal solution at an extreme feasiblemeasure µ (see [34, Theorem 19]). Since extreme feasible measures correspond todeterministic policies, it follows that randomized policies are not needed.

We formally justify the equivalence between Problem (3.1) and Problem (3.3) inthe next lemma. Specifically, we show that an optimal solution for either problemcan be used to construct an optimal solution for the other problem. The intuition isthat the constraint L0, Tµ = ⌫ defines all feasible occupation measures that can beproduced by policies in ⇧. Further, since we are using the modified transition kernelQ

T

, we know that ⇢ (X (µ)) is equal to ⇢

�

C

⇡

⌫, T

�

for some policy ⇡ based on Theorem3.5.

We note that given an occupation measure µ 2 M⇣

K⌘

, we get a policy ⇡

µ

2 ⇧defined by the conditional distribution of µ on A, ⇡

µ

(B | s, y, z) = µ (B | s, y, z), forall B 2 B (A), for each state (s, y, z) 2 S.Lemma 3.6. If ⇡ is ✏�optimal for Problem (3.1), then µ

⇡

⌫

is ✏�optimal for Problem

(3.3). Conversely, if µ is ✏�optimal for Problem (3.3), then ⇡

µ

is ✏�optimal for

Problem (3.1).

Proof. All occupation measures µ

⇡

⌫

for ⇡ 2 ⇧ must satisfy L0, Tµ⇡

⌫

= ⌫. If µsatisfies the equality L0, Tµ = ⌫, then ⇡

µ

2 ⇧. Thus, any feasible µ for Problem (3.3)gives a feasible policy ⇡ 2 ⇧, and vice versa.

We know that inf⇡2⇧ ⇢

�

C

⇡

⌫, T

�

= inf⇡2⇧ ⇢ (X (µ⇡

⌫

)), since ⇢ (X (µ⇡

⌫

)) = ⇢

�

C

⇡

⌫, T

�

by construction of X (µ⇡

⌫

) and by Assumption 2.2. Further inf⇡2⇧ ⇢ (X (µ⇡

⌫

)) =inf

µ2M(K) {⇢ (X (µ)) : L0, Tµ = ⌫}, since µ

⇡

⌫

is feasible for Problem (3.3) for any

⇡ 2 ⇧, and ⇡

µ

2 ⇧ for any µ feasible to Problem (3.3). So, the optimal values of

Problem (3.1) and Problem (3.3) are equal. Note ⇢ (X (µ)) = ⇢

⇣

C

⇡µ

⌫, T

⌘

again by

Assumption 2.2 and the definition of ⇡µ

to get the desired result.To get cleaner notation throughout the paper, we introduce the additional linear

mapping L1, T : M⇣

K⌘

! M (Y), whereM (Y) is the space of finite signed measures

on Y = [0, c/ (1� �)], defined by

[L1, Tµ] (B) :=1� �

�

T

ˆKI

�

y 2 B, z = �

T

µ (d (s, y, z, a)) , 8B 2 B (Y) . (3.4)

10 Haskell and Jain

Note that ✓ = L1, Tµ is proportional to the marginal distribution of µ on Y conditionedon the event

�

z = �

T

that appears in the statement of Theorem 3.5. Equivalently,✓ = L1, Tµ is the marginal distribution of µ on Y conditioned on the event {t � T},i.e., we have passed the time T and the running costs are no longer updated. Weare introducing this shorthand to get a cleaner statement of Problem (3.3), since it isdi�cult to write out ⇢ (X (µ)) directly for specific forms of ⇢.

Given a distribution ✓ 2 M (Y), we abuse notation and let X (✓) be the randomvariable in L defined by Pr {X (✓) ⌘} = ✓ {y ⌘}, for all ⌘ 2 R. Under thisdefinition, X (✓) is equal in distribution to X (µ) where ✓ = L1, Tµ. Problem (3.3) isthen equivalent to

⇢

⇤ , infµ2M(K), ✓2M(Y)

{⇢ (X (✓)) : L0, Tµ = ⌫, L1, Tµ = ✓} , (3.5)

using the fact that ⇢ (X (✓)) = ⇢ (X (µ)) when ✓ = L1, Tµ. The additional measure ✓

helps us write the objective ⇢ (X (✓)) more cleanly once we choose specific functionalforms for ⇢. For this reason, we will focus on Problem (3.5) rather than Problem (3.3)in the remainder of the paper.

4. Finite approximations. Problem (3.5) is generally an infinite-dimensionaloptimization problem, it has infinitely many variables and constraints. Such prob-lems are extremely hard to solve directly, but they can be approximated by finite-dimensional optimization problems, i.e., problems with finitely many variables andconstraints. In this section we develop two approaches for making finite approxima-tions of Problem (3.5). First, we explain how the aggregation-relaxation-inner ap-proximation method can give finite approximations for Problem (3.5). This methodworks in full generality, and has been studied for infinite-dimensional linear program-ming problems (see [21, 23]). Second, we develop a discretization scheme for theaugmented state variables. This discretization scheme is intuitive and leads easily toa convergence rate analysis.

4.1. Aggregation-relaxation-inner approximation. We now elucidate finiteapproximations for the infinite-dimensional linear programming problems that arisein the classical convex analytic approach for risk-neutral MDPs. These finite approx-imations are based on an aggregation of the constraints, a relaxation of the aggregateconstraints, and then an inner approximation of the decision variables. We note simi-lar developments in [21, 23] for infinite-dimensional linear programming problems. Inour case, we have to take extra care because Problem (3.5) has a nonlinear objectivein general. We make the following assumption about the objective.Assumption 4.1. ⇢ (X (·)) : M (Y) ! R is weakly continuous.

Assumption 4.1 is needed to establish an asymptotic convergence of our approx-imation. In earlier work on approximation of infinite-dimensional linear programs[21, 23], there was no need for Assumption 4.1 because the objective function waslinear. It was possible to use Fatou’s lemma to show that a sequence of approximatesolutions converges to an optimal solution asymptotically. Our Problem (3.5) has anonlinear objective and thus needs a special consideration. We will show that thisassumption is met for all of the risk-aware optimization formulations that we studyin the next section.

We also make the following assumption about the augmented state space, andthe augmented set of state-action pairs.Assumption 4.2. S and K are locally compact separable metric spaces.

Risk-aware MDPs 11

This assumption is met under many circumstances, for instance if S and K areEuclidean spaces - which is usually the case in practice. Assumption 4.2 is needed so

that we can approximate probability measures in M⇣

K⌘

and M⇣

S⌘

with probability

measures that have finite support.

We begin by describing aggregation of the constraints. Let C⇣

S⌘

be the space of

continuous functions on S. The constraint L0, Tµ = ⌫ is equivalent to hL0,Tµ�⌫, fi =0, 8f 2 C0

⇣

S⌘

, where C0⇣

S⌘

⇢ C⇣

S⌘

is any countable dense subset of C⇣

S⌘

, by

[23, Lemma 12.5.2]. Similarly, let C (Y) be the space of continuous functions on Y,and let C1 (Y) be a countable dense subset of C (Y). The constraint L1, Tµ = ✓ isequivalent to hL1,Tµ � ✓, fi = 0, 8f 2 C1 (Y). We will now approximate these twonew representations of the equality constraints in Problem (3.5). Let {C0, k}

k�0 be an

increasing sequence of finite sets with C0, k " C0⇣

S⌘

and {C1, k}k�0 be an increasing

sequence of finite sets with C1, k " C1 (Y). Now we discuss the inner approximation ofthe infinite-dimensional decision variables µ and ✓. Let S0 ⇢ S, Y0 ⇢ Y, Z0 ⇢ (0, 1],and A0 ⇢ A be countable dense sets, with increasing sequences {S

k

}k�0, {Yk

}k�0,

{Zk

}k�0, and {A

k

}k�0 with S

k

" S0, Yk

" Y0 , Zk

" Z0 , and A

k

" A0.Finally, take �0,k = P (S

k

⇥ Y

k

⇥ Z

k

⇥A

k

) and �1,k = P (Yk

). The sets �0 =[1k=1�0,k and �1 = [1

k=1�1,k are then dense in the spaces of probability measureson K and Y respectively, under Assumption 4.2.

The resulting finite approximation of Problem (3.5) is then

P (C0, k, C1, k, ✏k, �0, l, �1, l) : infµ2M(K), ✓2M(Y)

⇢ (X (✓))

s.t. |hL0, Tµ� ⌫, fi| ✏

k

, 8f 2 C0, k,

|hL1, Tµ� ✓, fi| ✏

k

, 8f 2 C1, k,

µ 2 �0, l, ✓ 2 �1, l.

Problem P (C0, k, C1, k, ✏k, �0, l, �1, l) has finitely many constraints indexed by C0, k

and C1, k, and finitely many variables since we have restricted the support of µ and

✓. We continue to view µ 2 �0, l and ✓ 2 �1, l as elements of M⇣

K⌘

and M (Y), re-

spectively. Note that the constraints in P (C0, k, C1, k, ✏k, �0, l, �1, l) include an errortolerance of ✏

k

, since we cannot expect to satisfy them exactly with discretized decisionvariables. The next theorem considers the behavior of P (C0, k, C1, k, ✏k, �0, l, �1, l)as k, l ! 1, and its proof is similar to the preceding two theorems.Theorem 4.3. Let ⇢⇤

kl

be the optimal value of Problem P (C0, k, C1, k, ✏k, �0, l, �1, l),and let {(µ

kl

, ✓

kl

)}k, l�0 be a sequence of solutions of it. Then,

(i) Problem P (C0, k, C1, k, ✏k, �0, l, �1, l) is solvable for each k, for all su�-

ciently large l.

(ii) ⇢

⇤kl

! ⇢

⇤as k ! 1 and l ! 1. Every weak accumulation point of

{(µkl

, ✓

kl

)}k, l�0 is an optimal solution of Problem (3.5).

Proof. This result is similar to the proof of [23, Theorem 12.5.3]. Only now, we useweak-continuity of the objective ⇢ (X (·)) in ✓ to get lim inf

i!1 ⇢ (X (✓i

)) � ⇢ (X (✓)),whenever ✓

i

! ✓ in the weak topology. In [23, Theorem 12.5.3], Fatou’s lemma wasused to establish lim inf

i!1hc, µi

i � hc, µi, when µ

i

! µ in the weak topology.Because we worked in a general setting (we have made no assumptions on the

state and action spaces other than Assumption 4.2), the convergence results in this

12 Haskell and Jain

subsection are asymptotic. Next we look at a finite approximation scheme that hasconvergence rate guarantees when S and A are finite.

4.2. Discretization. In this subsection, we consider the special case when theoriginal state and action spaces are finite. Here, we only need to discretize the aug-menting state variables defined on Y ⇥ (0, 1]. Although the running cost can onlytake finitely many values when S and A are finite, we still discretize Y because thenumber of possible values for the running cost can be quite large. We propose anatural discretization scheme for the running cost and the discounting that readilyleads to convergence rate estimates. We will use Y ⇢ Y to denote a general finitediscretization of Y. The choice of a discretization Z ⇢ (0, 1] is automatic once T has

been fixed, specifically it is Z = {�t}Tt=0.

We will discretize Y into the finite set Y ⇢ Y, where the granularity of Y issup

y2Y infy2Y |y � y|, the set distance between Y and Y. Given ✏ > 0, a set Y with

granularity exists with not more than dc/ (✏ (1� �))e elements. We introduce a newstochastic process {y

t

}t�0 for the discretized running costs on Y to di↵erentiate from

the original continuous running costs {yt

}t�0. The system dynamic for {y

t

}t�0 is

y

t+1 = argminy2Y

|y � (zt

c (st

, a

t

) + y

t

) |, 8t � 0,

which simply assigns y

t+1 to be the closest point in Y to the original update givenby z

t

c (st

, a

t

) + y

t

. We will show that the error of this scheme increases linearly withtime. We will also introduce {z

t

}t�0 to denote the discretized discounting process

whose system dynamic is given by

z

t+1 =

(

� z

t

, z

t

> �

T

,

z

t

, z

t

= �

T

.

The process {zt

}t�0 is the same as {z

t

}t�0 up until time T .

The stochastic processes {yt

}t�0 and {z

t

}t�0 are defined on (⌦, F) along with

{(st

, a

t

)}t�0. By construction, {y

t

}t�0 is a function of {(s

t

, a

t

)}t�0 and {z

t

}t�0

is a deterministic process. We denote the discretized augmented state space asS = S ⇥ Y ⇥ Z, and the corresponding discretized set of state-action pairs is K ={(s, y, z, a) 2 S⇥ Y⇥ Z⇥ A : a 2 A (s)}. Corresponding to the stochastic process{(s

t

, y

t

, z

t

, a

t

)}t�0, we introduce the transition kernel Q

T

which accounts for the sys-

tem dynamics of {yt

}t�0 and {z

t

}t�0. We let L0, T be the linear operator L0, T with

Q

T

in place of QT

,

⇥

L0, Tµ⇤

(B) := µ (j)� �

X

(s,y,z,a)2KQ

T

(j | s, y, z, a)µ (s, y, z, a) , 8j 2 S, (4.1)

where µ (j) =P

a2A µ (s, y, z, a), for all (s, y, z) 2 S. Equation (4.1) is the discretizedanalog of L0, T defined in equation (3.2). Similarly, we define L1, T to be the linearoperator L1, T suitably modified for use on Y,

⇥

L1, Tµ⇤

(y) :=1� �

�

T

X

(s,y,z,a)2KI

�

y = j, z = �

T

µ (s, y, z, a) , 8j 2 Y. (4.2)

Again, equation (4.2) is the discretized analog of L0, T defined in equation (3.4). Fora measure ✓ 2 M (Y), define the random variable X (✓) in L to have the distributionPr

�

X(µ) = j

= ✓(⌘), 8⌘ 2 Y.

Risk-aware MDPs 13

The convex analytic formulation for the discretized MDP is then

infµ2R|K|

, ✓2R|Y|

�

⇢

�

X (✓)�

: L0, Tµ = ⌫, L1, Tµ = ✓

. (4.3)

Problem (4.3) has finitely many variables and constraints by construction. We nowwant to compare solutions of the discretized Problem (4.3) and the original Problem(3.5).Theorem 4.4. Let ⇢

⇤ = inf⇡2⇧ ⇢ (C⇡

⌫

) and choose any ✏ > 0 and T = T (✏). If the

granularity of Y is smaller than ✏/T , then:

(i) |yT

(!)� y

T

(!) | < ✏ for all ! 2 ⌦;(ii) Under Assumption 3.2, |⇢ (y

T

)� ⇢ (yT

) | ✏;

(iii) For µ 2 M �

K�

with

⇢

�

X (✓)� inf

µ2M(K)

�

⇢

�

X (µ)�

: L0, Tµ = ⌫, L1, Tµ = ✓

+ ✏

we have ⇢

�

C

⇡

⌫

�

< ⇢

⇤ + 3 ✏ where ⇡ is the policy generated by µ.

Proof. (i) By definition, y0 = y0 = c (s0, a0) so |y0 � y0| = 0. Now, |y1 � y1| < ✏

by construction of {yt

}t�0 and the fact that Y has granularity ✏. For the inductive

step, suppose |yt

� y

t

| < t ✏. Then, yt+1 = z

t

c (st

, a

t

)+ y

t

and y

t+1 = argminy2Y |y�

(zt

c (st

, a

t

) + y

t

) |. Then,|y

t+1 � y

t+1| |yt+1 � (z

t

c (st

, a

t

) + y

t

) |+ | (zt

c (st

, a

t

) + y

t

)� y

t+1|= |y

t

� y

t

|+ | (zt

c (st

, a

t

) + y

t

)� y

t+1|< (t+ 1) ✏,

using the update for yt+1 and the induction hypothesis.

(ii) Follows immediately from part (i).(iii) For ⇡ 2 ⇧, we have that ⇢

�

X (µ⇡

⌫

)�

= ⇢ (yT

) by construction.Problem (4.3) can be solved exactly, at least in principle, because it is a finite-

dimensional optimization problem. When there is additional problem structure, wecan extend this discretization scheme to solve Problem (4.3) when S and A are infinite.Such a situation occurs in the dynamic risk-averse newsvendor problem which we willreport in the future.

5. Optimization of risk functionals. Section 3 shows how to cast risk-awareMDPs as static optimization problems and Section 4 shows how to construct tractableapproximations. In this section, we apply our general methodology to two specificrisk-aware MDPs. First, we minimize expected utility and then we minimize CVaR.Notably, the expected utility minimizing MDPs lead to linear programming problemsin occupation measures. The CVaR minimizing MDPs lead to nonconvex problemsin occupation measures, but these problems can be solved with a sequence of linearprogramming problems.

5.1. Expected utility risk functional. Utility functions can be used to ex-press a decision maker’s risk preferences. Often, decision makers have increasingmarginal costs. Hence, we focus on utility functions that are increasing and convex.For a fixed increasing, convex, and continuous utility function u : R ! R, we canreplace the risk-neutral expectation E⇡

⌫

[·] with the expected utility E⇡

⌫

[u (·)]. Theresulting risk-aware MDP is

inf⇡2⇧

E [u (C⇡

⌫

)] = inf⇡2⇧

E⇡

⌫

"

u

1X

t=0

�

t

c (st

, a

t

)

!#

. (5.1)

14 Haskell and Jain

Since we are focusing on costs rather than rewards, we prefer lower expected utilityto higher expected utility. It would be more correct to refer to u as a “disutility”function since it measures costs, but we continue to use the more common term“utility”. Problem (5.1) has been studied with state space augmentation in [4] forconcave and convex utility functions u. In [4], it is shown that Problem (5.1) can besolved with value iteration and policy iteration on the augmented state space.

We first approximate Problem (5.1) with a finite horizon problem, and then weformulate the resulting problem as a static optimization problem in occupation mea-sures. The resulting static optimization problem turns out to be a linear programmingproblem. Next, we use linear programming duality to reveal the dual problem in valuefunctions - from which we can recover dynamic programming equations.

Next we confirm that the expected utility objective E [u (·)] for Problem (5.1)satisfies Assumption 3.2, so that our earlier error bounds from Lemma 3.4 apply.Lemma 5.1. The risk function ⇢ (·) = E [u (·)] satisfies Assumption 3.2.

Proof. Since u is increasing, convex, and continuous on Y, it is also Lipschitzcontinuous on this interval. Without loss of generality, we can take the Lipschitzconstant to be one by appropriately scaling u by a positive constant. Now compute

|E⇡

⌫

"

u

T

X

t=0

�

t

c (st

, a

t

)

!#

� E⇡

⌫

"

u

1X

t=0

�

t

c (st

, a

t

)

!#

|

E⇡

⌫

"

|u

T

X

t=0

�

t

c (st

, a

t

)

!

� u

1X

t=0

�

t

c (st

, a

t

)

!

|#

E⇡

⌫

"

|1X

t=T+1

�

t

c (st

, a

t

) |#

,

where the last term can be made arbitrarily small by taking T ! 1. In the secondinequality, we are using Lipschitz continuity, i.e. |u (x)�u (y) | |x� y| for all x andy.

As for the general Problem (2.2), we fix the time horizon T and consider thetruncated problem

inf⇡2⇧

E⇥

u

�

C

⇡

⌫, T

�⇤

, (5.2)

which can be solved exactly with convex analytic methods. Problem (5.2) is equivalentto the following optimization problem:

infµ2M(K), ✓2M(Y)

{E [u (X (✓))] : L0, Tµ = ⌫, L1, Tµ = ✓} , (5.3)

where L0, T is defined in (3.2) and L1, T is defined in (3.4). By definition of X (✓),the objective E [u (X (✓))] =

´u (y) ✓ (dy) = h✓, ui is linear in ✓. Thus, we see that

Problem (5.3) is actually a linear programming problem. We remind the reader thatProblem (5.3) gives rise to a deterministic optimal policy by [34, Theorem 19] becauseit is linear.Remark 5.2. The discretized finite-dimensional version of the expected utility mini-

Risk-aware MDPs 15

mizing MDP is

infµ, ✓

X

y2Yu (y) ✓ (y) (5.4)

s.t. ⌫ (j) =X

a2Aµ (j, a)� �

X

(s,y,z,a)2KQ

T

(j | s, y, z, a)µ (s, y, z, a) , 8j 2 S,

✓ (⇠) =1� �

�

T

X

(s,y,z,a)2KI

�

y = ⇠, z = �

T

µ (s, y, z, a) , 8⇠ 2 Y,

where we are using

µ (j) =X

a2Aµ (j, a)

for all j 2 S.The next step in our analysis is to take the dual of Problem (5.3). Duality is

helpful here in two ways. First, it enhances our understanding of Problem (5.3) byproviding a certificate of optimality. Second, it reveals a linear programming problemin value functions for Problem (5.3). To proceed, we define the adjoint of the linearoperator L1, T . Let F (R) be the space of bounded measurable functions f : R ! R.Lemma 5.3. The adjoint of L1, T is L

⇤1, T : F (R) ! F

⇣

K⌘

defined by

⇥

L

⇤1, T f

⇤

(s, y, z, a) :=1� �

�

T

f (y) I�

z �

T

, 8 (s, y, z, a) 2 K.

Proof. The adjoint is defined by

hf, L1,Tµi = hL⇤1,T f, µi, so we compute

hf, L1, Tµi =ˆRf (y) [L1, Tµ] (dy)

=

ˆRf (y0)

1� �

�

T

ˆKI

�

y = y

0, z �

T

µ (d (s, y, z, a))

�

dy

0

=

ˆK

ˆR

1� �

�

T

f (y0) I�

y = y

0, z �

T

dy

0�

µ (d (s, y, z, a))

=

ˆK

1� �

�

T

f (y) I�

z �

T

�

µ (d (s, y, z, a)) .

We are now ready to compute the Lagrangian dual of Problem (5.3). This deriva-tion follows from the infinite dimensional linear programming theory (see [2]). We areassured that the Lagrangian dual will be a linear programming problem because the

primal Problem (5.3) is a linear programming problem. Define F⇣

S⌘

to be the space

of bounded measurable functions f : S ! R. The value function for Problem (5.2)

exists in the space F⇣

S⌘

, and the following dual problem is an optimization problem

in value functions.

16 Haskell and Jain

Theorem 5.4. The dual to Problem (5.3) is

supv2F(S)

hv, ⌫i (5.5)

s.t. v (s, y, z) �

ˆSv (⇠) Q

T

(d⇠ | s, y, z, a)

+1� �

�

T

u (y) I�

z �

T

, 8 (s, y, z, a) 2 K. (5.6)

Proof. Let the Lagrange multiplier for constraint L0, Tµ = ⌫ be v 2 F⇣

S⌘

and let

the Lagrange multiplier for constraint L1, Tµ = ✓ be w 2 F (R), then the Lagrangianfor Problem (5.3) is

(µ, ✓, v, w) = E [u (X (✓))] + hv, L0, Tµ� ⌫i+ hw, L1, Tµ� ✓i.Problem (5.3) is equivalent to

inf✓, µ�0

supv, w

(µ, ✓, v, w) ,

so the dual problem is defined to be

supv, w

inf✓, µ�0

(µ, ✓, v, w) .

Rearranging the Lagrangian

(µ, ✓, v, w) = h✓, ui+ hv, L0, Tµ� ⌫i+ hw, L1, Tµ� ✓i= h✓, u� wi+ hµ, L⇤

0, T v + L

⇤1, Twi � hv, ⌫i

we see that the dual problem is

supv, w

� hv, ⌫i

s.t.u� w � 0,

L

⇤0, T v + L

⇤1, Tw � 0.

The adjoint of L0, T is L⇤0, T : F

⇣

S⌘

! F⇣

K⌘

defined by

[L⇤0T

h] (s, y, z, a) := h (s, y, z)� �

ˆSh (⇠) Q

T

(d⇠ | s, y, z, a) ,

(see [19] for example.) We have the adjoint of L1, T from Lemma 5.3, which givesthe form of the dual stated above after switching the sign of the unrestricted valuefunction v = �v.

Problem (5.5) - (5.6) is a maximization problem which drives the components ofv to be as large as possible. Thus, constraint (5.6) must be binding at optimality forsome action a 2 A for every state (s, y, z) 2 S, or else we could increase v further.We then see that the dynamic programming equations on the augmented state spaceS are

v (s, y, z) = infa2A(s)

⇢

1� �

�

T

u (y) I�

z �

T

+ �

ˆSv (⇠) eQ

T

(d⇠ | s, a)�

, 8 (s, y, z) 2 S.

(5.7)

Risk-aware MDPs 17

The term 1��

�

T u (y) I�

z �

T

appears as a cost function on the augmented statespace, the original cost function c is absorbed into the transition kernel. We emphasizethat the preceding Bellman equations are stationary on the augmented state space.

Using the dynamic programming equations (5.7), we can write the value functionfor Problem (5.2) as

v

⇡ (s, 0, 1) = inf⇡2⇧

E⇡

⌫

" 1X

t=0

1� �

�

T

u (yt

) I�

z

t

�

T

#

, 8s 2 S.

On the augmented state space, we have a cost function 1��

�

T u (y) I�

z �

T

thatdepends on y and z but not s and a, since the original cost function was absorbedinto the transition kernel. The new cost function 1��

�

T u (y) I�

z �

T

is always zeroup until time T , after which it accounts for the running cost.

Recall that a problem is ‘solvable’ when its optimal value is attained, there is ‘noduality gap’ between a primal and its dual when both problems have the same optimalvalue, and ‘strong duality’ holds between a primal and its dual when both problemsare solvable. To conclude this section, we turn to the issues of solvability and strongduality for Problems (5.3) and (5.5) - (5.6). The following technical conditions ensurethat Problem (5.3) is solvable, and that strong duality holds.Assumption 5.5. (i) The cost function

1��

�

T u (y) I�

z �

T

: K ! R is inf-

compact, i.e., the level sets

⇢

(s, y, z, a) 2 K :

1� �

�

T

u (y) I�

z �

T

✏

�

are compact for all ✏ � 0.(ii) The transition law Q

T

is weakly continuous, i.e.

(s, y, z, a) !ˆSv (s) Q

T

(ds | s, y, z, a) 2 Cb

⇣

K⌘

, 8v 2 Cb

⇣

S⌘

,

where Cb

⇣

K⌘

and Cb

⇣

S⌘

denote the space of continuous and uniformly bounded func-

tions on K and S, respectively.(iii) There exists a uniformly bounded minimizing sequence

�

v

i

i�0⇢ F

⇣

S⌘

for

Problem (5.5) - (5.6).

The preceding assumptions are standard in the literature on the convex analyticapproach to MDPs (see the monographs [22, 23] for a summary). The next theoremsummarizes well-known solvability and duality results for infinite-dimensional LPs, asapplied to Problems (5.3) and (5.5) - (5.6). As a reminder, a primal and its dual haveno duality gap when their optimal values are equal, and strong duality holds whenboth optimal values are attained.Theorem 5.6. (i) ([19, Theorem 3.2]) Under Assumptions 2.1 and 5.5(i)(ii), Prob-

lem (5.3) is solvable.

(ii) [19, Theorem 4.6] Under Assumptions 2.1 and 5.5(i)(ii), there is no duality

gap between Problem (5.3) and Problem (5.5) - (5.6).

(iii) [2, Theorem 3.9] Under Assumptions 2.1 and 5.5(i)(ii)(iii), strong duality

holds between Problems (5.3) and (5.5) - (5.6).

When strong duality holds, the optimal values of Problems (5.3) and (5.5) - (5.6)are equal. In this situation, we can recover an optimal policy for Problem (5.2) by

18 Haskell and Jain

solving either the primal problem in occupation measures or the dual problem in valuefunctions. If µ⇤ is an optimal solution to Problem (5.3), then ⇡

⇤ 2 ⇧ defined by

⇡

⇤ (B | s, y, z) = µ

⇤ (B | s, y, z) , 8B 2 B (A)

is an optimal policy for Problem (5.2). Conversely, if v⇤ is an optimal solution toProblem (5.5) - (5.6), then a greedy policy with respect to v

⇤,

⇡

⇤ (s, y, z) 2 arg mina2A(s)

⇢

1� �

�

T

u (y) I�

z �

T

+ �

ˆSv

⇤ (⇠) QT

(d⇠ | s, a)�

,

is an optimal policy for Problem (5.2).Remark 5.7. As we mentioned at the beginning of this subsection, Problem (5.1) isstudied with dynamic programming methods in [4]. Our results here are of a verydi↵erent flavor. First, we solve Problem (5.1) with the convex analytic approach while[4] develops value iteration and policy iteration algorithms for Problem (5.1). Ourpresent paper and [4] share the same type of history augmentation, which necessitatesan uncountable state space. However, we are able to provide a finite approximationand error guarantees while the algorithms in [4] are purely conceptual.

5.2. Conditional value-at-risk. CVaR is among the most popular risk func-tions, the CVaR-minimizing policy at level � 2 (0, 1) solves

inf⇡2⇧

inf⌘2R

⇢

⌘ +1

1� �

E⇥

(C⇡

⌫

� ⌘)+⇤

�

(5.8)

= inf⇡2⇧

inf⌘2R

(

⌘ +1

1� �

E⇡

⌫

" 1X

t=0

�

t

c (st

, a

t

)� ⌘

!

+

#)

.

The dependence on ⇡ enters through the expectation inside the minimization problem.Problem (5.8) is solved with state space augmentation and dynamic programmingalgorithms in [3]. In contrast, we provide a solution via the convex analytic approach.

Problem (5.1) naturally leads to a linear programming problem in occupationmeasures. Here we will see that the convex analytic formulation for Problem (5.8)is nonconvex. This fact is in contrast to stochastic optimization with CVaR, whichgives convex optimization problems.

The next lemma shows that we do not have to minimize over all ⌘ 2 R whenevaluating the CVaR of C⇡

⌫, T

, we only need to minimize over ⌘ 2 Y.Lemma 5.8. For any X 2 L with support contained in Y,

inf⌘2R

⇢

⌘ +1

1� �

E⇥

(X � ⌘)+⇤

�

= inf⌘2Y

⇢

⌘ +1

1� �

E⇥

(X � ⌘)+⇤

�

.

Proof. For ⌘ < 0, we have

⌘ +1

1� �

E⇥

(X � ⌘)+⇤

= ⌘ +1

1� �

E [X � ⌘] =

✓ ��

1� �

◆

⌘ +1

1� �

E [X] ,

which is increasing as ⌘ ! �1. For ⌘ > c/ (1� �), we have

⌘ +1

1� �

E⇥

(X � ⌘)+⇤

= ⌘,

which is increasing as n ! 1.

Risk-aware MDPs 19

Now we check Assumption 3.2 for CVaR to get error bounds.

Lemma 5.9. [16, Lemma 4.3] The risk function ⇢ (·) = inf⌘2R

n

⌘ + 11��

E⇥

(·� ⌘)+⇤

o

satisfies Assumption 3.2.

For a fixed time horizon T , the truncated problem is

inf⇡2⇧

inf⌘2R

⇢

⌘ +1

1� �

Eh

�

C

⇡

⌫,T

� ⌘

�

+

i

�

. (5.9)

Problem (5.9) can be expressed as a static problem in occupation measures:


⇢

inf⌘2R

⇢

⌘ +1

1� �

E⇥

(X (✓)� ⌘)+⇤

�

: L0, Tµ = ⌫, L1, Tµ = ✓

�

,

(5.10)by Theorem 3.5 and Lemma 3.6. By definition, E

⇥

(X (✓)� ⌘)+⇤

=´Y (y � ⌘)+ ✓ (dy),

so we define

g (✓) := inf⌘2R

⇢

⌘ +1

1� �

ˆY(y � ⌘)+ ✓ (dy)

�

to be the CVaR objective. Then Problem (5.10) can be written simply as


{g (✓) : L0, Tµ = ⌫, L1, Tµ = ✓} .

Problem (5.10) is a convex analytic formulation for Problem (5.8), but it is not aconvex optimization problem - it has a convex feasible region but a concave mini-mization objective. However, since the objective of Problem (5.10) is concave, we arestill guaranteed the existence of a deterministic optimal policy by [34, Theorem 19].Remark 5.10. When S and A are finite, we can use the discretization from Subsection4.2 to approximate Problem (5.9) with

infµ, ✓

inf⌘2Y

8

<

:

⌘ +1

1� �

X

y2Y(y � ⌘)+ ✓ (y)

9

=

;

(5.11)

s.t. ⌫ (j) = µ (j)� �

X

(s,y,z,a)2KQ

T

(j | s, y, z, a)µ (s, y, z, a) , 8j 2 S,

✓ (⇠) =1� �

�

T

X

(s,y,z,a)2KI

�

y = ⇠, z = �

T

µ (s, y, z, a) , 8⇠ 2 Y.

Since the preceding problem is finite-dimensional, we can apply the algorithm from [30]directly, rather than applying the algorithm to each finite approximation in problemsP (C0, k, C1, k, ✏k, �0, l, �1, l) and taking k, l ! 1.

The next lemma concerns properties of g, its concavity and its continuity withrespect to the weak topology on M (Y). Part (i) is by definition, the proof of part(ii) follows by an approximation argument with a finite set N ⇢ Y.Lemma 5.11. (i) g (·) is concave in ✓.

(ii) g (·) is weak-continuous.

In this case, Problem P (C0, k, C1, k, ✏k, �0, l, �1, l) takes the form:

20 Haskell and Jain


g (✓) (5.12)

s.t. |hL0, Tµ� ⌫, fi| ✏

k

, 8f 2 C0,k, (5.13)

|hL1, Tµ� ✓, fi| ✏

k

, 8f 2 C1,k, (5.14)

µ 2 �0,l, ✓ 2 �1,l, (5.15)

where �0,l ⇢ M⇣

K⌘

and �1,l ⇢ M (Y) are the sets of probability measures with

finite support defined in Subsection 4.1. The preceding problem has finitely manydecision variables and finitely many constraints. We can use the successive linearapproximation method to solve Problem (5.12) - (5.15) exactly, see [30]. In the finitesetting, the subgradient is

@g

⇣

✓

⌘

= conv

8

<

:

�

(y � ⌘)+

y2�1,k: g

⇣

✓

⌘

= ⌘ +1

1� �

X

y2�1,k

(y � ⌘)+ ✓ (y)

9

=

;

⇢ R|�1,k|, where we are viewing g as a function on probability distributions on �1,k.

The idea is that at each candidate point, we will linearize the objective of Problem(5.12) - (5.15) and then solve the resulting LP to get the next candidate point. Thisprocedure is justified since we know the optimal solution of a concave minimizationproblem will be found at an extreme point of the feasible region. We emphasize thatsuccessive linear approximation is being applied to the discretized Problem (5.12) -(5.15), not the infinite-dimensional Problem (5.10). Let

�

µ

i

, ✓

i

�

be the i

th candidatesolution to Problem (5.12) - (5.15), then the linearization of Problem (5.12) - (5.15)at

�

µ

i

, ✓

i

�

is


hsi, ✓ � ✓

ii

s.t. |hL0, Tµ� ⌫, fi| ✏

k

, 8f 2 C0,k,

|hL1, Tµ� ✓, fi| ✏

k

, 8f 2 C1,k,

µ 2 �0,k, ✓ 2 �1,k,

where si 2 @g

�

✓

i

�

. The solution of the preceding problem becomes the next candidatesolution

�

µ

i+1, ✓

i+1�

. It is shown in [30] that this procedure will converge to theoptimal solution of Problem (5.12) - (5.15).Remark 5.12. Instead of minimizing the CVaR of costs, we could maximize the CVaRof reward. For a reward function r : K ! R, we can let y

t

=P

t

i=0 �i

r (si

, a

i

) bethe running reward. Using our same truncation argument, we get the optimizationproblem

supµ2M(K), ✓2M(Y)

⇢

inf⌘2R

⇢

⌘ +1

1� �

ˆ(y � ⌘)+ ✓ (dy)

�

: L0, Tµ = ⌫, L1, Tµ = ✓

�

.

We have already established that CVaR is a concave function of ✓, thus this problemis automatically convex since it is maximizing a concave function.Remark 5.13. We can treat mean-deviation and mean-semideviation in the same man-ner as above. The resulting static problems in occupation measures are nonconvex.

Risk-aware MDPs 21

The situation is mitigated somewhat by the fact that the feasible regions are de-termined by linear constraints, and thus the feasible region is convex. The onlynonconvexity is in the objectives.

6. Risk constrained optimization. We have so far addressed MDP modelswith minimization of some risk function of the infinite horizon discounted cost. Inthis section, we extend our development to risk-constrained MDPs. For an additionalrisk function # : L ! R and a constant , we can add a constraint to Problem (2.2)to get

inf⇡2⇧

{⇢ (C⇡

⌫

) : # (C⇡

⌫

) } . (6.1)

We will study two specific instances of Problem (6.1) in this section: one based onstochastic dominance and the other on chance constraints.

6.1. Stochastic dominance constraints. Stochastic dominance relations (orstochastic orders) are partial orders on the space of random variables (see [33, 40]).They have major relevance to risk management because they allow us to express riskpreferences for an entire class of decision makers, as opposed to Problem (5.1) whichrepresents only a single decision maker. Optimization with stochastic dominanceconstraints was addressed in [10, 11, 12, 18].Definition 6.1. For random variables X, Y 2 R, X is dominated by Y in the in-

creasing convex stochastic order, written X icx

Y , if E [u (X)] E [u (Y )] for allincreasing convex functions u : R ! R such that both expectations exist.

If X icx

Y , then any risk-averse decision maker with an increasing convex utilityfunction would prefer the random variable X to Y . We will assume that there is areference random variable Y that serves as a benchmark. The benchmark Y expressesthe user’s desiderata regarding the properties of a favorable cost distribution.

Fortunately, a computationally tractable representation of icx

is available. De-note (x)+ = max {x, 0}. It is known (see [10]) that in this case X

icx

Y is equivalentto

E⇥

(X � ⌘)+⇤ E

⇥

(Y � ⌘)+⇤

, 8⌘ 2 R.

The above parametric representation of icx

is easier to implement than its originaldefinition, since it reduces to a continuum of constraints indexed by a single parameter,whereas the original definition is a continuum of constraints indexed by a functionspace. We can also write X

icx

Y as the system

E⇥

(X � ⌘)+⇤ E

⇥

(Y � ⌘)+⇤

, 8⌘ 2 suppY,

where suppY is the support of Y . Furthermore, when suppY = {⌘i

}i2I

for a finiteindex I, then X

icx

Y is equivalent to

E⇥

(X � ⌘

i

)+⇤ E

⇥

(Y � ⌘

i

)+⇤

, 8i 2 I.

The next assumption about the boundedness of the benchmark streamlines our anal-ysis, and is not unreasonable in practice.Assumption 6.2. The support of the benchmark Y is contained in [⌘1, ⌘2].

22 Haskell and Jain

With these ingredients, we propose the stochastic dominance-constrained MDP:

inf⇡2⇧

E⇡

⌫

" 1X

t=0

�

t

c (st

, a

t

)

#

(6.2)

s.t. E⇡

⌫

" 1X

t=0

�

t

c (st

, a

t

)� ⌘

!

+

#

E⇥

(Y � ⌘)+⇤

, 8⌘ 2 [⌘1, ⌘2] .

Equivalently,

inf⇡2⇧

�

E [C⇡

⌫

] : E⇥

(C⇡

⌫

� ⌘)+⇤ E

⇥

(Y � ⌘)+⇤

, 8⌘ 2 [⌘1, ⌘2]

.

In our earlier work [18], we applied stochastic dominance constraints to the steadystate in the classic convex analytic formulation. Problem (6.2) di↵ers substantiallybecause it has stochastic dominance constraints on the discounted cost over the entiretime horizon.

We will see that some of the results and proofs for Problem (6.2) in this sectionare similar to those for Problem (5.1). The connection between Problem (6.2) andProblem (5.1) is natural: stochastic dominance constraints are defined in terms of acontinuum of constraints on the expected utility of cost, while Problem (5.1) minimizesthe expected utility for a single utility function. Our overall plan is the same, we willapproximate Problem (6.2) with a truncated problem that can be solved exactly withconvex analytic methods.

Problem (6.2) is constrained, so we cannot apply Lemma 3.4 to get error bounds.We also need to consider satisfaction of the constraints. The following result is im-mediate since (x� ⌘)+ is Lipschitz continuous for all ⌘. Recall y

T

=P

T

t=0 �t

c (st

, a

t

)is the running cost at time T .Lemma 6.3. For any ✏ > 0, there is a T such that

| (yT

� ⌘)+ � 1X

t=0

�

t

c (st

, a

t

)� ⌘

!

+

| ✏, 8⌘ 2 [⌘1, ⌘2] ,

for all ⇡ 2 ⇧.We will approximate Problem (6.2) with the truncation

inf⇡2⇧

n

E⇥

C

⇡

⌫,T

⇤

: Eh

�

C

⇡

⌫,T

� ⌘

�

+

i

E⇥

(Y � ⌘)+⇤

, 8⌘ 2 [⌘1, ⌘2]o

. (6.3)

We use this to get an estimate on the quality (both in terms of optimality and feasi-bility) of a solution to Problem (6.3) versus Problem (6.2).

The next lemma states that a near optimal solution for Problem (6.3) will benearly optimal and nearly feasible for Problem (6.2). Let �

SD

and �SD, T

denote thefeasible regions of Problems (6.2) and (6.3), respectively. The next proof is immediate.Lemma 6.4. Choose T as in the statement of Lemma 6.3, then:

(i) |Eh

�

C

⇡

⌫, T

� ⌘

�

+

i

� E⇥

(C⇡

⌫

� ⌘)+⇤ | < ✏, 8⌘ 2 [⌘1, ⌘2];

(ii) For ⇡ 2 �SD, T

with E⇥

C

⇡

⌫, T

⇤ inf⇡2�SD, T E

⇥

C

⇡

⌫, T

⇤

+ ✏ we have E⇥

C

⇡

⌫

⇤

<

inf⇡2�SD E [C⇡

⌫

] + 3 ✏ and Eh

�

C

⇡

⌫

� ⌘

�

+

i

E⇥

(Y � ⌘)+⇤

+ ✏ for all ⌘ 2 [⌘1, ⌘2].

Note that Lemma 6.4 guarantees near-feasibility of an optimal solution to Problem(6.3) with respect to Problem (6.2), rather than exact feasibility.

Risk-aware MDPs 23

To succinctly express the stochastic dominance constraint, let C ([⌘1, ⌘2]) be thespace of continuous functions on [⌘1, ⌘2] in the supremum norm. We define a newlinear operator L2 : M (Y) ! C ([⌘1, ⌘2]) via

[L2✓] (⌘) :=

ˆY(y � ⌘)+ ✓ (dy) , 8⌘ 2 [⌘1, ⌘2] ,

which gives the vector of expected utilities E⇥

(X (✓)� ⌘)+⇤

as ⌘ ranges over [⌘1, ⌘2].Finally, let g 2 C ([⌘1, ⌘2]) be defined by

g (⌘) = E⇥

(Y � ⌘)+⇤

, 8⌘ 2 [⌘1, ⌘2] ,

the vector of expected utilities of the benchmark over ⌘ 2 [⌘1, ⌘2]. Using L2, we canwrite Problem (6.3) in convex analytic terms as


{E [X (✓)] : L0, Tµ = ⌫, L1, Tµ = ✓, L2✓ � g} . (6.4)

By the same reasoning as for Problem (5.3), Problem (6.4) is a linear programmingproblem.Remark 6.5. The discretized finite-dimensional version of the stochastic dominance-constrained MDP is given by

infµ, ✓

X

y2Yy ✓ (y) (6.5)

s.t. ⌫ (j) =X

a2Aµ (j, a)� �

X

(s,y,z,a)2KQ

T

(j | s, y, z, a)µ (s, y, z, a) , 8j 2 S,

✓ (⇠) =1� �

�

T

X

(s,y,z,a)2KI

�

y = ⇠, z = �

T

µ (s, y, z, a) , 8⇠ 2 Y,

X

y2Y(y � ⌘

i

)+ ✓ (y) E⇥

(Y � ⌘

i

)+⇤

, 8i 2 I,

where we are assuming the benchmark has finite support indexed by I.The following lemma connects Problem (6.4) with Problem (6.3).

Lemma 6.6. If ⇡ is optimal for Problem (6.3), then µ

⇡

⌫

is optimal for Problem (6.4).

Conversely, if µ is optimal for Problem (6.4), then ⇡

µ

is optimal for Problem (6.3).

Proof. We use the fact that Eh

�

C

⇡

⌫, T

� ⌘

�

+

i

= E⇥

(X (µ⇡

⌫

)� ⌘)+⇤

and also that

E

⇣

C

⇡µ

⌫, T

� ⌘

⌘

+

�

= E⇥

(X (µ)� ⌘)+⇤

for all ⌘ 2 [⌘1, ⌘2]. Since,

inf⇡2�SD, T

E⇥

C

⇡

⌫, T

⇤

= inf⇡2�SD, T

E [X (µ⇡

⌫

)] ,

it follows that

inf⇡2�SD, T

E [X (µ⇡

⌫

)] = infµ2M(K), ✓2M(Y)

{E [X (✓)] : L0, Tµ = ⌫, L1, Tµ = ✓, L2✓ � g} ,

since µ

⇡

⌫

is feasible for Problem (6.4) for any ⇡ 2 �SD, T

, and ⇡

µ

2 �SD, T

for any µ

feasible to Problem (6.4). So, the optimal values of Problem (6.3) and Problem (6.4)are equal.

24 Haskell and Jain

We now derive the dual to Problem (6.4). As for Problem (5.3), duality canbe used to get a certificate of optimality for Problem (6.4). Additionally, dynamicprogramming equations will emerge, although there is a significant di↵erence betweenthe dynamic programming equations for the unconstrained Problem (5.2) versus theconstrained Problem (6.4). We first compute the adjoint of L2 since it will appear inthe dual to Problem (6.4).Lemma 6.7. The adjoint of L2 is L

⇤2 : M ([⌘1, ⌘2]) ! F (R) defined by

[L⇤2⇤] (⌘) :=

ˆ⌘2

⌘1

(y � ⌘)+ ⇤ (d⌘) , 8⌘ 2 R.

Proof. Take ⇤ 2 M ([⌘1, ⌘2]), then

h⇤, L2✓i =ˆ

⌘2

⌘1

ˆY(y � ⌘)+ ✓ (dy)

�

⇤ (d⌘) =

ˆY

ˆ⌘2

⌘1

(y � ⌘)+ ⇤ (d⌘)

�

✓ (dy) ,

by Fubini’s theorem.We report the dual of Problem (6.4) in the next theorem. It is an optimization

problem in value functions, and it will have some similarities to Problem (5.5) - (5.6).However, the Lagrange multiplier of the stochastic dominance constraint will nowappear in the dual.Theorem 6.8. The dual to Problem (6.4) is

supv2F(S),⇤2M([⌘1, ⌘2])

hv, ⌫i � h⇤, gi (6.6)

s.t. v (s) �

ˆSv (⇠) Q

T

(d⇠ | s, a) + 1� �

�

T

y I

�

z �

T

+1� �

�

T

ˆ⌘2

⌘1

(y � ⌘)+ ⇤ (d⌘) I�

z �

T

,

8 (s, y, z, a) 2 K. (6.7)

Proof. Let the Lagrange multiplier for constraint L0, Tµ = ⌫ be v 2 F⇣

S⌘

,

the Lagrange multiplier for constraint L1, Tµ = ✓ be w 2 F (R), and the Lagrangemultiplier for constraint L2✓ � g be ⇤ 2 M ([⌘1, ⌘2]). The Lagrangian for Problem(6.4) is then

(µ, ✓, v, w, ⇤) = E [X (✓)] + hv, L0, Tµ� ⌫i+ hw, L1, Tµ� ✓i+ h⇤, L2✓ � gi.Problem (6.4) is then equivalent to

inf✓, µ�0

supv, w,⇤

{ (µ, ✓, v, w, ⇤) : ⇤ 0} ,

so the dual problem is defined to be

supv, w,⇤

⇢

inf✓, µ�0

(µ, ✓, v, w, ⇤) : ⇤ 0

�

.

Rearranging the Lagrangian

(µ, ✓, v, w, ⇤) = h✓, yi+ hv, L0, Tµ� ⌫i+ hw, L1, Tµ� ✓i+ h⇤, L2✓ � gi= h✓, y � w + L

⇤2⇤i+ hµ, L⇤

0v + L

⇤1wi � hv, ⌫i � h⇤, gi

Risk-aware MDPs 25

we see that the dual problem is

supv,w,⇤

� hv, ⌫i � h⇤, gi

s.t. y � w + L

⇤2⇤ � 0,

L

⇤0v + L

⇤1w � 0.

Switch the sign of v, and take w (y) = y+´⌘2

⌘1(y � ⌘)+ ⇤ (d⌘) to get the desired form.

Remark 6.9. The dual Problem (6.6) - (6.7) is naturally a linear programming prob-lem, since the primal Problem (6.4) is an LP. Problem (6.6) - (6.7) is significant for tworeasons. First, it reveals the role that utility functions play as the Lagrange multipli-ers of the stochastic dominance constraints. This result has already been discoveredin stochastic optimization in [10, 11], and it is unsurprising that it holds for MDPs aswell. Basically, the Lagrange multiplier ⇤ induces an increasing convex function of ythrough the expression

ˆ⌘2

⌘1

(y � ⌘)+ ⇤ (d⌘) I�

z �

T

.

Each (y � ⌘)+ is convex in y, ⇤ is a nonnegative measure, the sum of increasing

convex functions is increasing and convex, and we view z as fixed so I

�

z �

T

is aconstant. Second, Problem (6.6) - (6.7) reveals a new form of dynamic programmingequations.

The resulting dynamic programming equations for Problem (6.3) are v (s, y, z) =

infa2A(s)

⇢

1� �

�

T

y I

�

z �

T

+

ˆ⌘2

⌘1

(y � ⌘)+ ⇤ (d⌘) + �

ˆSv (⇠) Q

T

(d⇠ | s, a)�

,

8 (s, y, z) 2 S. (6.8)

Notice that the dual variable ⇤ appears in the preceding objective function: ⇤ is herebecause the original problem was constrained. The entire expression

1� �

�

T

y I

�

z �

T

+

ˆ⌘2

⌘1

(y � ⌘)+ ⇤ (d⌘)

acts as a cost function on the augmented state space. Equation (6.3) thus does notrepresent Bellman iteration in the traditional sense, since there is an external tuningparameter ⇤ that is not determined by value iteration. The appearance of such anexternal tuning parameter is typical for constrained MDPs (see [1]): in general, con-straints in optimization and control problems cause Lagrange multipliers to appear.

We make the following assumptions to guarantee solvability and strong duality.Assumption 6.10. There exists a bounded minimizing sequence

��

v

i

, ⇤i

�

i�0in

Problem (6.6) - (6.7).

This type of assumption is common in the literature on infinite-dimensional LPs(see [2, 21, 19, 20], for example). It ensures that the dual optimal value is attainedby giving a minimizing sequence that attains this value. The next theorem summa-rizes solvability and strong duality results for Problem (6.3), and is based on linearprogramming duality.

26 Haskell and Jain

Theorem 6.11. Suppose Assumptions 2.1, 5.5, and 6.10 hold. Also suppose Problem

(6.3) is feasible. Then:

(i) Problem (6.4) is solvable;

(ii) There is no duality gap between Problems (6.4) and (6.6) - (6.7);

(iii) Strong duality holds between Problems (6.4) and (6.6) - (6.7).

Remark 6.12. Our development in this section extends to multivariate stochasticdominance constraints with minor modifications. Suppose we have a vector-valuedcost function d : K ! Rn in addition to c. We are interested in constraining thedistribution of the vector valued random variable

P1t=0 �

t

d (st

, a

t

). However, forn � 2 there is no parametric representation of �

icv

, see [24, 8]. In general, we willtake U to be a collection of increasing concave functions from Rn to R, and theassociated relaxed dominance constraints

E⇡

⌫

"

u

1X

t=0

�

t

d (st

, a

t

)

!#

� E [u (Y )] , 8u 2 U .

The development for this case is largely similar to the one in this section, at theexpense of more complicated notation and some further technical assumptions on d

and U .6.2. Chance constraints. We now consider chance-constrained MDPs in this

subsection. Chance constrained optimization problems are usually non-convex andvery di�cult to solve. Fortunately, in our framework, they lead to linear programmingproblems because we are optimizing over measures rather than random variables. Thedevelopment here is actually quite similar to the one in the preceding subsection.

If we view the indicator function as a type of utility step function, then theprobability

P

⇡

⌫

( 1X

t=0

�

t

c (st

, a

t

) ⌘

)

= E⇡

⌫

"

I

1X

t=0

�

t

c (st

, a

t

) ⌘

!#

is an expected utility. A collection of chance constraints is thus similar to a collectionof constraints on expected utilities, like stochastic dominance constraints. The generalchance-constrained MDP is

inf⇡2⇧

{E [C⇡

⌫

] : Pr {C⇡

⌫

⌘

i

} � 1� �

i

, i 2 I} (6.9)

= inf⇡2⇧

(

E⇡

⌫

" 1X

t=0

�

t

c (st

, a

t

)

#

: P⇡

⌫

( 1X

t=0

�

t

c (st

, a

t

) ⌘

i

)

� 1� �

i

, i 2 I

)

,

where ⌘

i

� 0 and �

i

2 (0, 1) for all i 2 I are given constants.We use the next lemma to justify truncation of Problem (6.9).

Lemma 6.13. For any ⌘ � 0 and ✏ > 0, there is a T such that

|P⇡

⌫

(

T

X

t=0

�

t

c (st

, a

t

) ⌘

)

� P

⇡

⌫

( 1X

t=0

�

t

c (st

, a

t

) ⌘

)

| ✏

for all ⇡ 2 ⇧.Proof. We already know y

T

converges toP1

t=0 �t

c (st

, a

t

) on all trajectories asT ! 1, and thus

P

⇡

⌫

(

T

X

t=0

�

t

c (st

, a

t

) ⌘

)

! P

⇡

⌫

( 1X

t=0

�

t

c (st

, a

t

) ⌘

)

Risk-aware MDPs 27

as T ! 1, since almost sure convergence implies convergence in distribution.We now consider the truncation

inf⇡2⇧

�

E⇥

C

⇡

⌫, T

⇤

: Pr�

C

⇡

⌫, T

✏

i

� 1� �

i

, i 2 I

. (6.10)

Since we have the distribution of C⇡

⌫, T

, we can compute Pr�

C

⇡

⌫, T

✏

i

with Theorem3.5 to get

Pr�

C

⇡

⌫, T

⌘

i

=1� �

�

T

ˆSI

�

y ⌘

i

, z = �

T

µ

⇡

⌫

(d (s, y, z, a)) =

ˆI {y ⌘

i

} ✓ (dy)

where ✓ = L1, Tµ⇡

⌫

. In particular, Pr�

C

⇡

⌫, T

⌘

i

is a linear function of µ⇡

⌫

, and thus

it is a linear function of ✓. We introduce a linear operator L3 : M (R) ! R|I| definedby

[L3✓]i

:=

ˆRI {y ⌘

i

} ✓ (dy) , 8i 2 I.

We also define g = (1� �

i

)i2I

to get


{E [X (✓)] : L0, Tµ = ⌫, L1, Tµ = ✓, L3✓ g} . (6.11)

Problem (6.11) is a linear programming problem. We will compute its dual to obtaindynamic programming equations, which are similar to those for Problem (6.3). First,we derive the adjoint of L3.Lemma 6.14. The adjoint of L3 is L

⇤3 : R|I| ! F (R) defined by

[L⇤3�]

i

:=X

i2I

�

i

I {y ⌘

i

} , 8i 2 I.

Proof. Take � 2 R|I|, then

h�, L3✓i =X

i2I

�

i

ˆI {y ⌘

i

} ✓ (dy) =ˆ "

X

i2I

�

i

I {y ⌘

i

}#

✓ (dy) .

The computation of the dual of Problem (6.11) is similar to the one for the dualof Problem (6.4), and we omit the detailed computation. Furthermore, strong dualityholds under similar su�cient conditions.Theorem 6.15. The dual to Problem (6.11) is

supv2F(S),�2R|I|

hv, ⌫i � h�, gi (6.12)

s.t. v (s, y, z) �

ˆSv (⇠) Q

T

(d⇠ | s, a) + 1� �

�

T

y I

�

z �

T

+1� �

�

T

X

i2I

�

i

I {y ⌘

i

} I �z �

T

, 8 (s, y, z, a) 2 K. (6.13)

As in the case of the stochatic dominance-constrained MDPs, we can infer dy-namic programming equations for Problem (6.10) as well from the dual Problem (6.12)- (6.13). such equations have an external tuning parameter �, and are thus not truedynamic programming equations. Instead, it is better to solve Problem (6.11) and itsdual by linear programming.

28 Haskell and Jain

Remark 6.16. We now consider S and A to be finite so that we can apply the dis-cretization technique from Subsection 4.2 to get finite-dimensional versions of Prob-lems (6.10) and (6.12) - (6.13). This explicityly brings out the linear programmingstructure of the chance-constrained problem.

First, the discretized primal problem in occupation measures is

infµ, ✓

X

y2Yy ✓ (y) (6.14)

s.t. ⌫ (j) = µ (j)� �

X

(s,y,z,a)2KQ

T

(j | s, y, z, a)µ (s, y, z, a) , 8j 2 S,

✓ (⇠) =1� �

�

T

X

(s,y,z,a)2KI

�

y = ⇠, z = �

T

µ (s, y, z, a) , 8⇠ 2 Y,

X

y2YI {y ⌘

i

} ✓ (y) � 1� �

i

, 8i 2 I.

The dual problem is now simply

supv,�

X

j2Sv (j) ⌫ (j)�

X

i2I

�

i

(1� �

i

)

s.t. v (s, y, z) �

X

⇠2Sv (⇠) Q

T

(⇠ | s, a) + 1� �

�

T

y I

�

z �

T

+1� �

�

T

X

i2I

�

i

I {y ⌘

i

} I �z �

T

, 8 (s, y, z, a) 2 K.

Both of these problems are finite-dimensional linear programming problems. Wheneither problem is feasible, the other is feasible and we automatically have strongduality.Remark 6.17. It is striking that our chance-constrained MDPs give rise to linearprogramming problems in occupation measures, since chance-constrained stochasticoptimization problems are typically nonconvex. This observation is due to the factthat we are optimizing over measures rather than random-variable-valued mappings.

Specifically, the probability of the eventn

P

T

t=0 �t

c (st

, a

t

) ✏

o

is a linear function

of the occupation measure.

7. Numerical illustration. We now illustrate the dependence of the size ofthe discretized static optimization problems on the desired accuracy. We fix an errortolerance ✏ > 0 throughout this section and compute the truncation T = T (✏). Inparticular, if we choose

T (✏) =

⇠

� log (✏) + log (1� �)� log (c)

log (�)

⇡

, (7.1)

it ensures thatP1

t=T+1 |�t

c (st

, a

t

) | < ✏. Then, for the expected utility minimizingMDP, if the utility function u (·) is Lipschitz continuous with constant 1, then forsuch a T (✏), we have

|E⇡

⌫

"

u

T

X

t=0

�

t

c (st

, a

t

)

!#

� E⇡

⌫

"

u

1X

t=0

�

t

c (st

, a

t

)

!#

| < ✏.

Risk-aware MDPs 29

# of variables # of constraintsExpected utility 484,022 48,422

Stochastic dominance 484,022 48,522Chance constraints 484,022 48,522

CVaR 484,022 48,422

Figure 7.1. Size of optimization problems for various discretized optimization problems.

Similar calculations for the stochastic-dominance constrained and the chance-constrained MDPs show that truncation at T (✏) will result in an error of at most✏. The CVaR minimizing MDP di↵ers slightly, and here T (✏) will result in error ofat most ✏(1 � �). Once the threshold T (✏) is chosen we discretize Y = [0, c] withgranularity ✏/T (✏) to obtain Y such that

|Y| =⇠

c

(✏/T (✏))

⇡

=

⇠

c T (✏)

✏

⇡

,

and we also obtain |Z| = T (✏) . Each of the discretized primal problems (Problems5.4, 5.11, 6.5 and 6.14) then has

|K|+ |Y| = |S| |Y| |Z| |A|+ |Y| = |Y| (|S| |Z| |A|+ 1) =

⇠

c T (✏)

✏

⇡

(|S| |A|T (✏) + 1)

variables. Expected utility and CVaR both have

|S|+ |Y| = |S| |Y| |Z|+ |Y| = |Y| (|S| |Z|+ 1) =

⇠

c T (✏)

✏

⇡

(|S|T (✏) + 1)

constraints. The stochastic dominance and chance-constrained MDPs both have

|S|+ |Y|+ |I| = |Y| (|S| |Z|+ 1) + |I| =⇠

c T (✏)

✏

⇡

(|S|T (✏) + 1) + |I|

constraints. We point out that CVaR is a nonlinear optimization problem, while allthe others are all linear programming problems.

Let us now consider what these numbers look like with a particular MDP. Supposethe state and action space sizes are |S| = 100 and |A| = 10, the discount factor � = 0.9,and the upper bound on the costs is c = 1. Let us take I = 100. For CVaR, we use� = 0.9, and take ✏ = 1. With this, T (✏) = 22.

From table in Figure 7, we see that these are large linear programs that aresolvable on standard laptops available today.

8. Conclusion. In this paper, we introduce a framework of risk-aware MDPsthat is a generalization of the classical MDP and constrained MDP models. Therein,expectation of infinite horizon discounted costs is considered. We replace the expec-tation, a risk neutral measure, with a general risk functional. This framework encom-passes many popular ways of expressing risk: expected (dis-)utility models, coherentrisk measures such as Conditional Value-at-Risk, stochastic dominance and chanceconstraints. Prior attempts have focused on developing dynamic programming algo-rithms, albeit with limited success. This is because in such problems optimal policiesare not stationary, and by the nature of the problem are going to be history depen-dent. In contrast, dynamic programming methods are successful when the underlying

30 Haskell and Jain

stochastic processes are Markovian. We thus develop a convex analytic approach. Weaugment the state space, introduce an occupation measure on it, which then yieldsoptimization formulations that are in most cases linear programs. These are indeedinfinite dimensional LPs. Hence, we give methods for successive finite approximationof such LPs. A striking result here is that, unlike in static optimization, the chance-constrained MDP can be solved via a linear program. This is very promising. Themethods and techniques we have developed are quite general and can be used to solveother risk-aware MDPs beyond those treated here.

REFERENCES

[1] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.[2] Edward J. Anderson and Peter Nash. Linear Programming in Infinite-Dimensional Spaces.

John Wiley & Sons, 1987.[3] Nicole Bauerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria.

Mathematical Methods of Operations Research, 74(3):361–379, 2011.[4] Nicole Bauerle and Ulrich Rieder. More risk-sensitive markov decision processes. Mathematics

of Operations Research, 39(1):105–120, 2014.[5] V. Borkar and R. Jain. Risk-constrained markov decision processes. In Proc. of the IEEE

Control and Decision Conference, 2010.[6] Vivek S Borkar. A convex analytic approach to markov decision processes. Probability Theory

and Related Fields, 78(4):583–602, 1988.[7] Vivek S. Borkar. Convex analytic methods in markov decision processes. In Eugene A. Feinberg,

Adam Shwartz, and Frederick S. Hillier, editors, Handbook of Markov Decision Processes,volume 40 of International Series in Operations Research & Management Science, pages347–375. Springer US, 2002.

[8] E. M. Bronshtein. Extremal convex functions. Sibirskii Matematicheskii Zhurnal, 19(1):10–18,1978.

[9] Ozlem Cavus and Andrzej Ruszczynski. Computational methods for risk-averse undiscountedtransient markov models. Operations Research, 2014.

[10] Darinka Dentcheva and Andrzej Ruszczynski. Optimization with stochastic dominanceconstraints. Society of Industrial and Applied Mathematics Journal of Optimization,14(2):548–566, 2003.

[11] Darinka Dentcheva and Andrzej Ruszczynski. Optimality and duality theory for stochasticoptimization problems with nonlinear dominance constraints. Mathematical Programming,99:329–350, 2004.

[12] Darinka Dentcheva and Andrzej Ruszczynski. Optimization with multivariate stochastic dom-inance constraints. Mathematical Programming, 117:111–127, 2009.

[13] Cyrus Derman. Finite State Markovian Decision Processes. Academic Press, Inc., Orlando,FL, USA, 1970.

[14] G. Di Masi and L. Stettner. Risk-sensitive control of discrete-time markov processes withinfinite horizon. SIAM Journal on Control and Optimization, 38(1):61–78, 1999.

[15] Jerzy A. Filar, L. C. M. Kallenberg, and Huey-Miin Lee. Variance-penalized markov decisionprocesses. Mathematics of Operations Research, 14(1):147–161, 1989.

[16] Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time.Walter de Gruyter, 2004.

[17] M. Frittelli, M. Maggis, and I. Peri. Risk measures on p(r) and value at risk with probability/lossfunction. To appear in Mathematical Finance, 2012.

[18] William B. Haskell and Rahul Jain. Stochastic dominance-constrained markov decision pro-cesses. Society of Industrial and Applied Mathematics Journal on Control and Optimiza-tion, 51(1):273–303, 2013.

[19] Onesimo Hernandez-Lerma and Juan Gonzalez-Hernandez. Constrained Markov control pro-cesses in Borel spaces: the discounted case. Mathematical Methods of Operations Research,52:271–285, 2000.

[20] Onesimo Hernandez-Lerma, Juan Gonzalez-Hernandez, and Raquiel R. Lopez-Martınez. Con-strained average cost Markov control processes in Borel spaces. SIAM J. Control Optim.,42(2):442–468, 2003.

[21] Onesimo Hernandez-Lerma and Jean B Lasserre. Approximation schemes for infinite linearprograms. SIAM Journal on Optimization, 8(4):973–988, 1998.

Risk-aware MDPs 31

[22] Onesimo Hernandez-Lerma and Jean Bernard Lasserre. Discrete-Time Markov Control Pro-cesses: Basic Optimality Criteria. Springer-Verlag New York, Inc., 1996.

[23] Onesimo Hernandez-Lerma and Jean Bernard Lasserre. Further Topics On Discrete-TimeMarkov Control Processes. Springer-Verlag New York, Inc., 1999.

[24] Soren Johansen. The extremal convex functions. Math. Scand., 34:61–68, 1974.[25] Lodewijk Cornelis Maria Kallenberg. Linear programming and finite markovian control prob-

lems. MC Tracts, 148:1–245, 1983.[26] David M Kreps. Decision problems with expected utility critera, i: upper and lower convergent

utility. Mathematics of Operations Research, 2(1):45–53, 1977.[27] David M Kreps. Decision problems with expected utility criteria, ii: Stationarity. Mathematics

of Operations Research, 2(3):266–274, 1977.[28] Shigeo Kusuoka. On law invariant coherent risk measures. Advances in mathematical eco-

nomics, 3(1):83–95, 2001.[29] Jean-Bernard Lasserre. Moments, positive polynomials and their applications, volume 1. World

Scientific, 2009.[30] Olvi L Mangasarian. Solution of general linear complementarity problems via nondi↵erentiable

concave minimization. Acta Mathematica Vietnamica, 22(1):199–205, 1997.[31] Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259–

267, 1960.[32] Harry Markowitz. Portfolio selection*. The journal of finance, 7(1):77–91, 1952.[33] Alfred Muller and Dietrich Stoyan. Comparison Methods for Stochastic Models and Risks.

John Wiley and Sons, Inc., 2002.[34] AB Piunovskiy. Optimal control of random sequences in problems with constraints, volume

410. Kluwer Academic Pub, 1997.[35] Andras Prekopa. On probabilistic constrained programming. In Proceedings of the Princeton

symposium on mathematical programming, pages 113–138. Princeton, New Jersey: Prince-ton University Press, 1970.

[36] Martin L. Puterman. Markov Decision Processes Discrete Stochastic Dynamic Programming.John Wiley & Sons, 2005.

[37] R Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk. Journalof risk, 2:21–42, 2000.

[38] A. Ruszczynski. Risk-averse dynamic programming for markov decision processes. Mathemat-ical programming, 125(2):235–261, 2010.

[39] Andrzej Ruszczynski and Alexander Shapiro. Optimization of convex risk functions. Mathe-matics of Operations Research, 31(3):433–452, 2006.

[40] Moshe Shaked and J. George Shanthikumar. Stochastic Orders. Springer, 2007.[41] Matthew J. Sobel. Mean-variance tradeo↵s in an undiscounted mdp. Operations Research,

42(1):175–183, 1994.[42] G Peter Todd. Mean-variance analysis in portfolio choice and capital markets, volume 66.

John Wiley & Sons, 2000.[43] Buheeerdun Yang. Conditional value-at-risk minimization in finite state markov decision pro-

cesses: Continuity and compactness. Journal of Uncertain Systems, 7(1):50–57, 2013.

a convex analytic approach to risk-aware markov...

Documents