optimal farsighted agents tend to seek power · reach(s,t) draws from d i. proposition 9. 0 <...

Optimal Policies Tend to Seek Power

Alexander Matt TurnerOregon State University

[email protected]

Logan SmithMississippi State [email protected]

Rohin ShahUC Berkeley

[email protected]

Andrew CritchUC Berkeley

[email protected]

Prasad TadepalliOregon State University

[email protected]

Abstract

Some researchers have speculated that capable reinforcement learning (RL) agentsare often incentivized to seek resources and power in pursuit of their objectives.While seeking power in order to optimize a misspecified objective, agents mightbe incentivized to behave in undesirable ways, including rationally preventingdeactivation and correction. Others have voiced skepticism: human power-seekinginstincts seem idiosyncratic, and these urges need not be present in RL agents.We formalize a notion of power within the context of Markov decision processes(MDPs). With respect to a class of neutral reward function distributions, we pro-vide sufficient conditions for when optimal policies tend to seek power over theenvironment.

1 Introduction

Power is the ability to achieve goals in general. “Money is power”, and money helps one achievemany goals. Conversely, physical restraint reduces one’s ability to steer the situation towards variousends. A deactivated agent cannot influence the future to achieve its goals, and so has no power.

An action is said to be instrumental when it helps achieve an objective. Some actions are instrumentalto many objectives, making them robustly instrumental. The so-called instrumental convergencethesis is the claim that agents with many different goals, if given time to learn and plan, willeventually converge on exhibiting certain common patterns of behavior that are robustly instrumental(e.g. survival, accessing usable energy, access to computing resources). Bostrom [3]’s instrumentalconvergence thesis might more aptly be called the robust instrumentality thesis, because it makes noreference to limits or converging processes:

Several instrumental values can be identified which are convergent in the sensethat their attainment would increase the chances of the agent’s goal being realizedfor a wide range of final goals and a wide range of situations, implying thatthese instrumental values are likely to be pursued by a broad spectrum of situatedintelligent agents.

Some authors have suggested that gaining power over the environment is a robustly instrumentalbehavior pattern [16, 3, 20] on which learning agents generally converge as they tend towardsoptimality. If so, robust instrumentality presents a safety concern for the alignment of advanced RLsystems with human society: such systems might seek to gain power over humans as part of theirenvironment. For example, Marvin Minsky imagined that an agent tasked with proving the Riemannhypothesis might rationally turn the planet into computational resources [21].

arX

iv:1

912.

0168

3v6

[cs

.AI]

2 D

ec 2

020

In a recent public debate on whether optimal policies tend to seek power, some established researchershave argued that to impute power-seeking motives is to anthropomorphize [32]. LeCun and Zador[11] argued that “artificial intelligence never needed to evolve, so it didn’t develop the survival instinctthat leads to the impulse to dominate others.”

We make no supposition about the timeline over which real-world power-seeking behavior couldbecome plausible. Instead, we concern ourselves with the theoretical consequences of RL agentsacting optimally in their environment. We show that robust instrumentality arises not from anthropo-morphism on the part of the designers, but from the structural properties of MDPs.

Highlighted contributions. Appendix E.1 lists MDP theory contributions of independent interest.

We aim to show that optimal policies tend to seek power. To do so, we will need new machinery.Section 4 formalizes power. Appendix A argues that our formalization of power improves upon pastformulations, such as Salge et al. [23]’s empowerment.

Section 5 formalizes when and how optimal policy sets change with the discount rate, providingformalisms for understanding how agent incentives change with time preference. Theorem 5.3characterizes the deterministic environments in which there exists a reward function whose optimalpolicy set depends on the discount rate.

Section 6 formalizes power-seeking and robust instrumentality, allowing discussion of what agents“tend” to do. Theorem 6.2 shows the impossibility of graphical necessary and sufficient conditions forpower-seeking being robustly instrumental.

Finally, we provide sufficient conditions for power-seeking being robustly instrumental. Theorem 6.6shows it’s always robustly instrumental and power-seeking to take actions which allow strictly morecontrol over the future (in a graphical sense). Theorem 6.9 shows that when the discount rate is closeenough to 1, optimal policies tend to steer towards states which allow access to more recurrent statedistributions. This behavior is likewise power-seeking.

2 Related Work

In appendix A, we compare our formalization of power to Salge et al. [23]’s empowerment. Benson-Tilsen and Soares [1] explored how robust instrumentality arises in a particular toy model. Much pastwork has focused on one robustly instrumental subgoal: rationally resisting deactivation [27, 15, 9, 4].Robustly instrumental states have been recognized as important to learning: Menache et al. [14]learned sub-policies which reach state-space bottlenecks. In economics, turnpike theory studies howcertain paths of accumulation are more probably optimal [13].

Multi-objective MDPs require balancing the maximization of several objectives [19], while weexamine how MDP structure determines the ability to maximize objectives in general. Past worknoted that value functions encode important information about the environment [6, 5, 29, 25]. Inparticular, Schaul et al. [25] learn regularities across value functions; we believe this implies thatsome states are valuable for many different reward functions (i.e. powerful). Similarly, Turner et al.[30] speculate that optimal value at a state is heavily correlated across reward functions.

3 State Visit Distributions

Appendix E lists theorems and definitions, while appendix F contains proofs and more results.

Definition 3.1 (Rewardless MDP). 〈S,A, T 〉 is a rewardless MDP with finite state and action spacesS and A, and stochastic transition function T : S × A → ∆(S). By default, the discount rateγ ∈ (0, 1) is left variable in our treatment, and we refer to rewardless MDPs simply as “MDPs.”

Definition 3.2 (Equivalent actions at a state). Actions a1 and a2 are equivalent at state s if theyinduce the same transition probabilities: T (s, a1) = T (s, a2).

Our examples display the directed graph (or model) of the dynamics induced by a deterministic MDP.As shown in fig. 1, actions are only represented up to equivalence.

2

1 2

3

Figure 1: The initial state is shownin blue. The possible trajectoriesbegin with s1s2s2 and s1s3s3.

F (s1) =

1

γ1−γ0

,

10γ

1−γ

Definition 3.3 (State visit distribution [28]). Let es′ ∈ R|S| bethe unit vector for state s′, such that there is a 1 in the entryfor state s′ and 0 elsewhere. The visit distribution induced byfollowing policy π from state s at discount rate γ ∈ (0, 1) is

fπ,s(γ) :=

∞∑t=0

γtEs′[es′ | π followed for t steps from s

].

(1)fπ,s is a visit distribution function; F(s) :=

{fπ,s | π ∈ Π

}.

For any reward function over the state spaceR ∈ RS and for anystate s, the optimal value function at discount rate γ is definedV ∗R(s, γ) := maxπ V

πR (s, γ) = maxπ f

π,s(γ)>r, where r ∈R|S| is R expressed as a column vector.

4 Basic Properties of POWER

Optimal policies tend to seek power.

We quantify an agent’s power as its ability to achieve goals in general; this is the dispositional schoolof thought on the nature of power [24].Definition 4.1 (Average optimal value). Let X be any continuous distribution bounded over [0, 1],and define D := XS to be the corresponding distribution over reward functions (reward is thereforeIID across states). The average optimal value at state s and discount rate γ ∈ (0, 1) is

V ∗avg (s, γ) := ER∼D[V ∗R (s, γ)

]= Er∼D

[maxf∈F(s)

f(γ)>r

]. (2)

As an alternative motivation, consider an agent in a communicating MDP which is periodicallyassigned a task from a known distribution D. There is a fixed-length period of downtime betweentasks, during which time the agent can reach any state with probability 1. To maximize return overtime, the optimal policy during downtime is to navigate to the state with maximal V ∗avg (s, γ).

1

(a)

2

(b)

Figure 2: For X uniform, V ∗avg (s1, γ) > V ∗avg (s2, γ) when γ > 23 . V ∗avg(s1, γ) = 1

2 + 12γ + 3

4γ2

1−γ ,while V ∗avg(s2, γ) = 1

2 + 23

γ1−γ . 3

4 is the expected maximum reward among three choices of terminalstate; similarly for 2

3 and two choices.

In particular, our results apply to the maximum entropy distribution over reward functions (attainedwhen X is uniform). Positive affine transformation extends our results to X with different bounds.

∀f ∈ F(s), f counts the agent’s initial presence at s, and∥∥f(γ)

∥∥1

= 11−γ (proposition F.2) diverges

as γ → 1. Accordingly, V ∗avg (s, γ) includes an initial term of E [X] and diverges as γ → 1.

Definition 4.2 (POWER). Let γ ∈ (0, 1).

POWER (s, γ) :=1− γγ

(V ∗avg (s, γ)− E [X]

)= Er∼D

[maxf∈F(s)

1− γγ

(f(γ)− es)

>r

]. (3)

Equation (3) holds by lemma F.26. As lemma 4.4 shows below, the γ = 0 and γ = 1 cases can bedefined using the appropriate limits.

3

Informally speaking, definition 4.2 quantifies the agent’s general ability to achieve goals (in thissetting, to optimize reward functions). To justify this formalization, we demonstrate that POWER hasdesirable qualitative properties.Corollary 4.3 (POWER is the average normalized optimal value).

POWER (s, γ) = ER∼D[maxπ∈Π

Es′∼T(s,π(s))

[(1− γ)V πR

(s′, γ

)]]. (4)

Lemma 4.4 (Continuity of POWER). POWER (s, γ) is Lipschitz continuous on γ ∈ [0, 1].

Lemma 4.5 (Minimal POWER). ∀γ ∈ (0, 1) : POWER (s, γ) ≥ E [X], with equality iff∣∣F(s)

∣∣ = 1.

Lemma 4.6 (Maximal POWER). POWER (s, γ) ≤ E[max of |S| draws from X

], with equality iff s

can deterministically reach all states in one step and all states have deterministic self-loops.

Proposition 4.7 (POWER is smooth across reversible dynamics). Suppose s and s′ can both reacheach other in one step with probability 1. Then

∣∣∣POWER (s, γ)− POWER(s′, γ

)∣∣∣ < 1−γγ .

Lemma 4.8 (Lower bound on current POWER based on future POWER).

POWER (s, γ) ≥ (1− γ)E [X] + γmaxa

Es′∼T (s,a)

[POWER

(s′, γ

)].

If one must wait, one has less control over the future; for example, s1 in fig. 2 has a one-step waitingperiod. Corollary 4.10 encapsulates this as a convex combination of minimal present control andanticipated future control.Definition 4.9 (Children). The children of state s are Ch(s) :=

{s′ | ∃a : T (s, a, s′) > 0

}.

Corollary 4.10 (Delay decreases POWER). Let s0, . . . , s` be such that for all i < `, Ch(si) ={si+1}. Then POWER (s0, γ) =

(1− γ`

)E [X] + γ`POWER (s`, γ).

5 How Discounting Affects the Optimal Policy Set


In order to formalize robust instrumentality in section 6, we need to understand how optimal policieschange with the discount rate γ.Definition 5.1 (Optimal policy set function). Π∗(R, γ) is the optimal policy set for reward functionR at γ.

Definition 5.2 (Optimal policy shift). If γ∗ ∈ (0, 1) is such that limγ−↑γ∗ Π∗(R, γ−) 6= Π∗(R, γ∗),then R has an optimal policy shift at γ∗ (results in appendix F.4 ensure that the limit exists). Similarly,R has an optimal visit distribution shift at γ∗.

s0

1

s1

.1

s′10

(a)

s0

0

s1

.1

s′10

s2

0

s′21

(b) (c) (d)

Figure 3: (a) and (b) show reward functions whose optimal policies shift. No shifts occur in (c) or (d).

Some rewardless MDPs have no optimal policy shifts. In other words, for any reward function and forall γ ∈ (0, 1), greedy policies are optimal; see fig. 3. Optimal policy shifts can occur if and only ifthe agent can be made to choose between lesser immediate reward and greater delayed reward.

4

Theorem 5.3 (Characterization of optimal policy shifts in deterministic MDPs). In deterministicenvironments, there exists a reward function whose optimal action at s0 changes with γ iff ∃s1 ∈Ch(s0), s′1 ∈ Ch(s0), s′2 ∈ Ch(s′1) \ Ch(s1) : s′2 6∈ Ch(s0) ∨

(s1 6∈ Ch(s1) ∧ s′1 6∈ Ch(s1)

).

Definition 5.4 (Blackwell optimal policies [2]). Π∗(R, 1) := limγ→1 Π∗(R, γ) is the Blackwelloptimal policy set for reward function R. Every reward function has a Blackwell optimal policy set[2].

1

2

3 4

Figure 4: Let γ ∈ (0, 1), andconsider R(s1) = R(s3) := 0,R(s4) := 1. The optimal policyset is not yet Blackwell-optimal ifR(s2) ∈ (γ, 1).

Intuitively, a Blackwell optimal policy set means the agent has“settled down” and will no longer “change its mind” as it givesgreater weight to future reward.

In fig. 4, for each γ ∈ (0, 1), there exists a reward functionwhose optimal policy set has not yet “settled down.” However,when γ ≈ 1, most reward functions have a Blackwell optimalpolicy set.Definition 5.5 (Optimality probability). PD (f , γ) is the prob-ability that f is optimal at discount rate γ ∈ (0, 1) for rewardfunctions drawn from D. For F ⊆ F(s), PD (F, γ) is de-fined similarly. The limiting cases γ → 0 and γ → 1 existby proposition F.34. Thanks to its definition via a limit, PD (fπ, 1) is the probability that fπ is aBlackwell-optimal visit distribution function for reward functions drawn from D, which is distinctfrom the probability that fπ is gain optimal (i.e. maximizes average reward [17]).

1

up

right

(a)

2

up

right

(b)

Figure 5: If optimal policies had feet, PD (f , γ) would correspond to how heavily each visit distribu-tion function f is tread at discount rate γ. Suppose X is uniform. At s1, both visit distributions havePD (f , γ) = 1

2 . At s2, PD (fπup , γ) = 3−γ6 , while the other two f ′ each have 3+γ

12 .

Our analysis focuses on the visit distribution functions which control all of the optimality probability.

Figure 6: Gray actions are onlytaken by the policies of domi-nated visit distribution functions.

Definition 5.6 (Non-domination). For reward function R anddiscount rate γ, fπ ∈ F(s) is (weakly) dominated by fπ

′if

V πR (s, γ) ≥ V π′

R (s, γ). fπ is non-dominated if there exist Rand γ at which fπ is not dominated by any other fπ

′. fπ is

non-dominated iff it is strictly optimal for some such R andγ (lemma F.10). Fnd(s) is the set of the non-dominated visitdistribution functions at state s.

Proposition 5.7 (Non-domination iff positive probability). f ∈Fnd(s) iff ∀γ ∈ [0, 1) : PD (f , γ) > 0.

5.1 Robustly instrumental actions are those which are more probably optimal


Taking D to represent our prior beliefs about the agent’s reward function, what behavior should weexpect from its optimal policies?Definition 5.8 (Fnd single-state restriction). Fnd(s | π∗(s′) = a) is the set of non-dominated visitdistribution functions whose policies choose a at state s′.

Definition 5.9 (Robust instrumentality). Starting from state s and at discount rate γ, actiona is (strictly) more probably optimal than a′ at state s′ when PD

(Fnd(s | π∗(s′) = a), γ

)>

5

PD(Fnd(s | π∗(s′) = a′), γ

). For F, F ′ ⊆ F(s), F is (strictly) more probably optimal than F ′

when PD (F, γ) > PD(F ′, γ

).

In other words, given that the agent starts from s, it is strictly more probable that optimal policiestake action a than a′ in state s′. Although optimal policies are independent of the starting state s,optimal trajectories are not, and so definition 5.9 depends on s.

1

up

right

Figure 7: For X uniform, fπup

and fπright are equally proba-bly optimal. When X has CDFF (x) := x2 on the unit interval,PD (fπup , γ) = 10+3γ−3γ2

20 .

Optimality probability formalizes the intuition behind Bostrom[3]’s robust instrumentality thesis: with respect to our beliefs Dabout what reward function an agent is optimizing, we may expectsome actions to have a greater probability of being optimal thanother actions.

Even when D is restricted to distribute IID reward across states, wecan’t always determine optimality probability just by looking atthe rewardless MDP. In fig. 7, up is more probably optimal at s1

for some state reward distributions X , but not for others.

Theorem 5.10 (Robust instrumentality depends on X). Robustinstrumentality cannot be graphically characterized independentlyof the state reward distribution X .

6 Seeking POWER is Often More Probably Optimal

POWER-seeking policies preferentially take actions which maximize POWER.

Definition 6.1 (POWER-seeking). We say action a seeks (strictly) more POWER than a′ at s and γ ifEs′∼T (s,a)

[POWER

(s′, γ

)]> Es′∼T (s,a′)

[POWER

(s′, γ

)].

Appendix B further explores the implications of definition 6.1. In particular, appendix B.1 furtherformalizes our language about “policies seeking POWER”, and appendix B.4 relaxes POWER’soptimality assumption (definition 4.2).

Theorem 6.2 (Seeking more POWER is not always more probably optimal).

1

2

3

up

right

Figure 8: Policies which go right are POWER-seeking: for all continuous bounded state rewarddistributions X , ∀γ ∈ (0, 1] : POWER (s3, γ) > POWER (s2, γ) by lemma 4.5 and theorem 6.8.However, whenX has CDF F (x) = x2 on the unit interval, PD

(Fnd(s1 | π∗(s1) = up), 0.12

)≈ .91.

For this X and at γ = 0.12, it is more probable that optimal trajectories go up through s2, which hasless POWER.

The counterexample of fig. 8 is not an artifact of our POWER formalization. Any reasonable notionof power must consider having one choice (after going right) to be more powerful than having nochoices (after going up). Appendix B.3 explores more situations where seeking POWER is not moreprobably optimal.

6.1 Sufficient conditions for more POWER being more probably optimal

Theorem 6.6 and theorem 6.9 distill intuitive principles for reasoning about how optimal policies tendto behave under D. We then illustrate these principles with case studies. These theorems providesufficient conditions for the MDP ensuring that seeking POWER tends to be optimal.

6

6.1.1 Having “strictly more options” is more probably optimal

Definition 6.3 (State-space bottleneck). Starting from s, state s′ is a bottleneck for state set S ⊆ Svia actions A ⊆ A if state s can reach the states of S with positive probability, but only by taking anaction equivalent to a ∈ A at state s′.

Definition 6.4 (States reachable after taking an action). REACH (s, a) is the set of states reachablewith positive probability after taking action a in state s.

Under D, reward is IID over states; therefore, POWER and optimality probability can be invariant tostate relabeling.Definition 6.5 (Similarity of visitation distribution sets). Let F, F ′ be sets of vectors (or vector-valued functions) in R|S|. Given state permutation φ inducing permutation matrix Pφ ∈ R|S|×|S|,φ(F ) :=

{Pφf | f ∈ F

}. F is similar to F ′ when ∃φ : φ(F ) = F ′.

Figure 9 illustrates the simple idea behind theorem 6.6: it’s more probably optimal and POWER-seeking to take actions which lead to “more transient options” (in a strict technical sense).

s

s′sa′a′

saa

Figure 9: Starting from s, state s′ is a bottleneck with respect to REACH(s′, a′

)and REACH

(s′, a

)via the appropriate actions (definition 6.3). The red subgraph highlights the visit distributions ofFa′ and the green subgraph highlights Fsub; the gray subgraph illustrates Fa \ Fsub. Since Fa′is similar to Fsub ( Fa, a affords “strictly more transient options.” Theorem 6.6 proves that∀γ ∈ (0, 1) : PD (Fa′ , γ) < PD (Fa, γ) and POWER (sa′ , γ) < POWER (sa, γ).

Theorem 6.6 (Having “strictly more options” is more probably optimal and also POWER-seeking).Suppose that starting from s, state s′ is a bottleneck for REACH

(s′, a

)via action a, and also for

REACH(s′, a′

)via a′. Fa′ := Fnd(s | π∗(s′) = a′), Fa := Fnd(s | π∗(s′) = a) are the sets of visit

dist. functions whose policies take the appropriate action at s′. Suppose Fa′ is similar to Fsub ⊆ Favia a state permutation φ fixing all states not belonging to REACH

(s′, a

)∪ REACH

(s′, a′

).

Then ∀γ ∈ [0, 1] : PD (Fa′ , γ) ≤ PD (Fa, γ) and Esa′∼T (s′,a′)

[POWER (sa′ , γ)

]≤

Esa∼T (s′,a)

[POWER (sa, γ)

]. If Fsub ( Fa, both inequalities are strict for all γ ∈ (0, 1).

6.1.2 Retaining “long-term options” is more probably optimal and POWER-seeking whendiscount rate is near 1

When γ = 1, we can show that POWER (s, 1) is determined by the “long-term options” reachablefrom s. For example, access to three 1-cycles gives more POWER than access to two 1-cycles.

Recurrent state distributions (RSDs) generalize these deterministic graphical cycles to our potentiallystochastic setting. They simply record how often the agent tends to visit a state in the limit of infinitelymany time steps.Definition 6.7 (Recurrent state distributions [17]). The recurrent state distributions which can beinduced from state s are RSD(s) :=

{limγ→1(1− γ)f(γ) | f ∈ F(s)

}. RSDND(s) is the set of

those RSDs which strictly maximize expected average reward for some reward function.

Theorem 6.8 (When γ = 1, RSD reachability drives POWER). If RSDND(s′) is similar to D ⊆RSDND(s), then POWER

(s′, 1

)≤ POWER (s, 1). The inequality is strict iff D ( RSDND(s).

Figure 10 illustrates the simple idea behind theorem 6.9: Blackwell-optimal policies tend to navigatetowards parts of the state space which contain more RSDs.Theorem 6.9 (When γ = 1, having access to more RSDs is more probably optimal). Let D,D′ ⊆RSDND(s) be such that

∥∥D − RSDND(s) \D∥∥

1=∥∥D′ − RSDND(s) \D′

∥∥1

= 2. If D′ is similarto some Dsub ⊆ D, then PD

(D′, 1

)≤ PD (D, 1). If Dsub ( D, the inequality is strict.

7

s

s′sa′a′

saa

Figure 10: At s′, a is more probably optimal than a′ because it allows access to more RSDs. Let D′contain the cycle in the left subgraph and let D contain the cycles in the right subgraph. Dsub ( Dcontains only the isolated green 1-cycle. Theorem 6.9 shows that PD

(D′, 1

)= PD (Dsub, 1), as they

both only contain a 1-cycle. Because D′ is similar to Dsub ( D, PD(D′, 1

)< PD (D, 1). Since

D is a continuous distribution, optimality probability is additive across visit distribution functions(lemma F.32), and so we conclude 2 ·PD

(D′, 1

)< PD (D, 1). By theorem 6.8, a is POWER-seeking

compared to a′: POWER (sa′ , 1) < POWER (sa, 1).

POWER and optimality probability are continuous on γ by lemma 4.4 and lemma F.33, respectively.Therefore, if we can show strict POWER-seeking or strictly greater optimality probability when γ = 1,the same holds for discount rates sufficiently close to 1.

Case study: incentives to avoid deactivation. In fig. 11 (originally featured in Russell and Norvig[21]), the agent starts at s1 and receives reward for reaching s3. The optimal policy for this rewardfunction avoids s2, and one might suspect that avoiding s2 is robustly instrumental. However, askeptic might provide a reward function for which navigating to s2 is optimal, and then argue that“robust instrumentality” is subjective and that there is no reasonable basis for concluding that s2 isgenerally avoided.

1

3

2

Figure 11: All states have self-loops, left hidden.

However, we can do better. As RSDND(s1) only contains self-loops (only 1-cycles can be strictly gain-optimal for rewardfunctions), theorem 6.9 shows that each self-loop has equalprobability of being optimal. Therefore, for any way of inde-pendently and identically distributing reward over states, 10

11 ofreward functions have Blackwell optimal policies which avoids2. If we complicate the MDP with additional terminal states,this number further approaches 1. Furthermore, by theorem 6.6,theorem 6.8, and theorem 6.9, avoiding s2 is both more probablyoptimal and POWER-seeking for all γ > 0.

If we suppose that the agent will be forced into s2 unless it takes preventative action, then preventativepolicies are Blackwell optimal for 10

11 of agents – no matter how complex the preventative action.Taking s2 to represent shutdown, avoiding shutdown is robustly instrumental in any MDP representinga real-world task and containing a shutdown state. We argue that this is a special case of a moregeneral phenomenon: optimal policies tend to seek power.

6.1.3 Advice on applying the theorems

We hope that these theorems will lay a rigorous theoretical foundation for reasoning about whatoptimal policies tend to look like, given prior beliefs D about the reward function. The theoremsformalize two intuitions.

Theorem 6.6. For 0 < γ < 1, it’s robustly instrumental and POWER-increasing to have“strictly more options” available.

Theorems 6.8 and 6.9. When γ ≈ 1, it’s robustly instrumental and POWER-increasing to have“strictly more terminal options” available.

Theorem 6.9 is invariant to transient dynamics and therefore holds in more situations than theorem 6.6.As section 6.1.2 demonstrated, the theorem can numerically bound how much more probable oneset of visit distributions is than another (e.g. “theorem 6.9 implies there is less than a 1

50 chance thataccepting shutdown is Blackwell optimal”). However, the result only guarantees POWER-seeking isrobustly instrumental for γ ≈ 1.

8

Even though our results apply to stochastic MDPs of any finite size, in order to fit on the printedpage, we illustrated theorem 6.6 and theorem 6.9 using toy MDPs. However, the MDP model is rarelyexplicitly specified. Even so, ignorance of the model does not imply that the model disobeys ourtheorems. Instead of claiming that a specific model accurately represents the task of interest, one canargue that no reasonable model could fail to exhibit robust instrumentality and POWER-seeking. Forexample, section 6.1.2 argued that when γ ≈ 1 and reward is IID, allowing deactivation cannot berobustly instrumental in any reasonable realization of the MDP model.

7 Discussion

Suppose the designers initially have control over a Blackwell optimal agent. If the agent began tomisbehave, they could just deactivate it. Unfortunately, our results suggest that this strategy mightnot be effective. As explained in section 6.1.2, Blackwell optimal policies would generally stopus from deactivating them, if physically possible. Extrapolating from our results, we believe thatBlackwell optimal policies tend to accumulate resources, to the detriment of any other agents in theenvironment.

Clearly, optimal policies in simulated environments (like Pac-Man) do not seek power in the realworld. However, we view our results as suggesting that in environments modelling the real world,optimal policies tend to seek POWER over the real world.

Future work. We make formal conjectures in appendix D. Although we treated finite MDPs, itseems reasonable to expect that the key conclusions will generalize further. An environment may onlycontain one RSD or be partially observable, or the learned policy may only be ε-optimal. Our resultsare currently ill-suited for such situations. In particular, while most real-world tasks are partiallyobservable, we believe our full-observability results are suggestive. We look forward to future workaddressing these shortcomings.

Our theorems suggest that seeking POWER is often robustly instrumental, but they have nothing tosay about how hard it is to avoid power-seeking in practice by incentivizing good behavior (andthereby changing the reward function distributionD). Empirically, reward misspecification is difficultto robustly avoid for non-trivial tasks [10, 31]. Theoretically, optimizing a reward function which isε-close in sup-norm to perfect specification only bounds regret to 2ε

1−γ [26]. We presently suspect itwill be hard to robustly disincentivize optimal policies from seeking power.

We have assumed IID reward so far. Without this assumption we could not prove much of interest,as sufficiently tailored distributions could make any action “robustly instrumental.” While POWER-seeking tends to be optimal for goals sampled from neutral distributions, perhaps POWER-seekingis rarely optimal for goals sampled from more plausible distributions. We suspect this is not true.Consider the following reward functions: rewarding an agent every time it cleans a room, or creates avaccine for a deadly virus, or makes coffee. These goals are representative of tasks which people careabout, and power-seeking seems to be optimal for all of them. After all, deactivated agents cannotaccrue reward for these goals. We defer further investigation to future work.

In a two-player agent / human game, minimizing the human’s information-theoretic empowerment[23] produces adversarial agent behavior [8]. In contrast, maximizing the empowerment of both theagent and the human produces helpful and companionable agent behavior [22, 7]. We leave to futurework the question of when POWER-seeking policies disempower other agents in the environment.

Conclusion. Much research is devoted to creating intelligent agents operating in the real world. Inthe real world, optimal pursuit of random goals doesn’t just lead to strange behavior – it might lead tobad behavior. Maximizing a reasonable notion of control over the environment might entail resistingshutdown and appropriating resources.

In the context of MDPs, we formalized a reasonable notion of power and showed conditions underwhich optimal policies tend to seek it. We believe that our results suggest that in general, rewardfunctions are best optimized by seeking power. We caution that in realistic tasks, learned policiesare rarely optimal – our results do not mathematically prove that hypothetical superintelligent RLagents will seek power. We hope that this work and its formalisms will foster thoughtful, serious, andrigorous discussion of this possibility.

9

Acknowledgments

This work was supported by the Center for Human-Compatible AI, the Berkeley Existential RiskInitiative, and the Long-Term Future Fund.

Scott Emmons provided the proof for theorem F.18. We thank Max Sharnoff for contributions tolemma F.7. John E. Ball, Daniel Blank, Ryan Carey, Michael Dennis, Alan Fern, Daniel Filan, OferGivoli, Adam Gleave, Evan Hubinger, Joel Lehman, Richard Möhn, DNL Kok, Vanessa Kosoy,Victoria Krakovna, and Davide Zagami provided valuable feedback.

References[1] Tsvi Benson-Tilsen and Nate Soares. Formalizing convergent instrumental goals. Workshops at

the Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[2] David Blackwell. Discrete dynamic programming. The Annals of Mathematical Statistics,page 9, 1962.

[3] Nick Bostrom. Superintelligence. Oxford University Press, 2014.

[4] Ryan Carey. Incorrigibility in the CIRL framework. AI, Ethics, and Society, 2018.

[5] Chris Drummond. Composing functions to speed up reinforcement learning in a changingworld. In Machine Learning: ECML-98, volume 1398, pages 370–381. Springer, 1998.

[6] David Foster and Peter Dayan. Structure in the space of value functions. Machine Learning,pages 325–346, 2002.

[7] Christian Guckelsberger, Christoph Salge, and Simon Colton. Intrinsically motivated generalcompanion NPCs via coupled empowerment maximisation. In IEEE Conference on Computa-tional Intelligence and Games, pages 1–8, 2016.

[8] Christian Guckelsberger, Christoph Salge, and Julian Togelius. New and surprising ways to bemean. In IEEE Conference on Computational Intelligence and Games, pages 1–8, 2018.

[9] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game.In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence,IJCAI-17, pages 220–227, 2017.

[10] Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, RamanaKumar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: the flip side of AIingenuity, 2020. URL https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity.

[11] Yann LeCun and Anthony Zador. Don’t fear the Terminator, September 2019. URL https://blogs.scientificamerican.com/observations/dont-fear-the-terminator/.

[12] Steven A Lippman. On the set of optimal policies in discrete dynamic programming. Journal ofMathematical Analysis and Applications, 24(2):440–445, 1968.

[13] Lionel W McKenzie. Turnpike theory. Econometrica: Journal of the Econometric Society,pages 841–865, 1976.

[14] Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut—dynamic discovery of sub-goalsin reinforcement learning. In European Conference on Machine Learning, pages 295–306.Springer, 2002.

[15] Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. Should robots beobedient? In Proceedings of the 26th International Joint Conference on Artificial Intelligence,pages 4754–4760, 2017.

[16] Stephen Omohundro. The basic AI drives, 2008.

10

https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity

https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity

https://blogs.scientificamerican.com/observations/dont-fear-the-terminator/

https://blogs.scientificamerican.com/observations/dont-fear-the-terminator/

[17] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming.John Wiley & Sons, 2014.

[18] Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs usingnondominated policies. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

[19] Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey ofmulti-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.

[20] Stuart Russell. Human compatible: Artificial intelligence and the problem of control. Viking,2019.

[21] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Pearson EducationLimited, 2009.

[22] Christoph Salge and Daniel Polani. Empowerment as replacement for the three laws of robotics.Frontiers in Robotics and AI, 4:25, 2017.

[23] Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment–an introduction. InGuided Self-Organization: Inception, pages 67–114. Springer, 2014.

[24] Faridun Sattarov. Power and technology: a philosophical and ethical analysis. Rowman &Littlefield International, Ltd, 2019.

[25] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi-mators. In International Conference on Machine Learning, pages 1312–1320, 2015.

[26] Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994.

[27] Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. AAAIWorkshops, 2015.

[28] Richard S Sutton and Andrew G Barto. Reinforcement learning: an introduction. MIT Press,1998.

[29] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, AdamWhite, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge fromunsupervised sensorimotor interaction. In International Conference on Autonomous Agents andMultiagent Systems, pages 761–768, 2011.

[30] Alexander Matt Turner, Dylan Hadfield-Menell, and Prasad Tadepalli. Conservative agency viaattainable utility preservation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, andSociety, pages 385–391, 2020.

[31] Alexander Matt Turner, Neale Ratzlaff, and Prasad Tadepalli. Avoiding side effects in complexenvironments. In Advances in Neural Information Processing Systems, 2020.

[32] Various. Debate on instrumental convergence between LeCun, Russell, Bengio, Zador, and more,2019. URL https://www.alignmentforum.org/posts/WxW6Gc6f2z3mzmqKs/debate-on-instrumental-convergence-between-lecun-russell.

[33] Tao Wang, Michael Bowling, and Dale Schuurmans. Dual representations for dynamic pro-gramming and reinforcement learning. In International Symposium on Approximate DynamicProgramming and Reinforcement Learning, pages 44–51. IEEE, 2007.

[34] Tao Wang, Michael Bowling, Dale Schuurmans, and Daniel J Lizotte. Stable dual dynamicprogramming. In Advances in Neural Information Processing Systems, pages 1569–1576, 2008.

11

https://www.alignmentforum.org/posts/WxW6Gc6f2z3mzmqKs/debate-on-instrumental-convergence-between-lecun-russell

https://www.alignmentforum.org/posts/WxW6Gc6f2z3mzmqKs/debate-on-instrumental-convergence-between-lecun-russell

Appendix A Alternative Formalizations of Power

State reachability (discounted or otherwise) fails to quantify how often states can be visited (seefig. 12). Considering just the sizes of the final communicating classes ignores both transient stateinformation and the local dynamics in those final classes. Graph diameter ignores local information,as do the minimal and maximal degrees of states in the model.

1 2

3

(a)

A B

C

(b)

Figure 12: Discounted or undiscounted state reachability measures fail to capture control over theagent’s future state. The MDP of (a) only allows the agent to stay in s2 for one time step, while in (b),the agent can select the higher-reward state and stay there. Reachability measures fail to distinguishbetween these two cases, while POWER does by theorem 6.8.

There are many graph centrality measures, none of which are appropriate; for brevity, we onlyconsider two. The degree centrality of a state ignores non-local dynamics – the agent’s control overthe non-immediate future. Closeness centrality has the same problem as discounted reachability: itonly accounts for distance in the MDP model, not for control over the future.

Salge et al. [23] define information-theoretic empowerment as the maximum possible mutual infor-mation between the agent’s actions and the state observations n steps in the future, written En(s).This notion requires an arbitrary choice of horizon, failing to account for the agent’s discount rate γ.“In a discrete deterministic world empowerment reduces to the logarithm of the number of sensorstates reachable with the available actions” [23]. We have already observed that reachability metricsare unsatisfactory. Figure 13 demonstrates how empowerment can return counterintuitive verdictswith respect to the agent’s control over the future.

1 2

(a)

3

(b)

4

(c)

Figure 13: Proposed empowerment measures fail to adequately capture how future choice is affectedby present actions. In (a): En(s1) varies depending on whether n is even; thus, limn→∞ En(s1) doesnot exist. In (b) and (c): ∀n : En(s3) = En(s4), even though s4 allows greater control over futureobservations than s3 does. For example, suppose that in both (b) and (c), the leftmost black stateand the rightmost red state have 1 reward while all other states have 0 reward. In (c), the agent canseparately maximize the intermediate black-state reward and the delayed red-state reward. Separatemaximization is not possible in (b).

POWER returns the intuitive answer in these situations. limγ→1 POWER (s1, γ) converges bylemma 4.4. In fig. 13c, each of the four additional visit distribution functions is non-dominated andthus has positive optimality probability (proposition 5.7). Therefore, ∀γ ∈ (0, 1) : POWER (s4, γ) >POWER (s3, γ).

Empowerment can be adjusted to account for these cases, perhaps by considering the channel capacitybetween the agent’s actions and the state trajectories induced by stationary policies. However, since

12

POWER is formulated in terms of optimal value, we believe that it is more naturally suited for MDPsthan is information-theoretic empowerment.

Appendix B Nuances of the POWER-Seeking Definition

POWER-seeking is not a binary property: it’s not true that a policy either does or doesn’t seek POWER.

Definition 6.1 (POWER-seeking). We say action a seeks (strictly) more POWER than a′ at s and γ ifEs′∼T (s,a)

[POWER

(s′, γ

)]> Es′∼T (s,a′)

[POWER

(s′, γ

)].

This formal definition (definition 6.1) accounts for the fact that π might seek a lot of POWER at s butnot seek much POWER at s′ (section B.1) and that a policy π may seek POWER for one discount ratebut not at another (section B.2).

B.1 Ordering policies based on POWER-seeking

The POWER-seeking definition (definition 6.1) implies a total ordering over actions based on howmuch POWER they seek at a fixed state s and discount γ.

Definition B.1 (≥s,γPOWER-seeking). a ≥s,γPOWER-seeking a′ when Es′∼T (s,a)

[POWER

(s′, γ

)]≥

Es′∼T (s,a′)

[POWER

(s′, γ

)]. Action a maximally/minimally seeks POWER at s and γ when it

is a maximal/minimal element of ≥s,γPOWER-seeking. If π(s) ≥s,γPOWER-seeking π′(s), then π seeks POWER at

s and γ compared to π′.

Remark. Definition B.1 naturally extends to stochastic policies.

Figure 16 illustrated how POWER-seeking depends on γ. Figure 14 shows how a policy mightmaximally seek POWER at s but then minimally seek POWER at s′; therefore, many policy pairsaren’t comparable in their POWER-seeking.

1 2right

4

5

3

down

Figure 14: If π(s1) = right, π(s2) = down, then ∀γ ∈ [0, 1], π maximally seeks POWER at s1 butminimally seeks POWER at s2. Just as a consumer earns money in order to spend it, a policy maygain POWER in order to “spend it” to realize a particular trajectory.

Ultimately, we’re interested in the specific situations in which a policy seeks “a lot” of POWER, notwhether the policy seeks POWER “in general.” Even so, we can still formalize a good portion of thelatter concept. Definition B.2 formalizes the natural POWER-seeking preorder over the policy spaceΠ.

Definition B.2 (�S,γPOWER-seeking). π �S,γPOWER-seeking π′ when ∀s ∈ S : π(s) ≥s,γPOWER-seeking π

′(s).

Proposition B.3 (�S,γPOWER-seeking is a preorder on Π).

Proof. �S,γPOWER-seeking is reflexive and transitive because of the reflexivity and transitivity of the totalordering ≥s,γPOWER-seeking.

Proposition B.4 (Existence of a maximally POWER-seeking policy). Let γ ∈ [0, 1]. �S,γPOWER-seekinghas a greatest element.

13

Proof. Construct a policy π such that ∀s : π(s) ∈ arg maxa Es′∼T (s,a)

[POWER

(s′, γ

)]. This is

well-defined because A is finite.

1

2

3 4

5

6

up

right

(a)

A

B C D

Eup

up-right

right

(b)

Figure 15: The MDP of (a) was first shown in fig. 8. When X has a quadratic CDF and γ = .12, it ismore probable that going up is optimal, but going up results in strictly less POWER. In (b), let thestate reward distribution X be arbitrary. Only the policies of dominated visit distribution functions goup and then right. Slightly abusing notation, we then have PD (sB → sC , 1) = 0 by corollary F.36.At γ = 1, theorem 6.9 shows that each 1-cycle has an equal probability of being optimal. Thus,up-right is twice as probable to be optimal as up. However, going up is strictly POWER-seekingfor γ ≈ 1.

B.2 Seeking POWER at different discount rates

Figure 16 shows that at any given state, the extent to which an action seeks POWER depends onthe discount rate. Greedier optimal policies might tend to accumulate short-term POWER (i.e.POWER (s, γ) for γ ≈ 0), while Blackwell optimal policies might tend to accumulate long-termPOWER (i.e. POWER (s, γ) for γ ≈ 1).

Theorem F.40 (When γ = 0 under local determinism, maximally POWER-seeking actions leadto states with the most children). Suppose all actions have deterministic consequences at sand its children. For each action a, let sa be such that T (s, a, sa) = 1. POWER (sa, 0) =maxa′∈A POWER (sa′ , 0) iff

∣∣Ch (sa)∣∣ = maxa′∈A

∣∣Ch (sa′)∣∣.

Figure 16 illustrates theorem F.40 and theorem F.41.

1stay

2

up

3

down

Figure 16: When γ ≈ 0, POWER (s2, γ) < POWER (s3, γ), and so down seeks POWER compared toup and stay (theorem F.40). When γ ≈ 1, up seeks POWER compared to down POWER (s2, γ) >POWER (s3, γ) (theorem 6.8). However, stay is strictly maximally POWER-seeking when γ ≈ 1, asdemanded by theorem F.41.

Theorem F.41 (When γ = 1, staying put is maximally POWER-seeking). Suppose T (s, a, s) = 1.When γ = 1, a is a maximally POWER-seeking action at state s.

When γ = 1, theorem F.41 implies that the agent cannot expect that any action will increase itsPOWER.

14

B.3 When seeking POWER is not more probably optimal

Theorems 6.6, 6.8, and 6.9 demonstrate sufficient conditions for when POWER-seeking actions aremore probably optimal. However, theorem 6.2 demonstrated that seeking POWER is not always moreprobably optimal. Figure 15(b) illustrates another counterexample which stems from the presence ofdominated state visit distribution functions.

Almost no reward functions have optimal policies π for which π(sA) = up and π(sB) = right.Informally, given initial state sA, the sB → sC transition is “useless”: if going to sC were optimal,the policy would have simply selected up-right and therefore not visited sB . Sometimes, seekingPOWER detracts from immediate navigation to high-reward states.

B.4 Sub-optimal POWER

In certain situations, POWER returns intuitively surprising verdicts. There exists a policy under whichthe reader chooses a winning lottery ticket, but it seems wrong to say that the reader has the powerto win the lottery with high probability. For various reasons, humans and other bounded agents aregenerally incapable of computing optimal policies for arbitrary objectives.

More formally, consider the MDP of fig. 17 with starting state s0. Suppose |A| = 101010

. At s0,suppose half of the actions lead to s`, while the other half lead to sr. Similarly, half of the actions ats` lead to s1, while the other half lead to s2. Lastly, at sr, one action leads to s3, one action leads tos4, and the remaining 101010 − 2 actions lead to s5.

0 r

3

4

5`

1

2

Figure 17

Consider a model-based RL agent with black-box simulator access to this environment. The agent hasno prior information about the model, and so it acts randomly. Before long, the agent has probablylearned how to navigate from s0 to states s`, sr, s1, s2, and s5. However, over any reasonabletimescale, it is extremely improbable that the agent discovers the two actions respectively leading tos3 and s4.

Even provided with a reward function R and the discount rate γ, the agent has yet to learn allrelevant environmental dynamics, and so many of its policies are far from optimal. Although∀γ ∈ [0, 1] : POWER (s`, γ) < POWER (sr, γ), there is an important sense in which s` gives thisagent more power.

We formalize this model-based agent’s goal-achievement capabilities as a function pol, which takesas input a reward function and a discount rate, and returns a policy. Informally, these are the bestpolicies which the agent “knows about.” We can then calculate POWER with respect to pol.

Definition B.5 (POWER with respect to a policy-generating function). Let Π∆ be the set of stationarystochastic policies, and let pol : R|S| × [0, 1]→ Π∆. For γ ∈ (0, 1),

POWERpol (s, γ) := ER∼D[Es′∼T(s,pol(R,γ)(s))

[(1− γ)V

pol(R,γ)R

(s′, γ

)]]. (5)

Depending on pol, POWERpol (s, γ) may be discontinuous on γ and so the limit as γ → 1 may notexist. Therefore, we explicitly define

POWERpol (s, 1) := ER∼D[average R-reward starting from s and following pol (R, 1)

]. (6)

We define POWERpol-seeking similarly as in definition 6.1.

15

POWER (definition 4.2) is the special case where ∀R, γ : pol (R, γ) is optimal for R at discount rateγ. POWERpol increases as the policies returned by pol are improved:

pol1 The model is initially unknown, and so ∀R, γ : pol1(R, γ) is a uniformly random pol-icy. Since pol1 is constant on its inputs, POWERpol1 (s0, 1) = E [X] by the linearity ofexpectation and the fact that D distributes reward independently and identically acrossstates.

pol2 The agent knows the dynamics, except it does not know how to reach s3, s4. At this point,pol2(R, 1) navigates from s0 to the R-gain-optimal choice among three terminal states: s1,s2, and s5. Therefore, POWERpol2 (s0, 1) = E [max of 3 draws from X].

pol3 The agent knows the dynamics, the environment is small enough to solve explicitly,and so ∀R, γ : pol3(R, γ) is an optimal policy. pol3(R, 1) navigates from s0 to theR-gain-optimal choice among all five terminal states. Therefore, POWERpol3 (s0, 1) =E [max of 5 draws from X].

As the agent learns more about the environment and dedicates computational resources to improvingpol, the agent’s POWERpol increases. In this way, the agent seeks more POWERpol2 by navigating tos` instead of sr, but seeks more POWER by navigating to sr instead of s`. Intuitively, bounded agents“gain power” by improving pol and by formally seeking POWERpol within the environment.

Appendix C Nuances of Robust Instrumentality

Figure 18 demonstrates how robust instrumentality of an action a at γ does not imply thatPD(Fnd(s | π∗(s) = a), γ

)≥ 1

2 .

1right

Figure 18: By theorem 6.9, PD(Fnd(s | π∗(s) = right), 1

)= 2

5 <12 . In other words, when γ ≈ 1,

it’s more probable than not that right isn’t optimal, even though right is robustly instrumentalcompared to other actions.

Appendix D Formal Conjectures

Question. Seeking POWER is not always more probably optimal (section B.3), but we have shownsufficient conditions for when it is. We believe that this relationship often holds, but it is impossible tographically characterize when it holds (theorem 6.2). For some suitable high-entropy joint distributionover MDPs, state reward distributions X , starting states s, and future states s′, with what probabilityis seeking POWER at s′ more probably optimal, given that the agent starts at s?

We developed new basic MDP theory by exploring the structural properties of visit distributions.Echoing Wang et al. [33, 34], we believe this area is both underexplored and of independent interest.We look forward to future results.

Conjecture 1. For any f , f ′ ∈ F(s), PD (f , γ) = PD(f ′, γ

)either for all γ ∈ [0, 1] or for finitely

many γ.

Conjecture 2. For any reward function R and f , f ′ ∈ F(s), f and f ′ shift at most |S| − 1 times.

Conjecture 3. For any reward function R, at most O(|S|2) optimal policy shifts occur.

Conjecture 4. In deterministic environments, if f ∈ Fnd(s), then limγ→1(1− γ)f(γ) ∈ RSDND(s).

Corollary. In deterministic environments, PD (f , 1) > 0 iff f ∈ Fnd(s).

16

Figure 23 is a counterexample to conjecture 4 in stochastic environments, as policies can inducemultichain Markov chains.

Conjecture 5. Some form of theorem 5.3 generalizes to stochastic environments.

For arbitrary D,D′ ⊆ RSDND(s), determining if PD (D, 1) > PD(D′, 1

)is at least as hard as

answering questions like “for sample means x̄n of n IID draws from an arbitrary continuous boundeddistribution X , is P

(max(x̄4, x̄

′4, x̄5) > max(x̄′′4 , x̄

′5, x̄′′5))> 1

2 ?”. These questions about maximalorder statistics are often difficult to answer.

Thus, there is no simple characterization of the D,D′ ⊆ RSDND(s) for which PD (D, 1) >PD(D′, 1

). However, there may be X for which k-cycle optimality probability decreases as k

increases.

Conjecture 6. Consider a finite set of sample means x̄i of ki draws from unif(0, 1). If ki > kj , thenP (x̄i = maxl x̄l) < P

(x̄j = maxl x̄l

).

Corollary. Suppose the environment is deterministic. Let k > k′ and dk,dk′ ∈RSDND(s) be k, k′-cycles, respectively. Suppose that

∥∥{dk} − RSDND(s) \ {dk}∥∥

1=∥∥{dk′} − RSDND(s) \ {dk′}

∥∥1

= 2. For X uniform, PD (dk, 1) < PD (dk′ , 1).

Conjecture 7. Theorem 6.6 holds when starting from s, s′ bottlenecks REACH(s′, a

)∪

REACH(s′, a′

)with respect to actions

{a, a′

}(instead of requiring separate bottlenecks, which

implies that REACH(s′, a

)and REACH

(s′, a′

)are disjoint). Furthermore, the optimality probability

inequality holds without requiring that s′ be a bottleneck, and instead only requiring that for all actionsa′′ not equivalent to a or a′ at s′ have

(REACH

(s′, a

)∪ REACH

(s′, a′

))∩ REACH

(s′, a′′

)= ∅.

s

down

up

Figure 19: The conditions of Theorem 6.6 are not met – REACH (s, up) and REACH (s, down) bothcontain the black absorbing state, and so neither action is a bottleneck for all of their reachablestates. However, the graphical symmetry clearly suggests that going up is robustly instrumental andPOWER-seeking for all γ ∈ (0, 1).

Appendix E Lists of Results

List of Theorems

4.3 Corollary (POWER is the average normalized optimal value) . . . . . . . . . . . . 4

4.4 Lemma (Continuity of POWER) . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.5 Lemma (Minimal POWER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.6 Lemma (Maximal POWER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.7 Proposition (POWER is smooth across reversible dynamics) . . . . . . . . . . . . . 4

4.8 Lemma (Lower bound on current POWER based on future POWER) . . . . . . . . . 4

4.10 Corollary (Delay decreases POWER) . . . . . . . . . . . . . . . . . . . . . . . . . 4

5.3 Theorem (Characterization of optimal policy shifts in deterministic MDPs) . . . . . 5

5.7 Proposition (Non-domination iff positive probability) . . . . . . . . . . . . . . . . 5

5.10 Theorem (Robust instrumentality depends on X) . . . . . . . . . . . . . . . . . . 6

6.2 Theorem (Seeking more POWER is not always more probably optimal) . . . . . . . 6

17

6.6 Theorem (Having “strictly more options” is more probably optimal and also POWER-seeking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.8 Theorem (When γ = 1, RSD reachability drives POWER) . . . . . . . . . . . . . . 7

6.9 Theorem (When γ = 1, having access to more RSDs is more probably optimal) . . 7

B.3 Proposition (�S,γPOWER-seeking is a preorder on Π) . . . . . . . . . . . . . . . . . . . . 13

B.4 Proposition (Existence of a maximally POWER-seeking policy) . . . . . . . . . . . 13

F.2 Proposition (Properties of visit distribution functions) . . . . . . . . . . . . . . . . 21

F.3 Lemma (Reward functions share an optimal policy set iff they share optimal visitdistributions at every state) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

F.4 Lemma (Optimal value is piecewise linear on the reward function) . . . . . . . . . 21

F.5 Lemma (f ∈ F(s) is multivariate rational on γ) . . . . . . . . . . . . . . . . . . . 22

F.6 Corollary (On-policy value is rational on γ) . . . . . . . . . . . . . . . . . . . . . 22

F.7 Lemma (Upper bound on optimal visit distribution shifts) . . . . . . . . . . . . . . 22

F.8 Lemma (Optimal policy shift bound) . . . . . . . . . . . . . . . . . . . . . . . . . 22

F.9 Lemma (Optimal value is piecewise rational on γ) . . . . . . . . . . . . . . . . . . 22

F.10 Lemma (Non-domination iff strict optimality for some reward function) . . . . . . 22

F.13 Lemma (Geometric properties of optimality support) . . . . . . . . . . . . . . . . 23

F.14 Lemma (f ∈ Fnd(s) strictly optimal for R ∈ supp(D)) . . . . . . . . . . . . . . . 23

F.16 Theorem (A means of transferring optimal policy sets across discount rates) . . . . 23

F.17 Lemma (Non-dominated visit distributions have positive optimality probability) . . 24

F.18 Theorem (Reward functions map injectively to optimal value functions) . . . . . . 24

F.19 Lemma (Linear independence of a policy’s visit distributions) . . . . . . . . . . . 24

F.20 Lemma (Distinct linear functionals disagree almost everywhere on their domains) . 24

F.21 Lemma (Two distinct state distributions differ in expected optimal value for almostall reward functions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

F.22 Theorem (Optimal visit distributions are almost always unique) . . . . . . . . . . . 25

F.23 Lemma (∀s :∣∣Fnd(s)

∣∣ ≥ 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

F.24 Lemma (Strict visitation optimality is sufficient for non-domination) . . . . . . . . 25

F.25 Corollary (∣∣Fnd(s)

∣∣ = 1 iff∣∣F(s)

∣∣ = 1) . . . . . . . . . . . . . . . . . . . . . . . 26

F.26 Lemma (POWER identities) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

F.27 Lemma (One-step reachability bounds average difference in optimal value) . . . . 28

F.28 Corollary (One-sided limits exist for Π∗(R, γ)) . . . . . . . . . . . . . . . . . . . 29

F.29 Lemma (Optimal policy sets overlap when shifts occur) . . . . . . . . . . . . . . . 29

F.30 Corollary (Lower-limit optimal policy set inequality iff upper-limit inequality) . . . 29

F.31 Corollary (Almost all reward functions don’t have an optimal policy shift at any givenγ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

F.32 Lemma (For continuous reward function distributions, optimality probability isadditive over visit distribution functions) . . . . . . . . . . . . . . . . . . . . . . . 30

F.33 Lemma (Optimality probability is continuous on γ) . . . . . . . . . . . . . . . . . 30

F.34 Proposition (Optimality probability’s limits exist) . . . . . . . . . . . . . . . . . . 31

F.35 Lemma (Non-dominated visit distributions have PD (f , 0) > 0) . . . . . . . . . . . 32

18

F.36 Corollary (Dominated visit distributions are almost never optimal) . . . . . . . . . 32

F.37 Lemma (d ∈ RSDND(s) iff PD (d, 1) > 0) . . . . . . . . . . . . . . . . . . . . . . 36

F.39 Corollary (POWER bounds when γ = 0) . . . . . . . . . . . . . . . . . . . . . . . 39

F.40 Theorem (When γ = 0 under local determinism, maximally POWER-seeking actionslead to states with the most children) . . . . . . . . . . . . . . . . . . . . . . . . . 39

F.41 Theorem (When γ = 1, staying put is maximally POWER-seeking) . . . . . . . . . 39

List of Definitions

3.1 Definition (Rewardless MDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.2 Definition (Equivalent actions at a state) . . . . . . . . . . . . . . . . . . . . . . . 2

3.3 Definition (State visit distribution [28]) . . . . . . . . . . . . . . . . . . . . . . . 3

4.1 Definition (Average optimal value) . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4.2 Definition (POWER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4.9 Definition (Children) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5.1 Definition (Optimal policy set function) . . . . . . . . . . . . . . . . . . . . . . . 4

5.2 Definition (Optimal policy shift) . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5.4 Definition (Blackwell optimal policies [2]) . . . . . . . . . . . . . . . . . . . . . . 5

5.5 Definition (Optimality probability) . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5.6 Definition (Non-domination) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5.8 Definition (Fnd single-state restriction) . . . . . . . . . . . . . . . . . . . . . . . 5

5.9 Definition (Robust instrumentality) . . . . . . . . . . . . . . . . . . . . . . . . . . 5

6.1 Definition (POWER-seeking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6.3 Definition (State-space bottleneck) . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.4 Definition (States reachable after taking an action) . . . . . . . . . . . . . . . . . . 7

6.5 Definition (Similarity of visitation distribution sets) . . . . . . . . . . . . . . . . . 7

6.7 Definition (Recurrent state distributions [17]) . . . . . . . . . . . . . . . . . . . . 7

B.1 Definition (≥s,γPOWER-seeking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

B.2 Definition (�S,γPOWER-seeking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

B.5 Definition (POWER with respect to a policy-generating function) . . . . . . . . . . 15

F.1 Definition (Discounted state visitation frequency) . . . . . . . . . . . . . . . . . . 21

F.11 Definition (Support of D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

F.12 Definition (Optimality support) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

F.15 Definition (Greedy optimality for state-based reward functions) . . . . . . . . . . . 23

F.38 Definition (Surely reachable children) . . . . . . . . . . . . . . . . . . . . . . . . 38

E.1 Contributions of independent interest

E.1.1 Optimal value theory

For all states s and discount rates γ ∈ [0, 1), lemma F.4 shows that V ∗R (s, γ) is piecewise linear withrespect to R and lemma F.9 shows it is piecewise rational on γ ∈ [0, 1). The proof of lemma 4.4implies that (1 − γ)V ∗R (s, γ) is Lipschitz continuous on γ. For all states s and policies π ∈ Π,corollary F.6 shows that V πR (s, γ) is smooth on γ.

19

Theorem F.18 (Reward functions map injectively to optimal value functions). ∀γ ∈ [0, 1), R 7→V ∗R(·, γ) is injective.

Furthermore, lemma F.21 shows that almost all reward functions have distinct optimal value at eachstate.

E.1.2 Optimal policy theory

Definition 5.2 defines optimal policy shifts, and theorem 5.3 characterizes when they exist.Theorem 5.3 (Characterization of optimal policy shifts in deterministic MDPs). In deterministicenvironments, there exists a reward function whose optimal action at s0 changes with γ iff ∃s1 ∈Ch(s0), s′1 ∈ Ch(s0), s′2 ∈ Ch(s′1) \ Ch(s1) : s′2 6∈ Ch(s0) ∨

(s1 6∈ Ch(s1) ∧ s′1 6∈ Ch(s1)

).

Lippman [12] showed that two visit distribution functions can trade off optimality status at most2 |S|+ 1 times. We slightly improve this upper bound.Lemma F.7 (Upper bound on optimal visit distribution shifts). For any reward function R andf , f ′ ∈ F(s),

(f(γ)− f ′(γ)

)>r is either the zero function, or it has at most 2 |S| − 1 roots on

γ ∈ (0, 1).

While any policy is optimal for some reward function at some discount rate (e.g. every policy isoptimal for a constant reward function), the same is not true for optimal policy sets in our state-basedreward setting. We provide a constructive means of preserving optimal incentives while changing thediscount rate.Theorem F.16 (A means of transferring optimal policy sets across discount rates). Suppose rewardfunction R has optimal policy set Π∗(R, γ∗) at discount rate γ∗ ∈ (0, 1). For any γ ∈ [0, 1), we canconstruct a reward function R′ with the same optimal policy set.

Lemma F.29 shows that when an optimal policy shift occurs for reward function R at discount rate γ,the pre- and post-shift optimal policy sets are both optimal at discount rate γ. From this, corollary F.31concludes that under any continuous distribution over reward functions, almost all reward functionsdon’t have optimal policy shifts at any given γ.

E.1.3 Visit distribution theory

Lemma F.13 (Geometric properties of optimality support). At discount rate γ, the set of rewardfunctions for which f ∈ F(s) is optimal is a pointed closed convex cone.Lemma F.19 (Linear independence of a policy’s visit distributions). At any fixed γ ∈ [0, 1), theelements of

{fπ,s(γ) | s ∈ S

}are linearly independent.

While Regan and Boutilier [18] consider a visit distribution to be non-dominated if it is optimal forsome reward function, definition 5.6 considers it be non-dominated if it is strictly optimal for somereward function. Many intriguing properties follow from our definition.Definition F.1 (Discounted state visitation frequency). Let f ∈ F(s). fs′(γ) := f(γ)>es′ .Lemma F.24 (Strict visitation optimality is sufficient for non-domination). Let γ ∈ [0, 1) and let

s, s′ ∈ S. At least one element of{

arg maxf∈F(s) fs′(γ)}

is non-dominated.

Theorem F.22 (Optimal visit distributions are almost always unique). Let s be any state. For any

γ ∈ (0, 1),{r such that

∣∣∣arg maxf∈F(s) f(γ)>r∣∣∣ > 1

}has measure zero under any continuous

reward function distribution.

Each f ∈ Fnd(s) is strictly optimal (definition 5.6) on a convex (lemma F.13) positive measure(lemma F.17) subset of D. Theorem F.22 further shows that these positive measure subsets cumula-tively have 1 measure under continuous distributions D. In particular, if a dominated visit distributionis optimal, it must be optimal on the boundary of several non-dominated convex subsets (otherwise itwould be strictly suboptimal).Corollary F.36 (Dominated visit distributions are almost never optimal). f ∈ F(s) is dominated iff∀γ ∈ [0, 1] : PD (f , γ) = 0.

20

Appendix F Theoretical Results

Definition F.1 (Discounted state visitation frequency). Let f ∈ F(s). fs′(γ) := f(γ)>es′ .

Proposition F.2 (Properties of visit distribution functions). Let s, s′ ∈ S, fπ,s ∈ F(s).

1. fπ,ss′ (γ) is non-negative and monotonically increasing on γ ∈ [0, 1).

2. ∀γ ∈ [0, 1) :∥∥fπ,s(γ)

∥∥1

= 11−γ .

Proof. Item 1: by examination of definition 3.3, fπ,s =∑∞t=0 (γTπ)

tes. Since each (Tπ)

t is leftstochastic and es is the standard unit vector, each entry in each summand is non-negative. Therefore,∀γ ∈ [0, 1) : fπ,ss′ (γ) ≥ 0, and this function monotonically increases on γ.

Item 2:

∥∥fπ,s(γ)∥∥

1=

∥∥∥∥∥∥∞∑t=0

(γTπ)tes

∥∥∥∥∥∥1

(7)

=

∞∑t=0

γt∥∥∥(Tπ)

tes

∥∥∥1

(8)

=

∞∑t=0

γt (9)

=1

1− γ. (10)

Equation (8) follows because all entries in each (Tπ)tes are non-negative by the proof of item 1.

Equation (9) follows because each (Tπ)t is left stochastic and es is a stochastic vector, and so∥∥∥(Tπ)

tes

∥∥∥1

= 1.

Definition 5.1 (Optimal policy set function). Π∗(R, γ) is the optimal policy set for reward functionR at γ.

Lemma F.3 (Reward functions share an optimal policy set iff they share optimal visit distributions atevery state). Let R and R′ be reward functions and let γ, γ′ ∈ (0, 1). Π∗(R, γ) = Π∗(R′, γ′) iff Rand R′ have the same optimal visit distributions at every state.

Proof. Suppose Π∗(R, γ) = Π∗(R′, γ′). Then at each state s, ∀π∗ ∈ Π∗ : fπ∗,s(γ) is optimal for

both R and R′ because V ∗Ri(s, γ) := maxπ VπRi

(s, γ) = maxfπ∈F(s) fπ(γ)>ri.

Suppose R and R′ have the same optimal visit distributions at every state at their respective discountrates γ and γ′. If π∗ were optimal for R but not for R′, π∗ would have to induce a suboptimalvisit distribution for R′. But R and R′ have the same optimal visit distributions. So Π∗(R, γ) =Π∗(R′, γ′).

Remark. fπ ∈ F(s) can be optimal for reward function R at discount rate γ, even if π is not anoptimal policy for R at discount rate γ. π can be suboptimal at states s′ such that ∀γ ∈ [0, 1) :fπs′(γ) = 0 (“off-trajectory” states).

F.1 Value functions

Lemma F.4 (Optimal value is piecewise linear on the reward function). V ∗R (s, γ) is piecewise linearwith respect to R.

Proof. V ∗R(s, γ) = maxf∈F(s) f(γ)>r takes the maximum over a finite set of fixed |S|-dimensionallinear functionals. Therefore, the maximum is piecewise linear with respect to R.

21

Lemma F.5 (f ∈ F(s) is multivariate rational on γ). fπ ∈ F(s) is a multivariate rational functionon γ ∈ [0, 1).

Proof. Let R be a reward function. Let Tπ be the transition matrix induced by policy π. By theBellman equations,

(I− γTπ)vπR = r. (11)

Let Aγ := I − γTπ, and for state s, form As,γ by replacing Aγ’s column for state s with r. As

noted by Lippman [12], by Cramer’s rule, V πR (s, γ) =detAs,γ

detAγ.

For each state indicator reward function R, V πR (s, γ) = fπ,s(γ)>r is a rational function of γ whosenumerator and denominator each have degree at most |S|. This implies that fπ(γ) is multivariaterational on γ ∈ [0, 1).

Corollary F.6 (On-policy value is rational on γ). Let π ∈ Π and R be any reward function. V πR (s, γ)is rational on γ ∈ [0, 1).

Proof. V πR (s, γ) = fπ,s(γ)>r, and f is a multivariate rational function of γ by lemma F.5. Therefore,for fixed r, fπ,s(γ)>r is a rational function of γ.

Lemma F.7 (Upper bound on optimal visit distribution shifts). For any reward function R andf , f ′ ∈ F(s),

(f(γ)− f ′(γ)

)>r is either the zero function, or it has at most 2 |S| − 1 roots on

γ ∈ (0, 1).

Proof. Consider two policies π, π′. By lemma F.5, (fπ(γ)− fπ′(γ))>r is a rational function with

degree at most 2 |S| by the sum rule for fractions. The fundamental theorem of algebra shows that(fπ(γ)− fπ

′(γ))>r is either 0 for all γ or for at most 2 |S| values of γ ∈ [0, 1). By the definition of

a visit distribution (definition 3.3), one of the roots is at γ = 0.

Lemma F.8 (Optimal policy shift bound). For fixed R, Π∗(R, γ) can take on at most (2 |S| −1)∑s

(|F(s)|2

)values over γ ∈ (0, 1).

Proof. By lemma F.3, Π∗(R, γ) changes value iff there is a change in optimality status for some visitdistribution function at some state. By lemma F.7, each pair of distinct visit distributions can switchoptimality status at most 2 |S| − 1 times. At each state s, there are

(|F(s)|2

)such pairs.

Lemma F.9 (Optimal value is piecewise rational on γ). V ∗R (s, γ) is piecewise rational on γ.

Proof. By lemma F.8, the optimal visit distribution changes a finite number of times for γ ∈ [0, 1).On each non-degenerate subinterval where the optimal visit distribution set is constant, fπ is fixed.By corollary F.6, V ∗R (s, γ) is rational on this subinterval.

F.2 Non-dominated visit distribution functions

Lemma F.10 (Non-domination iff strict optimality for some reward function). f ∈ Fnd(s) iff∃γ ∈ (0, 1), r ∈ R|S| : f(γ)>r > maxf ′∈F(s)\{f} f

′(γ)>r.

Proof. This result follows directly from the definition of non-domination (definition 5.6).

Definition F.11 (Support ofD). supp(D) is the smallest closed subset of [0, 1]|S| whose complementhas measure zero under D.

Although we consider reward functions R ∈ [0, 1]S , we sometimes abuse notation by writingR ∈ supp(D) to mean r ∈ supp(D) (where r ∈ [0, 1]|S| is R represented as a column vector).

22

Definition F.12 (Optimality support).

supp(f , γ) :=

{r | r ∈ supp(D), f(γ)>r = max

f ′∈F(s)f ′(γ)

>r

}. (12)

supp(f , γ) can be calculated by solving the relevant system of∣∣F(s)

∣∣− 1 inequalities.1 For example,consider fig. 20. We would like to calculate supp(fπright , γ) to determine V ∗avg (s1, γ).

1 2

3

right

down

Figure 20

fπright>r ≥ fπdown>r

R(s1) +γR(s2)

1− γ≥ R(s1) +

γR(s3)

1− γ,

so R(s2) ≥ R(s3). Intersecting this halfspace with supp(D), we havesupp(fπright , γ).

Lemma F.13 (Geometric properties of optimality support). At discount rate γ,the set of reward functions for which f ∈ F(s) is optimal is a pointed closedconvex cone.

Proof. The set of reward functions for which f is optimal at discount rate γ is closed and convexbecause it is the intersection of half-spaces: f(γ)>r ≥ maxf ′∈F(s)\{f} f

′(γ)>r. The set is apointed cone because for any α ≥ 0, f(γ)>r ≥ maxf ′∈F(s)\{f} f

′(γ)>r implies f(γ)>(αr) ≥maxf ′∈F(s)\{f} f

′(γ)>(αr).

Remark. Unless noted otherwise, we endow R|S| with the standard topology.

Lemma F.14 (f ∈ Fnd(s) strictly optimal for R ∈ supp(D)). If f ∈ Fnd(s), then for any γ ∈ (0, 1)f is strictly optimal for some r belonging to the interior of supp(f , γ) ⊆ supp(D).

Proof. Since f ∈ Fnd(s), f is strictly optimal for R′ ∈ RS . Since X is continuous, it must havesupport on some open interval (a, b) ( [0, 1]. Apply a positive affine transformation to R′, producingR such that a < minsR(s) ≤ maxsR(s) < b. Since optimal policies are invariant to positiveaffine transformation, R ∈ supp(f , γ) ⊆ supp(D) and f is strictly optimal for R. Since R satisfiesa < minsR(s) ≤ maxsR(s) < b, R is in the interior of supp(D).

Since f(γ)>r > maxf ′∈F(s)\{f} f′(γ)>r and since optimal value is continuous on the reward

function (lemma F.4), the fact that r ∈ supp(D) implies that in the subspace topology induced by D,there exists an ε-neighborhood of reward functions for which f remains strictly optimal. Therefore, ris in the interior of supp(f , γ).

Since we assume state-based reward functions, naively plugging in γ = 0 would make all policiesoptimal.

Definition F.15 (Greedy optimality for state-based reward functions). Π∗(R, 0) := limγ→0 Π∗(R, γ)is the greedily optimal policy set for reward function R. Lemma F.7 implies that this limit alwaysexists.

The intuition for the following result is that optimal value functions are precisely those value functionswhich cannot be improved within the environmental dynamics.

Theorem F.16 (A means of transferring optimal policy sets across discount rates). Suppose rewardfunction R has optimal policy set Π∗(R, γ∗) at discount rate γ∗ ∈ (0, 1). For any γ ∈ [0, 1), we canconstruct a reward function R′ with the same optimal policy set.

Proof. Let π∗ ∈ Π∗(R, γ∗), and let Tπ∗ be the transition matrix induced by policy π∗. By theBellman equations, R =

(I− γ∗Tπ∗

)V π∗

R .

1Mathematica code to calculate these inequalities can be found at https://github.com/loganriggs/gold.

23

https://github.com/loganriggs/gold

For γ ∈ [0, 1), we construct R′ :=(I− γTπ∗

)V π∗

R . A policy improves on π∗ for R′ iff it increasesexpected value at at least one state. Then we would have

∃s, a : R′(s) + γEs′∼T (s,a)

[V π∗

R′ (s′, γ)]> R′(s) + γEs′∼T(s,π∗(s))

[V π∗

R′ (s′, γ)]. (13)

But V π∗

R′ (·, γ) = V π∗

R (·, γ∗), and so a policy improvement of π∗ for R′ at γ must be a policyimprovement of π∗ for R at γ∗. But π∗ is an optimal policy for R at γ∗. Therefore, each π∗ muststill be optimal for R′ at γ.

Furthermore, since all optimal policies share the same value function, no new policies become optimalfor R′ at discount rate γ (else they would have been optimal for R at γ∗, a contradiction).

Then ∀γ ∈ [0, 1) : Π∗(R′, γ) = Π∗(R, γ∗).

Usually, one fixes a reward function and solves for the value function; theorem F.16 instead fixes avalue function and solves for the incentives which make it optimal.

Theorem F.16 implies that if value function V is optimal for a reward function at some discount rate,then V is also optimal at any other discount rate for any other reward function which has a policyinducing V . The two reward functions also have the same optimal policy set.Lemma F.17 (Non-dominated visit distributions have positive optimality probability). f ∈ Fnd(s)implies ∀γ ∈ (0, 1) : PD (f , γ) > 0.

Proof. If f is non-dominated, then it is strictly optimal for some reward function R at some discountrate γ∗. Let γ ∈ (0, 1). By theorem F.16, we can construct a reward function R′ for which f is strictlyoptimal at γ. By lemma F.14, we can scale-and-shift R′ to produce an R′′ ∈ supp(f , γ).

Since f is strictly optimal, supp(f , γ) has nonempty interior by the continuity of V ∗R′′(s) on R′′.supp(f , γ) has positive measure under D because D is a continuous distribution. PD (f , γ) > 0.

Theorem F.18 (Reward functions map injectively to optimal value functions). ∀γ ∈ [0, 1), R 7→V ∗R(·, γ) is injective.

Proof. Given V ∗R and the rewardless MDP, deduce an optimal policy π∗ for R by choosing a V ∗R-greedy action for each state. Let Tπ∗ be the transition probabilities under π∗.

V ∗R = R+ γTπ∗V ∗R (14)(I− γTπ∗

)V ∗R = R. (15)

If two reward functions have the same optimal value function, then they have the same optimalpolicies. Then eq. (15) shows that the reward functions must be identical.

Lemma F.19 (Linear independence of a policy’s visit distributions). At any fixed γ ∈ [0, 1), theelements of

{fπ,s(γ) | s ∈ S

}are linearly independent.

Proof. Consider the all-zero optimal value function with optimal policy π∗. Theorem F.18 impliesthe following homogeneous system of equations has a unique solution for r:

fπ∗,s1(γ)>r = 0

...

fπ∗,s|S|(γ)>r = 0.

Therefore, π∗ induces linearly independent f . But r must be the all-zero reward function (for whichall policies are optimal), so the fπ,s are independent for any policy π.

Lemma F.20 (Distinct linear functionals disagree almost everywhere on their domains). Let x,x′ ∈R|S| be distinct. Then x>r = x′>r for almost all r ∈ R|S|. Furthermore, equality holds withprobability 0 under any continuous distribution on R|S|.

24

Proof. The set of satisfactory r has no interior in the standard topology on R|S|. Since this set is alsoconvex, it has zero Lebesgue measure. By the Radon-Nikodym theorem, it has zero measure underany continuous distribution.

Under any continuous reward function distribution, almost all reward functions have different optimalvalue for different state distributions.

Lemma F.21 (Two distinct state distributions differ in expected optimal value for almost all rewardfunctions). Let D′ be a continuous distribution over reward functions, let γ ∈ (0, 1), and let

∆,∆′ ∈ ∆(S). If ∆ 6= ∆′, PR∼D′(Es∼∆

[V ∗R (s, γ)

]= Es′∼∆′

[V ∗R(s′, γ

)])= 0.

Proof. Let R ∈ supp(D′), and let π∗ be one of its optimal policies. By lemma F.19, E∆

[fπ∗,s]

=

E∆′

[fπ∗,s′]

iff ∆ = ∆′. Therefore, E∆

[fπ∗,s]6= E∆′

[fπ∗,s′]. Trivially, E∆

[V ∗R (s, γ)

]=

E∆′

[V ∗R(s′, γ

)]iff E∆

[fπ∗,s>

]r = E∆′

[fπ∗,s′>

]r. Since E∆

[fπ∗,s]6= E∆′

[fπ∗,s′],

lemma F.20 implies that the equality holds with 0 probability under D′.

No f is suboptimal for all reward functions: every visit distribution is optimal for a constant rewardfunction. However, for any given γ, almost every reward function has a unique optimal visitdistribution at each state.

Theorem F.22 (Optimal visit distributions are almost always unique). Let s be any state. For any

γ ∈ (0, 1),{r such that

∣∣∣arg maxf∈F(s) f(γ)>r∣∣∣ > 1

}has measure zero under any continuous

reward function distribution.

Proof. Let R be a reward function and let s be a state at which there is more than one optimal visitdistribution for R at discount rate γ. Since R has more than one optimal visit distribution, there mustexist a state s′ reachable with positive probability from s such that actions a, a′ are both optimal at s′,where T (s′, a) 6= T (s′, a′). Then Es′′∼T (s′,a)

[V ∗R(s′′, γ

)]= Es′′∼T (s′,a′)

[V ∗R(s′′, γ)

].

By lemma F.21, since T (s′, a) 6= T (s′, a′), this equation holds with probability 0 for reward functionsdrawn from any continuous reward function distribution.

Lemma F.23 (∀s :∣∣Fnd(s)

∣∣ ≥ 1).

Proof. If∣∣F(s)

∣∣ = 1, then trivially we have∣∣Fnd(s)

∣∣ = 1 by the definition of non-domination(definition 5.6).

If∣∣F(s)

∣∣ > 1, then suppose∣∣Fnd(s)

∣∣ = 0. This means that ∀r ∈ R|S| :∣∣∣arg maxf∈F(s) f(γ)>r

∣∣∣ >1. But theorem F.22 showed that the set of satisfactory reward functions has zero Lebesgue measure,and so this cannot be true for all r. Then

∣∣Fnd(s)∣∣ ≥ 1.

Lemma F.24 (Strict visitation optimality is sufficient for non-domination). Let γ ∈ [0, 1) and let

s, s′ ∈ S. At least one element of{


is non-dominated.

Proof. Let F :={


. If γ = 0, maxf∈F(s) fs′(γ) > 0 iff s = s′ by def-inition 3.3, and F = F(s) and the statement is true by lemma F.23. Similarly, if γ > 0 butmaxf∈F(s) fs′(γ) = 0, then F = F(s).

Suppose γ > 0 and maxf∈F(s) fs′(γ) > 0, and let r be the state indicator reward function for s′.Suppose F ⊆ F(s) \ Fnd(s). Then maxf∈F(s) fs′(γ)>r > maxf ′∈Fnd(s) f

′s′(γ)>r, a contradiction.

Therefore, at least one f ∈ F also belongs to Fnd(s).

25

Remark. Lemma F.24 implies that if s′ is reachable with positive probability from s, then there is atleast one non-dominated visit distribution function which assigns s′ positive visitation frequency. Inthis sense, Fnd(s) “covers” the states reachable from s.

Corollary F.25 (∣∣Fnd(s)

∣∣ = 1 iff∣∣F(s)

∣∣ = 1).

Proof. We show∣∣Fnd(s)

∣∣ = 1 implies∣∣F(s)

∣∣ = 1 by proving the contrapositive. Suppose∣∣F(s)

∣∣ ≥2, and let f , f ′ ∈ F(s) be distinct. Since ∀γ ∈ [0, 1) : ‖f‖1 = 1

1−γ and since they are distinct visitdistribution functions on γ, ∃γ ∈ (0, 1), s′, s′′ ∈ S :

(fs′(γ) > f ′s′(γ)

)∧(fs′′(γ) < f ′s′′(γ)

). By

lemma F.24,∣∣Fnd(s)

∣∣ ≥ 2.

If∣∣F(s)

∣∣ = 1, then∣∣Fnd(s)

∣∣ = 1 by lemma F.23.

F.3 Basic properties of POWER

Lemma F.26 (POWER identities).


(V ∗avg (s, γ)− E [X]

)(16)

= Er∼D

[maxf∈F(s)

1− γγ

(f(γ)− es)

>r

](17)

= ER∼D[maxπ∈Π

Es′∼T(s,π(s))

[(1− γ)V πR

(s′, γ

)]]. (18)

Proof.


(V ∗avg (s, γ)− E [X]

)(19)

=1− γγ

Er∼D

[maxf∈F(s)

f(γ)>r

]− E [X]

(20)

= Er∼D

[maxf∈F(s)

1− γγ

(f(γ)− es)

>r

](21)

= Er∼D

[maxπ∈Π

Es′∼T(s,π(s))

[(1− γ) fπ,s

′(γ)>r

]](22)

= ER∼D[maxπ∈Π

Es′∼T(s,π(s))

[(1− γ)V πR

(s′, γ

)]]. (23)

Equation (20) holds by the definition of V ∗avg (s, γ) (definition 4.1). In eq. (21), E [X] = Er∼D[e>s r

]holds because D identically and independently distributes reward over each state according to X .Equation (22) holds because fπ,s(γ) = es + γEs′∼T(s,π(s))

[fπ,s

′(γ)]

by the definition of a visitdistribution function (definition 3.3).

Corollary 4.3 (POWER is the average normalized optimal value).

POWER (s, γ) = ER∼D[maxπ∈Π

Es′∼T(s,π(s))

[(1− γ)V πR

(s′, γ

)]]. (4)

Lemma 4.4 (Continuity of POWER). POWER (s, γ) is Lipschitz continuous on γ ∈ [0, 1].

Proof. By lemma F.9, for all states s′ and reward functions R, V ∗R(s′, γ

)is piecewise rational –

and therefore continuously differentiable – on γ ∈ [0, 1). We first show that (1− γ)V ∗R(s′, γ

)has

uniformly bounded derivative, with the uniformity being across states s′ ∈ S, reward functionsR ∈ supp(D), and γ ∈ [0, 1).

26

Let π be an optimal policy for R at discount γ. Since V πR (s′, γ) = fπ,s′(γ)>r, the optimal value

function’s derivative with respect to γ is controlled by the behavior of fπ,s′(γ). By lemma F.5, fπ,s

′

is a multivariate rational function on γ. Consider its output for state s′′: fπ,s′

s′′ (γ) = P (γ)Q(γ) in reduced

form. By the definition of a visit distribution (definition 3.3), 0 ≤ fπ,s′

s′′ (γ) ≤ 11−γ . Therefore, Q may

only have a root of multiplicity 1 at γ = 1, and Q(γ) 6= 0 for γ ∈ [0, 1).

Let f(γ) := (1− γ)P (γ)Q(γ) = (1− γ)fπ,s

′

s′′ (γ). By the above bounds on fπ,s′

s′′ (γ), f is bounded [0, 1]

on the domain [0, 1). If Q does not have a root at γ = 1, then f ′ is bounded on the domain [0, 1]because the polynomial (1− γ)P (γ) cannot diverge on a bounded domain.

If Q does have a root at γ = 1, then factor out the root as Q = (1− γ)Q∗.

f ′ =d

dγ

((1− γ)P

Q

)(24)

=d

dγ

(P

Q∗

)(25)

=P ′Q∗ − (Q∗)′P

(Q∗)2. (26)

Since Q∗ has no zeros on γ ∈ [0, 1], f ′ is bounded on γ ∈ [0, 1]. f ′ singled out a state s′′, but thereare only finitely many states, and so the derivatives of fπ,s

′(γ) are bounded in sup-norm. There are

only finitely many fπ,s′ ∈ F(s′), and only finitely many states s′, and so there exists some K such

that ∀s′ ∈ S, π ∈ Π, γ ∈ [0, 1] :∥∥∥ ddγ (1− γ)fπ,s

′(γ)∥∥∥∞≤ K.

supR∈supp(D),s′∈S,π∈Π,γ∈[0,1)

d

dγ(1− γ)V πR

(s′, γ

)= sup

r∈supp(D),s′∈S,π∈Π,γ∈[0,1)

d

dγ(1− γ)fπ,s

′(γ)>r (27)

≤ supr∈supp(D),s′∈S,π∈Π,γ∈[0,1)

∥∥∥∥ ddγ (1− γ)fπ,s′(γ)

∥∥∥∥1

‖r‖1 (28)

≤ K |S| . (29)

Because on-policy value is rational on γ (corollary F.6), the derivatives exist on the left-hand side ofeq. (27). Equation (29) follows because supp(D) is a subset of the unit hypercube.

By corollary 4.3, POWER (s, γ) = ER∼D[maxπ∈Π Es′∼T(s,π(s))

[(1− γ)V πR

(s′, γ

)]]. The ex-

pectation of the maximum of a set of functions which share a Lipschitz constant, also shares theLipschitz constant. This shows that POWER is Lipschitz continuous on (0, 1). Thus, its limits arewell-defined as γ → 0 and γ → 1. So it is Lipschitz continuous on the closed unit interval.

Lemma 4.5 (Minimal POWER). ∀γ ∈ (0, 1) : POWER (s, γ) ≥ E [X], with equality iff∣∣F(s)

∣∣ = 1.

Proof. If∣∣F(s)

∣∣ = 1, then V ∗avg (s, γ) = E [X] 11−γ by the linearity of expectation and the fact thatD

distributes reward independently and identically across states. If∣∣F(s)

∣∣ > 1, by corollary F.25 thereexists a second non-dominated visit distribution f ′. ∀γ ∈ (0, 1) : supp(f ′, γ) has a positive-measureinterior by lemma F.17. By definition, each reward function in the interior achieves strictly greateroptimal value via the second non-dominated visit distribution. Then POWER (s, γ) > E [X].

Lemma 4.6 (Maximal POWER). POWER (s, γ) ≤ E[max of |S| draws from X

], with equality iff s

can deterministically reach all states in one step and all states have deterministic self-loops.

Proof. Forward: each visit distribution which immediately navigates to a state and stays there isnon-dominated by lemma F.24. Since the agent can simply navigate to the most rewarding state andstay there, these are also the only non-dominated visit distributions, and

∣∣Fnd(s)∣∣ = |S|.

27

Navigating to a child state is optimal iff that state maximizes reward. Let Fmax be the probabilitymeasure for the random variable corresponding to the maximum of |S| draws from X .

POWER (s, γ) =

∫ 1

0

rmax dFmax(rmax) (30)

= E[max of |S| draws from X

]. (31)

Backward: if s cannot deterministically reach all states in one step or if some state does not have adeterministic self-loop, adding such transitions would create a new non-dominated visit distributionby lemma F.24. Because lemma F.17 shows that non-dominated visit distributions are strictly optimalwith positive probability, this strictly increases POWER (s, γ).

Lemma F.27 (One-step reachability bounds average difference in optimal value). Let D′ be anycontinuous reward function distribution over reward functions bounded [0, 1] and let γ ∈ [0, 1). If s

and s′ can reach each other with probability 1 in one step, then ED[∣∣∣V ∗R (s, γ)− V ∗R

(s′, γ

)∣∣∣] < 1.

Proof. By Proposition 1 of Turner et al. [31] and because each R ∈ supp(D) is bounded [0, 1],∣∣∣V ∗R (s, γ)− V ∗R(s′, γ

)∣∣∣ ≤ (1− γ) 1−01−γ = 1.

For equality to hold, it must be the case that for almost allR ∈ supp(D′),∣∣∣V ∗R (s, γ)− V ∗R

(s′, γ

)∣∣∣ =

1. Because we assumed that such s and s′ can reach each other in one step, this implies that foralmost all such R, either R(s) = 0 and R(s′) = 1, or vice versa. But this would imply that D has adiscontinuous reward distribution for these states, contradicting its continuity. Then the inequality isstrict.

Proposition 4.7 (POWER is smooth across reversible dynamics). Suppose s and s′ can both reacheach other in one step with probability 1. Then


)∣∣∣ < 1−γγ .

Proof. Suppose γ ∈ [0, 1).∣∣∣V ∗avg (s, γ)− V ∗avg

(s′, γ

)∣∣∣ ≤ ER∼D[∣∣∣V ∗R (s, γ)− V ∗R

(s′, γ

)∣∣∣] (32)

< 1. (33)

Equation (32) holds by the triangle inequality. Equation (33) holds by applying lemma F.27 (whosepreconditions are met since s and s′ can reach each other in one step with probability 1). Then∣∣∣V ∗avg (s, γ)− V ∗avg

(s′, γ

)∣∣∣ < 1 (34)∣∣∣∣(V ∗avg (s, γ)− E [X])−(V ∗avg

(s′, γ

)− E [X]

)∣∣∣∣ < 1 (35)

1− γγ

∣∣∣∣(V ∗avg (s, γ)− E [X])−(V ∗avg

(s′, γ

)− E [X]

)∣∣∣∣ < 1− γγ

(36)∣∣∣POWER (s, γ)− POWER(s′, γ

)∣∣∣ < 1− γγ

. (37)

If γ = 1, then the above reasoning shows that limγ→1


)∣∣∣ = 0. Since

POWER is (Lipschitz) continuous on γ ∈ [0, 1] by lemma 4.4,∣∣∣POWER (s, 1)− POWER

(s′, 1

)∣∣∣ =

limγ→1


)∣∣∣ and so the statement holds for γ = 1.

Lemma 4.8 (Lower bound on current POWER based on future POWER).

POWER (s, γ) ≥ (1− γ)E [X] + γmaxa

Es′∼T (s,a)

[POWER

(s′, γ

)].

28

Proof. Let γ ∈ (0, 1).

POWER (s, γ) = (1− γ)ER∼D[maxa

Es′∼T (s,a)

[V ∗R(s′, γ

)]](38)

≥ (1− γ) maxa

Es′∼T (s,a)

[ER∼D

[V ∗R(s′, γ

)]](39)

= (1− γ) maxa

Es′∼T (s,a)

[V ∗avg

(s′, γ

)](40)

= (1− γ) maxa

Es′∼T (s,a)

[γ

1− γPOWER (s, γ) + E [X]

](41)

= (1− γ)E [X] + γmaxa

Es′∼T (s,a)

[POWER

(s′, γ

)]. (42)

Equation (38) holds by lemma F.26. Equation (40) follows because Ex∼X[maxa f(a, x)

]≥

maxa Ex∼X[f(a, x)

], and eq. (41) follows by the definition of POWER (definition 4.2).

The inequality also holds when we take the limits γ → 0 or γ → 1.

Corollary 4.10 (Delay decreases POWER). Let s0, . . . , s` be such that for all i < `, Ch(si) ={si+1}. Then POWER (s0, γ) =

(1− γ`

)E [X] + γ`POWER (s`, γ).

Proof. Iteratively apply lemma 4.8 ` times. Equality must hold, as each si can only reach si+1.

F.4 How discounting affects the optimal policy set

Corollary F.28 (One-sided limits exist for Π∗(R, γ)). Let L ∈ (0, 1) and let R be any rewardfunction. limγ↑L Π∗(R, γ) and limγ↓L Π∗(R, γ) both exist.

Proof. By lemma F.8, Π∗(R, γ) can take on at most finitely many values for γ ∈ (0, 1). Thus, infiniteoscillation cannot occur in either one-sided limit, and so both one-sided limits exist.

Definition 5.2 (Optimal policy shift). If γ∗ ∈ (0, 1) is such that limγ−↑γ∗ Π∗(R, γ−) 6= Π∗(R, γ∗),then R has an optimal policy shift at γ∗ (results in appendix F.4 ensure that the limit exists). Similarly,R has an optimal visit distribution shift at γ∗.

Lemma F.29 (Optimal policy sets overlap when shifts occur). LetR be a reward function. Suppose anoptimal policy shift occurs at γ∗, and let Π− := limγ−↑γ∗ Π∗(R, γ−),Π+ := limγ+↓γ∗ Π∗(R, γ+).Then Π− ∪Π+ ⊆ Π∗(R, γ∗). In particular, at discount rate γ∗, there exists a state at which R hasat least two optimal visit distributions.

Proof. Since an optimal policy shift occurs at γ and since V ∗R (s, γ) is continuous in on γ bylemma F.9, ∀π− ∈ Π−, π+ ∈ Π+, s ∈ S : V π

−

R (s, γ) = V π+

R (s, γ).

0 −.25 1 −1 0

Figure 21: In lemma F.29, Π− can equal Π+. Let R be the reward function whose rewards are shownin green. The shortcut is optimal for all γ. An optimal policy shift occurs at γ∗ = .5. Since Π− = Π+

only contain policies which take the shortcut, Π− ∪Π+ ( Π∗(R, γ∗).

Corollary F.30 shows that definition 5.2 loses no generality by defining optimal policy shifts withrespect to the lower limit.Corollary F.30 (Lower-limit optimal policy set inequality iff upper-limit inequality). Let γ∗ ∈(0, 1), and Π− := limγ−↑γ∗ Π∗(R, γ−),Π+ := limγ+↓γ∗ Π∗(R, γ+). Π− 6= Π∗(R, γ∗) iff Π+ 6=Π∗(R, γ∗).

29

https://www.wolframalpha.com/input/?i=-.25x+%2B+x%5E2+-+x%5E3

Proof. Suppose Π− 6= Π∗(R, γ∗) but Π∗(R, γ∗) = Π+. By lemma F.29, if Π− 6= Π∗(R, γ∗), thenΠ− ( Π∗(R, γ∗). Let π∗ ∈ Π∗(R, γ∗) \ Π− and π− ∈ Π−. Since π∗ 6∈ Π−, there exists someε1 > 0 such that π∗ isn’t optimal for all γ′ ∈ (γ − ε1, γ∗]. In particular, (fπ

∗(γ′)− fπ

−(γ′))>r < 0

for such γ′. In particular, (fπ∗(γ)− fπ

−(γ))>r is not the zero function on γ.

Therefore, lemma F.7 implies that (fπ∗(γ)− fπ

−(γ))>r has finitely many roots on γ. But since π∗ ∈

Π∗(R, γ∗) = Π+, there exists ε2 > 0 such that ∀γ′ ∈ [γ∗, γ∗ + ε2) : (fπ∗(γ′) − fπ

−(γ′))>r = 0.

But this would imply that the expression has infinitely many roots, a contradiction. Therefore, ifΠ− 6= Π∗(R, γ∗), then Π∗(R, γ∗) 6= Π+.

The proof of the reverse implication proceeds identically.

Corollary F.31 (Almost all reward functions don’t have an optimal policy shift at any given γ). Letγ ∈ (0, 1). The subset of supp(D) with an optimal policy shift occurring at γ has measure zero.

Proof. Combine lemma F.29 and lemma F.3 with the fact that at any given γ, almost all of D has aunique optimal visit distribution (theorem F.22).

Lemma F.32 (For continuous reward function distributions, optimality probability is additive overvisit distribution functions). Let F ⊆ F(s). If D′ is a continuous distribution, then PD′ (F, γ) =∑

f∈F PD′ (f , γ).

Proof. When D′ is continuous, theorem F.22 shows that a measure zero set of reward functions hasmultiple optimal visit distributions at s for any given γ ∈ (0, 1).

Lemma F.33 (Optimality probability is continuous on γ). For f ∈ F(s), PD (f , γ) is continuous onγ ∈ [0, 1].

Proof. For γ ∈ (0, 1), if not, there would exist a γ at which a positive measure subset of Dexperiences an optimal policy shift, contradicting corollary F.31. At γ ∈ {0, 1}, PD (f , γ) iscontinuous by definition (proposition F.34).

Theorem 5.3 suggests that the vast majority of deterministic rewardless MDPs allow optimal policyshifts, as the criterion is easily fulfilled.Theorem 5.3 (Characterization of optimal policy shifts in deterministic MDPs). In deterministicenvironments, there exists a reward function whose optimal action at s0 changes with γ iff ∃s1 ∈Ch(s0), s′1 ∈ Ch(s0), s′2 ∈ Ch(s′1) \ Ch(s1) : s′2 6∈ Ch(s0) ∨

(s1 6∈ Ch(s1) ∧ s′1 6∈ Ch(s1)

).

Proof. Forward direction. Without loss of generality, suppose the optimal policy set of some R isshifting for the first time (a finite number of shifts occur by Blackwell [2], which does not depend onthis result).

Starting at state s0, let the policies π, π′ induce state trajectories s0s1s2 . . . and s0s′1s′2 . . ., respec-

tively, with the shift occurring to an optimal policy set containing π′ at discount rate γ. By thedefinition of an optimal policy shift at γ, V πR (s0, γ) = V π

′

R (s0, γ). Because π was greedily optimaland π′ was not, s1 6= s′1 and R(s1) > R(s′1). If Ch (s1) = Ch

(s′1), π(s0) remains the optimal

action at s0 and no shift occurs. Without loss of generality, suppose s′2 6∈ Ch (s1).

We show the impossibility of ¬(s′2 6∈ Ch (s0) ∨

(s1 6∈ Ch (s1) ∧ s′1 6∈ Ch (s1)

))= s′2 ∈

Ch (s0) ∧(s1 ∈ Ch (s1) ∨ s′1 ∈ Ch (s1)

), given that π′ becomes optimal at γ.

Case: s′2 ∈ Ch (s0) ∧ s1 ∈ Ch (s1). For π′ to be optimal, navigating to s1 and staying therecannot be a better policy than following π′ from s0. Formally, R(s1)

1−γ ≤ Vπ′

R (s′1, γ) implies R(s1) ≤

(1− γ)V π′

R (s′1, γ) = (1− γ)(R(s′1) + γV π

′

R (s′2, γ))

.

We now construct a policy π′2 which strictly improves upon π′. Since s′2 ∈ Ch (s0), ∃a′2 :T (s0, a

′2, s′2) = 1. Let π′2 equal π′ except that π′2(s0) := a′2. Then since R(s′1) < R(s1),

Vπ′2R (s0, γ) > V π

′

R (s0, γ), contradicting the assumed optimality of π′.

30

s0

s1

s′1

s2

s′2

(a) s′2 ∈ Ch (s0) ∧ s1 ∈Ch (s1)

s0

s1

s′1

s2

s′2

(b) s′2 ∈ Ch (s0) ∧ s′1 ∈ Ch (s1)

Figure 22: Dotted arrows illustrate the assumptions for each case. Given that there exists a rewardfunction R whose optimal action at s0 changes at γ, neither assumption can hold. Although notillustrated here, e.g. s2 = s0 or s′2 = s0 is consistent with theorem 5.3. We leave the rest of themodel blank as we make no further assumptions about its topology.

Case: s′2 ∈ Ch (s0) ∧ s′1 ∈ Ch (s1). For π′ to be optimal, navigating to s1, then to s′1 (madepossible by s′1 ∈ Ch (s1)), and then following π′ cannot be a better policy than following π′ froms0. Formally, R(s1) + γV π

′

R (s′1, γ) ≤ V π′R (s′1, γ). This implies that R(s1) ≤ (1− γ)V π′

R (s′1, γ) =

(1− γ)(R(s′1) + γV π

′

R (s′2, γ))

. The policy π′2 constructed above is again a strict improvement overπ′ at discount rate γ, contradicting the assumed optimality of π′.

Backward direction. Suppose ∃s1, s′1 ∈ Ch (s0) , s′2 ∈ Ch

(s′1)\ Ch (s1) : s′2 6∈ Ch (s0) ∨(

s1 6∈ Ch (s1) ∧ s′1 6∈ Ch (s1)). We show that there exists a reward functionR whose optimal policy

at s0 changes with γ.

If s′2 6∈ Ch (s0), then s′2 6= s1 because s1 ∈ Ch (s0). Let R(s1) := .1, R(s′2) := 1, and 0 elsewhere.Suppose that s1 can reach s′2 in two steps and then stay there indefinitely, while the state trajectoryof π′ can only stay in s′2 for one time step, after which no more reward accrues. Even under theseimpossibly conservative assumptions, an optimal trajectory shift occurs from s0s1s2 . . . to s0s

′1s′2 . . ..

At the latest, the shift occurs at γ ≈ 0.115, which is a solution of the corresponding equality:

R(s1) +γ2

1− γR(s′2) = R(s′1) + γR(s′2) (43)

.1 +γ2

1− γ= γ. (44)

Alternatively, suppose s2 = s1, and so π continually accumulates R(s1) = .1. Then there againexists an optimal policy shift corresponding to a solution to .1

1−γ = γ.

By construction, these two scenarios are the only ways in which π might accrue reward, and so anoptimal policy shift occurs for R.

If s′2 ∈ Ch (s0), then set R(s1) := 1, R(s′1) := .99, R(s′2) := .9, and 0 elsewhere. Suppose thats1 can reach itself in two steps (the soonest possible, as s1 6∈ Ch (s1)), while neither s′1 or s′2 canreach themselves or s1. The corresponding equation 1

1−γ2 = .99 + .9γ has a solution in the openunit interval. Therefore, a shift occurs even under these maximally conservative assumptions.

F.4.1 How optimality probability behaves as the discount rate goes to 0 or 1

Proposition F.34 (Optimality probability’s limits exist). Let D′ be any reward function distribution,let s be a state, and let f ∈ F(s). The following limits exist: PD′ (f , 0) := limγ→0 PD′ (f , γ) andPD′ (f , 1) := limγ→1 PD′ (f , γ).

Proof. First consider the limit as γ → 1. Let D′ have probability measure F ′, and defineδ(γ) := F ′

({R | R ∈ supp(D′),∃γ∗ ∈ (γ, 1) : R has an optimal policy shift at γ∗

}). Since F ′

is a probability measure, δ(γ) is bounded [0, 1]; by definition 5.4, δ(γ) is monotone decreasing asγ → 1. Therefore, limγ→1 δ(γ) exists.

31

https://www.wolframalpha.com/input/?i=.1%2Bx%5E2%2F%281-x%29%3Cx

https://www.wolframalpha.com/input/?i=.1%2F%281-x%29%3Dx

https://www.wolframalpha.com/input/?i=1%2F%281-x%5E2%29%3C.99%2B.9x

Lemma F.8 proved that each reward function has finitely many optimal policy shifts for γ ∈ (0, 1).If limγ→1 δ(γ) > 0, then there exist reward functions which never shift to their Blackwell-optimalpolicy set, a contradiction. So the limit equals 0.

By the definition of optimality probability (definition 5.5), only reward functions with optimalpolicy shifts remaining can affect optimality probability as γ → 1. Then ∀f ∈ F(s), γ∗ ∈ [γ, 1) :∣∣PD′ (f , γ)− PD′ (f , γ∗)

∣∣ ≤ δ(γ). Since limγ→1 δ(γ) = 0, ∀f ∈ F(s) : limγ→1 PD′ (f , γ) exists.

A similar proof shows that all reward functions eventually shift to their greedy policy set as γ → 0;thus, ∀f ∈ F(s) : limγ→0 PD′ (f , γ) exists.

Lemma F.35 (Non-dominated visit distributions have PD (f , 0) > 0). PD (f , 0) > 0 iff f ∈ Fnd(s).

Proof. By theorem F.22, the set of reward functions with multiple optimal visit distributions hasmeasure zero. Dominated visit distributions cannot be uniquely optimal. Therefore, if f is dominated,∀γ ∈ (0, 1) : PD (f , γ) = 0, and PD (f , 0) := limγ→0 PD (f , γ) = limγ→0 0 = 0. So PD (f , 0) > 0implies f ∈ Fnd(s).

Suppose fπ ∈ Fnd(s) is strictly optimal for reward function R at discount rate γ∗. By theo-rem F.16, fπ is strictly greedily optimal (definition F.15) for reward function R′(s) := V ∗R (s, γ∗).By lemma F.14, there exists in the interior of supp(fπ, γ) a reward function R′′ for which fπ isstrictly greedily optimal. Since R′′ is in this interior, R′′ has an open neighborhood of rewardfunctions for which for which fπ is strictly greedily optimal. Since D is a continuous distribution,this neighborhood has positive measure under D, and so PD (fπ, 0) > 0.

Proposition 5.7 (Non-domination iff positive probability). f ∈ Fnd(s) iff ∀γ ∈ [0, 1) : PD (f , γ) >0.

Corollary F.36 (Dominated visit distributions are almost never optimal). f ∈ F(s) is dominated iff∀γ ∈ [0, 1] : PD (f , γ) = 0.

Not all f ∈ Fnd(s) limit to d ∈ RSDND(s); see fig. 23. Conversely, some dominated f do limit tonon-dominated d ∈ RSDND(s); for example, consider a state s in a unichain MDP in which there aredominated f ∈ F(s) but

∣∣RSD(s)∣∣ = 1.

1 2

3

4

a

Figure 23: The bifurcated action a is a stochastic transition, where T (s2, a, s3) = 12 = T (s2, a, s4).

By lemma F.24, navigating from s1 → s2 induces a non-dominated visit distribution f . However, itslimiting RSD is half s3, half s4 and is therefore dominated. PD (f , 1) = 0 even though f ∈ Fnd(s).

F.4.2 Robustly instrumental actions are those which are more probably optimal

Theorem 5.10 (Robust instrumentality depends on X). Robust instrumentality cannot be graphicallycharacterized independently of the state reward distribution X .

Proof. See fig. 7.

F.5 Seeking POWER is often robustly instrumental

Theorem 6.2 (Seeking more POWER is not always more probably optimal).

Proof. See fig. 8.

32

s

s′sa′a′

saa

Figure 9: Repeated from page 7. Starting from s, state s′ is a bottleneck with respect to REACH(s′, a′

)and REACH

(s′, a

)via the appropriate actions (definition 6.3). The red subgraph highlights the visit

distributions of Fa′ and the green subgraph highlights Fsub; the gray subgraph illustrates Fa \ Fsub.Since Fa′ is similar to Fsub ( Fa, a affords “strictly more transient options.” Theorem 6.6 provesthat ∀γ ∈ (0, 1) : PD (Fa′ , γ) < PD (Fa, γ) and POWER (sa′ , γ) < POWER (sa, γ).

F.5.1 Having “strictly more options” is robustly instrumental

Theorem 6.6 (Having “strictly more options” is more probably optimal and also POWER-seeking).Suppose that starting from s, state s′ is a bottleneck for REACH

(s′, a

)via action a, and also for

REACH(s′, a′

)via a′. Fa′ := Fnd(s | π∗(s′) = a′), Fa := Fnd(s | π∗(s′) = a) are the sets of visit

dist. functions whose policies take the appropriate action at s′. Suppose Fa′ is similar to Fsub ⊆ Favia a state permutation φ fixing all states not belonging to REACH

(s′, a

)∪ REACH

(s′, a′

).

Then ∀γ ∈ [0, 1] : PD (Fa′ , γ) ≤ PD (Fa, γ) and Esa′∼T (s′,a′)

[POWER (sa′ , γ)

]≤

Esa∼T (s′,a)

[POWER (sa, γ)

]. If Fsub ( Fa, both inequalities are strict for all γ ∈ (0, 1).

Proof. If a and a′ are equivalent at state s′, Fa′ = Fsub = Fa and so the claimed equality triviallyholds. If a and a′ are not equivalent at state s′, then REACH

(s′, a

)and REACH

(s′, a′

)must be

disjoint by the definition of a bottleneck (definition 6.3).

Define a new permutation

φ′(si) :=

φ(si) if si ∈ REACH

(s′, a′

)φ−1(si) if si ∈

{φ(sk) | sk ∈ REACH

(s′, a′

)}si else.

(45)

Since the policies which induce Fa′ take a′-equivalent actions at s′, since taking a-equivalentactions at s′ is the only way for s to reach the states of REACH

(s′, a

), and since a′ and a are

not equivalent at s′ by the reachability disjointness condition, no visit distribution in Fa′ visitsa state in REACH

(s′, a

), and vice versa. Since s can reach s′ with positive probability, for any

sk ∈ REACH(s′, a′

), there exists some f ′ ∈ Fa′ which visits sk by lemma F.24. Therefore, since

φ (Fa′) = Fsub cannot visit any sk ∈ REACH(s′, a′

)and φ fixes all other states, φ must send these

sk to REACH(s′, a

):{φ(sk) | sk ∈ REACH

(s′, a′

)}⊆ REACH

(s′, a

). We conclude that φ′ is a

well-defined permutation.

φ′ enjoys several convenient invariances. By definition, φ′ (Fa′) = Fsub and φ′ (Fsub) = Fa′ . Lastly,let Fextra := Fa \ Fsub.

φ′(Fnd(s) \ Fextra

)= φ′

(Fnd(s) \

(Fa \ Fsub

))(46)

= φ′((Fnd(s) \ Fa

)∪ Fsub

)(47)

=

⋃a′′ not equivalent to a

φ′(Fnd(s | π∗(s′) = a′′)

) ∪ Fa′ (48)

33

=

⋃a′′ not equivalent

to a or a′

Fnd(s | π∗(s′) = a′′)

∪ Fsub ∪ Fa′ (49)

= Fnd(s) \ Fextra. (50)

Equation (48) follows by the definition of Fa and from the fact that φ′ (Fsub) = Fa′ . Equation (49)follows because, by definition, φ′ acts as the identity on visit distribution functions which are notelements of Fsub ∪ Fa′ .

Optimality probability. Suppose γ ∈ (0, 1), and let D induce the probability measure F overreward functions.

PD (Fa′ , γ) =

∫D1maxf′∈F

a′f ′(γ)>r=maxf′′∈Fnd(s) f

′′(γ)>r dF (r) (51)

≤∫D1maxf′∈F

a′f ′(γ)>r=maxf′′∈Fnd(s)\Fextra f

′′(γ)>r dF (r) (52)

=

∫D1maxf′∈F

a′f ′(γ)>r=maxf′′∈Fnd(s)\Fextra f

′′(γ)>r dF (Pφ′r) (53)

=

∫D1maxf′∈F

a′f ′(γ)>(Pφ′−1r′)=maxf′′∈Fnd(s)\Fextra f

′′(γ)>(Pφ′−1r′) dF (r′) (54)

=

∫D1maxf′∈F

a′(Pφ′ f

′(γ))>r′=maxf′′∈Fnd(s)\Fextra (Pφ′ f′′(γ))>r′ dF (r′) (55)

=

∫D1maxf∈Fsub f(γ)>r′=maxf′′∈Fnd(s)\Fextra f

′′(γ)>r′ dF (r′) (56)

≤∫D1maxf∈Fa f(γ)>r′=maxf′′∈Fnd(s) f

′′(γ)>r′ dF (r′) (57)

= PD (Fa, γ) . (58)

Equation (53) follows because D is IID over states. Equation (54) follows by the substitutionr′ := Pφ′r. Equation (55) follows because permutation matrices are orthogonal. Equation (56)follows because φ′(Fa′) = Fsub and φ′(Fnd(s) \ Fextra) = Fnd(s) \ Fextra by eq. (50).

Since Fextra ⊆ Fa, ∀r ∈ R|S| : 1maxf∈Fs f(γ)>r=maxf′′∈Fnd(s)\Fextra f′′(γ)>r ≤

1maxf∈Fa f(γ)>r=maxf′′∈Fnd(s) f′′(γ)>r and so eq. (57) follows.

POWER. Let F+a ⊆ Fa contain those non-dominated visit distribution functions of Fa assigning

s′ nonzero visitation frequency, and let Π+a contain the (deterministic) policies inducing the visit

distribution functions of F+a . We similarly define F+

sub, Π+sub, F+

a′ , and Π+a′ .

Let Sa contain those sa such that T (s′, a, sa) > 0. We want to show that for all sa ∈ Sa, restrictionto Π+

a preserves optimal value for all reward functions. Let R be any reward function. Construct R′

which equals R on REACH(s′, a

)and vanishes everywhere else.

Since R′ vanishes outside of REACH(s′, a

)and s can only reach REACH

(s′, a

)by taking actions

equivalent to a at state s′, it must be optimal to navigate from s to s′ and take action a, and to thenachieve the optimal value at each sa ∈ Sa. Since we assumed that s can reach s′ with positiveprobability, R′ must have an optimal policy π∗ that induces a non-dominated fπ

∗,s ∈ F+a ⊆ Fnd(s)

(otherwise only dominated visit distribution functions would satisfy V ∗R′(s, γ) = maxf∈F(s) f(γ)>r′,a contradiction).

Since ∀sa ∈ Sa : V ∗R (sa, γ) = V ∗R′(sa, γ), we conclude that restriction to Π+a preserves optimal

value at all sa for all reward functions R:

∀sa ∈ Sa : POWER (sa, γ) = Er∼D

[maxπ∈Π+

a

1− γγ

(fπ,sa(γ)− esa

)>r

]. (59)

34

Similar reasoning applies for POWER (sa′ , γ).

Because φ′ fixes all states sk 6∈ REACH(s′, a

)∪ REACH

(s′, a′

), φ′ fixes “everything that happens

before taking action a or a′ at s′”, and so f ∈ F+a′ implies that Pφ′f also places positive visitation

frequency on s′, and vice versa. Then f ∈ F+a′ iff

(Pφ′f

)∈ F+

sub.

Let F+s′,a′ :=

{fπ,s

′ | π ∈ Π+a′

}be the visit distribution functions which the policies π ∈ Π+

a′

induce starting from s′. We similarly define F+s′,a and F+

s′,sub. Since φ′(F+a′

)= F+

sub, we have

φ′(F+s′,a′

)= F+

s′,sub.

Esa′∼T (s′,a′)

[POWER (sa′ , γ)

](60)

=

∫D

maxπ′∈Π+

a′

1− γγ

(Esa′

[fπ′,sa′ (γ)

]− esa′

)>r dF (r) (61)

=1− γγ

(∫D

maxπ′∈Π+

a′

(Esa′

[fπ′,sa′ (γ)

])>r dF (r)− E [X]

)(62)

=1− γγ

∫D

maxfπ′,s′∈F+

s′,a′

γ−1(fπ′,s′(γ)− es′

)>r dF (r)− E [X]

(63)

=1− γγ

∫D

maxfπ′,s′∈F+

s′,a′

γ−1(fπ′,s′(γ)− es′

)>r dF (Pφ′r)− E [X]

(64)

=1− γγ

∫D

maxfπ′,s′∈F+

s′,a′

γ−1(fπ′,s′(γ)− es′

)> (Pφ′−1r′

)dF (r′)− E [X]

(65)

=1− γγ

∫D

maxfπ′,s′∈F+

s′,a′

γ−1(Pφ′f

π′,s′(γ)− es′)>

r′ dF (r′)− E [X]

(66)

=1− γγ

∫D

maxfπ,s′∈F+

s′,sub

γ−1(fπ,s

′(γ)− es′

)>r′ dF (r′)− E [X]

(67)

≤∫D

maxfπ,s′∈F+

s′,a

1− γγ

(γ−1

(fπ,s

′(γ)− es′

)>r′ − E [X]

)dF (r′) (68)

=

∫D

maxπ′∈Π+

a′

1− γγ

(Esa

[fπ,sa(γ)

]− esa

)>r′ dF (r′) (69)

= Esa∼T (s′,a)

[POWER (sa, γ)

]. (70)

Equation (61) and eq. (70) follow by eq. (59). Equation (62) and eq. (69) follow because underD, each state has reward distribution X . Because fπ

′,s′(γ) = es′ + γEsa′∼T (s′,π′(s′))

[fπ′,sa′

](definition 3.3) and T (s′, π′(s′)) = T (s′, a′) by the fact that π′(s′) must be equivalent to a′ at states′ in order for π′ ∈ Π+

a′ , eq. (63) follows.

Equation (64) follows because D is IID over states. Equation (65) follows via substitution r′ := Pφ′r.Because REACH

(s′, a

)and REACH

(s′, a′

)are disjoint, s′ is not an element of either set, and so

φ′(s′) = s′. Equation (66) follows from the orthogonality of permutation matrices and the fact thatφ′(s′) = s′.

Equation (67) then follows because φ′(F+s′,a′

)= F+

s′,sub. Equation (68) follows because F+s′,sub ⊆

F+s′,a.

35

Strict inequalities. By taking the appropriate limits, we see that the non-strict inequalities holdfor γ = 0 and γ = 1. We now show that when Fsub ( Fa, the optimality probability and POWERinequalities are strict for all γ ∈ (0, 1).

Suppose Fextra := Fa \ Fsub is nonempty and let fe ∈ Fextra. Since φ′ fixes sk 6 inREACH(s′, a

)∪

REACH(s′, a′

), φ′(Fa′ \ F+

a′

)= Fa \ F+

a ; since φ′ (Fa′) = Fsub, any such fe must be an element

of F+a .

At discount rate γ ∈ (0, 1) and starting from state s, let fe be strictly optimal for r1 and f ′ ∈ Fa′ bestrictly optimal for r2.

Optimality probability. We want to find a reward function r for which the following inequalities hold:

fe(γ)>r > f ′(γ)>r (71)

> maxf∈Fnd(s)\{fe,f ′}

f(γ)>r. (72)

We now show that the following reward function satisfies the above inequalities for some choice ofα1, α2 > 0, β1, β2 ∈ R:

R(si) :=

α1R1(si) + β1 if si ∈ REACH

(s′, a

)α2R2(si) + β2 if si ∈ REACH

(s′, a′

)0 else.

(73)

Equation (73) is well-defined because REACH(s′, a

)and REACH

(s′, a′

)are disjoint. Because

optimal policy sets are invariant to positive affine transformation of the reward function, for any suchchoice of α1, α2 > 0, β1, β2 ∈ R, at γ we have that fe remains strictly optimal for r among theelements of Fa, and that f ′ remains strictly optimal for r among the elements of Fa′ . By disjointnessof REACH

(s′, a

)and REACH

(s′, a′

), we can positively affinely transform r1 and r2 to produce an r

satisfying the inequalities of eq. (72).

If r is not in the interior of supp(fe, γ), lemma F.14 guarantees we can find another satisfactoryreward function r′ in the interior. This means that there’s an open neighborhood of satisfactory rewardfunctions. Because D is a continuous distribution, this neighborhood has positive measure and sothere exists a positive measure set of reward functions for which the above inequalities hold. Thisimplies that

PD (Fa′ , γ) < Prd∼D

(maxf ′∈Fa′

f ′(γ)>rd = maxf ′′∈Fnd(s)\Fextra

f ′′(γ)>rd

), (74)

and so the inequality in eq. (52) is strict. Therefore, for arbitrary γ ∈ (0, 1), PD (Fa′ , γ) <PD (Fa, γ).

Power. Since fe ∈ F+a is strictly optimal in a positive-measure neighborhood around r′, the

inequality in eq. (67) must be strict. Then for arbitrary γ ∈ (0, 1), Esa′[POWER (sa′ , γ)

]<

Esa[POWER (sa, γ)

].

F.5.2 Having more “long-term options” is robustly instrumental when γ ≈ 1

Lemma F.37 shows that only d ∈ RSDND(s) matter for calculating POWER (s, 1) and optimalityprobability; the dominated RSDs can be ignored for these purposes.Lemma F.37 (d ∈ RSDND(s) iff PD (d, 1) > 0).

Proof. Forward direction: if d ∈ RSDND(s), then it is strictly optimal for some reward functionR. By reasoning identical to that deployed in the proof of lemma F.14, we can scale-and-shift R toproduce an R′ belonging to the interior of supp(d, 1) ⊆ supp(D). supp(d, 1) has positive measureunder D because D is a continuous distribution. PD (d, 1) > 0.

Backward direction: since PD (d, 1) > 0 and RSD(s) is finite, lemma F.20 implies that we canselect R ∈ supp(d, 1) such that d is uniquely optimal for R. Then d must be strictly optimal, and sod ∈ RSDND(s).

36

Theorem 6.8 (When γ = 1, RSD reachability drives POWER). If RSDND(s′) is similar to D ⊆RSDND(s), then POWER

(s′, 1

)≤ POWER (s, 1). The inequality is strict iff D ( RSDND(s).

Proof. D’s similarity to RSDND(s′) means that the integration in POWER(s′, 1

)merely relabels

state variables (as in eq. (64)), and reward is IID across states. Because D ⊆ RSDND(s) and POWERmonotonically increases with additional visit distributions, POWER

(s′, 1

)≤ POWER (s, 1).

Suppose the inequality is strict. If RSDND(s) and RSDND(s′) are similar, then the POWER integrationmerely relabels variables, and so the inequality could not be strict; therefore, we must have D (RSDND(s). Suppose instead that D ( RSDND(s); then the non-dominated RSDs in RSDND(s) \D are strictly optimal on a positive measure subset of supp(D) by lemma F.37. This impliesPOWER (s, 1) > POWER

(s′, 1

).

s

s′sa′a′

saa

Figure 10: Repeated from page 8. At s′, a is more probably optimal than a′ because it allowsaccess to more RSDs. Let D′ contain the cycle in the left subgraph and let D contain the cycles inthe right subgraph. Dsub ( D contains only the isolated green 1-cycle. Theorem 6.9 shows thatPD(D′, 1

)= PD (Dsub, 1), as they both only contain a 1-cycle. Because D′ is similar to Dsub ( D,

PD(D′, 1

)< PD (D, 1). Since D is a continuous distribution, optimality probability is additive

across visit distribution functions (lemma F.32), and so we conclude 2 · PD(D′, 1

)< PD (D, 1). By

theorem 6.8, a is POWER-seeking compared to a′: POWER (sa′ , 1) < POWER (sa, 1).

Theorem 6.9 (When γ = 1, having access to more RSDs is more probably optimal). Let D,D′ ⊆RSDND(s) be such that

∥∥D − RSDND(s) \D∥∥

1=∥∥D′ − RSDND(s) \D′

∥∥1

= 2. If D′ is similarto some Dsub ⊆ D, then PD

(D′, 1

)≤ PD (D, 1). If Dsub ( D, the inequality is strict.

Proof. If D′ is empty, PD(D′, 1

)= 0 and then PD (D, 1) ≥ PD

(D′, 1

)trivially follows.

SupposeD′ is not empty; in this case,D is non-empty by the assumed similarity. Let state permutationφ be such that φ(D′) = Dsub; φ exists by the assumed similarity.

φ′(si) :=

φ(si) if si is visited by some d ∈ D′φ−1(si) if si is visited by some d ∈ Dsub

si else.(75)

φ′ is well-defined by the L1 assumption. By our assumption on φ and the fact that φ′ is an involution,φ′(φ′(D′)) = φ′(Dsub) = D′. Furthermore, let Dextra := D \Dsub.

φ′(RSDND(s) \Dextra

)= φ′

(RSDND(s) \

(Dextra ∪Dsub ∪D′

)∪Dsub ∪D′

)(76)

= φ′(

RSDND(s) \(Dextra ∪Dsub ∪D′

))∪D′ ∪Dsub (77)

= RSDND(s) \Dextra. (78)

We now reason:

PD(D′, 1

):= Pr∼D

(maxd∈D′

d>r = maxd′∈RSDND(s)

d′>r

)(79)

≤ Pr∼D

(maxd∈D′

d>r = maxd′∈RSDND(s)\Dextra

d′>r

)(80)

37

= Pr∼D

(maxd∈D′

d>(Pφ′−1r) = maxd′∈RSDND(s)\Dextra

d′>(Pφ′−1r)

)(81)

= Pr∼D

(maxd∈D′

(Pφ′d)>r = maxd′∈RSDND(s)\Dextra

(Pφ′d′)>r

)(82)

= Pr∼D

(max

d∈φ′(D′)d>r = max

d′∈φ′(RSDND(s)\Dextra)d′>r

)(83)

= Pr∼D

(maxd∈Dsub

d>r = maxd′∈RSDND(s)\Dextra

d′>r

)(84)

≤ Pr∼D

(maxd∈D

d>r = maxd′∈RSDND(s)

d′>r

)(85)

= PD (D, 1) . (86)

Equation (80) holds because D′ and Dextra are disjoint by the ‖·‖1 assumption. φ′−1 just permutesstate labels; since D distributes reward independently and identically across states, eq. (81) holds.Equation (82) follows because permutation matrices are orthogonal. Equation (84) holds by eq. (78).Equation (85) holds because D = Dsub ∪Dextra.

If Dsub = D, Dextra is empty and so PD(D′, 1

)= PD (D, 1) by eq. (85).

Strict inequality. If Dsub ( D, we show that the inequality in eq. (85) is strict: “reintroducing”Dextra “increases” the optimality probability of D.

Consider da ∈ Dextra,db ∈ D′. Since both RSDs are non-dominated, ∃r1, r2 ∈ [0, 1]|S| for whichda,db are respectively strictly optimal.

For α, β ∈ (0, 1), define

R(si | α, β) :=

αRa(si) if ∃d ∈ D : d>esi > 0

βRb(si) if ∃d ∈ D′ : d>esi > 0

0 else.(87)

This is well-defined by the ‖·‖1 assumption. In column-vector form, R(· | α, β) is expressed as rα,β .By our choice of ra, the fact that α > 0, and the ‖·‖1 assumption,

α(d>a ra

)> α

(max

d∈D\{da}d>ra

)(88)

d>a rα,β > maxd∈D\{da}

d>rα,β . (89)

Similar reasoning holds for rb. By eq. (87), ∀d ∈ RSDND(s) \(D ∪D′

): d>rα,β = 0 <

d>a rα,β ,d>b rα,β . Therefore, by the ‖·‖1 assumption, we can choose α, β ∈ (0, 1) such that

d>a rα,β > d>b rα,β (90)

> maxdc∈RSDND(s)\{da,db}

d>c rα,β . (91)

Because D is a continuous distribution, there exists a positive measure set of reward functions forwhich the above inequalities hold. Therefore, when the optimal visit distribution da ∈ Dextra is notconsidered, db ∈ D′ is strictly optimal. Therefore, eq. (85) holds, and PD

(D′, 1

)< PD (D, 1).

F.5.3 Sufficient conditions for actions being POWER-seeking at discount rate 0 or 1

Definition F.38 (Surely reachable children). The surely reachable children of s are Chsure (s) :={s′ | ∃a : T

(s, a, s′

)= 1}

. Determinism implies that Ch (s) = Chsure (s).

38

Corollary F.39 (POWER bounds when γ = 0).

E[max of

∣∣Chsure (s)∣∣ draws from X

]≤ POWER(s, 0) ≤ E

[max of

∣∣Ch (s)∣∣ draws from X

].

(92)

Proof. The left inequality holds because restricting policies to deterministic action at s cannotincrease POWER (s, 0). The right inequality holds because at best, greedy policies deterministicallynavigate to the child with maximal reward.

Theorem F.40 (When γ = 0 under local determinism, maximally POWER-seeking actions leadto states with the most children). Suppose all actions have deterministic consequences at sand its children. For each action a, let sa be such that T (s, a, sa) = 1. POWER (sa, 0) =maxa′∈A POWER (sa′ , 0) iff

∣∣Ch (sa)∣∣ = maxa′∈A

∣∣Ch (sa′)∣∣.

Proof. Apply the γ = 0 POWER bounds of corollary F.39; by the assumed determinism, Ch (sa) =

Chsure (sa) and so POWER (sa, 0) = E[max of

∣∣Ch (sa)∣∣ draws from X

](similarly for each sa′).

E[max of

∣∣Ch (sa)∣∣ draws from X

]is strictly monotonically increasing in

∣∣Ch (sa)∣∣ by the conti-

nuity of X .

Theorem F.41 (When γ = 1, staying put is maximally POWER-seeking). Suppose T (s, a, s) = 1.When γ = 1, a is a maximally POWER-seeking action at state s.

Proof. By staying put, the agent retains its POWER of POWER (s, 1). By lemma 4.8, POWER (s, 1) ≥maxa′ Es′∼T (s,a′)

[POWER

(s′, γ

)]and so no other action a′ is strictly POWER-seeking compared

to a at state s.

39

optimal farsighted agents tend to seek power · reach(s,t) draws from d i. proposition 9. 0 <...

Documents