optimal sequential decision-making under uncertainty brice_peron_thesis.pdfhence, it is necessary to...

Optimal sequential decision-makingunder uncertainty

Martin Peron

Queensland University of Technology

Science and Engineering Faculty

School of Mathematical Sciences

2018

Submitted in fulfilment of the requirements of the degree ofDoctor of Philosophy

Abstract

This thesis is motivated by the management of the invasive tiger mosquitoAedes albopictus across the Torres Strait Islands, an archipelago of islands atthe doorstep of the Australian mainland. Mosquitoes can randomly colonisenew islands through human-mediated pathways. At the same time, decision-makers can implement management actions at regular time intervals to erad-icate mosquitoes on a few selected islands under a budgetary constraint.Hence, it is necessary to prioritise management when making sequential de-cisions. Inspired by the management of Aedes albopictus, this thesis developsnovel mathematical models and tailored solution methods to make optimalsequential decisions under uncertainty.

A Markov decision process (MDP) is a framework that is well-suited tomodel this type of sequential decision problem. A popular algorithm calledstochastic dynamic programming (sometimes simply referred to as dynamicprogramming) can then be used to optimise these decisions. However, MDPsapplied to this type of spatial problem inevitably fall prey to the ‘curse ofdimensionality’: they are computationally very demanding, or intractable, forall but small problems. In Chapters 3 and 4 we propose two novel methodsto scale MDPs to larger problems.

Another key aspect of our case study is the uncertainty in the systemdynamics of the mosquitoes, e.g. the probability of colonising a new islandfrom a given island. This uncertainty, together with the management prob-lem itself, is referred to as adaptive management in environmental sciences.An adaptive management problem can be modelled by a partially observableMarkov decision process (POMDP), but at a much higher computational costthan MDPs. In Chapters 5 and 6 we propose two approaches to solve adap-tive management problems faster, and on larger problems. We now describeour contributions in more detail.

In Chapter 3, we develop a new approach to assist decision-makers whenactions are simultaneous and of different durations. This approach modifiestime constraints to reduce the model size by several orders of magnitude toobtain bounds on the unknown exact performance, for problems too large for

1

dynamic programming to compute the exact solution. Applied to our casestudy, the bounds provide a narrow range guaranteed to contain the perfor-mance of the exact optimal policy. This research impacts meta-populationsand network management problems in biosecurity, health and ecology whenthe budget allows the implementation of simultaneous actions.

In Chapter 4, we propose two new approximate dynamic programmingalgorithms adapted to Susceptible-Infected-Susceptible networks. We showthat these two algorithms have a lower computational complexity than thestandard version of dynamic programming. These approaches are tractableon the management of Aedes albopictus (17 islands), as opposed to the stan-dard version of dynamic programming, and rival its performance on simplerproblems (10 islands). This work can be re-used on Susceptible-Infected-Susceptible networks or graph MDPs in various fields, e.g. to deal with indi-viduals or locations in a network, or products in an inventory problem.

In Chapter 5, we propose a method to improve the initialisation of POMDPsolvers that are used when solving adaptive management problems. We showthat our approach, which consists of solving a number of Markov decisionprocesses, generates a lower bound on the optimal value function that is op-timal in the corners of the belief space. This simple and inexpensive initiallower bound can be used as an initialisation to POMDP solvers. Tested ontwo state-of-the-art POMDP solvers, our approach shows significant compu-tational gains in our case study and on a previously published data challenge.This research is relevant for managing systems where the system responseis partially unknown, in fields as varied as natural resource management,medical science, or machine or infrastructure maintenance.

In Chapter 6, we introduce a novel optimal control approach to addressadaptive management problems, starting with a stylised continuous-timeproblem. The variable representing our knowledge of the unknown parame-ter is shown to follow a differential equation. All states are replaced by theirexpected values, which leads to a deterministic model that is solved withan optimal control algorithm. This algorithm rivals dynamic programmingon small problems and remains tractable on larger problems, in contrastto dynamic programming. It achieves the right balance between aggres-sive and smoothly varying controls. This approach can be beneficial forcontinuous-time real-world problems, such as stock portfolio optimization orflight trajectory planning, or to problems where the state is multidimensionalor near-continuous.

Together, these four papers make an original and substantial contributionto knowledge from both theoretical and applied standpoints. We provide in-sights on the mathematical properties of sequential decision problems and

2

propose novel optimisation techniques that circumvent the curse of dimen-sionality and allow solving larger instances. Finally, we apply these tech-niques on a novel computational sustainability case study, leading to valuablemanagement recommendations to decision makers.

3

Acknowledgements

My PhD journey would have been quite different without the support andhelp that I received, both at QUT and CSIRO. All of my supervisors havefound a perfect balance by providing outstanding guidance whilst leaving meroom to go my own way. In particular, I would like to express my sinceregratitude to:

My external supervisor in CSIRO, Iadine Chades, for introducing me tothis challenging problem during my Master’s internship. You have providedme with constant and invaluable guidance on many levels, as well as supportand encouragement, throughout my PhD.

My former Principal supervisor, Kai Helge Becker, who supported me,helped me grow as a mathematician and as a scientist, in particular by greatlybettering my writing skills.

My Principal supervisor, Kate Helmstedt, for accompanying me throughthe end of my PhD journey, in particular through her fruitful comments onmy thesis and on Chapter 4.

My external supervisor Peter Bartlett, for very insightful and stimulatingideas, in particular on approximate approaches on SIS MDPs.

My associate supervisor Kerrie Mengersen for her very insightful com-ments on my thesis.

Those who contributed to my main PhD chapters as co-authors: CassieJansen, Nancy Schellhorn, Chrystal Mantyka-Pringle, Sam Nicol, Christo-pher Baker, Barry Hughes.

The many scientists who have helped me develop ideas, discussed myresearch and/or proof-read my documents. These include members of theConservation Decisions Team (Iadine Chades, Sam Nicol, Yann Dujardin,Jean-Baptiste Pichancourt, Josie Carwardine, Rocio Ponce Reyes, CameronFletcher, Moreno Di Marco, John Dwyer, Pirashanth Ratnamogan), but alsoMarco Kienzle, Regis Sabbadin, Nathalie Peyrard.

The researchers with whom I have had countless interesting lunch discus-sions in CSIRO, which were a sometimes much-needed escape from my PhD

4

topic.

Finally, I would like to thank my family for their love, support and en-couragement, and without whom I wouldn’t be the person that I am today.

My PhD was supported by an Industry Doctoral Training Centre schol-arship and a CSIRO top-up scholarship, which gave me financial support forattending conferences, courses and workshops in Australia and overseas.

5

Statement of original authorship

The work contained in this joint thesis undertaken between QUT and CSIROhas not been previously submitted to meet the requirements for an award atthese or any other higher education institution. To the best of my knowledgeand belief, the thesis contains no material previously published or written byanother person except where due reference is made.

Lodgement of the Thesis for Examination

Signature:

Date: 22/03/2018

Submission of the Final Thesis

Signature:

Date: 31/07/2018

6

QUT Verified Signature


Contents

Abstract 1

Acknowledgements 4

Statement of original authorship 6

1 Introduction 9

1.1 Motivating problem: The tiger mosquito Aedes albopictus . . 9

1.1.1 Aedes albopictus . . . . . . . . . . . . . . . . . . . . . . 9

1.1.2 Situation in Australia . . . . . . . . . . . . . . . . . . . 11

1.1.3 A general problem . . . . . . . . . . . . . . . . . . . . 13

1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3 Research problems . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Research gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Research questions . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 Outcomes of the study . . . . . . . . . . . . . . . . . . . . . . 15

1.7 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . 16

2 Literature Review 18

2.1 Sequential decision problems on large networks . . . . . . . . . 18

2.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.2 The problem of simultaneous asynchronous actions ofdifferent durations . . . . . . . . . . . . . . . . . . . . 24

2.1.3 Solving large SIS-based MDPs . . . . . . . . . . . . . . 25

2.2 Accounting for structural uncertainty with adaptive manage-ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.1 General background: a brief history of addressing struc-tural uncertainty . . . . . . . . . . . . . . . . . . . . . 27

7

2.2.2 Technical background: solving structural uncertaintywith Markovian processes . . . . . . . . . . . . . . . . 32

2.2.3 The limitation of current MOMDP solvers . . . . . . . 48

2.2.4 A change of paradigm . . . . . . . . . . . . . . . . . . 49

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Selecting simultaneous actions of different durations to opti-mally manage an ecological network 52

4 Two approximate dynamic programming algorithms for man-aging complete SIS networks 64

5 Fast-tracking Stationary MOMDPs for Adaptive Manage-ment Problems 78

6 Continuous-time dual control 87

7 Conclusions 106

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.2 Management recommendations against invasive mosquito Aedesalbopictus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.3.1 Current limitations . . . . . . . . . . . . . . . . . . . . 109

7.3.2 Restrictive assumptions . . . . . . . . . . . . . . . . . 110

7.3.3 Future avenues of research . . . . . . . . . . . . . . . . 111

Bibliography 123

Appendix A Optimization methods to solve adaptive manage-ment problems 124

Appendix B Appendix to Chapter 3 145

Appendix C Proof of Theorems in Chapter 5 162

C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . 163

C.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . 169

8

Chapter 1

Introduction

In this introduction we describe the motivation, aim, research problems,research gaps, research question and outcomes of this thesis. Figure 1.1summarises how the chapters of this thesis relate to each other and withregard to these different concepts.

1.1 Motivating problem: The tiger mosquito

Aedes albopictus

Invasive species, i.e. ‘species, subspecies or lower taxon, introduced out-side its natural past or present distribution [...] and whose introductionand/or spread threaten biological diversity’ (Convention on biological diver-sity, 2002), can have a profound impact on ecosystems. They disrupt thestructure of ecosystems, may compete with indigenous species (sometimesleading to their extinction; Gherardi (2007)), and may threaten human healthby transmitting diseases or infections, causing wounds and deaths (Mazzaet al., 2014). The damages from invasive species have been estimated at US$1.4 trillion per year (Pimentel et al., 2001), not including the extinction ofnative species.

1.1.1 Aedes albopictus

This research project focuses on the Asian tiger mosquito, Aedes albopictus(first described by Skuse, 1894), an aggressive, daytime biting insect amongthe most dangerous invasive species around the world. It is a known vectorof several pathogens such as dengue and chikungunya viruses. These life-threatening viruses cause a variety of symptoms, such as high fever, headache,

9

Figure 1.1: Diagram summarising how the chapters relate to the differentresearch problems.

10

skin rash and muscle and joint pains. Native to south-east Asia, Aedes al-bopictus has rapidly dispersed extensively across the world, colonising everycontinent except Antarctica over the last 30-40 years (Bonizzoni et al., 2013).Found mainly in temperate and tropical regions (Paupy et al., 2009), Aedesalbopictus continues invading new, cooler areas, such as Northern Europe(Bonizzoni et al., 2013). It moves primarily through human-related trans-port such as used tyres, machinery and other containers (Ritchie et al., 2006).

1.1.2 Situation in Australia

Although the Australian mainland is currently not infested, concerns aboutits biosecurity can be raised for two reasons. First, Aedes albopictus has al-ready been intercepted several times in Australian ports since 1988 (Ritchieet al., 2006), but active surveillance and appropriate quarantine measuresmean the mosquito has never managed to establish permanently. Second,Aedes albopictus has been detected in 2005 in the Torres Strait Islands (Fig.1.2; Ritchie et al. (2006)), an Australian archipelago. Many are still infestedtoday, in particular some islands adjacent to mainland Australia, such asThursday and Horn islands (Beebe et al., 2013). These islands constitutepotential sources for the introduction of Aedes albopictus into mainland Aus-tralia through numerous human-mediated pathways between the islands andtowards north-east Australia (Fig. 1.3). A conservative estimate for the totalwillingness of Australian households to pay for reducing the probability ofinvasion from 50% to 5% is A$349 (US$284) (Mwebaze et al., 2017). Theauthors conclude that the benefits of reducing the probability of an invasionincursion outweigh the costs.

Consequently, the Australian Health authorities funded the Aedes albopic-tus Eradication Program (AAEP) (Hill et al., 2008), whose aims are to erad-icate the insect from the Torres Strait Islands and to protect mainland Aus-tralia from infestation (Beebe et al., 2013). The management includes thetreatment of containers and mosquitoes with diverse insecticides to reducelarval habitat and the monitoring of uninfested islands (Hill et al., 2008).The ideal outcome would be to eradicate Aedes albopictus completely on theTorres Strait Islands. However, the tight budget does not allow the decisionmakers to manage all islands simultaneously. Decision makers must select theislands to manage in priority to protect the Australian mainland optimally.

Two aspects of this problem are noteworthy. First, the decisions to beimplemented are sequential, i.e. the management effort can be shifted towardsdifferent islands at regular time steps. Second, the dispersal of the mosquitoes

11

Figure 1.2: Location of the Torres Strait (circled), between Papua NewGuinea and mainland Australia.

Figure 1.3: Map of the Torres Strait showing Papua New Guinea, the 17 pop-ulated Torres Strait Islands and mainland Australia as red squares. Blue linesillustrate possible invasion pathways of Aedes albopictus between nodes viahuman-mediated transport including local boats, airplanes or ferries. Path-ways with a small transmission probability are not shown for clarity.

12

is uncertain. Not only do they spread randomly (stochastically), but theirpopulation dynamics is also not perfectly known. It is not possible to knowfor sure which islands will be infested or not at a given point in time.

Although we will focus on managing Aedes albopictus to some extent, itis important to notice that the problem of making sequential decisions underhigh levels of uncertainty is not limited to this particular application.

1.1.3 A general problem

As we will see in this thesis, these two essential aspects of the problem arisein many different fields of research and applications. The management of dy-namic ecological systems—such as weed control, disease management and fireregime management to name a few—is often constrained by limited resources,leading managers to use mathematical methods to make cost-effective deci-sions (Duke et al., 2013). Besides, imperfect ecological models mean it isunsure how the system will evolve. Finally, such systems often require im-plementing regular management actions over time, which in turn involve se-quential decisions—much like our motivating problem described above. Theunderlying mathematical structure of our motivating problem is very similar,if not identical, to that of many other environmental problems.

This is also true for non-environmental problems. As we will see through-out this thesis, many challenging problems involving sequential decisions un-der uncertainty can be found in fields as diverse as advertising, flight control,finance, machine maintenance or military applications.

Throughout this thesis, we will be using the mosquito problem both as aninspiration and as a case study. However, the potential for further applicationof the work presented here is much broader than this problem itself.

1.2 Aim

The aim of this thesis is to use mathematical analysis and computationaltechniques to optimise sequential decisions under uncertainty, inspired bythe management of invasive Aedes albopictus.

1.3 Research problems

As we will see in the Literature Review, a Markov decision process (MDP) isan appropriate framework to solve sequential decision problems with knownsystem dynamics (or population dynamics). A partially observable Markov

13

decision process (POMDP) is suitable for problems with uncertain systemdynamics. Using these two frameworks comes with caveats and led to theidentification of three research problems:

1. Up to two or three islands can be managed simultaneously on the 17Torres Strait Islands. Besides, we will see that the different actionsavailable are not of the same duration. Hence, we may need to makedecisions while actions are in progress, which is difficult to account forwhen dealing with Markov decision processes.

2. Even when disregarding the issue of simultaneous actions, each of the17 Torres Strait Islands can be either infested or not (this type ofnetwork is called Susceptible-Infected-Susceptible or SIS). Hence, thereare 217 possible states in the system. Using a Markov decision processrequires computing all the probabilities of transitioning from any ofthe 217 states to any state. This number is prohibitively high from acomputational perspective.

3. Using a Markov decision process requires knowing the system dynamics,i.e. the action- and state-dependent probabilities that drive the system.Despite all the effort devoted to evaluate these probabilities, they areoften imprecise. POMDPs can be harnessed to address such problems,accounting both for the management problem to be solved and theuncertainty on the system dynamics, but will prove very slow in ourcase study and intractable for all but small problems.

1.4 Research gaps

Our literature review led us to identify four research gaps:

1. When dealing with simultaneous actions of different durations, a brute-force application of dynamic programming can only address a few is-lands at a time. Approaches allowing for dealing with more islands arelacking, and only up to 5-10 nodes are tractable (Chapter 3).

2. Some methods exist to solve large MDPs but few are tailored to SISnetworks, limiting the number of nodes to 10-15 (Chapter 3).

3. Very little attention has been devoted to improving POMDP solversfor adaptive management problems, and these solvers can be very slow(Chapter 5).

14

4. To date, adaptive management has always been modelled in discrete-time, which comes with two limitations. First, there are many real-world problems that require continuous attention and discretising thetime frame involves non-trivial trade-offs between accuracy and com-putational burden. Second, this often leads to using dynamic program-ming, which is not well suited to multidimensional or unbounded states.In contrast, solution techniques from continuous-time optimal controlmay help deal with larger states spaces because they have very differentstrengths and weaknesses from dynamic programming. There has beenlittle research to bridge the gap between optimal control and adaptivemanagement.

1.5 Research questions

The following research questions arise, in order, from each of the four aboveresearch gaps:

1. Can an approximate approach successfully deal with simultaneous ac-tions of different durations?

2. Can an approximate approach successfully deal with large SIS net-works?

3. Can the computation time of adaptive management solvers be reduced?

4. Can optimal control be used to handle adaptive management problems?

1.6 Outcomes of the study

This is a thesis by publication. The research conducted during this PhD hasled to four articles (two published, one accepted and one submitted) whichaddress, in order, the four questions listed above:

1. Peron, M., Jansen, C. C., Mantyka-Pringle, C., Nicol, S., Schellhorn,N. A., Becker, K. H., and Chades, I. (2017b). Selecting simultaneousactions of different durations to optimally manage an ecological net-work. Methods in Ecology and Evolution, 8(10):1332–1341 (Chapter3);

2. Peron, M., Bartlett, P. L., Becker, K. H., Helmstedt, K. J., and Chades,I. (2018). Two approximate dynamic programming algorithms for man-aging complete SIS networks (Chapter 4); In Press.

15

3. Peron, M., Becker, K. H., Bartlett, P., and Chades, I. (2017a). Fast-Tracking Stationary MOMDPs for Adaptive Management Problems.In Proceedings of the Thirty-First AAAI Conference on Artificial In-telligence (AAAI-17), pages 4531–4537 (Chapter 5);

4. Peron, M., Baker, C. M., Hughes, B. D., Chades, I. Continuous-timedual control (Chapter 6). Submitted to Optimal Control Applicationsand Methods.

We also include in Appendix A the following co-authored paper:

Chades, I., Nicol, S., Rout, T. M., Peron, M., Dujardin, Y., Pichancourt,J.-B., Hastings, A., and Hauser, C. E. (2017). Optimization methods tosolve adaptive management problems. Theoretical Ecology, 1(1):1–20.

1.7 Outline of the thesis

This thesis contains an introduction (Chapter 1), a literature review (Chapter2), core chapters (Chapter 3–6) and a conclusion chapter (Chapter 7).

In our literature review (Chapter 2), we will present how Markov decisionprocesses and some of their extensions can be used to deal with the differ-ent types of uncertainties on sequential decision problems, with a particularemphasis on structural uncertainty, i.e. uncertainty on the system dynamics.We outline the limitations of the current state-of-the-art approaches and linkto the chapters addressing them.

In Chapter 3, we develop a new approach to assist decision-makers whenactions are simultaneous and of different durations (Objective 1). This ap-proach modifies time constraints to reduce the model size by several ordersof magnitude to obtain bounds on the unknown exact performance, for prob-lems too large for dynamic programming to compute the exact solution.Applied to our case study, the bounds provide a narrow range guaranteed tocontain the performance of the exact optimal policy. This research impactsmetapopulations and network management problems in biosecurity, healthand ecology when the budget allows the implementation of simultaneous ac-tions.

In Chapter 4, we propose two new approximate dynamic programming al-gorithms adapted to Susceptible-Infected-Susceptible networks (Objective 2).We show that these two algorithms have a lower computational complexitythan the standard version of dynamic programming. These approaches aretractable on the management of Aedes albopictus (17 islands), as opposed

16

to standard dynamic programming, and rival its performance on simplerproblems (10 islands). This work can be re-used on Susceptible-Infected-Susceptible networks or graph MDPs in various fields, to deal with individ-uals or locations in a network or products in an inventory problem.

In Chapter 5, we propose a method to improve the initialisation of POMDPsolvers that are used when solving adaptive management problems (Objec-tive 3). We show that our approach, which consists of solving a numberof Markov decision processes, generates a lower bound that is optimal inthe corners of the belief space. With an additional assumption about theoptimal policy, we demonstrate that this lower bound is also a linear ap-proximation to the value function. Tested on two state-of-the-art POMDPsolvers, our approach shows significant computational gains in our case studyand on a previously published data challenge. This simple and inexpensiveinitial lower bound can be used as an initialisation to POMDP solvers. Itis relevant for managing systems where the system response is partially un-known, in fields as varied as natural resource management, medical science,or machine, network or infrastructure maintenance.

In Chapter 6, we introduce a novel optimal control approach to addressadaptive management problems, starting with a stylised continuous-timeproblem (Objective 4). The variable representing our knowledge of the un-known parameter is shown to follow a differential equation. All states arereplaced by their expected values, which leads to a deterministic model thatis solved with an optimal control algorithm. This algorithm rivals dynamicprogramming on small problems and remains tractable on larger problems,in contrast to dynamic programming. It achieves the right balance betweenaggressive and smoothly varying controls. This approach can be beneficialfor continuous-time real-world problems, such as stock portfolio optimizationor flight trajectory planning, and also for problems where the state is mul-tidimensional or near-continuous. This is the only manuscript that is notdirectly applied to the management of Aedes albopictus.

In Chapter 7, we conclude by briefly summarising our contributions andtheir significance. We then summarise the implications for managing Aedesalbopictus and outline the future directions that would be worth exploring.

17

Chapter 2

Literature Review

This literature review first introduces Markov decision processes and stochas-tic dynamic programming, and how spatial decision problems can be framedas MDPs using the Susceptible-Infected Susceptible framework (Section 2.1.1).We describe in Sections 2.1.2 and 2.1.3 the two issues that we encounter andresolve in Chapters 3 and 4. We then briefly depict in Section 2.2.1 how dif-ferent fields have addressed structural uncertainty, i.e. an uncertain systemdynamics on top of the stochasticity of the system. Finally, we describe howMDPs, partially observable Markov decision processes and their variants canbe used to solve such problems (Section 2.2.2), but have limitations whichwe aim at resolving in Chapters 5 and 6 (Sections 2.2.3 and 2.2.4).

2.1 Sequential decision problems on large net-

works

In this section we introduce Markov decision processes and show how to solvethem. We then show how to adapt them to Susceptible-Infected-Susceptiblenetworks.

2.1.1 Background

Markov decision processes

A Markov decision process (MDP) is a convenient mathematical frameworkto model the impact of sequential decisions on a probabilistic system (Bell-man, 1957). In environmental sciences, it can be for example used to modelthe random dispersal of a species and how the species responds to our man-agement actions. It is the foundation of more complex frameworks described

18

in further sections. For simplicity in the notation in the rest of this thesis,all variables referring to time step t + 1 will be followed by a prime symbol(’), as opposed to variables of time step t. Four components define an MDP〈S,A, P, r〉 (Puterman, 1994):

1. The state space, noted S: The system is in exactly one state s ∈ S atevery time step. The initial state of the system is denoted by s0.

2. The action space, noted A: the policy-maker chooses one action a ∈ Ato implement at each time step t after observing the current state.

3. The transition probability or transition function P (sometimes calledtransition matrix), where P (s′|s, a) is the probability of transitioningto the state s′ from the state s after implementing action a. Theprocess follows the Markov property in that this probability does notdepend on past states and actions. Although we use the notation P asa probability within the MDP framework, we will use the notation Prfor probabilities in the general case.

4. The reward function describing the immediate reward r(s, a) that thepolicy-maker receives for each state and action.

The state and action spaces are usually finite. A Markov decision processunfolds as follows: It starts at time t = 0 at the initial state s0. Observings0, the decision maker chooses an action a0 and receives the reward r(s0, a0).The state s1 corresponding to t = 1 is drawn according to the probabilityP (.|s0, a0). The decision maker observes s1, selects an action a1 and receivesthe reward r(s1, a1), and so on (Fig. 2.1).

Markov decision problems

A Markov decision problem is a Markov decision process together with anoptimisation criterion. Note that the term ‘Markov decision process’ isalso used to designate a Markov decision problem. A criterion accountsfor present and future rewards and represents the objective decision mak-ers try to achieve. We differentiate between four optimisation criteria mostcommonly used (Puterman, 1994):

• The finite criterion is the expected sum of rewards over a finite timehorizon (finite number of time steps, from t = 0 to t = T ). A discountfactor γ, with 0 ≤ γ ≤ 1, may reduce (if γ < 1) the impact of rewardsreceived further in time similarly to an inflation rate.

19

Figure 2.1: Decision diagram of a Markov decision process. Full arrows showrelations of dependence, e.g. the value of st+1 (or s′) depends on st (or s)through the probability P . Dashed arrows illustrate what factors the agentbases their decision upon; here, the best action is based on the current stateonly (Markov property).

• The γ-discounted infinite criterion is the expected discounted sum ofrewards over an infinite time horizon, with 0 ≤ γ < 1.

• The total reward criterion is the expected sum of rewards over an infi-nite time horizon.

• The average criterion is the average expected rewards over an infinitetime horizon.

A policy maps every state to an action. A policy π is either stationary, i.e.independent of time (π : S → A), or non-stationary (πt : S → A, for allt ∈ 1, 2, . . . , T). Policies in the cases of infinite time horizons are alwaysstationary.

The goal of an MDP solver is to find a policy π∗ that maximises the se-lected criterion, starting from s0. We focus on two of these criteria through-out this thesis: the γ-discounted infinite criterion (used in Chapters 3, 4 and5) and the finite criterion (used in Chapter 6). Table 2.1 describes these twocriteria and popular solution techniques solving them. For example, for theγ-discounted infinite criterion, π∗ satisfies (Sigaud and Buffet, 2010):

π∗ = arg maxπ

E[ ∞∑

t=0

γtr(st, π(st))|s0

](2.1)

20

Optimisationcriteria

Value function V π(s) Algorithms

Finite E[ T∑t=0

γtr(st, πt(st))|s0

]Backwards induction

γ-discountedinfinite

E[ ∞∑t=0

γtr(st, π(st))|s0

]

(γ < 1)

Linear programming, Valueiteration, Policy iteration

Table 2.1: Summary of the different optimisation criteria that define optimi-sation objectives, a value function to maximise and a set of solution methods(algorithms) (Chades et al., 2014).

The expectation is taken with respect to the random variables st, drawnaccording to the transition function P .

Stochastic dynamic programming

In the following, we present the algorithm backwards induction with a finitecriterion with discount factor. We show how to extrapolate it to discountedinfinite criterion. The principle of backwards induction underlies other al-gorithms like value iteration and policy iteration, thus allowing for solvingother criteria with only little adjustment (Table 2.1).

We define the value Vj(s) as the expected performance of the policy πapplied in a state s after j time steps:

Vj(s) = E[ T∑

t=j

γtr(st, πt(st))|sj = s]

(2.2)

with πj(s) being the action at time step j in the state s. The followingrecursive formula holds:

Vj(s) = r(s, πj(s)) + γ∑

s′∈S

P (s′|s, πj(s))Vj+1(s′) (2.3)

To find the optimal policy and value, we can use Bellman’s equation (Bell-man, 1957, Puterman, 1994):

Vj(s) = r(s, π∗j (s)) + γ∑

s′∈S

P (s′|s, π∗j (s))Vj+1(s′)

= maxa∈A

[r(s, a) + γ

∑

s′∈S

P (s′|s, a)Vj+1(s′)]

(2.4)

21

The optimal policy satisfies:

π∗j = arg maxa∈A

[r(s, a) + γ

∑

s′∈S

P (s′|s, a)Vj+1(s′)]

(2.5)

The Bellman update operator H, defined on the set of value functions, is aconvenient notation:

(HV )(s) = maxa∈A

[r(s, a) + γ

∑

s′∈S

P (s′|s, a)V (s′)]

(2.6)

It follows Vj = HVj+1. To maximise the finite criterion, the operator H iscomputed T times and obtain the successive values VT , VT−1, until V0. Thisprocess, sometimes referred to as backwards induction, returns the optimalvalue V ∗j (s) and also the optimal policy π∗j (s) for any time step j and states (Puterman, 1994).

When optimising with regard to the infinite discounted criterion (γ < 1),we can also apply H repeatedly to any initial value function (value iteration).H is a γ-contraction, i.e. ‖HV1 − HV2‖∞ ≤ γ‖V1 − V2‖∞ for any value

functions V1, V2, which implies HkVk→∞−−−→ V ∗ with V ∗ = HV ∗. In practice,

this process is stopped when HkV is close enough to V ∗. The following resultallows obtaining an ε-optimal policy, for any desired convergence thresholdε:

‖Hk+1V −HkV ‖∞ ≤ε(1− γ)

γ⇒ ‖Hk+1V − V ∗‖∞ ≤ ε (2.7)

In the case of an infinite horizon, a similar version of value iteration is pol-icy iteration. The peculiarity of policy iteration is that it puts the emphasison the policy instead of the value: starting from any initial policy, the bestcurrent policy is evaluated and improved until it is optimal. Whether it isbetter than value iteration is still an open question (Littman, 1996). Back-wards induction, value iteration and policy iteration are also referred to asstochastic dynamic programming (SDP). Papadimitriou and Tsitsiklis (1987)have shown that finite-horizon and discounted infinite-horizon MDPs are P-complete, i.e. they can be considered efficiently solvable. For this reasonMDPs are widely used across many disciplines.

In the conservation literature, the use of MDPs has increased in the lastdecades, with applications in biological invasions (Firn et al., 2008, Reganet al., 2006), disease management (Chades et al., 2011), release of biocontrolagents (Shea and Possingham, 2000), harvesting (Williams, 1996), migratoryspecies (Nicol et al., 2013), recovery of interacting species (Chades et al.,2012b) and fire regimes (McCarthy et al., 2001, Possingham, 1997, Richardset al., 1999). We now describe Susceptible-Infected-Susceptible networks andthe steps necessary to apply MDP to them.

22

Susceptible-Infected-Susceptible (SIS) networks

SIS networks are used to model spatial systems where a species can spreadover a network (Chades et al., 2011). Each node in the network can be eitherinfested or susceptible (i.e. at risk of being infested). The species can infestnew nodes by spreading. Infested nodes can be cured and reinfested. SISnetworks have also been used to model infectious diseases infecting individu-als (Sahneh et al., 2012) or viruses infecting computers (Pastor-Satorras andVespignani, 2001).

Numbering each node from 1 to N , we denote by si the status of nodenumber i: si = 1 if node i is infested, 0 otherwise. A colonisation proba-bility matrix (pij) describes the probability for any susceptible node i to becolonised from any infested node j. The probability for node i to remain‘susceptible’ is then given by:

Pr(s′i = 0|si = 0) =N∏

j=1

(1− sjpij). (2.8)

So, the probability of transitioning from ‘susceptible’ to ‘infested’ is 1 −∏Nj=1(1 − sjpij). We are interested in decision problems, where actions on

nodes can be implemented. The effectiveness of the sub-action implementedon node i at a given time is denoted ai. It is defined as the probability oferadicating the disease (or curing the virus) over one timestep:

Pr(s′i = 1|si = 1) = 1− ai. (2.9)

This transition is independent of the states in other nodes, and so is neglect-ing the possibility of eradication and reinfestation in one time step. In thispaper we address two common management objectives for SIS models. Inthe eradication objective, the goal is to maximise the number of susceptiblenodes (Chades et al., 2011), so the reward is defined as

r(s, a) = |i|si = 0, 1 ≤ i ≤ N|. (2.10)

In the containment objective, the goal is to stop the spread of the species(Sahneh et al., 2012) so the reward can be along the lines of:

r(s, a) =

0 if node i is infected;

1 otherwise,(2.11)

if node i is to be protected.

MDPs can be adapted to SIS networks as follows. An MDP state s de-scribes the situation on all nodes and is of the form s = (s1, s2, . . . , sN). The

23

transition function is calculated as follows: P (s′|s, a) =∏N

i=1 Pr(s′i|si, a).

Then, stochastic dynamic programming can be applied to SIS-based MDPsto find the optimal policy.

2.1.2 The problem of simultaneous asynchronous ac-tions of different durations

To achieve a management objective faster, several actions can be imple-mented simultaneously. In particular, for spatial problems, simultaneousactions in different locations must be optimised. To date, little research hasfocused on simultaneous actions (Boutilier and Brafman, 1997). In artificialintelligence, simultaneous actions have become important when several deci-sion problems merge (Singh and Cohn, 1998), or when actions have randomdurations (Rohanimanesh and Mahadevan, 2002). Accommodating simul-taneous actions of different durations is challenging because they terminateat different timesteps (Barto and Mahadevan, 2003) and, thus, computingan exact SDP to find an optimal policy requires high memory demands andcomputation time. A workaround is to use approximate methods. Approx-imate algorithms focus on maximising the value of policies but, in practice,policies that cannot be explained in ecological terms will not be applied bymanagers (Walters, 1986). Identifying sensible rules of thumb, i.e. simplifiedversions of more complex policies, is often preferred (Chades et al., 2008,Grechi et al., 2014). However, this simplification causes a loss of value thatis often unknown to managers (Pichancourt et al., 2012).

In Chapter 3 we propose a new approach to assist decision-makers whenactions are simultaneous and of different durations. This approach modifiestime constraints (by synchronising actions and their durations) to reduce themodel size by several orders of magnitude to obtain bounds on the unknownexact performance, for problems too large for dynamic programming to com-pute the exact solution. provide upper and lower bounds of the optimalperformance Applied to the management of Aedes albopictus in the TorresStrait Islands, our case study, the bounds provide a narrow range guaranteedto contain the performance of the exact optimal policy. This research couldbe applied in a range of disciplines, including forestry (Forsell et al., 2011)and invasive or threatened species management (Mantyka-Pringle et al., 2016,Monterrubio et al., 2015), but also in non-spatial ecological problems (Pelizzaet al., 2010, Phelan et al., 1996). Applications in other fields of research alsoexist, e.g. for maintaining and structuring planktonic ecosystems (Poulin andFranks, 2010), optimising a multi-agent agricultural model (Piorr et al., 2009)or modelling a team cooperation in animal societies (Anderson and Franks,

24

2001).

2.1.3 Solving large SIS-based MDPs

Although the model size has been reduced significantly, our case study is stillintractable. As a reminder, there are N = 17 islands, each of which can beeither infested or susceptible. Our goal is to select two or three of the infestedislands to be managed, at every time steps, in order to maximise a certaincriterion (here, contain the mosquitoes as long as possible). The number ofstates in this reduced model, i.e. when disregarding the different durationsof actions, is |S| = 2N . This number is computationally prohibitive when Ngrows: it is intractable for more than 13 islands (Chapter 3). Approximateapproaches are required if we are to solve the 17-island problem. The sameissue occurs in many real-world applications, where the system under studyhas too many features or locations (Hoey et al., 1999, Sahneh et al., 2012,Chades et al., 2011).

Our case study has two noteworthy properties. First, the network is‘complete’: every node can cause every other node to switch state, i.e. fromsusceptible to infected or vice-versa. This complicates the application of localoptimisation approaches. Second, the probability for each node to switchstate is small. This implies that the next state will likely be the same as thecurrent state.

In the last decades, the application of approximate dynamic programming(ADP) approaches have allowed researchers to solve large MDPs. These ap-proaches can be classified into three groups (Powell, 2007), all of which arerelevant to our case study. First, simulation-optimisation approaches consistof using simulations to optimse the policy, which is then assumed to dependon a set of parameters. These approaches aim at finding the best set ofparameters, but do not try to predict how actions might impact the future(myopic policies; Spall (2005). Note that simulation-optimisation approachesare not dynamic programming per se, although they are presented as ADPbecause they can address large MDPs (Powell, 2007). These are inspiringfor our case study, because nodes do not switch states frequently, so predict-ing the future might not be of prime importance. Second, rolling horizonprocedures use a prediction of the near-future to save on potentially costlylong-term predictions. Typical approaches include model predictive controland Monte Carlo tree search (Ho et al., 2015). Third, dynamic programmingapproaches explicitly estimate the values of states to derive optimal actions.These include approximate linear programming and mean field approximatepolicy iteration (Ho et al., 2015, Forsell et al., 2011, Forsell and Sabbadin,

25

2006), both of which imply local optimisation and are unlikely to work be-cause our actions are global. Hoey et al. (1999) introduced SPUDD, anMDP solver that uses algebraic decision diagrams (similar to decision trees)to represent policies and value functions and is probably a more appropriatedynamic programming approach. However, these algorithms are not partic-ularly adapted to networks that are very connected like in our case study.Also, they do not exploit the small switching probabilities for each island.

In Chapter 4, we propose two new approximate dynamic programmingalgorithms adapted to Susceptible-Infected-Susceptible networks. We showthat these two algorithms have a lower computational complexity than thestandard version of dynamic programming (Eqs. 2.6 and 2.7). These ap-proaches are tractable on the management of Aedes albopictus (17 islands),as opposed to dynamic programming (and to SPUDD on most instances).They are also near-optimal on some of the largest problems for which wecan compute the exact solution (10 islands). This work could be applied inmultiple fields. There are many environmental spatial problems requiring ef-fective MDP solvers on networks (Forsell et al., 2011, Nicol et al., 2013, Firnet al., 2008). Other interconnected systems would benefit from this work,e.g. when a system administrator tries to keep as many machines as possiblerunning in a network (Poupart, 2005) or when maximising the reliability ofinformation in a military sensor network (Gillies et al., 2009). Even thoughour two approximate approaches are not guaranteed to be optimal, the re-sulting policies can still be used as an initial policy or a basis of comparisonby other algorithms.

This concludes the section on solving standard sequential decision prob-lems on SIS networks. Another key aspect in managing Aedes albopictus isthe uncertainty about the system dynamics, to which the standard versionof dynamic programming is not adapted.

2.2 Accounting for structural uncertainty with

adaptive management

When designing optimisation models to solve environmental decision prob-lems, decision makers and modellers often assume a perfect knowledge of thesystem dynamics. When the system dynamics is not perfectly known, it isoften much harder to decide which action to select to improve the systemstate the best. This imperfect knowledge is called structural uncertainty,which can be informally described as an uncertainty about ‘where we areheading’. In an MDP, this is equivalent to an uncertain transition function.

26

Note that this could include uncertain rewards as well, but this thesis focusessolely on uncertain transition functions. In Section 2.2.1, we will give a briefaccount of how researchers from different disciplines have approached struc-tural uncertainty. In Section 2.2.2, we will provide more technical detailsabout the relevant solution techniques, and describe some limitations of thecurrent work that we will attempt to overcome in Chapters 5 and 6.

2.2.1 General background: a brief history of address-ing structural uncertainty

Three fields of literature have looked at this problem from different perspec-tives. It is called ‘dual control’ in control theory, ‘Bayesian reinforcementlearning’ in machine learning and ‘adaptive management’ in environmentalsciences. In the following we briefly describe the progress these three fieldshave made in the last decades.

In control theory: dual control

In the early 1950’s, control theory researchers aimed at designing autopilotsfor aircraft in a field called adaptive control, with limited tools such as lin-ear feedback controllers (Astrom and Wittenmark, 2008). Bellman (1961,1957) then introduced dynamic programming, which helped dealing withmore complex stochastic processes.

Then, researchers started to look at problems with uncertain and ‘learn-able’ environments, the first example of which was the two-armed banditproblem (Bellman, 1961, Yakowitz, 1969). Only two actions can be imple-mented (one for each arm), and each arm yields a stochastic return of un-certain expected reward. The objective is to maximise the total reward aftera fixed given number of plays, so the player wants to know which arm hasthe highest expected reward. At each time-step, the player has to trade-offbetween greedy, rewarding decisions (play the arm that has produced thehighest average reward so far) or exploration (try the other arm to refineour estimation of its average reward). The bandit problem captures this‘exploration-exploitation trade-off’ well, which we will find again in thesethree fields of research.

Since many real-world control problems are stochastic, decision makersare often uncertain about how their actions affect the system (Astrom andWittenmark, 2008). Structural uncertainty is usually modelled by uncertainparameters governing the system over time, and we can learn about these pa-rameters, just like the average reward of each arm. However, learning these

27

parameters is not the objective, rather, the objective only depends on thestate and control. Thus there exists a trade-off between better understand-ing the system and guiding it towards a better state: probing controls (i.e.improving knowledge) should only be chosen over more cautious, rewardingcontrols if the long-term benefits of learning outweigh the short-term loss ofperformance. This trade-off is called dual control (Astrom and Wittenmark,2008) and is equivalent to the exploration-exploitation trade-off.

Solving this trade-off, i.e. achieving the balance between probing andcaution, is not an easy task. Although typical control applications are incontinuous time, e.g. flight trajectories, researchers have mostly focussed onsolving discrete-time problems (Astrom and Wittenmark, 2008, Wittenmark,2002), as they are deemed easier to solve. There exist continuous-time ex-ceptions but with no attempt at finding the optimal trade-off (Naik et al.,1992): the control follows the certainty equivalence principle, where the con-trol is chosen as if the current estimate of the uncertain parameter were true(Bertsekas, 1995). In contrast, we will see in Section 2.2.2 that one can findthe optimal discrete-time control by modelling the problem as a Markovianprocess and solving it by using stochastic dynamic programming (Puterman,1994).

Applications of dual control include route optimisation with uncertaintraffic, medical drugs with uncertain efficiency (Astrom and Wittenmark,2008) and wood chip refining, with an uncertain gap between essential me-chanical components (Dumont and Astrom, 1988). Many other applicationsin industry can be found in Astrom (1983).

In machine learning: Bayesian reinforcement learning

Researchers from other fields have also looked at variations of the same prob-lems. An area called reinforcement learning (RL) explores algorithms to makegood sequential decisions in an uncertain environment (Sutton and Barto,1998). For the purpose of this thesis, we focus on the uncertainty about thetransition function, but it can be other features of MDPs that are uncer-tain, such as the rewards. The reinforcement learning literature falls in twocategories: non-Bayesian reinforcement learning and Bayesian reinforcementlearning.

Non-Bayesian reinforcement learning consists in ‘manually’ solving theexploration-exploitation trade-off, by selecting actions recommended by aheuristic. At every time step, the action maximising a certain criterion isselected. This criterion is typically the combination of an exploitation score(e.g. the average reward obtained from this action so far) and an exploration

28

score (e.g. one over the number of times this action has been tried). Thesealgorithms can be model-based, i.e. maintaining an explicit estimation ofthe transition probabilities, or model-free, which shortcuts this step and fo-cusses on actions yielding the best value. Examples of model-free RL areQ-learning (Watkins, 1989) or temporal-difference learning (Sutton, 1988).Some of these methods provably converge towards the optimal value and pol-icy (Watkins, 1989) but little emphasis has been placed on the time neededto converge. In contrast, model-based methods are better suited to estimatethe convergence time. Given parameters ε and ρ, Kakade (2003) aims atfinding in a minimum amount of time a probably approximately correct pol-icy, i.e. a policy that is ε-optimal with probability at least ρ. Further, byestimating the loss of performance (regret) accumulated during this learningtime, Auer et al. (2009) propose an algorithm minimising the total regret(see also Bartlett and Tewari (2009)). However, the performance of thesealgorithm is evaluated asymptotically, i.e. with an emphasis on long-termtrends. In our environmental context, we cannot solely focus on long-termdecisions because time scales are relatively short (a few years or decades).Non-Bayesian reinforcement learning is inspiring but Bayesian reinforcementlearning is closer to what we want to achieve.

Bayesian reinforcement learning consists of using a distribution over var-ious unknown quantities (Vlassis et al., 2012). In particular, model-basedBayesian reinforcement learning explicitly estimates a model of the sys-tem dynamics, which makes it possible to find the optimal exploration-exploitation trade-off. Silver (1963) introduced a formal model of unknowntransition functions, called multi-matrix Markov process. When one of sev-eral transition functions is true, Silver (1963) proposed to maintain a prob-ability distribution over transition functions and update them via Bayes’theorem. When no prior information is available, multinomial Beta (also re-ferred to as Dirichlet) distributions can be used advantageously to representthe uncertainty on the transition probabilities because its strong propertiesmake it convenient to update. Duff (2002) calls this problem Bayes-adaptiveMDP and incorporates the structural uncertainty in the MDP state, justlike in adaptive management and dual control. The author also points outthat a Bayes-adaptive MDP can be cast as a partially observable Markovdecision process (POMDP), and that the properties of POMDPs (see Sec-tion 2.2.2) can thus be exploited to solve Bayes-adaptive MDP. In theory,dynamic programming can be applied to this augmented MDP to solve theexploration-exploitation trade-off optimally. When the problem is too large,one can use approximation techniques to find near-optimal solutions (Duff,2002, 2003). Likewise, after exploring the structure of the resulting value

29

function and the algorithm complexity, Poupart et al. (2006) proposes analgorithm and various approximations to circumvent the computational dif-ficulties. Also of note are works by Kolter and Ng (2009) and Fard andPineau (2010), who obtain some promising results using probably approx-imately correct algorithms in this Bayesian reinforcement learning context,while Dimitrakakis (2009) evaluates the complexity of stochastic branch andbound algorithms in the belief exploration tree.

Bayesian reinforcement learning is a step in the right direction but comeswith two caveats. First, solving the decision problem to find the policy fallsprey to the curse of dimensionality because the Dirichlet distribution hasmany parameters: it is intractable for all but very restricted cases (Vlassiset al., 2012, Kolter and Ng, 2009). Second, learning the parameters takesa long time when applying the policy. Both caveats arise because althoughthe Bayesian reinforcement learning formulations aim at learning about thesystem dynamics, they assume no prior knowledge about its structure. Bothof these problems can be circumvented in an environmental context, becauseprior scenarios describing the population dynamics of a species usually exist.

In environmental sciences: Adaptive management

Due to the high levels of uncertainty involved when dealing with real-worldenvironmental problems, environmental modellers only have an imperfectmodel of the system dynamics. This uncertainty can lie on the nature ofthe model (e.g. whether or not a species predates on another one) or ona parameter, such as a death or birth rate. Approaches proposed to solvestructural uncertainty have greatly improved and increased in complexityover time. Applications occur in conservation (Chades et al., 2012a) andnatural resource management (Johnson et al., 2002, Frederick and Peterman,1995).

The deferred action approach consists of observing the system evolutionin order to gather sufficient knowledge prior to defining and applying man-agement actions (Walters and Hilborn, 1978). However, it is difficult toestimate how much data is necessary to make satisfactory decisions. More-over, the system after the policy has started will likely be different from theundisturbed system during the passive observation process, making theseobservations incomplete (Walters and Hilborn, 1978). A similar approach istrial and error, where the action thought to be the best at the time is applieduntil it is proven to be ineffective. However, trial and error is not based ona precise model, does not intently explore new actions for learning purposesand does not take account of the uncertainty (Duncan and Wintle, 2008).

30

For this reason, we will focus on more recent approaches that update thesystem knowledge and hence improve the policy at the same time the policyis applied. Walters and Hilborn (1978) coined this process adaptive man-agement (AM), or ‘learning by doing’. Adaptive management has drawna lot from adaptive control (Walters and Hilborn, 1978). During the ap-plication of the policy, the successive observations provide information onhow the system responds. This information can be analysed to increase theknowledge of the system, which can in turn lead to policy improvements. Asin dual control and Bayesian reinforcement learning, learning is not part ofthe objective. Thus there exists the same trade-off between informative andrewarding actions.

Some approaches consists of choosing the action as if the current estimateof the system dynamics were true. This is called passive adaptive manage-ment in the environmental sciences literature (Walters and Hilborn, 1978).This is equivalent to the certainty equivalence principle in the dual controlliterature (Bertsekas, 1995). This approach has a low computational cost butis sub-optimal. In contrast, we will see that it is possible to achieve the besttrade-off thanks to a method called active adaptive management (Waltersand Hilborn, 1978). This method accounts for both the future rewards andthe knowledge improvement yielded by the possible actions. If all parame-ters are accounted for, active adaptive management yields the best expectedoutcome. For this reason, we will focus on active adaptive management inthe rest of this thesis and ‘adaptive management’ will always refer to theactive version.

Solving such adaptive management problems is not an easy task, becauseit is equivalent to solving an MDP with a continuous state (see Section 2.2.2).In the last decades, much attention was devoted to handling this continuousset while maintaining accuracy. Traditionally, this continuous set was dis-cretised (Williams, 1996, Runge, 2013), which has two limitations. It leadsto an approximate solution, and setting the discretisation threshold comesdown to manually trading off between precision and solving time.

Williams (2011) shows that AM problems can be cast as POMDPs butdoes not evaluate the approach on a case study. Then, Chades et al. (2012a)showed that MOMDPs, a particular case of POMDPs that is easier to solve,can also be used to model structural uncertainty. The authors successfullyapply this solution technique on a population of threatened birds (Chadeset al., 2012a).

Just as in Bayesian reinforcement learning, framing the problem as aPOMDP or MOMDP was an important step in addressing structural uncer-tainty because it does not involve discretising the state space, thus overcom-

31

ing the two limitations mentioned above (the fact that it took a few decadesbefore reseachers made that step is probably because POMDPs have not beenoriginally developed to address structural uncertainty, but observational un-certainty (see Section 2.2.2)). The difference with Bayesian reinforcementlearning is that the comparatively smaller uncertainty (i.e. only a few pos-sible transition functions) in an adaptive management context makes thePOMDP much smaller and tractable by off-the-shelf POMDP or MOMDPsolvers (Chades et al., 2012a). Hence, modellers in adaptive management candirectly benefit from decades of research on efficient POMDPs solving. Forexample, Nicol et al. (2013) use Symbolic Perseus, a POMDP solver specifi-cally designed for states that are combinations of sub-states (Poupart, 2005),thus handling many more states than with standard solvers.

In conclusion, the adaptive management literature has developed toolsthat are well suited to solve environmental problems for two reasons. First,the transition function is assumed to be either one of a few possible func-tions or depending on one continuous parameter (Chades et al., 2017). Thisis adapted to environmental problems because modellers do have prior in-formation about the transition function, for example different scenarios oropinions about the natural system dynamics. Second, linking structural un-certainty with POMDPs allowed for solving larger real-world problems faster.For these reasons we will follow the adaptive management literature and usePOMDPs to solve our case study.

In the following Section, we provide more technical details on the solutiontechniques that have been developed to deal with structural uncertainty.

2.2.2 Technical background: solving structural uncer-tainty with Markovian processes

In this Section, we will first show how to model problems involving struc-tural uncertainty with Markov decision processes (MDPs) through discreti-sation. Then, we will see that partially observable Markov decision processes(POMDPs) are very much adapted to model this type of problem, and in par-ticular mixed observability Markov decision processes (MOMDPs). Through-out this Section, we will also depict some popular solution techniques solvingPOMDPs and MOMDPs, which will provide a basis for some of the theoret-ical improvements exposed in Chapters 5 and 6.

32

Solving structural uncertainty with Markov decision processes

The general idea to achieve this is to incorporate the uncertainty about thesystem dynamics in the MDP state. Each MDP state is now a combina-tion of a system state, which represents the physical state of nature and amodel state, which represents the true system dynamics or transition func-tion. As traditionally in adaptive management, we assume that the modelstate belongs to a finite set of model or transition functions (Moore and Con-roy, 2006, Walters and Hilborn, 1976). Note that structural uncertainty issometimes modelled by an unknown continuous parameter; such problemsare very challenging to solve (Chades et al., 2017), unless they have a specialstructure allowing for reducing the structural uncertainty to a finite param-eter, for example through beta-binomial conjugate relationship (McCarthyand Possingham, 2007, Runge, 2013).

Often, the system state is assumed perfectly observable and hence notmuch of a concern to model. By contrast, the model state is not knownwith certainty in AM. One solution is to calculate the likelihood of eachmodel (probability of truly reflecting the system dynamics) and assign theselikelihoods to the model state. In other words, the model state is a probabilitydistribution over models. The issue is that, even if the set of models isdiscrete, the set of all probability distributions is continuous. Hence, this setcannot be trivially handled by computers and requires further insights.

In the last decades, much attention was devoted to handling this continu-ous set of all probability distributions while maintaining accuracy. Tradition-ally, those probabilities were discretised in order to be handled by computers.Each element of the resulting grid is linked to an MDP state. When dynamicprogramming requires the value of a model state that does not fall on thegrid, an interpolation rule applies. Discretising the possible set of distri-bution probabilities leads to sub-optimal solutions and implies trading offbetween precision and solving time: the finer the discretisation, the higherthe accuracy and computation time.

In the next section, we introduce the general POMDP framework. Aswe will see, POMDPs are tailored to handle observational uncertainty (whenthe state is not perfectly observable), but not to structural uncertainty (un-certain system dynamics). However, we will see in a subsequent section thatMOMDPs, a variant of POMDPs, handle structural uncertainty well andhave become a popular decision tool to tackle adaptive management prob-lems.

33

Addressing observational uncertainty - partially observable MDP

In real-world problems, the state of the system under management mightbe hard to observe. Rather, the decision maker observes a signal, whichpartially depends on the last action and the current state. This is called ob-servational uncertainty. Note that it is different from structural uncertainty,which describes an uncertainty on the system dynamics but not necessarilyon the current state. Observational uncertainty can be informally describedas an uncertainty about ‘where we are’ as opposed to structural uncertainty,which is about ‘where we are heading’ (but as we will see, one can be usedto model the other). Information about the real state is inferred from thisobservation as well as the process history. If the underlying system dynamicscan be modelled by an MDP, the problem can be modelled as a partially ob-servable MDP (POMDP; Sigaud and Buffet (2010)). Typical applications inenvironmental sciences are for species that are hard to detect (Chades et al.,2008, Regan et al., 2011). There are also numerous applications in artifi-cial intelligence, e.g. to model a robot in a partially observable environment(Pineau et al., 2003).

A POMDP 〈S,A, P, r,Ω, O〉 is defined by six main components. The firstfour are the same as MDPs:

1. The state space S;

2. The action space A;

3. The transition function P , where P (s′|s, a) is the probability of tran-sitioning from the state s to s′ when a is implemented;

4. The reward r(s, a) for each state s and action a;

Two additional features relate to observations:

5. The observation space Ω, which is finite;

6. The observation probability O. O(o′|s′, a) is the probability of observ-ing o′ ∈ Ω if the state is s′ after action a.

The POMDPs process unfolds the same as an MDP: starting from an initialstate s0, the process transitions from state to state depending on the actionsimplemented and governed by the probabilities P . The goal of a POMDPsolver is to find a policy π∗ that maximises a given criterion from Table 2.1.

The difference with MDPs is that the decision maker does not observethe system state, rather, she receives an observation in the set Ω, gener-ated with the probability distribution O (decision diagram in Fig. 2.2).

34

Figure 2.2: Decisions diagrams of MDP and POMDP. Grey areas depictvariables that are not completely observable. The different dashed arrowstowards at+1 (or a′) illustrate the need for the decision maker to considermore factors than the simple current observation to calculate an optimalsolution.

For this reason, POMDPs are non-Markovian (Sigaud and Buffet, 2010),i.e. the current observation is not sufficient for the decision maker to makean optimal decision. Remembering past information is necessary. Defin-ing the history as the initial state, the past actions and the observations(ht = (s0, a0, o1, a1, . . . , ot−1, at−1, ot)), we define a POMDP policy as a map-ping from the set of possible histories to the set of actions. We now presentsome of the solution techniques that have been proposed to find the bestPOMDP policy. Since an MOMDP, a variant of a POMDP tailored to han-dle structural uncertainty (as we will see), is structurally very similar to aPOMDP, these techniques can be used to solve MOMDPs as well in order tosolve structural uncertainty and solve adaptive management problems.

Solving POMDPs – general concepts and results

Let us start with a very basic way of solving a POMDP. In finite time horizon,the number of possible histories is finite, and so is the number of policies. So,we can solve a POMDP optimally by evaluating all of these POMDP policies,and selecting the best one. Besides, in infinite time horizon, we can find anapproximate policy (ε-optimal, with ε as small as we desire) by solving thePOMDP as a finite time POMDP; we simply need to have a long enoughtime horizon so as to ensure that the error incurred by this approximation

35

Figure 2.3: Representation of the belief space for 2, 3 and 4 states. Eachbelief state is associated to one point in the (|S| − 1)-simplex and vice versa.Belief states (0.3, 0.7) and (0.1, 0.6, 0.3) are given as examples.

is below ε (see Eq. 2.7). However, it can be shown that the number ofpolicies (mappings from histories to action) after t time steps grows doublyexponentially with t (Sigaud and Buffet, 2010): this brute-force approach iscomputationally intractable for all but tiny problems.

Thankfully, it is possible to make optimal decisions without keeping trackof the entire history of actions and observations. A sufficient statistics is acompact representation of a mathematical object that preserves enough infor-mation to make optimal decisions. In the following, we introduce the conceptof a belief state, a sufficient statistics for POMDPs that will therefore act asa substitute for the history. Note that although belief states have playeda central role in solving larger POMDPs (as we will see), POMDPs stillremain very hard to solve: Papadimitriou and Tsitsiklis (1987) have demon-strated that finite-horizon POMDPs are PSPACE-complete, while Madaniet al. (1999) further showed that infinite-horizon POMDPs are undecidable.

Belief states

A belief state b is a probability distribution over the states S (Sigaud andBuffet, 2010). It is a useful mathematical concept representing, at any pointin time, our imperfect knowledge about the underlying ‘real-world’ states,which cannot be observed perfectly. If S = s1, s2, a possible belief state is(0.3, 0.7), meaning that the probability of being in s1 and s2 is 0.3 and 0.7respectively. The set of all possible belief states is denoted by B and is asimplex of dimension |S| − 1. Examples of a simplex are a point, a segment,a triangle and a tetrahedron for S = 1, 2, 3, 4 respectively (see Fig. 2.3).

36

One essential property of the belief states is that they can be updatediteratively, without considering the entire history. All variables referring totime step t+1 will be followed by a prime symbol (’), as opposed to variablesof time step t. An action a implemented in the belief state b and followed bythe observation o′ leads to one fully determined belief state denoted by bb,a,o

′

through Bayes’ theorem:

bb,a,o′(s′) = Pr(s′|b, a, o′)

=Pr(o′|s′, b, a)Pr(s′|b, a)

Pr(o′|b, a)

=O(o′|s′, a)

∑s∈S b(s)P (s′|s, a)∑

s′∈S O(o′|s′, a)∑

s∈S P (s′|s, a)b(s)

(2.12)

The denominator can be seen as a normalising factor ensuring that the beliefstate bb,a,o

′is a probability distribution. The belief state bb,a,o

′is called a

successor of b. Eq. 2.12 shows that we can calculate the belief states fromthe previous belief state, action and observation, i.e. without tracking thecomplete history. This means that a process ‘built’ on belief states is Marko-vian; in other words, any POMDP can be cast as a continuous-state beliefMDP 〈B,A, τ, R〉 (Sigaud and Buffet, 2010):

1. Belief states b ∈ B are the states of this belief MDP. Note that withthis formulation, we can easily account for a potential uncertainty onthe initial state s0 by starting with an initial belief state b0.

2. Actions are the same as in the POMDP.

3. R(b, a) =∑

s∈S b(s)r(s, a) is the expected reward in the belief state b,when action a is undertaken.

4. The transition function τ is

τ(b′|b, a) = Pr(b′|b, a)

=∑

o′∈Ω

Pr(b′|b, a, o′)Pr(o′|b, a) (2.13)

First, the right factor Pr(o′|b, a) equals the denominator in Eq. 2.12:

Pr(o′|b, a) =∑

s′∈S

O(o′|s′, a)∑

s∈S

P (s′|s, a)b(s) (2.14)

Second, the left factor Pr(b′|b, a, o′) can be greatly simplified. Recall thatthe successor of b after action a and observation o′ is fully and uniquely

37

determined, and denoted by bb,a,o′. Hence, the quantity Pr(b′|b, a, o′), i.e. the

probability that b′ is this successor, equals one if b′ = bb,a,o′, and 0 otherwise.

This leads us to the following equation for τ :

τ(b′|b, a) =∑

o′∈Ω,b′=bb,a,o′

∑

s′∈S

O(o′|s′, a)∑

s∈S

P (s′|s, a)b(s) (2.15)

ObjectiveSince any POMDP can be rewritten as a belief MDP, we can now define theobjective. In this new way of formalising POMDPs, a POMDP policy is amapping from belief states to actions. Solving a POMDP means finding thepolicy that maximises a given optimisation criterion for an initial belief stateb0. For example, for the γ-discounted infinite criterion, π∗ satisfies:

π∗ = arg maxπ

E[ ∞∑

t=0

γtR(bt, π(bt))|b0

](2.16)

POMDP solvers use an update operator that is nearly identical to classicMDP solvers (Eq. 2.6), but applied to the belief state instead of the state:

(HV )(b) = maxa∈A

[R(b, a) + γ

∑

b′∈B

τ(b′|b, a)V (b′)], for all b ∈ B (2.17)

Note that although the belief space B is infinite, the above sum, for a givenb ∈ B, has a well defined value and is tractable. This is because |Ω| is finite,and there is, for each observation o′ ∈ Ω, only one belief state b′ for whichτ(b′|b, a) 6= 0 (Eq. 2.15). Hence, it is possible to update the value of anybelief state. One might then think that, just like in MDPs, applying thisoperator recursively to every belief state in B leads to the optimal policy.The issue is that, the belief space B is infinite, as opposed to S, which makesthis approach impossible. However, we will see in the following section thatthe value function has a remarkable property, which allows for designing moreefficient approaches.

Piecewise linear convexity of the value function

Smallwood and Sondik (1973) showed that the optimal value functionof a finite-horizon POMDP is piecewise linear convex (Sigaud and Buffet,2010). This means that the value function can be defined as the maximumof a finite number of linear functions, called α-vectors. Fig. 2.4 shows twosimple examples of value functions. In infinite time horizon, the optimal value

38

function is convex (not necessarily piecewise linear) but can be approximatedarbitrary closely by a piecewise linear convex function.

Intuitively, the term ‘linear’ in ‘piecewise linear convex’ is due to the factthat only linear operators intervene in the evaluation of a policy (see Eq.2.16):

Vπ(b0) = E[ ∞∑

t=0

γtR(bt, π(bt))|b0

]= E

[ ∞∑

t=0

γt∑

s∈S

bt(s)r(s, π(bt))|b0

](2.18)

The term ‘piecewise convex’ is due to the fact that we are maximising ourobjective: we are choosing, in each belief state, the best policy, i.e. the linearfunction that has the highest value in this belief state. Another argumentto justify convexity is that belief states near the edge of the belief space,e.g. of the form (0, . . . , 0, 1, 0, . . . , 0) or even (0.5, 0.5, 0, . . . , 0), correspondto a better knowledge of the underlying state than belief states of the form(1/|S|, . . . , 1/|S|). This adds credence to the fact that the value in the centerof the belief space cannot be greater than the weighted average of the valuesnear the edge of the belief space.

α-vectors are important because knowing the set of α-vectors is equiva-lent to knowing the policy. For this to be true, each α-vector is defined asthe combination of one action and an linear function defined over the beliefspace; often the term ‘α-vector’ simply refers to its linear function, withoutambiguity. A given set of α-vectors can be ‘applied’ as a policy as follows.Based on the current belief state b, we identify the α-vector that maximisesthe value in b and implement the action associated with this α-vector. Wethen receive from the system an observation o, from where we update thebelief state through Bayes’ theorem (see Eq. 2.12), and the process repeats.In finite time horizon, the policy is defined as one set of α-vectors per timestep. In infinite horizon, the policy is one set of α-vectors. Throughout therest of this introduction we will focus on infinite time horizon.

This sheds light on a very effective way to solve POMDPs: one needs tostore and update at each time step the set of α-vectors, which will provideboth the policy and, by definition, the value. This is how most POMDPssolvers work. Given the current set of α-vectors Γ′, the updated set Γ canbe calculated as follows (Sigaud and Buffet, 2010):

Γ = r(., a)+γ

|Ω|∑

i=1

∑

s′∈S

α′o′i(s′)O(o′i|s′, a)

P (s′|., a)|a ∈ A, (α′o′1 , α′o′2, . . . , α′o′|Ω|

) ∈ Γ′|Ω|(2.19)

Computing the new value function comes down to computing the set Γ from

39

Figure 2.4: Typical value function in the case of a finite horizon of a two-state (left, bold line) and a three-state (right) POMDP. The value functionis piecewise linear convex, i.e. it equals the maximum of linear functions,referred to as α-vectors ((f1, f2, f3, f4) with two states; planes with threestates).

Γ′. Since Γ is finite by construction, it can be computed in finite time.The value function defined by the set Γ converges towards the optimal valuefunction (Sigaud and Buffet, 2010). Therefore, we can provably obtain anε-optimal policy in a finite number of steps by repeatedly updating the setΓ. Like in the MDP case, the value function can be initialised by any lowerbound on the optimal value function, e.g. by one α-vector of value zeroeverywhere (if the rewards are non-negative).

An impediment of calculating Γ from Γ′ is that the cardinality of Γ canbe up to |A||Γ′||Ω|. It can increase dramatically fast. For all but very smallproblems, we can store only a subset of Γ. In conclusion, in both casesof finite and infinite horizons, an important objective when using α-vectorsis to keep the set of α-vectors small by deleting unwanted elements. ManyPOMDP solvers use this principle, and can be classified as either exact solversor approximate solvers.

Solving POMDPs with α-vectors: Exact solvers

The way to keep the set of α-vectors small is to remove α-vectors that donot affect the value function at all. We call such α-vectors dominated. Anα-vector α0 ∈ Γ is called dominated if, for all belief states b ∈ B, there exists

40

an α-vector α ∈ Γ− α0 such that:

α0.b ≤ α.b (2.20)

This implies that, for all belief states b ∈ B:

maxα∈Γ

α.b = maxα∈Γ−α0

α.b (2.21)

In other words, the dominated α-vectors can be removed without decreasingthe value function. A natural approach to reduce the computational burdenis then to prune out the dominated α-vectors to only keep the useful ones(e.g. f1, f2 and f4 in Fig. 2.4). A number of algorithms follow this approachto solve POMDPs exactly Monahan (1982), White (1991), Cassandra et al.(1994). In the following we only describe Monahan’s algorithm because allthese methods are similar in spirit. We refer the reader to Sigaud and Buffet(2010) for more details on other algorithms.

Monahan’s algorithm consists of generating the entire set Γ described inEq. 2.19. Then, dominated α-vectors are removed; this can be achieved byimplementing a simple linear program (CheckDomination) on all α ∈ Γ. Theoutput z, b of CheckDomination(αcheck,Γ) satisfy z ≤ 0 if αcheck is dominated,and z > 0 if αcheck is useful at least in the belief state b.

Algorithm 1 CheckDomination(αcheck,Γ)

Input: αcheck ∈ ΓObjective: max zVariable: b ∈ B, z ∈ RConstraints: z ≤ (αcheck − α).b, for all α ∈ Γ− αcheckOutput: z, b;

Monahan’s approach turn out to be rather inefficient (Sigaud and Buf-fet, 2010) because the size of the linear program can be prohibitive. Also,generating the entire set Γ can be computationally costly. Later algorithmsmanaged to reduce computational costs by pruning α-vectors earlier in thealgorithm, to avoid computing Γ completely; see for example Cassandra et al.(1994).

In spite of this progress, exact algorithms are still too computationallydemanding to address problems with more than a few dozen states, actions,and observations (Shani et al., 2013). Approximate approaches were requiredto deal with larger problems.

41

Approximate solvers – Point-based algorithms

In the last decade, researchers have investigated approximate methods ca-pable of tackling much larger problems. Exact algorithms are focussing onthe set of α-vectors, thus disregarding the belief states altogether. However,a POMDP objective is specified with an initial belief state b0. It stands toreason that only those belief states that are ‘reachable’ from b0 (i.e., are asuccessor of b0 at some point of time) are relevant to the policy and the valuefunction. Computing the α-vectors that are optimal only for those reachablebelief states comes down to disregarding entire parts of the belief space, thusconsiderably reducing the computational cost. Algorithms following that ap-proach are called point-based algorithms. We present two such algorithms:the pioneering point-based value iteration and the state-of-the-art SARSOP.

Point-based value iteration

Point-Based Value Iteration (PBVI, Pineau et al. (2003)), repeatedly im-proves the set Γ of α-vectors by executing two interleaving tasks: value up-dating and belief exploration (functions Update and Explore in Algorithm2). The algorithm stops when the value improvement over one time step isless than a pre-set threshold. The set Γ is initialised with a lower bound onthe optimal value function, e.g. the minimum possible rewards at each timestep, i.e. αmin = mins,a r(s,a)

1−γ × (1, 1, . . . , 1), where γ is the discount factor—inChapter 5 we will introduce a way to find a better lower bound. This ini-tialisation guarantees that the value function defined by the α-vectors willalways be increasing throughout the algorithm.

The set of ‘visited’ belief states is denoted by B. As explained above, a keypoint in PBVI and all point-based algorithms is that α-vectors should onlybe calculated if they are ‘attached’ to a visited belief state, i.e. in B. So, theUpdate function calculates, through the function Backup, the best α-vectorsfor each belief state b ∈ B. The Explore function is in charge of adding newbelief states to B. For each b ∈ B, a set of candidate successors (defined in

Eq. 2.12) denoted by B is built by randomly drawing one successor for each

action. The element of B that is the furthest from existing points of B isadded to B. The intuition behind this choice is to spread across the beliefspace as evenly as possible, in an attempt to minimise the maximum gapbetween neighbouring belief states in B. This maximum gap linearly upper-bounds the difference between the current value function and the optimalvalue function (Pineau et al., 2003). Therefore, by reducing this maximumgap, PBVI provably converges to the optimal value function and stops when

42

Algorithm 2 PBVI

PBVI structureInitialisation:B = b0; Γ = αmin1: repeat2: Γ← Update(B,Γ)

3: B ← Explore(B)4: until termination condition

Function Update(B,Γ)

1: repeat2: for b ∈ B do3: α← Backup(b,Γ)4: Γ = Γ ∪ α5: until Γ stationary

Function Explore(B)

1: for b ∈ B do2: B = ∅3: for a ∈ A do4: o′ ← Draw(O(b, a))

5: B = B ∪ bb,a,o′6: b′ = arg maxb∈B ‖b− B‖2

7: B = B ∪ b′

Function Backup(b,Γ)

1: for a ∈ A do2: for o′ ∈ Ω do3: αa,o′,α′ =

∑s∈S b(s)

∑s′∈S α

′(s′)O(o′|s′, a)P (s′|s, a)4: αa,o′ = arg maxα′∈Γ αa,o′,α′

5: α = arg maxa∈A[r(., a) + γ∑

o′∈Ω αa,o′ ]

43

the value improvement over one time step is less than a pre-set threshold.

PBVI manages to solve much larger instances than previous exact algo-rithms. By construction, all belief states in B are reachable, which ensuresPBVI does not waste computational resources in investigating unvisited ar-eas of the belief space. This is the main strength of PBVI and the reasonwhy it performs reasonably well on large instances on which most precedingalgorithms were not tractable. Compared to other point-based approaches,PBVI obtains good results on instances where a wide search is needed, es-pecially where stochasticity is high (Shani et al., 2013). On the other hand,PBVI disregards the value function when exploring the belief space, and itsbackups are implemented for all considered belief states, which induce highcomputational costs. PBVI paved the way for other point-based algorithmsthat also rely on diverse exploration and value updating methods.

SARSOP

One of the best performing algorithms for large POMDPs is SARSOP,which stands for Successive Approximations of the Reachable Space underOptimal Policies (Kurniawati et al., 2008). Like PBVI, SARSOP does notvisit unreachable belief states. It goes further than PBVI in that it aimsat visiting only states that are reachable when applying the optimal policy.This seems sensible because belief states only visited after implementing sub-optimal actions are somewhat irrelevant to what we are trying to achieve,and can be discarded.

Of course, the optimal policy is unknown at the beginning of the al-gorithm, so SARSOP cannot discard any belief states then. However, thealgorithm progressively improves, for each belief state b and action a, anupper bound Q(a, b) and a lower bound Q(a, b) of the value function. Then,when the upper bound of an action a is lower than (dominated by) the lowerbound of another action a′ in a belief state b, i.e. Q(a, b) < Q(a′, b), theaction a is pruned out of the possible actions in b. All successors of b result-ing from executing action a are removed from the reachable space B, whichresult in a significant computational gain. This process is similar in spiritto the branch-and-bound algorithm solving mixed integer linear program-ming problems, except that SARSOP branches on actions instead of integervariables.

Pruning these dominated actions significantly reduces the size of thereachable space B, which causes great computational savings. SARSOP hasbeen found to outperform previous solvers (especially on large domains), cansolve POMDPs with thousands of states optimally in a matter of minutes

44

(Kurniawati et al., 2008), and is often described as a state-of-the-art optimalsolver by experts in the field (Ong et al., 2010, Silver and Veness, 2010).

A word on alternative approximate approaches

Some other approximate algorithms have been proposed to address problemswith many more states. Silver and Veness (2010) use Monte Carlo search toexplore the set of reachable belief states and obtain promising results on in-stances with millions of states (Silver and Veness, 2010). This approach couldbe categorised as model-free reinforcement learning as it focuses on values ofindividual belief states without using α-vectors, trading off between explo-ration and exploitation. Perhaps inspired by approximate dynamic program-ming (Powell, 2007), Boutilier and Poole (1996) merge the branches of thebelief exploration tree that correspond to the same decisions while Littman(1996) and Poupart (2005) present approaches to compress the size of beliefstates in order to reduce the computational burden of updating them. Someworks are also tailored to exploit the structure of certain POMDPs, namelyfactored POMDPs. A POMDP is called factored if its states can be natu-rally decomposed in a combination of several sub-states. Poupart (2005) andSim et al. (2008) extend the use of algebraic decision diagram (ADD, a com-pact representation of a policy or value function; see Hoey et al. (1999)) tofactored POMDPs. How well these approaches scale to large problems verymuch depends on how independent the sub-states are; Poupart’s algorithmhas scaled to dozens of thousands of states (Nicol et al., 2013) on problemswhere SARSOP runs out of memory (Peron et al., 2017a). Although all theseapproaches are useful and inspiring to solve large POMDPs, they have thedisadvantage of not having any performance guarantee.

In conclusion, a great deal of progress has been made to design approx-imate or exact POMDP solvers, to deal with partial observability of states.We now show that a slightly different framework called MOMDP is an ef-ficient way to model structural uncertainty in order to solve adaptive man-agement problems.

Mixed observable MDPs

A mixed observable MDP (MOMDP) is a special case of POMDPs, where thestate can be decomposed into a fully observable component and a partiallyobservable component (Ong et al., 2010). Alternatively, they can be seenas MDPs extended with a non-observable component (Fig. 2.5). MOMDPscan model various decision problems where an agent knows its position butevolves in a partially observable environment, or as we will see, when the

45

transition functions or rewards are uncertain. Formally, an MOMDP (Onget al., 2010) is a tuple 〈X, Y,A,O, Tx, Ty, Z,R, γ〉 in which:

• The state space is of the form X × Y . The current state (x, y) fullyspecifies the system at every time step. The component x ∈ X isassumed fully observable and y ∈ Y is partially observable;

• A is the finite action space;

• Tx(x, y, a, x′) = Pr(x′|x, y, a) is the probability of transitioning fromthe state (x, y) to x′ when a is implemented. Ty(x, y, a, x

′, y′) = Pr(y′|x, y, a, x′)is the probability of transitioning from y to y′ when a is implementedand the observed component transitions from x to x′. The process re-spects the Markov property in that these probabilities do not dependon past states or actions;

• The reward function is the immediate reward r(x, y, a) that the policy-maker receives for implementing a in state (x, y);

• O is the finite observation space;

• Z(a, x′, y′, o′) = Pr(o′|a, x′, y′) is the probability of observing o′ ∈ O ifthe state is (x′, y′) after action a;

• γ is the discount factor (< 1 in infinite time horizon).

MO-SARSOP is an MOMDP solver based on SARSOP tailored to solveMOMDPs (Ong et al., 2010). Since the variable x is fully observable, thereis no need to maintain a belief state on x, as opposed to y. So, belief statesb(x, y) can be replaced by couples of the form (x, b(y)), with significant com-putational gains due to the discreteness of x. Further, the set Γ of α-vectorscan be replaced favourably by ∪x∈XΓ(x), where Γ(x) contains α-vectors onlyfor the state x. The resulting increase in the number of α-vectors is morethan compensated by the reduction in size of each α-vector, which has nowcoordinates on y only. MO-SARSOP outperforms SARSOP on most well-known benchmarks because part of the state is often fully observable.

Addressing structural uncertainty with mixed observable MDPs

Chades et al. (2012a) have shown that structural uncertainty can be modelledas an MOMDP as follows:

• The fully observable state x corresponds to the physical state of thesystem, e.g. the population of a threatened or invasive species, whichwe observe perfectly by assumption.

46

Figure 2.5: Decisions diagrams of POMDP (A) and MOMDP (B). The ob-servation of yt is imperfect and is affected by xt.

• The partially observable state y represents the true system dynamics.The set Y represents the set of possible transition functions.

• The probabilities Tx(x, y, a, x′) are the probabilities of transitioning

from the state x to x′ when a is implemented, if y were the true systemdynamics.

• The probabilities Ty(x, y, a, x′, y′) usually do not depend on the physical

state x or our actions a; it can be simplified into Ty(y, y′) (see decision

diagram in Fig. 2.6b).

• The reward function r(x, y, a) usually does not depend on the ‘systemdynamics state’ y, i.e. obtaining information yields no direct reward.It can be simplified into r(x, a).

• The general MOMDP framework allows for observations on the par-tially observable state y. Often in adaptive management, however, yis assumed completely unobservable. This can be modelled by havingjust one observation, i.e. O = o and Z(a, x′, y′, o) = 1 for all a, x′

and y′.

• The initial belief state b0 depends on the prior information on thedifferent transition functions. For example, if it is usually uniform,b0 = (1/|Y |, . . . , 1/|Y |).

47

Figure 2.6: Decisions diagrams of general-case MOMDP (A) and MOMDPhandling structural uncertainty (B). Grey and black areas respectively depictpartially and not observable variables. These frameworks require storing thehistory (indirectly, through belief states) in order to make the best decision(dashed lines).

Off-the-shelf MOMDP solvers like MO-SARSOP can then be used to findthe best policy, taking the above MOMDP model as an input. The optimalpolicy then achieves the optimal trade-off between informative and rewardingactions, by definition (Chades et al., 2012a). In Chades et al. (2012a), thepolicy appears to actively try to distinguish between the different systemdynamics, by selecting discriminating management actions. The true systemdynamics is identified after a handful of time steps, after which the systemcan be managed at maximum efficiency.

In practice, MOMDP solver such as MO-SARSOP are particularly adaptedto this kind of problems by only maintaining beliefs on the unobservablevariable y. They allow finding the optimal trade-off between rewarding andinformative action efficiently.

2.2.3 The limitation of current MOMDP solvers

As we have seen, modelling an adaptive management problem as an MOMDPleads to the best policy and benefits from the latest advances in POMDPsolvers. In practice, the high complexity of MOMDPs (PSPACE-complete;Chades et al. (2012a)) makes the convergence relatively slow for all but trivial

48

problems.

In Chapter 5, we focus on stationary MOMDP, a particular type ofMOMDP where the partially observable component y is stationary, i.e. doesnot change over time. This means that yt+1 = yt at any time t, or equiv-alently, Ty(y, y

′) = 1 if y = y′, 0 otherwise. In adaptive management, thetransition function is typically assumed stationary (Walters and Hilborn,1978, Chades et al., 2012a, Runge, 2013). This assumption is also frequentlysatisfied in other disciplines, e.g. a customer’s profile or a patient’s condition,which can be reasonably assumed stationary over a short period of time.

In Chapter 5, we present an approach to improve the initialisation ofstationary MOMDP solvers, which can be used when solving adaptive man-agement problems. We show that our approach, which consists of solving anumber of Markov decision processes, generates a lower bound on the valuefunction that is optimal in the corners of the belief space, i.e. belief statesof the form (0, . . . , 0, 1, 0, . . . , 0). With an additional assumption about theoptimal policy, we demonstrate that this lower bound is also a linear approxi-mation to the value function. Tested on two state-of-the-art POMDP solvers,our approach shows significant computational gains in our case study and ona previously published data challenge. Our approach could be of use in vari-ous contexts, e.g. threatened species management and natural resource man-agement (Runge, 2013, Johnson et al., 2002), medical science (Hauskrecht,1997) education (Cassandra, 1998), or machines or infrastructures mainte-nance (Faddoul et al., 2015) to cite a few.

Although this method allows for solving adaptive management problemsfaster, it still falls prey to Bellman’s curse of dimensionality; in our casestudy, no more than nine islands can be accommodated for in the model(Chapter 5). We discuss this issue in the next section.

2.2.4 A change of paradigm

In the three communities mentioned in Section 2.2.1, researchers have fo-cussed on solving discrete-time problems (Chades et al., 2012a, Astrom andWittenmark, 2008, Duff, 2003). Discrete-time approaches are arguably easierto tackle than continuous-time ones, but suffers from two main drawbacks.First, there are many real-world problems that require continuous atten-tion, including stock portfolio optimization (Dias et al., 2015), flight trajec-tory planning (Kang and Bedrossian, 2007) or medical sciences (Hauskrecht,1997). Second, except for some simple linear systems, dynamic programmingis the method of choice for solving discrete-time dual controls (Astrom andWittenmark, 2008). This implies a finite set of states needs to be specified,

49

which makes it hard to solve where the state is unbounded or multidimen-sional (curse of dimensionality) (Bellman, 1957). This is the case in our casestudy, which is intractable for more than nine islands. Solving this problemon larger problems might require changing our paradigm completely.

Tools from continuous–time optimal control can help circumvent thesecaveats. Continuous–time optimal control problems can be solved by differ-ent methods, one of which is the Pontryagin minimum principle (Bertsekas,1995). This approach leads to differential equations that can be solved nu-merically to find the optimal control. Although this approach does not nat-urally deal with stochastic systems, it is of interest to us because it does notsuffer from the curse of dimensionality, and control problems with hundredsof states (dimensions in MDPs) can be solved, which is unimaginable withdynamic programming.

In Chapter 6, we address a continuous-time dual control problem. Wepropose an approach based on optimal control where the variable represent-ing our knowledge of the unknown parameter is shown to follow a differentialequation. All states are replaced by their expected values, which leads to adeterministic model that is solved with an optimal control algorithm, namelya forward-backward algorithm (Baker and Bode, 2016, Hackbusch, 1978,Lenhart and Workman, 2007). This algorithm rivals dynamic programmingon small problems and remains tractable on problems of higher dimensions,in contrast to dynamic programming. It achieves the right balance betweenaggressive and smoothly varying controls. Potential applications include un-certain prey-predator dynamics in environmental sciences (see Chades et al.(2012c) and Baker et al. (2017)), finance (Dias et al., 2015) or marketing(Zhang and Cooper, 2009).

2.3 Summary

We have described how to apply MDPs on Susceptible-Infected-Susceptiblenetworks. However, we have identified two limitations in the current litera-ture. First, solution techniques to address simultaneous actions of differentdurations, which is a feature of our case study, are lacking. In Chapter 3we develop a new approach to assist decision-makers when actions are simul-taneous and of different durations. Second, even with this problem solved,traditional solution techniques fall prey to the curse of dimensionality anddo not allow solving our complete case study because of the complexity ofSusceptible-Infected-Susceptible networks. In Chapter 4, we propose twonew approximate dynamic programming algorithms adapted to Susceptible-Infected-Susceptible network.

50

Then, we have shown that MOMDPs, an extension of MDPs, can be usedto deal with a more complex form of uncertainty present in our case study:structural uncertainty. As we have seen, modelling an adaptive managementproblem as an MOMDP leads to the best policy and achieves the optimaltrade-off between rewarding and informative actions. In practice, the highcomplexity of stationary MOMDPs leads to very slow convergence for all buttrivial problems. In Chapter 5, we propose a method to improve the initial-isation of POMDP or MOMDP solvers that are used when solving adaptivemanagement problems. Although this method allows for solving adaptivemanagement problems faster, it still falls prey to Bellman’s curse of dimen-sionality; in our case study, no more than nine islands can be accommodatedfor in the model. Tools from continuous–time optimal control can help cir-cumvent these caveats: control problems with hundreds and thousands ofstates (dimensions in MDPs) have been solved. Hence, we investigate inChapter 6 whether control-theoretic tools, and in particular optimal controltools, can be used to find actively learning policies in a continuous-time,unbounded- and continuous-state setting.

51

Chapter 3

Selecting simultaneous actions of different dura-tions to optimally manage an ecological network

To achieve a management objective faster, several actions can be imple-mented simultaneously. In particular, for spatial problems, simultaneousactions in different locations must be optimised. The same issue can occur ina broad range of disciplines, including invasive or threatened species manage-ment, forestry or agriculture (Section 2.1.2). Accommodating simultaneousactions of different durations is challenging because they terminate at differ-ent time steps and, thus, computing an exact SDP to find an optimal policyrequires high memory demands and computation time.

In this chapter, we address our first research question by developing a newapproach to assist decision-makers when actions are simultaneous and of dif-ferent durations. This novel approach modifies time constraints to reduce themodel size by several orders of magnitude to provably obtain bounds on theunknown exact performance, for problems too large for dynamic program-ming to compute the exact solution. Applied to the management of Aedesalbopictus in the Torres Strait Islands, our case study, the bounds providea narrow range guaranteed to contain the performance of the exact optimalpolicy. This chapter was published as

Peron, M., Jansen, C. C., Mantyka-Pringle, C., Nicol, S., Schellhorn,N. A., Becker, K. H., and Chades, I. (2017b). Selecting simultaneous actionsof different durations to optimally manage an ecological network. Methodsin Ecology and Evolution, 8(10):1332–1341.

Statement of joint authorship:

• Martin Peron conceived the presented idea, developed the theory, de-signed and implemented the optimisation models, performed the analy-sis, drafted most of the manuscript and acted as corresponding author.

52

• Cassie C. Jansen collected the data, conducted the Bayesian networkanalysis and edited the manuscript.

• Chrystal Mantyka-Pringle collected the data, conducted the Bayesiannetwork analysis and edited the manuscript.

• Sam Nicol collected the data and edited the manuscript.

• Nancy A. Schellhorn identified the problem, collected the data andedited the manuscript.

• Kai Helge Becker contributed to technical aspects of the paper andedited the manuscript.

• Iadine Chades directed the research, collected the data, conducted theBayesian network analysis and edited and wrote significant parts of themanuscript.

53

Selecting simultaneous actions of different durations to

optimallymanage an ecological network

Martin Peron*,1,2, Cassie C. Jansen2,3, Chrystal Mantyka-Pringle2,4, SamNicol2,

NancyA. Schellhorn2, Kai HelgeBecker5 and IadineChades2,6

1Mathematical School, QueenslandUniversity of Technology, Brisbane, Qld 4000, Australia; 2Commonwealth Scientific and

Industrial ResearchOrganisation, Dutton Park, Qld 4102, Australia; 3Metro North Public Health Unit, Queensland Health,

Windsor, Qld 4030, Australia; 4School of Environment and Sustainability, Global Institute forWater Security, University of

Saskatchewan, Saskatoon, SKS7N5B3, Canada; 5Department ofManagement Science, University of Strathclyde, Glasgow

G1 1XQ,UK; and 6ARCCentre of Excellence for Environmental Decisions, University of Queensland, Brisbane, Qld 4072,

Australia

Summary

1. Species management requires decision-making under uncertainty. Given a management objective and limited

budget, managers need to decide what to do, and where and when to do it. A schedule of management actions

that achieves the best performance is an optimal policy. A popular optimisation technique used to find optimal

policies in ecology and conservation is stochastic dynamic programming (SDP).Most SDP approaches can only

accommodate actions of equal durations.However, inmany situations, actions take time to implement or cannot

change rapidly. Calculating the optimal policy of such problems is computationally demanding and becomes

intractable for large problems. Here, we address the problem of implementing several actions of different

durations simultaneously.

2. We demonstrate analytically that synchronising actions and their durations provide upper and lower bounds

of the optimal performance. These bounds provide a simple way to evaluate the performance of any policy,

including rules of thumb.We apply this approach to the management of a dynamic ecological network ofAedes

albopictus, an invasive mosquito that vectors human diseases. The objective is to prevent mosquitoes from

colonising mainland Australia from the nearby Torres Straits Islands where managers must decide between

management actions that differ in duration and effectiveness.

3. We were unable to compute an optimal policy for more than eight islands out of 17, but obtained upper and

lower bounds for up to 13 islands. These bounds are within 16% of an optimal policy. We used the bounds to

recommendmanaging highly populated islands as a priority.

4. Our approach calculates upper and lower bounds for the optimal policy by solving simpler problems that are

guaranteed to perform better and worse than the optimal policy respectively. By providing bounds on the opti-

mal solution, the performance of policies can be evaluated even if the optimal policy cannot be calculated. Our

general approach can be replicated for problems where simultaneous actions of different durations need to be

implemented.

Key-words: Aedes albopictus, invasive species, Markov decision processes, mosquito, optimal

management, performance bounds, simultaneous actions, susceptible-infected-susceptible,

threatened species

Introduction

Managing dynamic ecological systems is often constrained by

limited resources, leadingmanagers to use mathematical meth-

ods to make cost-effective decisions (Duke, Dundas & Messer

2013). Given a specified management objective, sequential

decisions can be optimised with an algorithm called stochastic

dynamic programming (SDP, Marescot et al. 2013). When

computational resources are sufficient relative to the complex-

ity of the problem, SDP returns an optimal policy, i.e. action to

implement in each state of the system, and the performance (or

value) of this policy (Puterman 1994). In behavioural ecology,

SDP is used to assess if species optimise their reproductive fit-

ness over time (Houston et al. 1988; Venner et al. 2006). In

applied ecology, SDP has become an essential decision-making

tool when information ismissing, with applications in prioritis-

ing global conservation effort (Wilson et al. 2006), weed con-

trol (Firn et al. 2008), disease management (Chades et al.

2011), species migration (Nicol et al. 2015), fire regime man-

agement (McCarthy, Possingham & Gill 2001) and adaptive

management (Walters & Hilborn 1978; Hauser & Possingham

2008). To achieve a management objective faster, several*Correspondence author. E-mail: [email protected]

© 2017 CSIRO. Methods in Ecology and Evolution © 2017 British Ecological Society.

Methods in Ecology and Evolution 2017 doi: 10.1111/2041-210X.12744

actions can be implemented simultaneously. In particular, for

spatial problems, simultaneous actions in different locations

must be optimised, for example in forestry (Forsell et al. 2011)

or invasive or threatened species management (Monterrubio,

Rioja-Paradela & Carrillo-Reyes 2015; Mantyka-Pringle et al.

2016). Additionally, there are many examples in the ecological

literature of actions of different durations (Phelan, Norris &

Mason 1996; Pelizza et al. 2010).

To date, little research has focused on simultaneous actions

(Boutilier & Brafman 1997). In artificial intelligence, simulta-

neous actions have become important when several decision

problems merge (Singh & Cohn 1998), or when actions have

random durations (Rohanimanesh & Mahadevan 2003).

Accommodating simultaneous actions of different durations is

challenging because they terminate at different timesteps

(Barto & Mahadevan 2003) and thus, computing an exact

SDP to find an optimal policy requires high memory demands

and computation time. A workaround is to use approximate

methods. Approximate algorithms focus on maximising the

value of policies but, in practice, policies that cannot be

explained in ecological terms will not be applied by managers

(Walters 1986). Identifying sensible rules of thumb, i.e. simpli-

fied versions of more complex policies, is often preferred

(Chades et al. 2008;Grechi et al. 2014). However, this simplifi-

cation causes a loss of value that is often unknown tomanagers

(Pichancourt et al. 2012).

Here, we introduce two approximate models that provide

upper and lower bounds on the optimal performance at an

advantageous computational cost and will allow decision-

makers to find well performing rules of thumb. Obtaining an

upper bound and lower bound of the unknown optimal perfor-

mance is useful, as calculating the error in the performance of a

rule of thumb relative to the upper bound, e.g. 10%, guaran-

tees that this rule of thumb is within 10%of the optimal perfor-

mance.We apply our approach to the management of invasive

mosquito Aedes albopictus in the Torres Strait Islands, Aus-

tralia. This approach can be replicated for large problems

when simultaneous actions have different durations to evaluate

and increase the reliability of rules of thumb.

Materials andmethods

MARKOV DECISION PROBLEMS AND STOCHASTIC

DYNAMIC PROGRAMMING

Markov decision processes (MDP) are mathematical frameworks for

modelling sequential decision problems where the outcome is partly

stochastic and partly controlled by a decision-maker. AMDP is defined

by five components <S,A,P, r,C> (Puterman 1994): (i) a state space S,

(ii) an action space A, (iii) a transition function P for each action, (iv)

immediate rewards r and (v) a performance criterionC.

The decision-maker aims to direct the process towards rewarding

states, motivated by a performance criterion. From a given state s, the

decision-maker selects an action a and receives a reward r(s, a). At the

next timestep, the system transitions to a subsequent state s0 with prob-

ability P(s0 |s, a). The performance criterion C specifies the objective

(e.g. maximise or minimise a sum of expected future rewards), the time

horizon (finite or infinite), the initial state s0 and whether there is a

discount rate (c). A policy p describes which decisions are made in each

state, i.e. p:S?A.

Solving a MDP means finding a policy that optimises the perfor-

mance criterion (optimal policy). Stochastic dynamic programming

denotes a collection of solution methods to solveMDPs, such as policy

iteration and value iteration (see Marescot et al. (2013) and

Appendix S1, Supporting Information for an overview). SDP is an effi-

cient algorithm since it runs in polynomial time, but may not be tract-

able when the state or action spaces are very large, thus requiring

alternative approaches (Nicol &Chades 2011).

EXTENSION OF MDP FOR SIMULTANEOUS DECIS IONS OF

DIFFERENT DURATIONS

A limitation of MDPs is that all actions must occur for the same dura-

tion. Herein, we provide a method to overcome this limitation. Specifi-

cally, we address decision problems where an action a can be

decomposed into N sub-actions a1, a2, . . ., aN at each timestep, with ai2Ai. As an example of the distinction between actions and sub-actions,

consider amanagement strategy for a network ofN connected sites. An

action is comprised of N sub-actions applied to the individual sites.

Each sub-actionmay have a different duration d(ai) andmust be imple-

mented for its full duration. The transition function and rewards may

depend on the sub-actions a1, a2, . . ., aN currently implemented.

We propose a MDP model that solves this decision problem opti-

mally, called the exact model. To fit the MDP framework, we need

to respect the Markov property, which requires that subsequent

states can be predicted using only the current state and action. To

ensure that all actions are implemented until completion, we aug-

ment each state s 2 S with information about which sub-actions

are currently implemented (a1, a2, . . ., aN) and the number of time-

steps until each finishes (noted t1, t2, . . ., tN, with ti 2 ℕ). Formally,

each state of the exact model becomes (s, a1, t1, a2, t2, . . ., aN, tN).

The new state space is denoted Sexact. The set of possible actions

A(s, a1, t1, . . ., aN, tN) that can be implemented depends on the cur-

rent MDP state: if ai is not finished (ti > 0), then ai must continue;

if ai has just terminated (ti = 0), all sub-actions are possible.

The transition function Pexact should not only contain the transition

function P on states of S but also update the elements (ai, ti), i.e. ini-

tialise (ai, d(ai) 1) when ai begins, and then subtract 1 from ti at each

timestep until ai is completed. The rewards r are the same in the original

MDP and the augmented exactmodel.

We can apply SDP to this exactmodel <Sexact,A, Pexact, r> to find the

optimal policy and its performance, notedV*. However, the state space

Sexact is exponential in N : jSexactj ¼ jSjQNi¼1 1þP

ai2AiðdðaiÞ 1Þ

(Appendix S2). SDP is likely to be intractable for all but trivial values

of N (see Results). This motivates us to introduce two approximate

models, the lower boundmodel and the upper boundmodel.

These two models are obtained by synchronising sub-actions, which

forces sub-actions to finish simultaneously. As a consequence, the per-

formance of the lower and upper bound models will be lower and

higher than the performance of the exact model.With all actions finish-

ing simultaneously, the number of states can be reduced dramatically

and larger problems can thus be addressed.

LOWER BOUND MODEL

The lower boundmodel is obtained bymodifying the exact model in two

steps.

First, we add a synchronisation constraint, which forces all sub-

actions to be implemented as many times as necessary to end

© 2017 CSIRO. Methods in Ecology and Evolution © 2017 British Ecological Society., Methods in Ecology and Evolution

2 M. Peron et al.

simultaneously (Fig. 1). To do so, we forbid changing any sub-action

while at least one sub-action is in progress. The resultingMDP leads to

a performance equal or lower than the exact model (Appendix S3).

Intuitively, since the lower bound model is less flexible than the exact

model, fewer policies are possible and performance decreases.

Second, we reformulate the state space obtained by synchronisation

to remove unnecessary states. The states where at least one sub-action

is in progress are unnecessary to obtain the optimal policy. Given an

action a, the sub-actions a1, a2, . . . aN will finish simultaneously after

the least common multiple (LCM) of the durations d(a1), d(a2), . . .

d(aN):

LCMfdðaiÞ; 1 iNg ¼ minfd 2 N : 1 iN;

d ¼ kidðaiÞ for some ki 2 Ng

We note this duration LCM(a). We remove states with a duration

shorter than LCM(a) and assign the sub-actions of a with duration

LCM(a). The new state space is S⊂Sexact.

The new action space is the same as in the exact model except that it

is defined on the subset S ⊊ Sexact.

Because our aim is to compare the performance of the exact and

lower bound models, we define the transition function and rewards

such that a policy calculated with the lower bound model will have the

same performance as when evaluated using the exact model. Since the

new transitions last several timesteps during which the action does not

change, a transition over d timesteps is made of d times the same transi-

tion: for every action a, Plower is the function (matrix) P raised to the

power of the duration of a:

Plowerð:j:; aÞ ¼ Pð:j:; aÞLCMðaÞ eqn 1

The rewards rlower(s, a) should account for both the immediate

reward r(s, a) and the expected rewards of the states that are not com-

puted in the lower boundmodel (Fig. 1b):

rlowerðs; aÞ ¼ rðs; aÞ þXLCMðaÞ1

i¼1

ciXs02S

Pðs0js; aÞirðs0; aÞ eqn 2

The resulting model <S,A, Plower, rlower> is a semi-MDP (Bradtke &

Duff 1994), an extension of MDPs. Only one action can be imple-

mented at a time, however, different actions can have different

durations. Semi-MDPs can be solved efficientlywith SDP.Reformulat-

ing the model by removing unnecessary states does not affect perfor-

mance, so the optimal performance in the lower bound model Vlower is

a lower bound of the optimal performance in the exactmodelV*:

VlowerðsÞVðsÞ;8s 2 S eqn 3

The number of states in this model is greatly reduced, from

jSexactj ¼ jSjQNi¼1 1þP

ai2AiðdðaiÞ 1Þ

to |S|. This model can be

solved for problems of larger sizes than the exact problem.

UPPER BOUND MODEL

Like the lower boundmodel, the upper boundmodel is built in two steps

from the exactmodel.

First, we allowmanagement actions to be interrupted before comple-

tion in order to start different management actions (Fig. 1c). We

reduce the duration of all actions to a unique duration (noted GCD

for convenience), which equals the greatest common divisor (GCD)

of the durations of all sub-actions: GCDfdðaÞ : a 2 [Ni¼1Aig ¼

maxfd 2 N : a 2 [Ni¼1Ai; dðaÞ ¼ dka for some ka 2 Ng

Note that GCD does not depend on implemented sub-actions,

but rather the set of all sub-actions available. GCD must evenly

divide all durations d(a) to ensure that a management action a

applied repeatedly in the upper bound model can last d(a) time-

steps—its duration in the exact model. Any policy in the exact

model can also be implemented in the upper bound model. Since

the upper bound model is more flexible than the exact model

due to shorter actions (technically, a relaxation), the resulting

MDP leads to an equal or higher performance than the exact

problem formulation (Appendix S3).

Second, we reformulate the state space to remove unnecessary states.

As per the lower bound model, the states between times t = 0 and

t = GCDrequire no decisions and can be removed.

The state space and the action space of the upper bound model are

the same as in the lower bound model (S and A). For each action, the

new transition function Pupper is P raised to the power of the duration

GCD. The new rewards rupper(s, a) must take into account both the

immediate reward r(s, a) and the expected rewards in the removed

states, which are no longer computed:

(a) Exact

(b) Lower bound

No acon – 6 months

a1

Light Management – 3 years

(c) Upper bound

Strong Management – 3 years

a2

Time:0 1 2 3 4 5 6 7 8

a1

a2

a1

a2

Fig. 1. Schematic of exact (a), lower bound (b) and upper bound (c) policies over eight timesteps. We use the sub-actions of the case study as an

example, withN = 2 islands and A1 =A2=no action, light, strong of durations one, six and six timesteps respectively (i.e. 6 months, and 3 years).

Management actions can be asynchronous in the exact model (a). In the lower boundmodel (b), the curved arrows illustrate the supplementary con-

straint forcing themanagement action to continue as the lowest commonmultiplier of one and six is six. In the upper boundmodel (c), all actions are

interrupted after GCD(1,6,6) = 1 timestep. For models b and c, actions do not change between vertical bars allowing a decrease in the number of

states.


Optimising simultaneous actions 3

rupperðs; aÞ ¼ rðs; aÞ þXGCD1

i¼1

ciXs02S

Pðs0js; aÞirðs0; aÞ; 8s 2 S;8a 2 AðsÞ

eqn 4

The resulting MDP <S, A, Pupper, rupper> is a semi-MDP whose

optimal performance, Vupper, is an upper bound of the exact

performanceV*:

VðsÞVupperðsÞ; for all s 2 S eqn 5

In conclusion, we have constructed twoMDPmodels which provide

lower and upper bounds of the exact performance:

VlowerðsÞVðsÞV

upperðsÞ; for all s 2 S eqn 6

These approximatemodels require |S| states to be solved,many fewer

than the exactmodel. Importantly, unlike the policy of the lower bound

model, the policy of the upper boundmodel cannot be implemented (as

it violates the duration constraints of the actions). However, it provides

a valuable upper bound against which to compare viable sub-optimal

policies.

We provide the MATLAB code solving our case study at

https://doi.org/10.6084/m9.figshare.4557565. It uses the MDPSolve

package (https://sites.google.com/site/mdpsolve/). The necessary

input parameters are provided in Appendix S4.

CASE STUDY: MANAGING A. ALBOPICTUS IN THE

TORRES STRAIT ISLANDS

Aedes albopictus is a highly invasive species and a vector of several

arboviruses, including chikungunya and dengue viruses (Bonizzoni

et al. 2013). Aedes albopictus was first detected in the Torres Strait

Islands in 2005 (Ritchie et al. 2006), where it persists today. These

islands are potential sources of dispersal between Indonesia, Papua

New Guinea (PNG) and mainland Australia (Beebe et al. 2013) via

numerous human-mediated pathways including local boats, airplanes

and ferries (Fig. 2). Herein, for simplicity, we consider Indonesia and

PNGas a single potential source ofA. albopictus referred to as PNG.

IfA. albopictuswere to establish onmainland Australia, its invasion

is expected to be widespread and persistent (Hill, Axford & Hoffmann

2014), and extremely challenging to control (Beebe et al. 2013). Fur-

ther, Australia’s main population centres would likely become

receptive to dengue transmission (Russell et al. 2005) and subject to sig-

nificant biting nuisance (Beebe et al. 2013).

Since the detection ofA. albopictus in the Torres Strait, several man-

agement actions have been implemented. These include community

education, insecticide applications to harbourage areas (e.g. vegetation

which provides resting habitat for adult mosquitoes) and domestic

housing, and chemical treatment or disposal of container larval habi-

tats (e.g. plant pots, sagging tarps, etc.). We distinguish two levels of

such management actions: light and strong, the latter being costlier but

more effective. Since budget is limited, not all islands can be managed

simultaneously. At each timestep (6 months), decision-makers must

decide which of the 17 inhabited islands should be managed to protect

mainland Australia. Since we assume that mainland Australia cannot

be successfully managed if infested, our objective is to maximise the

mean time untilA. albopictus invadesmainlandAustralia.

States and transition function of the susceptible-infected-

susceptiblemodel

We model the mosquitoes’ dispersal over time using a Suscepti-

ble-Infected-Susceptible (SIS) network (Pastor-Satorras & Vespig-

nani 2001). The SIS model allows the locations (islands hereafter)

to be either infested with, or susceptible to an invasive species. In

our case study, each state s represents the infestation status of all

Torres Strait Islands and mainland Australia. Because we assume

that mainland Australia cannot be managed, we define ‘mainland

Australia infested’ as an absorbing state (i.e. sink), noted r. All

other states are of the form (s1, s2, . . ., sN), with si the status of

island i. Formally, S = r ∪ (s1, s2, . . ., sN); si 2 infested,

susceptible, 1 ≤ i ≤ N.

In an SISmodel, the status of each locationmay change at each time-

step in two ways. First, infested locations can become susceptible when

the species goes locally extinct: in our case study, the extinction proba-

bility equals the effectiveness of the action currently implemented (see

next paragraph). Second, links between locations represent risks of

reinfestation of susceptible locations from infested ones. Here, we

defined the (re)infestation probability as follows: mainland Australia

remains susceptible in the next timestep (i.e. st+1 6¼ r) with probabilityQi2f1;2;...;Ngð1 piMÞ where piM is the infestation probability from an

infested island i to mainland Australia. Conversely, the probability of

Papua New Guinea(infested)

Torres Strait Islands (partially infested)

Mainland Australia(not infested)

Australia

AsiaPacific Ocean

Indian Ocean

Fig. 2. Map of the Torres Strait showing the

nodes of the model (PNG, populated Torres

Strait Islands and mainland Australia) as red

squares. Blue lines illustrate possible invasion

pathways of Aedes albopictus between nodes

via human-mediated transport including local

boats, airplanes or ferries. Pathways with a

small transmission probability are not shown

for clarity.


4 M. Peron et al.

the mainland becoming infested (i.e. st+1 = r) is 1Qi2f1;2;...;Ng

ð1 piMÞ. The (re)infestation probability of all islands from neigh-

bouring islands follows suit.

Rewards and actions of the SISmodel

The reward r reflects our objective to prevent the infestation of

mainland Australia. We set a reward of 05 when the mainland

is not infested (s 6¼ r), and 0 if it is infested (s = r). With a dis-

count of c = 1, we receive 05 every timestep (6 months) until

mainland Australia becomes infested. The performance of any pol-

icy (i.e. expected cumulative reward obtained) equals the mean

time until infestation. Although c = 1, the mean time to infesta-

tion is finite because we assume that PNG is an infinite source

of mosquitoes.

Two management actions are possible (light and strong), but

the budget allows implementation of only one light management

and one strong management across all islands, or three light

managements, at each timestep. A sub-action ai can take values

in Ai = no action, light, strong. The maximum budget is

accounted for by reducing the set of possible actions

AðsÞ QNi¼1 Ai (unaffordable actions and their related states are

not computed). The effectiveness p(ai) of sub-action ai is defined

as the probability of eradicating the mosquito over one timestep

and depends on the characteristics of island i (Appendices S5 and

S6). Finally, the durations d(ai) vary: no action lasts one timestep

(6 months), and light and strong management last six timesteps

each.

Parameters

Data was collected at an expert elicitation workshop to estimate the

effectiveness of actions based on characteristics of the Torres Strait

Islands (Martin et al. 2012; Appendices S5 and S6). Experts in invasive

species, vector biology and ecology, mosquito control, public health

management and biosecurity estimated the effectiveness of all three

management actions. Estimates accounted for island characteristics

including size, vegetation refuge and accessibility (terrain), which influ-

ence both the operational feasibility of actions and the habitat suitabil-

ity for A. albopictus. We used a Bayesian network to calculate the

effectiveness of actions depending on these characteristics

(Appendix S6, Clark 2005).

We were unable to collect data on the probability of transmission

of A. albopictus between islands. In the absence of data, the proba-

bility of transmission between islands was derived using Cauchy dis-

persion kernels (Pitt 2008). Experts agreed that the transmission

between two given islands likely depended on the number of inhabi-

tants and the distance between islands; larger populations and proxi-

mal islands have higher transmission probabilities. The transmission

probability pji from an island i to j depends on the island popula-

tions, popi and popj, and the distance between the islands dij:

pji ¼ D popi popj

1þ dijb

2ðCauchyÞ eqn 7

where D is a constant influencing the speed of transmissions

through the network, and b is the shape parameter. We cali-

brated two sets of parameters arbitrarily, namely low and high

transmission probabilities, leading to slow and fast infestations of

mainland Australia respectively. The range of mean times to

infestation captures the time to infestation estimated by experts

(Appendices S7 and S8).

Computational experiments

We compared the optimal performances of our three models on

our case study. Recall that the performance of a policy equals the

mean time until infestation of mainland Australia. It was not nec-

essary to run simulations for these proposed MDP models because

both the optimal policy and its performance are direct outputs of

SDP. Solving the exact 17-island network problem was computa-

tionally intractable (runs out of the 1000GB memory), so we grad-

ually evaluated the performance of our proposed models on

networks including an increasing number of islands (remaining

islands do not affect the system). We first added the islands with

the highest probability of directly infesting the mainland (rule

‘highest transmission’, see Appendix S5). We tested the robustness

of our approach on two dispersal scenarios, with low and high

mosquito transmission probabilities (see Parameters).

We also evaluated several simple rules of thumb, which consist of

managing the highest-ranked infested islands according to the follow-

ing rankings: (i) largest populations; (ii) closest to the mainland; (iii)

easiest tomanage (islands where actions have the highest probability of

success); and (iv) highest transmission probability toward the main-

land.We calculated the performance of (v) continuously implementing

strong managements on all islands (‘all managed’, i.e. unlimited bud-

get) and (vi) no actions.We ran 10 000 simulations to assess the perfor-

mance of these rules of thumb and recorded the mean times of

infestation and 90% confidence intervals.

Results

LOW TRANSMISSION PROBABIL IT IES

For all models, infestation of mainland Australia happens

sooner as more islands are included in the analysis (as

expected; Fig. 3). A steep decrease in themean time until infes-

tation occurs when considering up to five islands (20–30 years/

island), followed by a gradual decrease (approximately 1 year/

island) until all islands are included in the analysis. This is

because we incrementally included the islands with the rule

‘highest transmission’.

As expected, the performance of the exact model is between

the upper bound model and the lower bound model and all

rules of thumb. The exact model runs out of memory for more

than eight islands while the lower and upper bound models are

tractable until 13 islands. This difference is attributable to a

higher number of states in the exact model, with a ratio up toQNi¼1ð1þ

Pai2Ai

ðdðaiÞ 1ÞÞ ¼ 11N (the ratio is lower in prac-

tice since states related to unaffordable actions are disregarded

- seeMaterial andmethods).

All model performances are equal when only one island

(Thursday Island) is included because there are no simultane-

ous management actions. For more than one island, all perfor-

mances remain similar, in particular those of the lower bound

and the exact models. We have chosen to display the perfor-

mance of the best rule of thumb only, ‘highest transmission’,

which performs equally to our lower bound. Other rules of

thumb (‘largest population’, ‘closest to mainland’ and ‘easiest

to manage’) perform worse than ‘highest transmission’

(Appendix S9). The performances of ‘no actions’ and ‘all man-

aged’ illustrate the worst and best possible outcomes,



respectively, but these bounds are less informative about the

optimal performance because they are much wider than the

lower and upper boundmodels.

To assess the quality of the upper bound, we calculated the

relative errors of all models compared to the upper bound

(Table 1). The relative error of the exact model remains less

than 14% and shows that this upper bound remains close to

the exact performances when islands are added. The relative

error of the lower bound remains less than 16% for all num-

bers of islands considered, guaranteeing that the lower bound

equals at least 84% of the exact performance in our case study.

Note that the lower bound is a very close approximation of the

exact policy (see Discussion). The relative error of ‘highest

transmission’ is similar to that of the lower bound and remains

less than 17%.

HIGH TRANSMISSION PROBABIL IT IES

When assuming high transmission probabilities, themean time

until infestation of mainland Australia under our best rule of

thumb is less (13 years for 17 islands) than that calculated

using low transmissions probabilities (50 years for 17 islands;

Fig. 4). The differences between all models are smaller than

with the low transmission probabilities. The rule of thumb

‘highest transmission’ performs consistently well, while others

(not shown) underperformed.

The lower and upper bounds are closer together than with

the low transmission probabilities. This is confirmed by the rel-

ative error compared to the upper bound (Table 2). The rela-

tive error of the exact model (<7%) shows that this upper

bound remains close to the exact performances regardless of

the number of islands considered. The relative error of the

lower bound model reaches 9% (6% when the exact model is

no longer tractable). The relative error of ‘highest transmis-

sion’ is less than 9%.

OPTIMAL POLIC IES—LOW AND HIGH TRANSMISSIONS

PROBABIL IT IES

When running simulations of the policies recommended by the

exact, lower bound and upper bound models, some islands

appear more important than others. It is therefore possible to

identify an order (the prioritisation ranking) in which islands

should be managed until eradication. Further, this prioritisa-

tion ranking is very similar for all three policies (when tract-

able) and for both high and low transmission probabilities.

When considering four islands (Appendix S10), all policies

prioritise Thursday and Horn Islands (in this order) before

Mulgrave and Banks Islands, i.e. this prioritisation ranking

matches the ‘highest transmission’ ranking exactly. These two

rankings are not the same when more islands are included (up-

per bound model for 11 islands; Fig. 5 and Appendix S5)

because other factors than ‘highest transmission’ also affect the

optimal policies. One such factor is the effectiveness of man-

agement: ineffective management actions on Banks cause it to

be ranked eighth on the prioritisation ranking against fourth

on the ‘highest transmission’ ranking. Another factor is the

proximity of islands: Jervis Island is fifth on the prioritisation

ranking against ninth on the ‘highest transmission’ ranking. A

possible interpretation is that Jervis, when compared to Yam

or Coconut for example, is close to critical islands such as

Thursday, Horn andMulgrave.

Discussion and concluding remarks

We developed a new approach to assist decision-makers when

actions are simultaneous and of different durations. This

approach modifies time constraints to reduce themodel size by

several orders of magnitude to obtain bounds of the unknown

exact performance. We applied this to the spatial management

of an invasive mosquito, A. albopictus, modelled as a SIS

0

20

40

60

80

100

120

140

160

Mea

n tim

e to

infe

stat

ion

ofth

e A

ustr

alia

n m

ainl

and

(yea

rs)

Active islands

Thursd

ay

+Horn (2

)

+Mulgrav

e (3)

+Banks

(4)

+Hammon

d (5)

+Sue (6

)

+Prince

of W

ales (

7)

+Yam (8

)

+Jervis

(9)

+Cocon

ut (10

)

+Saibai

(11)

+Murray

(12)

+Yorke (

13)

+Talbot

(14)

+Darnley

(15)

+Mt Corn

wallis (

16)

+Stephe

ns (17

)

Highest transmission firstNo actionUnlimited budgetLower bound modelUpper bound modelExact model

Fig. 3. Mean time to infestation of mainland

Australia for the exact, lower bound and

upper bound models in the case of low trans-

mission probabilities. Only the best rule of

thumb, ‘highest transmission’, is shown.

Islands are progressively included in the analy-

sis until the model is not tractable. The exact

model is tractable for up to eight islands, the

lower and upper bound model 13 islands. All

rules of thumb are tractable up to 17 islands.

The 90% confidence intervals are smaller in

size than the symbols displayed in the graph

and not displayed for clarity.


6 M. Peron et al.

network. The bounds provide a narrow range guaranteed to

contain the performance of the exact optimal policy, for prob-

lems too large to compute the exact solution. This research

impacts metapopulations and network management problems

in biosecurity, health and ecology when the budget allows the

implementation of simultaneous actions.

Our two approximate models share a number of advantages

when compared to rules of thumb. First, they account for the

consequences of actions on future events, which is necessary to

select the best immediate action. The sensitivity analysis on

low and high transmission probabilities shows that the lower

bound model is less likely to underperform than rules of

thumb, which are not guaranteed to perform well (Abel 2003).

Second, our models can be evaluated exactly with SDP rather

than using simulations. Third, the policies generated by our

models can be used to derive efficient rules of thumb.

The performances of the lower and upper bound models are

sensitive to the LCM and GCD of the duration of manage-

ment actions. In our case study, the lower bound likely per-

forms well because the LCM is exactly the duration of the

management actions. We have run the tool with various dura-

tions to evaluate the sensitivity of bound models to the GCD

and the LCM (Appendix S11). When these durations share

many divisors, the LCM and GCD are close, which leads to

small relative errors between bounds. By contrast, when dura-

tions do not share many divisors, the relative errors between

bounds increase.

Although the lower and upper bound models can be solved

at a reduced computational cost, in our case study the memory

size and computation times required still grow exponentially

with the number of islands considered (Appendix S12). Here,

the number of states is 2N + 1 becausewe used a flat representa-tion of states, i.e. each possible combination of island states is

accounted for. To optimise the management of SIS networks,

Chades et al. (2011) used factoredMDPs to take advantage of

the network structure, i.e. the independence of conditional

probabilities. In ourmodel, all islands are connected (complete

network) and using factored MDP provides no advantage. As

data becomes available, it is likely that small transmission

probabilities could be ignored to create a network structure

that could be exploited by factored MDPs (Hoey et al. 1999;

Forsell et al. 2011). We increased the number of islands man-

aged incrementally, ignoring the influence of other islands. An

alternative would be to aggregate the remaining islands as one

island.However, this is not a trivial task as it requires aggregat-

ing a large number of states (Li, Walsh & Littman 2006). How

0

5

10

15

20

25

30

35

Mea

n tim

e to

infe

stat

ion

of

the

Aus

tral

ian

mai

nlan

d (y

ears

)

Active islands

Thursd

ay

+Horn (2

)

+Mulgrav

e (3)

+Banks

(4)

+Hammon

d (5)

+Sue (6

)

+Prince

of W

ales (

7)

+Yam (8

)

+Jervis

(9)

+Cocon

ut (10

)

+Saibai

(11)

+Murray

(12)

+Yorke (

13)

+Talbot

(14)

+Darnley

(15)

+Mt Corn

wallis (

16)

+Stephe

ns (17

)

Highest transmission firstNo action

Unlimited budgetLower bound model

Upper bound modelExact model

Fig. 4. Mean time to infestation of the Aus-

tralian mainland for each model with high

transmission probabilities, with one to 17

islands included.

Fig. 5. Prioritisation ranking of 11 islands using the upper bound pol-

icy. The rankings that emerge from the exact, lower bound and upper

bound policies are the same when tractable. At each timestep, only the

two infested islands with highest ranking are managed, due to limited

budget.



to do so in the best way possible will be the aim of future

research.

MANAGEMENT IMPLICATIONS

All models target Thursday, Horn and Mulgrave Islands as

management priorities in this order because these islands are

highly populated and close to mainland Australia and, hence,

have the highest probability of transmission to mainland.

Knowing that these islands are close to each other (favouring

transmissions) and that Horn Island is the ‘transport hub’ of

the Torres Strait adds further credence to their high prioritisa-

tion. The prioritisation of these three islands is insensitive to

the number of islands included (1–13) and to the transmission

probabilities (low/high), showing the robustness of this policy.

However, the mean time until infestation greatly depends on

the dataset: it ranges from 13 to 50 years when calculated using

low (Fig. 3) and high transmission probabilities (Fig. 4)

respectively. Obtaining more precise estimates of the transmis-

sion probabilities will produce a narrower time range estimate.

Higher budgets allocated to management can also postpone

infestation, more sensitively when transmission probabilities

are low (40 years with no budget/80 years with unlimited bud-

get) than high (10/15 years). A comprehensive sensitivity anal-

ysis would help the decision maker set the most suitable

budget.

Additional factors may influence our management recom-

mendations. For example, A. albopictus is difficult to detect

and decision-makers cannot be certain that an island is suscep-

tible (Hawley 1988). It is possible to provide management rec-

ommendations that accounts for imperfect detection using

partially observable MDPs (Chades et al. 2008). However,

these models do not yet account for actions of different dura-

tions and are even more difficult to solve than MDPs. Other

unknown factors may influence management recommenda-

tions such as species interactions, increasedmigration flow and

effects of climate change. Value of information studies could

help decision-makers determine whether these unknown

factors warrant adapting management recommendations

(Canessa et al. 2015).

In our case study, practitioners keepmanagingA. albopictus

for a fixed period of time on the targeted island despite the

mosquitoes being undetected. This constraint was motivated

by the imperfect detectability of A. albopictus (Hawley 1988),

which may occur in other applications. For example, Chades

et al. (2008) show that managing a threatened species with

imperfect detectability can be optimal, even when we do not

observe the species. This typically happens when the species is

still deemed very likely to be present. Similarly, Regan, Chades

& Possingham (2011) recommend managing invasive plants

up to 4 years since the last detection to ensure eradication.

Another motivation for having prolonged management in

absence of sighting is to decrease the suitability of mosquito

habitat. For instance, managing soil organically for several

years reduces the susceptibility of a species ofmaize to an insect

pest significantly (Phelan, Norris & Mason 1996). Further-

more, to control a weed of rice, McIntyre, Mitchell & Ladiges

(1989) recommend combining management actions of various

durations and starting times, such as delaying flooding and

establishing a sward of pasture during the coolest seasons. Our

general approach could help optimise the spatial management

of these problems. Other reasons for long actions may include

operational constraints, such as fixed-length contracts for

workers implementing actions.

Authors’ contributions

The project was devised by all authors. The optimisation models were developed

and implemented by M.P. These optimisation models include the exact model,

the lower and upper bounds models. M.P. and I.C. wrote the manuscript; all

authors substantially edited the manuscript. Data were collected by I.C., C.J.,

C.M.P. and N.A.S. Bayesian network analysis was conducted by M.P., C.J.,

C.M.P. and I.C.

Acknowledgements

This research is supported by an Industry Doctoral Training Centre scholarship

(M.P.) and CSIRO Julius Career Awards (I.C., N.A.S.). We acknowledge the

Table 1. Relative errors (%) ofmodel performances compared to the upper boundwith low transmissions probabilities for an increasing number of

islands. When the exact performance is unknown, the relative error of any model to the upper bound specifies the guaranteed percentage difference

between the model performance and the optimal performance. For example, a relative error of 10% guarantees that this model is within 10% of the

optimal performance. The highest relative errors for eachmodel are shown in bold. Intractability occurred due tomemory limits

No. of islands included 2 3 4 5 6 7 8 9 10 11 12 13

Exact 391 942 115 122 130 131 136 Intractable

Lower bound 829 142 151 153 158 157 159 159 159 152 149 148Highest transmission 282 103 146 147 139 169 166 167 168 157 152 157

Table 2. Relative errors (%) of model performances compared to the upper bound with high transmissions probabilities for an increasing number

of islands. The highest relative errors for eachmodel are shown in bold.

No. of islands included 2 3 4 5 6 7 8 9 10 11 12 13

Exact 322 634 532 485 496 468 444 Intractable

Lower bound 731 874 678 598 6 56 524 478 466 542 42 408Highest transmission 314 87 778 493 653 665 628 493 561 321 514 637


8 M. Peron et al.

critical contributions of time and expertise provided by theA. albopictus Techni-

cal Advisory Group, and other experts who participated in the expert elicitation

workshop. We also thank Andrew Higgins for valuable feedback on this manu-

script. Computational resources and services used in this work were provided by

the HPC and Research Support Group, Queensland University of Technology,

Brisbane,Australia.

Data accessibility

All data relevant to this study, i.e. the effectiveness of actions and the probabilities

of transmission of mosquitoes, is available at https://doi.org/10.6084/m9.figsha

re.4557562 (Peron 2017a). The MDP transition matrices are generated using the

MATLAB code, which is available at https://doi.org/10.6084/m9.figshare.

4557565 (Peron 2017b).

References

Abel, C.F. (2003) Heuristics and problem solving. New Directions for Teaching

and Learning, 2003, 53–58.Barto, A. &Mahadevan, S. (2003) Recent advances in hierarchical reinforcement

learning.Discrete Event Dynamic Systems, 13, 341–379.Beebe, N.W., Ambrose, L., Hill, L.A. et al. (2013) Tracing the tiger: popu-

lation genetics provides valuable insights into the Aedes (Stegomyia)

albopictus invasion of the Australasian Region. PLoS Neglected Tropical

Diseases, 7, e2361.

Bonizzoni, M., Gasperi, G., Chen, X. & James, A.A. (2013) The invasive mos-

quito species Aedes albopictus: current knowledge and future perspectives.

Trends in Parasitology, 29, 460–468.Boutilier, C. & Brafman, R. (1997) Planning with concurrent interacting actions.

Proceedings of the American Association of Artificial Intelligence (AAAI-97),

720–726.Bradtke, S.J. & Duff, M.O. (1994) Reinforcement learning methods for continu-

ous-time Markov decision problems. Advances in Neural Information Process-

ing Systems, 7, 393–400.Canessa, S., Guillera-Arroita, G., Lahoz-Monfort, J.J., Southwell, D.M., Arm-

strong, D.P., Chades, I., Lacy, R.C. & Converse, S.J. (2015)When dowe need

more data? A primer on calculating the value of information for applied ecolo-

gists.Methods in Ecology and Evolution, 6, 1219–1228.Chades, I., McDonald-Madden, E., McCarthy, M.A., Wintle, B., Linkie, M. &

Possingham, H.P. (2008) When to stop managing or surveying cryptic threat-

ened species. Proceedings of the National Academy of Sciences of the United

States of America, 105, 13936.

Chades, I., Martin, T.G., Nicol, S., Burgman, M.A., Possingham, H.P. & Buck-

ley, Y.M. (2011) General rules for managing and surveying networks of pests,

diseases, and endangered species. Proceedings of the National Academy of

Sciences of the United States of America, 108, 8323–8328.Clark, J.S. (2005)Why environmental scientists are becoming Bayesians. Ecology

Letters, 8, 2–14.Duke, J.M., Dundas, S.J. & Messer, K.D. (2013) Cost-effective conservation

planning: lessons from economics. Journal of Environmental Management,

125, 126–133.Firn, J., Rout, T., Possingham,H. &Buckley, Y.M. (2008)Managing beyond the

invader: manipulating disturbance of natives simplifies control efforts. Journal

of Applied Ecology, 45, 1143–1151.Forsell, N., Wikstr€om, P., Garcia, F., Sabbadin, R., Blennow, K. & Eriksson,

L.O. (2011)Management of the risk of wind damage in forestry: a graph-based

Markov decision process approach. Annals of Operations Research, 190, 57–74.

Grechi, I., Chades, I., Buckley, Y., Friedel, M., Grice, A.C., Possingham, H.P.,

van Klinken, R.D. &Martin, T.G. (2014) A decision framework for manage-

ment of conflicting production and biodiversity goals for a commercially valu-

able invasive species.Agricultural Systems, 125, 1–11.Hauser, C.E. & Possingham, H.P. (2008) Experimental or precautionary? Adap-

tive management over a range of time horizons. Journal of Applied Ecology,

45, 72–81.Hawley,W. (1988) The biology ofAedes albopictus. Journal of the AmericanMos-

quito Control Association, 1, 1–39.Hill, M.P., Axford, J.K. &Hoffmann, A.A. (2014) Predicting the spread ofAedes

albopictus in Australia under current and future climates: multiple approaches

and datasets to incorporate potential evolutionary divergence. Austral Ecol-

ogy, 39, 469–478.Hoey, J., St-Aubin, R., Hu, A. & Boutilier, C. (1999) SPUDD: stochastic plan-

ning using decision diagrams. Proceedings of the Fifteenth Conference on

Uncertainty in Artificial Intelligence (ed. K.B. Laskey & G. Mason), pp. 279–288.MorganKaufmann, Stockholm, Sweden.

Houston, A., Clark, C., McNamara, J. &Mangel, M. (1988) Dynamic models in

behavioural and evolutionary ecology.Nature, 332, 29–34.Li, L., Walsh, T.J. & Littman, M.L. (2006) Towards a unified theory of state

abstraction for MDPs. 9th International Symposium on Artificial Intelligence

andMathematics. ISAIM, Fort Lauderdale, FL,USA.

Mantyka-Pringle, C.S., Martin, T.G., Moffatt, D.B., Udy, J., Olley, J., Saxton,

N., Sheldon, F., Bunn, S.E. & Rhodes, J.R. (2016) Prioritizing management

actions for the conservation of freshwater biodiversity under changing climate

and land-cover.Biological Conservation, 197, 80–89.Marescot, L., Chapron, G., Chades, I., Fackler, P., Duchamp, C., Mar-

boutin, E. & Gimenez, O. (2013) Complex decisions made simple: a pri-

mer on stochastic dynamic programming. Methods in Ecology and

Evolution, 4, 872–884.Martin, T.G., Burgman, M.A., Fidler, F., Kuhnert, P.M., Low-Choy, S.,

McBride, M. &Mengersen, K. (2012) Eliciting expert knowledge in conserva-

tion science.Conservation Biology, 26, 29–38.McCarthy, M.A., Possingham, H.P. & Gill, A.M. (2001) Using stochastic

dynamic programming to determine optimal fire management for Banksia

ornata. Journal of Applied Ecology, 38, 585–592.McIntyre, S., Mitchell, D. & Ladiges, P. (1989) Seedling mortality and submer-

gence inDiplachne fusca: a semi-aquatic weed of rice fields. Journal of Applied

Ecology, 26, 537–549.Monterrubio, C.L., Rioja-Paradela, T. & Carrillo-Reyes, A. (2015) State of

knowledge and conservation of endangered and critically endangered lago-

morphs worldwide.Therya, 6, 11–30.Nicol, S. & Chades, I. (2011) Beyond stochastic dynamic programming: a heuris-

tic sampling method for optimizing conservation decisions in very large state

spaces.Methods in Ecology and Evolution, 2, 221–228.Nicol, S., Fuller, R.A., Iwamura, T. &Chades, I. (2015) Adapting environmental

management to uncertain but inevitable change.Proceedings of the Royal Soci-

ety of London B: Biological Sciences, 282, 20142984.

Pastor-Satorras, R. &Vespignani, A. (2001) Epidemic spreading in scale-free net-

works.Physical ReviewLetters, 86, 3200.

Pelizza, S.A., Scorsetti, A.C., Bisaro, V., Lastra, C.C.L. & Garcıa, J.J.

(2010) Individual and combined effects of Bacillus thuringiensis var. israe-

lensis, temephos and Leptolegnia chapmanii on the larval mortality of

Aedes aegypti. BioControl, 55, 647–656Peron, M. (2017a) Supporting information.docx. figshare. https://doi.org/10.

6084/m9.figshare.4557562.v1.

Peron, M. (2017b) Simultaneous actions.rar. figshare. https://doi.org/10.6084/

m9.figshare.4557565.v1.

Phelan, P., Norris, K. & Mason, J. (1996) Soil-management history and

host preference by Ostrinia nubilalis: evidence for plant mineral balance

mediating insect–plant interactions. Environmental Entomology, 25, 1329–1336.

Pichancourt, J.B., Chades, I., Firn, J., van Klinken, R.D. &Martin, T.G. (2012)

Simple rules to contain an invasive species with a complex life cycle and high

dispersal capacity. Journal of Applied Ecology, 49, 52–62.Pitt, J.P.W. (2008) Modelling the spread of invasive species across heterogeneous

landscapes. Doctor of Philosophy, LincolnUniversity, Lincoln,NewZealand.

Puterman, M.L. (1994)Markov Decision Processes: Discrete Stochastic Dynamic

Programming. JohnWiley& Sons, Inc., NewYork, NY,USA.

Regan, T.J., Chades, I. & Possingham, H.P. (2011) Optimally managing under

imperfect detection: a method for plant invasions. Journal of Applied Ecology,

48, 76–85.Ritchie, S.A., Moore, P., Carruthers, M. et al. (2006) Discovery of a widespread

infestation of Aedes albopictus in the Torres Strait, Australia. Journal of the

AmericanMosquito Control Association, 22, 358–365.Rohanimanesh,K. &Mahadevan, S. (2003) Learning to take concurrent actions.

Proceedings of the Annual Neural Information Processing Systems Conference,

NIPS 2002, Vancouver, BC, Canada.

Russell, R.C., Williams, C.R., Sutherst, R.W. & Ritchie, S.A. (2005) Aedes (Ste-

gomyia) albopictus -a dengue threat for southern Australia? Communicable

Diseases Intelligence, 29, 296–298.Singh, S. & Cohn, D. (1998) How to dynamically merge Markov decision pro-

cesses. Advances in Neural Information Processing Systems (ed M.I. Jordan,

M.J. Kearns & S.A. Solla), pp. 1057–1063. MIT Press, Denver, CO, USA.

Venner, S., Chades, I., Bel-Venner, M.-C., Pasquet, A., Charpillet, F. &

Leborgne, R. (2006) Dynamic optimization over infinite-time horizon: web-

building strategy in an orb-weaving spider as a case study. Journal of Theoreti-

cal Biology, 241, 725–733.Walters, C.J. (1986) Adaptive Management of Renewable Resources. Macmillan,

NewYork,NY,USA.



Walters, C.J. &Hilborn,R. (1978) Ecological optimization and adaptivemanage-

ment.Annual Review of Ecology and Systematics, 9, 157–188.Wilson, K.A., McBride, M.F., Bode, M. & Possingham, H.P. (2006) Prioritizing

global conservation efforts.Nature, 440, 337–340.

Received 29 September 2016; accepted 22December 2016

Handling Editor: RyanChisholm

Supporting Information

Details of electronic Supporting Information are provided below.

Appendix S1.Description ofMarkov decision processes.

Appendix S2.Calculation of the number of states of the exactmodel.

Appendix S3.Proofs of upper and lower bounds.

Appendix S4. Description of the inputs parameters required for the

program.

Appendix S5. Table of the effectiveness of different management

actions on all Torres Strait Islands.

Appendix S6. Belief Bayesian network providing the effectiveness of

actions depending on four islands characteristics.

Appendix S7. Parameters used in the Cauchy formula for the low and

high transmissions.

Appendix S8.Human population size and distances between islands.

Appendix S9. Mean time until infestation of mainland Australia for

the threemodels and six rules of thumb.

Appendix S10. Prioritisation ranking on four islands for low and high

transmission probabilities.

Appendix S11.Relative errors of model performances compared to the

upper boundwith different sub-action durations.

Appendix S12. Computational times of the exact, lower bound and

upper boundmodels for low transmission probabilities.

Data S1.MATLAB code used for computational experiments.


10 M. Peron et al.

Chapter 4

Two approximate dynamic programming algo-rithms for managing complete SIS networks

In the previous chapter, we showed how to reduce the model size significantlyin the case of simultaneous actions of different durations. However, the 17-island problem is still intractable; the problem is even intractable for morethan 13 islands. The same issue occurs in many other real-world applications,where the system under study has too many features or locations (Section2.1.3). Approximate approaches are needed on spatial sequential decisionproblems.

In this chapter, we address our second research question by proposingtwo new approximate dynamic programming algorithms adapted to largeSusceptible-Infected-Susceptible networks. We demonstrate that the firstapproach comes with some performance guarantees and is less computation-ally complex than stochastic dynamic programming. We also prove that oursecond approximate approach runs in quadratic time in the number of nodes.These approaches are tractable on the management of Aedes albopictus (17islands), as opposed to dynamic programming. They are also near-optimalon some of the largest problems for which we can compute the exact solution(10 islands). This chapter has been accepted to Compass 2018, a conferenceon computing and sustainable societies hosted at Facebook, California fromJune 20-22, 2018.


• Martin Peron developed the theory, designed and implemented the opti-misation models, performed the analysis, drafted most of the manuscriptand acted as corresponding author.

• Peter L. Bartlett conceived the presented idea by suggesting the use of

64

continuous states and of a limited number of switches and guided theresearch.

• Kai Helge Becker edited the manuscript.

• Kate Helmstedt edited the manuscript.

• Iadine Chades guided the research and edited the manuscript.

65

Two approximate dynamic programming algorithms formanaging complete SIS networks

Leave Authors Anonymousfor SubmissionCity, Countrye-mail address



ABSTRACTInspired by the problem of best managing the invasivemosquito Aedes albopictus across the 17 Torres Straits islandsof Australia, we aim at solving a Markov decision process onlarge Susceptible-Infected-Susceptible (SIS) networks that arehighly connected. While dynamic programming approachescan solve sequential decision-making problems on sparselyconnected networks, these approaches are intractable forhighly connected networks. Inspired by our case study,we focus on problems where the probability of nodeschanging state is low and propose two approximate dynamicprogramming approaches. The first approach is a modifiedversion of value iteration where only those future states thatare similar to the current state are accounted for. The secondapproach models the state space as continuous instead ofbinary, with an on-line algorithm that takes advantage ofBellman’s adapted equation. We evaluate the resulting policiesthrough simulations and provide a priority order to managethe 17 infested Torres Strait islands. Both algorithms showpromise, with the continuous state approach being able to scaleup to high dimensionality (50 nodes). This work provides asuccessful example of how AI algorithms can be designed totackle challenging computational sustainability problems.

ACM Classification KeywordsG.1.2 Numerical Analysis: Approximation; G.1.6 NumericalAnalysis: Optimization; G.3 Probability and Statistics; I.2.8Artificial Intelligence: Problem Solving, Control Methods,and Search; I.2.m Artificial Intelligence: Miscellaneous; J.3Life and Medical Sciences

Author KeywordsMarkov decision process; Susceptible-Infected-Susceptiblenetworks; Aedes albopictus; Approximate dynamicprogramming; Invasive species; Optimal management;Computational sustainability

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CHI’16, May 07–12, 2016, San Jose, CA, USA

© 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 123-4567-24-567/08/06. . . $15.00

DOI: http://dx.doi.org/10.475/123_4

INTRODUCTIONMarkov decision processes (MDPs) are a mathematicalframework designed to optimize sequential decisions underuncertainty given a specific objective [1, 27]. MDPs canbe solved in polynomial time by a method called stochasticdynamic programming [13]. However, in many real-worldapplications the states describing the system are factored.That is, states are naturally defined as a combination ofsub-states. MDPs with such states are called factoredMDPs. Sub-states can correspond to different features of thesystem [12], individuals in a population [30], spatial locationsin a network [3] or products in an inventory problem [26]. Anessential aspect of factored MDPs is that their number of statesgrows exponentially when the number of sub-states increases.So, since stochastic dynamic programming requires listingall reachable states [27], too many sub-states make stochasticdynamic programming intractable. This issue has been termedthe curse of dimensionality [1].

There exist some exact MDP solvers tailored to solve factoredMDPs, e.g. SPUDD [12]. SPUDD consists of using algebraicdecision diagrams to represent policies and value functions,grouping together states that have the same value or optimalaction (see also [31]). This approach works well when manysub-states are conditionally independent and poorly otherwise.

In this paper, we aim at optimizing management decisionson a particular type of factored MDP called Susceptible-Infected-Susceptible (SIS) network. In an SIS network, eachsub-state represents a node in an interconnected network thatcan be either susceptible or infected [3]. SIS networks arecommonly used to model the spread of infectious disease orparasites in epidemiology [30, 28], meta-population dynamicsof threatened or invasive species in ecology [3, 20] orcomputer viruses in computer science [21, 14]. Inspiredby a real case study, the management of the Asian tigermosquito Aedes albopictus in Australia [22], we aim atexploiting this particular structure to solve highly connectedlarge size SIS-MDPs, thus providing good policies on largenetworks to decision makers and circumventing the curse ofdimensionality.

Case study: managing invasive Aedes albopictusThe Asian tiger mosquito, Aedes albopictus, is a highlyinvasive species and a vector of several arboviruses that affecthumans, including chikungunya and dengue viruses. Theseinvasive mosquitos were first detected in the Torres StraitIslands, Australia, in 2005 [29] where they persist today

Figure 1. The Torres Strait Islands. Connections between islands depictthe possibilities of transmission of the mosquitoes towards susceptibleislands. Low transmission probabilities are not shown for readability.

despite ongoing management effort. The N = 17 inhabitedislands constitute potential sources for the introduction ofAedes albopictus into mainland Australia through numeroushuman-related pathways between the islands and towardsnorth-east Australia (Figure 1).

Local eradication of the mosquito is possible throughmanagement actions on islands such as treating containersand mosquitoes with diverse insecticides. After eradication,re-infestation can occur from connected infested islands. Sincebudget is limited, not all islands can be treated simultaneously.The objective is to select islands to manage to maximize theexpected time before the mainland becomes infested. Pastattempts modeled this problem as an MDP and used stochasticdynamic programming (policy iteration) to find the optimalpolicy [22]. However, the approach failed to circumventthe curse of dimensionality. Only 13 out of the 17 TorresStrait Islands were accommodated, providing incompleterecommendations to managers. The main motivation of thispaper is to provide an approach to accommodate all 17 TorresStrait Islands.

To do so, we have identified two noteworthy properties ofthis system. First, the network is ‘complete’, i.e. everynode can be infested from any other node of the network.Consequently, local optimization approaches such as graph-based MDPs [6, 18], which only consider potential infestationsfrom a small subset of neighboring nodes, are not well suitedto this problem. Second, since local eradication is difficultto achieve and transmission rates are low, the probability foreach sub-state (node) to change (either from susceptible toinfested or vice versa) is small. This implies that the MDPstate at the next timestep will likely be similar to the currentstate, i.e. a small number of sub-states are likely to change.The two approximate dynamic programming approaches wepropose exploit these properties.

Approximate approachesIn the last decade, several approaches have been exploredto solve large factored MDPs, with multiple applicationsin computational sustainability [6, 7, 18, 20]. Generallyspeaking, these approaches can be classified into threegroups [26], all of which are relevant to our case study.

First, simulation-optimization methods consist of evaluatinga number of policies through simulations and selecting thebest one [34] (see also [16] in conservation biology). Theseapproaches do not anticipate what might happen in thefuture [26], which is appropriate for our case study problembecause states do not change frequently. Some approaches ofthe same flavor use cascade models to capture SIS dynamics,but do not involve sequential decisions [32].

Second, rolling horizon procedures (roll-out) use a predictionof the near-future to save on potentially costly long-termpredictions [18]. Typical approaches include model predictivecontrol and Monte Carlo tree search [10]. Roll-out procedureshave been used in conservation biology to solve SIS-MDPsthat are large but much more weakly connected than theTorres Strait Island system [18, 19]. Finally, some hindsightoptimization approaches can help optimize decisions on largenetworks, but with a focus on exponentially large actionspaces [36].

Third, approximate dynamic programming (ADP) approachesexplicitly estimate the values of states to derive optimal actions.For example, mean-field approximation algorithms [23,10, 20] and approximate linear programming methods [6]approximate the value function by decomposing into a sum ofthe values of each node. The value function is updated throughlocal optimizations, for example, in our case, assuming thateach node is only connected to a limited number of neighbors.Therefore, these approaches are not suited to highly connectednetworks, e.g. some have been reported to "work best whennodes have fewer than five neighbors" [20].

Inspired by these three classes of approaches, we introducetwo new approximate approaches to address large and highlyconnected SIS-MDP networks. Our first approach is asimplification of Bellman’s equation where only a smallsubset of the future states are considered. We demonstratethat this approach comes with some performance guaranteeand is less computationally complex than stochastic dynamicprogramming. However, its complexity is still exponential inthe number of sub-states. In contrast, our second approximateapproach is a more radical approximation that runs in lineartime in the time horizon and in quadratic time in the numberof nodes, but has no performance guarantees. We assess ouralgorithms on our case study and compare their solutions toSPUDD when possible [12], the reference algorithm to solvefactored MDPs.

MATERIAL AND METHODS

Markov decision processesMarkov decision processes (MDPs) are mathematicalframeworks for modeling sequential decision problems wherethe outcome is partly stochastic and partly controlled by adecision-maker [1]. A MDP is defined by five components

< S,A,P,r,C > [27] : (i) a state space S, (ii) an action space A,(iii) a transition function P, (iv) an immediate rewards functionr and (v) a performance criterion C.

The decision-maker aims to direct the process towardsrewarding states. From a given state s, the decision-makerselects an action a and receives a reward r(s,a). At the nexttime step, the system transitions to a subsequent state s′ withprobability P(s′|s,a). The performance criterion C specifiesthe objective (e.g. maximize or minimize a sum of expectedfuture rewards), the time horizon (finite or infinite), the initialstate s0 and whether there is a discount rate (γ). Here, we dealwith a discounted infinite time horizon, where we maximize

E[∞

∑t=0

γ tr(st ,at)|s0]. (1)

A policy π describes which decisions are made in each state,i.e. π : S→ A. Solving an MDP means finding an optimalpolicy π∗ that satisfies, in our case:

π∗ = argmaxπE[

∞

∑t=0

γ tr(st ,π(st))|s0]. (2)

Exact algorithms to solve MDPs include linear programming,value iteration, and policy iteration [33]. We choose to usevalue iteration, because its simplicity makes it easy to adapt toapproximately solve large factored MDPs.

Value iterationValue iteration requires the introduction of a value function V ,defined on all states s. The value V (s) corresponds to the sumof future rewards one can expect, starting from the state s. Thevalue function is unknown at the start of the algorithm, andit is customary to start with V = 0. The value function V isrepeatedly improved, for each state, with Bellman’s equation:

V ′(s) = maxa∈A

[r(s,a)+ γ ∑

s′∈SP(s′|s,π(s))V (s′)

]. (3)

Once this is evaluated for all states s ∈ S, we set V :=V ′ andthe process repeats until some termination condition is met,which can either consist of a maximum number of iterations(our choice in this manuscript) or a threshold ε under whichthe maximum difference between V to V ′ must fall (see [27]).The output policy is defined as

π(s) = argmaxa∈A

[r(s,a)+ γ ∑

s′∈SP(s′|s,π(s))V (s′)

]. (4)

Provided the reward is non-negative, the sequence of V ismonotonic and guaranteed to converge [33]. The outputsof value iteration are a policy and a value function, and thevalue function is guaranteed to be within 2γε

1−γ of the optimalvalue function with the ε termination criterion [33]. We nowdescribe Susceptible-Infected-Susceptible networks and thesteps necessary to apply MDP to them.

Susceptible-Infected-Susceptible (SIS) networksSIS networks are used to model spatial systems where aspecies can spread over a network [3]. Each node in thenetwork can be either infested (the terminology for invasive

species) or susceptible (i.e. at risk of being infested). Thespecies can infest new nodes by spreading. Infested nodes canbe cured and re-infested.

Numbering each node from 1 to N, we denote by si the statusof node number i: si = 1 if node i is infested, 0 otherwise. Atransmission probability matrix p ji describes the probabilityfor mosquitoes to be transmitted from any infested node j toany susceptible node i. The probability for node i to remain‘susceptible’ is then given by:

N

∏j=1

(1− s j p ji). (5)

So, the probability to transition from ‘susceptible’ to ‘infested’is 1−∏N

j=1(1− s j p ji).

All nodes are able to be managed with a sub-action. Theeffectiveness of a sub-action implemented on node i is denotedai. It is defined as the probability of locally eradicating themosquitoes over one time step, which implies

Pr(s′i = 1|si = 1) = 1−ai. (6)

In this paper we address two common management objectivesfor SIS models: eradication and containment. In theeradication objective, the goal is to maximize the numberof susceptible nodes [3], so the reward is defined as

r(s,a) =N

∑i=1

(1− si). (7)

In the containment objective, the goal is to prevent the speciesfrom reaching a node i [30], so the reward can be defined as:

r(s,a) = 1− si =

0 if node i is infected;1 otherwise,

(8)

if node i is to be protected.

SIS-MDPsSequential decision problems on SIS networks can be cast intoMDPs as follows (we call the resulting MDP an SIS-MDP).Each state s describes the situation on all nodes and is of theform s = (s1,s2, . . . ,sN). The transition function is definedas P(s′|s,a) = ∏N

i=1 Pr(s′i|si,a). Any dynamic programmingapproach, including value iteration, policy iteration or SPUDD,can then be applied to solve these SIS-MDPs [33].

The main issue with this approach is that the number of statesis |S| = 2N , which is computationally prohibitive when Ngrows. For the Torres Strait mosquito network, only up to 13nodes have been reported to be tractable [22]. In practice,one can distinguish three causes of intractability, called cursesof dimensionality [26]. The first curse of dimensionality isthe exponential number of states itself, which is prohibitivebecause the value function must be updated for each state (Eq.(3)). Further, each of these updates hinges upon a sum overfuture states (r.h.s. in Eq. (3)), which is equally prohibitive:this is the second curse of dimensionality. The third curse ofdimensionality is the exponential number of actions, whichdoes not apply in our case study because the total number ofactions is limited by a budgetary constraint. Driven by this

case study, we introduce two new approximate approaches toaddress the first two curses of dimensionality in SIS-MDPs.

First approximate approach: the Neighbor algorithmThe Neighbor algorithm is a modified version of value iteration(Algorithm 1). As mentioned in the Introduction, in our casestudy, the probability for each sub-state or node to changeover one time step is low. So, future states will likely differfrom the current state by a handful of nodes at most. Indeed,denoting p as an upper bound of the probability that any nodechanges, the probability that all nodes change is less than pN .Also, because the node switches are mutually independent,the expected number of node switches is less than N p, so it isunlikely that many more than N p nodes switch. This insightis the basis for the Neighbor algorithm.

The Neighbor algorithm1 consists of approximating Bellman’sequation (Eq. (3)) for each state by limiting the number ofsub-states K ∈ 0, . . . ,N that can change over the next timestep. By using the Kronecker delta (δs′isi

= 1 when s′i = si

and 0 otherwise), this can be formulated as the constraint∑N

i=1 δs′isi≥ N−K on future states s′ ∈ S (Lines 6-7). When K

is set to N, the Neighbor algorithm is equivalent to the valueiteration algorithm. When K is less than N, fewer future statesare accounted for in the calculation of the future expectedvalues than in the standard value iteration. This simplificationdecreases the computational complexity (see Proposition 2)but also decreases the precision (see Proposition 1).

In addition, the number of total iterations is set to thevariable H (Line 2). Similarly to K, this variable can betuned depending on the desired precision and computationalcomplexity of the algorithm (Propositions 1 and 2). Forexample in our experiments, we chose H = 10 and K = 4. Thepolicy returned by the Neighbor algorithm will be evaluatedon the real, complete problem during the simulations.

We aim to find an upper bound for the error incurred by theNeighbor algorithm (Algorithm 2) as opposed to the optimalvalue iteration. To do so, we denote by:

• πN the policy returned by the Neighbor algorithm;

• Vπ the exact value function of any policy π : S→ A, i.e.

Vπ(s) = E[∞

∑t=0

γ tr(st ,π(st))|s0 = s]; (9)

Note that VπN is the exact value function of the policy πN ,and likely differs from the approximate value function Vcomputed in the Neighbor algorithm.

• p is the maximum probability a node will change in onetime step:

p := max

(max

a∈A,1≤i≤Nai, max

1≤i≤N(1−∏

j,i(1− p ji)

)(10)

1The terminology Neighbor does not refer to nearby islands or nodes(geographically), but to MDP states that are similar. In this regard,this algorithm shares similarities with approaches restricting thesupport of the transition function (see [24] for an example).

• Rmax := maxs∈S,a∈A r(s,a) the maximum reward (weassume all rewards are nonnegative);

PROPOSITION 1. We assume that K ≥ N p. We have:

||Vπ∗ −VπN ||∞ (11)

≤ RmaxγH

1− γ+

Rmaxγ exp(−2 (K+1−N p)2

N

)

(1− γ)2 (12)

PROOF. See Appendix.

This proposition shows that increasing K or H will reduce theloss incurred when implementing the policy πN returned bythe Neighbor algorithm instead of the optimal policy π∗.

PROPOSITION 2. The Neighbor algorithm runs inO(H2N |A|NK) operations, as opposed to O(H4N |A|) forvalue iteration.

Note that the number of actions |A| will likely depend onN as well. Here, because the number of actions grows onlypolynomially due to the budgetary constraint, we focus on thenumber of states (first and second curses of dimensionality).

PROOF. The first three for-loops of the algorithm are on Hiterations, 2N states and |A| actions. The number of times thelast for-loop is computed equals

K

∑k=0

(Nk

)=

K

∑k=0

O(Nk) = O(NK) (13)

This proposition shows in particular that increasing the valueof K or H will increase the computational complexity of theNeighbor algorithm. Taken together, Propositions 1 and 2show that one can trade off performance and computationalexpense by varying the parameters K and H. The complexityis still exponential in the number of current states but not in thenumber of future states. The second curse of dimensionalityis circumvented2. This algorithm should run faster than valueiteration, but still falls prey to the first curse of dimensionality.The second approximate algorithm avoids this caveat.

The Neighbor algorithm is related to approximate dynamicprogramming [26] or approximate value iteration [33]. Thedifference with these classes of algorithms is that the Neighboralgorithm does not use an approximate representation ofthe value function such as a linear approximation. Instead,the approximation occurs in the probabilities involved inBellman’s equation (Eq. (3)). Also, the Neighbor algorithmuses expected value calculations instead of sampling, which iscommon in reinforcement learning [2, 35].

Second approximate approach: the Continuous algorithmOur second approach, which we refer to as the Continuousalgorithm, is an online algorithm (Algorithm 2): it onlyprovides an action to implement in the current state. Thus, it

2For increasing values of N, K needs to be increased to ensure thatK ≥ N p is satisfied, which also increases the complexity. However,we show in Appendix that the number of future states computedby the Neighbor algorithm is negligible compared to that of valueiteration when N grows to infinity.

Algorithm 1 Neighbor(K,H)1: Initialization: V (s) = 0 f or all states s ∈ S2: for iter = 1 : H do3: for s ∈ S do4: for a ∈ A do5: Q(a) = r(s,a)6: for s′ ∈ S,∑N

i=1 δs′isi≥ N−K do

7: Q(a) = Q(a)+ γP(s′|s,a)V (s′)8: π(s) = argmaxa∈A Q(a)9: V ′(s) = maxa∈A Q(a)

10: V (s) =V ′(s) f or all states s ∈ SOutput: Policy πN

avoids listing the states altogether and overcomes the firstcurse of dimensionality. This stands in contrast with thefirst approximate approach and dynamic programming, whichreturn the entire policy for all states before implementation.The Continuous algorithm is a rollout algorithm, i.e. thevalues associated to different actions are evaluated throughsimulations over a moving time horizon of fixed durationHc [33, 18].

As in the first approximate approach, this new approach isbased on the observation that the probability for sub-states ornodes to change over one time step is small. As a consequence,the future MDP state will likely be similar to the previous state,if not identical. This implies that the same action will likelybe applied multiples times. Thus, one can compare differentactions by assuming that the action chosen will never changein the future: this establishes a first approximation (Lines 1-3, Algorithm 2). Then, the binary sub-state of each node, i.e.si ∈ 0,1 corresponding to susceptible or infested, is replacedby its (continuous) probability of infestation, i.e. si ∈ [0,1].Treating discrete entities as continuous in an SIS context iscommon in continuous time [4] but is not common whenoptimizing decisions. When si was binary, the calculations offuture infestation probabilities was written as

Pr(s′i = 1|si = 1) = 1−ai,

Pr(s′i = 1|si = 0) = 1−∏Nj=1(1− p jis j),

(14)

where p ji is the probability of transmission from j (if infested)to i. It can now be adapted to these continuous sub-states asfollows (Line 6):

s′i = si(1−ai)+(1− si)(1−N

∏j=1

(1− p jis j)). (15)

These continuous sub-states establish a second approximation.They are considerably faster to calculate than the probability ofeach of the 2N combination of sub-states because the numberof operations is quadratic in the number of nodes insteadof exponential. However, these estimates are based on theprobability of infestation of sub-states instead of using theprecise conditional probabilistic relations between sub-states.Over many iterations, these estimates will diverge from thediscrete case.

Similarly to the discrete case, we define the reward in thecontinuous case as follows: for the eradication objective, thereward at each time step is ∑N

i=1(1− si), which represents theaverage number of susceptible nodes, to maximize. For thecontainment objective, the reward is 1−mainland, i.e. theprobability that the mainland is infested. These rewards areused to calculate Q(a), the cumulative ‘score’ of action a (Line5). Note that Line 7 applies to the containment objective only.At the end of the rolling horizon, the action with maximumscore is selected (Line 9).

PROPOSITION 3. The Continuous algorithm runs inO(|A||N|2Hc) operations.

PROOF. For each of the |A| actions and Hc iterations, thesub-state of each of the N nodes is updated by multiplyingN−1 numbers.

Algorithm 2 Continuous(s)1: for a ∈ A do2: Initialization: Q(a) = 0, mainland = 0,

(s1,s2, . . . ,sN) := s3: for iter = 1 : Hc do4: Q(a) = Q(a)+ γ iter−1r(s,a)5: for i = 1→ N do6: s′i = si(1−ai)+(1− si)(1−∏N

j=1(1− p jis j))

7: mainland = mainland + (1 − mainland)(1 −∏N

i=1(1− pi,mainlandsi)) (containment case only)8: si = s′i f or all 1≤ i≤ N9: a = argmaxa∈A Q(a)

10: Output: Action a

The complexity of this approach is polynomial in the problemsize, which is a considerable improvement as compared tostochastic dynamic programming. It circumvents both cursesof dimensionality.

Performance evaluationWe can evaluate the performance of each algorithm throughsimulations by implementing the recommended action ateach time step. Note that this is much faster for the firstapproach because the algorithm computes the policy for allstates before the simulations, potentially at a very high one-off computational cost (that is, it is an offline algorithm). Incontrast, the algorithm Continuous only outputs an action forone given state, so needs to be re-run at every time step for theupdated observed state (it is an online algorithm).

Framing the case study as an SIS-MDPWe will aim to find the optimal management of Aedesalbopictus. This decision problem is modeled as an SIS-MDPin which:

• The observable component s ∈ S specifies the presenceor absence of the mosquitoes across the N = 17 islands(|S|= 2N +1 = 131073). The term ’+1’ corresponds to an

Algorithm 3 EvaluatePolicy(s0)1: t = 02: for i = 1 : nSimulations do3: Initialization: s = s0, mainlandIn f ested = 0

// all islands start infested4: while !mainlandIn f ested do5: a = π(s) or a =Continuous(s)6: mainlandIn f ested := DrawMainlandState(s,a)7: s := DrawState(s,a)8: t := t +1

Output: Average time: t/nSimulations

absorbing state representing the presence of mosquitoes inthe mainland.

• Each action a ∈ A describes which islands should bemanaged and the type of management (light or strong).Due to a budgetary constraint, only up to three islandscan be managed simultaneously. The set A only containsthe combinations of management actions that satisfy thisconstraint.

• The transition probabilities T (s,a,s′) accounts for thepossible local eradications and transmissions betweenislands. In accordance with [22], we investigate twotransmission rates: fast and slow.

• In the case of a containment objective, the reward r(s,a)equals 0 if the mainland is infested and 1; in the case of aneradication objective, the reward is the sum of susceptibleislands — the mainland is disregarded;

• In the containment case, γ should ideally be 1 so theexpected cumulative reward (value) equals the expectedtime before infestation of Australia in years. For ease ofcomparison with SPUDD, we set γ = 0.99 for containmentand γ = 0.95 for eradication.

RESULTSWe show the average result, standard deviation andcomputational time on various problem instances for boththe Neighbor and Continuous algorithms (Table 1) on 10,000simulations. We set H = 10, K = 4 and Hc = 10 asthese parameters achieved a satisfying trade-off betweencomputation time and performance. We compare theirperformances to SPUDD (version 3.6.2). We run ouralgorithms with 10 islands (for which the optimal value wascalculated using the classic stochastic dynamic programmingalgorithm policy iteration in [22]), 17 islands (the full-scaleproblem) and a hypothetical network of 50 islands (to testscalability), and with different transmission parameters (high,low and random) and management objectives (eradication andcontainment). On 17 islands, SPUDD cannot accommodatea highly connected network, so we allow, for each island,transmissions from only five islands with the highesttransmission probabilities for tractability.

Both of our proposed algorithms are tractable for up to 17islands for all objectives and transmissions. However, SPUDDruns out of memory or time for the eradication objective and

Figure 2. The prioritization ranking shown on the Torres Strait Islands.The recommendation is to start by managing Thursday Island andfollow the arrows if mosquitoes are successfully eradicated. As a generalrule of thumb, islands that are closer to the mainland are to be managedwith priority. Other factors such as effectiveness of management actionsand closeness to other Torres Strait Islands also account for this ranking.

the random transmissions. The eradication objective is difficultto solve because all islands contribute to the rewards, and thusthe value functions. In contrast, the containment objective issimpler because keeping a few key islands susceptible mightbe enough to achieve a good performance. As for the randomtransmission case, it is harder to solve because the islandshave random attributes. The ‘low’ and ‘high’ transmissionprobabilities are easier, e.g. Thursday and Horn islands haverelatively high transmission probabilities to and from otherislands and all solvers manage to rapidly identify them asmanagement priorities.

In a limited system of 10 islands all solvers perform near-optimally. This shows that both our approximate algorithmsare able to provide good policies and suggests that they mightperform well on larger problems as well.

When all 17 islands are included, the three solvers obtainthe same value with high transmission probabilities, withSPUDD being much faster than the Neighbor algorithm.However, SPUDD under-performs with ‘low’ transmissionprobabilities. This is because SPUDD does not accommodatehighly connected networks, and thus only outputs a policytree depending on a handful of islands. Therefore, it makesno recommendations about other islands, resulting in a lossof performance. Our approximate approaches outperformSPUDD and also show more robustness to the parameters ofthe problem.

Finally, only the Continuous algorithm is tractable with 50islands, because it overcomes both curses of dimensionalityand has a polynomial complexity (it would also be tractablefor much larger problems for the same reasons, but we did notconfirmed this through experiments). In contrast, the Neighbor

Instance Neighbors H = 10, K = 4 Continuous Hc = 10 SPUDD#islands / transmission

objective (|S|, |A|)Value ±

95% confidenceTime Value ±

95% confidenceTime Value ±

95% confidenceTime

10 / high / con-tainment (1025,276)Optimal: 13.6 [22]

13.5 ± 0.3 60 s 13.6 ± 0.3 2,441 s 13.5 ± 0.3 4631 s

10 / lowcontainment (1025,276)

Optimal: 65.1 [22]

64.3 ± 1.4 59 s 64.9 ± 1.4 4,406 s 65.2 ± 1.4 10546 s

17 / high / con-tainment (131073,1123)

12.4 ± 0.3 172,222 s 12.8 ± 0.3 12,160s 12.6 ± 0.3 (*) 207 s (*)

17 / low / con-tainment (131073,1123)

54.9 ± 1.2 183,572 s 57.2 ± 1.2 20,151 s 54.9 ± 1.2 (*) 955 s (*)

17 / random / con-tainment (131073,1123)

16.4 ± 0.4 158,260 s 16.8 ± 0.4 13449 s Out ofmemory

17 / high / eradi-cation (131073,1123)

94.6 ± 0.3 156,544 s 95.0 ± 0.3 57,100 s Out oftime

17 / low / eradi-cation (131073,1123)

124.7 ± 0.4 156,479 s 125.1 ± 0.4 29,378 s Out oftime

17 / random / eradi-cation (131073,1123)

78.5 ± 0.4 172,258 s 79.8 ± 0.4 64,242s Out ofmemory

50 / randomcontainment

(2.25×1015,23376)

Out ofmemory

5.4 ± 0.3 164,225 s(1000

simulations)

Out ofmemory

Table 1. Average values, 95% confidence intervals and computational times of the two approximate approaches and SPUDD on 10,000 simulations. Bestvalues are shown in bold. (*) SPUDD was run on an approximate version of the 17-island problem for tractability (see main text). For the offline solvers(the Neighbor algorithm and SPUDD), the computational time shown does not include the simulation running time, which is very short. In contrast, theContinuous algorithm is online, so the computational time is only that of the simulations (no preprocessing). The memory is set to 2GB, which is themaximum memory supported by SPUDD. The computational time limit was set to 1 week (604,800s).

Ranking Island name Ranking Island name1 Thursday 10 Coconut2 Horn 11 Yorke3 Mulgrave 12 Saibai4 Sue 13 Murray5 Banks 14 Talbot6 Yam 15 Darnley7 Hammond 16 Mt Cornwallis8 Jervis 17 Stephens9 Prince of Wales

Table 2. Priority ranking of the 17 Torres Strait Islands for theContinuous algorithm with the containment objective with both lowand high transmission probabilities. At each time step, only thetwo or three infested islands with highest ranking are managed, dueto a limited budget. Note that this ranking is not unique: whenThursday, Horn and Mulgrave islands are susceptible, the Continuousalgorithm recommends managing Sue, Yam and Jervis islands with 3light managements. However, if Sue island becomes susceptible, theContinuous algorithm recommends managing Banks and Yam islandswith strong and light management respectively. It is then unclearwhether Banks is more of a priority than Jervis. Nevertheless, theprioritization ranking we present is accurate for most islands andprovides a good idea of which islands should be managed first.

algorithm overcomes only one of those issues and still has anexponential complexity.

Since the Continuous algorithm performs well, we show whichislands should be managed in priority in the containment caseaccording to this approach, in Table 2 and in Figure 2. Underthe containment objective, it recommends managing Thursday,Horn and Mulgrave Islands with priority if they are infested.These islands are highly populated and close to mainlandAustralia, and therefore have the highest probability of directlytransmitting mosquitoes to the mainland. This matches therecommendations found in [22].

DISCUSSIONIn this manuscript, we aimed to solve an MDP on a largeand fully connected SIS network. Given the intractabilityof stochastic dynamic programming, we propose two newapproximate approaches based on the observation that thetransition probabilities for each node is low. The first approachis a modified version of value iteration where only thosefuture states that are similar to the current state are accountedfor, with provable performance guarantees. This drasticallyreduces the computational time of Bellman’s equation at littlecost on the quality of the policy. The second approach goesfurther by modeling the sub-states comprising the MDP statesas continuous instead of binary, with an adapted Bellman’sequation.

Both approaches solve all versions of this case study, whichpolicy iteration and SPUDD could not. The Neighboralgorithm solves the second curse of dimensionality on future

states. The Continuous algorithm also circumvents the firstcurse of dimensionality on current states. Both approachescould handle completely connected SIS-MDP networks of sizeat least 17 (Neighbors) and 50 nodes (Continuous). While itwas not possible to evaluate loss of optimality on the 17-islandproblem because it is intractable for established techniques,our algorithms achieved near-optimal performance on the10-island problem. The lower performance of SPUDD isnot surprising as SPUDD takes advantage of conditionalindependence between sub-state variables [12]. In ourproblem, all sub-state variables are conditionally dependentbecause we are dealing with a complete network.

Although both of our new proposed approaches share somesimilarities, they also differ on several points. The advantageof the Neighbor algorithm is that it accounts for a small amountof sub-state changes (K), while the Continuous algorithmdoes not. Additionally, the Neighbor algorithm can trade-offcomputational time for policy quality by increasing/decreasingthe number of iterations (H) or the number of following states(K). The extreme case, i.e. setting the number of changesallowed to the total amount of nodes (K = N), is equivalentto performing value iteration given H is large enough. Itis an offline algorithm, which is easier to communicateto managers since the solution is calculated once. TheContinuous algorithm is online and each simulation only takesa few seconds to run. In our case this approach is fast andoutperforms the Neighbor algorithm. However, it comes withno performance guarantees and has the disadvantage of notaccounting for a change of action in the future: it mightperform poorly on systems that rely on changing actionssignificantly within a short time. Finally, the Neighboralgorithm retains exponential complexity in the number ofnodes in the network while the Continuous algorithm ispolynomial.

This work provides many avenues for future research. First,we have developed our equations for SIS-MDPs, howeverthe Neighbor algorithm could be applied to more generalfactored MDPs with only minor changes. Second, theContinuous algorithm works well when sub-states do notchange frequently. To apply this algorithm efficiently tomore difficult cases, it may be necessary to allow actionsto change in the future. This might be achieved while keepingcomputational complexity down by considering actions for themost likely future states, using for example smart samplingtechniques. Third, the Neighbor algorithm could be convertedto an online form by only considering states that can bereached from the current state. This would avoid runningthe entire algorithm prior to simulations. Also, we havebuilt this algorithm as approximate version of value iterationbut it would be interesting to design and evaluate a policyiteration version. Finally, we acknowledge that there aremany algorithms that would be appropriate in this context, e.g.Reinforcement Learning [35]. However, they do not nativelyassume or exploit that sub-states do not change frequently.Tailoring these reference algorithms for this property may leadto considerable computational savings.

This work could be applied in multiple fields. There aremany environmental spatial problems requiring effective MDPsolvers on highly connected networks. Examples includemanagement of forestry at risk of wind damage [7], adaptivemanagement of migratory birds under sea level rise [17] orcontrol of invasive mammals [9] or invasive weed [5]. Othernon-ecological interconnected systems would also benefitfrom this work. For example, system administrators try tokeep as many machines as possible running in a network [25]or in maximizing the reliability of information in a militarysensor network [8].

Inspired by the problem of best managing the invasivemosquito Aedes albopictus in Australia, we aimed at solvinga Markov decision process on large Susceptible-Infected-Susceptible (SIS) networks that are highly connected. Currentexact approaches are intractable for these types of networks.We have proposed two approximate algorithms that can tacklesuch large-scale problems and achieve promising results,and we have provided some theoretical insights about theirperformances. Although our two approximate approaches arenot guaranteed to be optimal, the resulting policies can stillbe used as an initial policy or a basis of comparison by otheralgorithms.

REFERENCES1. Richard Bellman. 1957. Dynamic programming.

Princeton University Press (1957).

2. Dimitri P. Bertsekas and John N. Tsitsiklis. 1995.Neuro-dynamic programming: an overview. In Decisionand Control, 1995., Proceedings of the 34th IEEEConference on, Vol. 1. IEEE, 560–564.

3. Iadine Chadès, Tara G. Martin, Sam Nicol, Mark A.Burgman, Hugh P. Possingham, and Yvonne M. Buckley.2011. General rules for managing and surveyingnetworks of pests, diseases, and endangered species.Proceedings of the National Academy of Sciences of theUnited States of America 108 (2011), 8323–8328.

4. Peter G. Fennell, Sergey Melnik, and James P. Gleeson.2016. Limitations of discrete-time approaches tocontinuous-time contagion dynamics. Physical Review E94, 5 (2016), 052125.

5. Jennifer Firn, Tracy Rout, Hugh Possingham, andYvonne M. Buckley. 2008. Managing beyond the invader:manipulating disturbance of natives simplifies controlefforts. Journal of Applied Ecology 45 (2008),1143–1151. DOI:http://dx.doi.org/10.1111/j.1365-2664.2008.01510.x

6. Nicklas Forsell and Régis Sabbadin. 2006. Approximatelinear-programming algorithms for graph-based Markovdecision processes. Frontiers in Artificial Intelligence andApplications 141 (2006), 590.

7. Nicklas Forsell, Peder Wikström, Frédérick Garcia, RégisSabbadin, Kristina Blennow, and Ljusk Ola Eriksson.2011. Management of the risk of wind damage in forestry:a graph-based Markov decision process approach. Annalsof Operations Research 190 (2011), 57–74.

8. Duncan Gillies, David Thornley, and ChatschikBisdikian. 2009. Probabilistic approaches to estimatingthe quality of information in military sensor networks.Comput. J. 53, 5 (2009), 493–502.

9. Kate J. Helmstedt, Justine D. Shaw, Michael Bode, AleksTerauds, Keith Springer, Susan A. Robinson, and Hugh P.Possingham. 2016. Prioritizing eradication actions onislands: it’s not all or nothing. Journal of AppliedEcology 53, 3 (2016), 733–741.

10. Christopher Ho, Mykel J. Kochenderfer, Vineet Mehta,and Rajmonda S. Caceres. 2015. Control of epidemics ongraphs. In Decision and Control (CDC), 2015 IEEE 54thAnnual Conference on. IEEE, 4202–4207.

11. Wassily Hoeffding. 1963. Probability inequalities forsums of bounded random variables. Journal of theAmerican statistical association 58, 301 (1963), 13–30.

12. Jesse Hoey, Robert St-Aubin, Alan Hu, and CraigBoutilier. 1999. SPUDD: Stochastic planning usingdecision diagrams. In Proceedings of the Fifteenthconference on Uncertainty in artificial intelligence.Morgan Kaufmann Publishers Inc., 279–288.

13. Michael L. Littman, Thomas L. Dean, and Leslie P.Kaelbling. 1995. On the complexity of solving Markovdecision problems. Morgan Kaufmann Publishers Inc.,394–402.

14. Alun L. Lloyd and Robert M. May. 2001. How virusesspread among computers and people. Science 292, 5520(2001), 1316–1317.

15. László Lovász, József Pelikán, and Katalin L.Vesztergombi. 2003. Discrete Mathematics. Springer,Secaucus, NJ.

16. Marissa F. McBride, Kerrie A. Wilson, Michael Bode,and Hugh P. Possingham. 2007. Incorporating the effectsof socioeconomic uncertainty into priority setting forconservation investment. Conservation Biology 21, 6(2007), 1463–1474.

17. Sam Nicol, Olivier Buffet, Takuya Iwamura, and IadineChadès. 2013. Adaptive management of migratory birdsunder sea level rise. In Proceedings of the Twenty-Thirdinternational joint conference on Artificial Intelligence.AAAI Press, Beijing, China, 2955–2957.

18. Sam Nicol and Iadine Chadès. 2011. Beyond stochasticdynamic programming: a heuristic sampling method foroptimizing conservation decisions in very large statespaces. Methods in Ecology and Evolution 2 (2011),221–228. DOI:http://dx.doi.org/10.1111/j.2041-210X.2010.00069.x

19. Sam Nicol, Iadine Chadès, Simon Linke, and Hugh P.Possingham. 2010. Conservation decision-making inlarge state spaces. Ecological Modelling 221, 21 (2010),2531–2536.

20. Sam Nicol, Regis Sabbadin, Nathalie Peyrard, and IadineChadès. 2017. Finding the best management policy toeradicate invasive species from spatial ecological

networks with simultaneous actions. Journal of AppliedEcology (2017).

21. Romualdo Pastor-Satorras and Alessandro Vespignani.2001. Epidemic spreading in scale-free networks.Physical review letters 86, 14 (2001), 3200.

22. Martin Péron, Cassie C. Jansen, ChrystalMantyka-Pringle, Sam Nicol, Nancy A. Schellhorn,Kai Helge Becker, and Iadine Chadès. 2017. Selectingsimultaneous actions of different durations to optimallymanage an ecological network. Methods in Ecology andEvolution 8, 10 (2017), 1332–1341.

23. Nathalie Peyrard and Régis Sabbadin. 2006. Mean fieldapproximation of the policy iteration algorithm forgraph-based Markov decision processes. Frontiers inArtificial Intelligence and Applications 141 (2006), 595.

24. Luis Enrique Pineda and Shlomo Zilberstein. 2014.Planning Under Uncertainty Using Reduced Models:Revisiting Determinization.. In ICAPS.

25. Pascal Poupart. 2005. Exploiting structure to efficientlysolve large scale partially observable Markov decisionprocesses. Ph.D. Dissertation. University of Toronto,Toronto.

26. Warren B. Powell. 2007. Approximate dynamicprogramming: solving the curses of dimensionality. Vol.703. John Wiley & Sons, Inc., New York, NY, USA.

27. Martin L. Puterman. 1994. Markov decision processes:discrete stochastic dynamic programming. John Wiley &Sons, Inc., New York, NY, USA.

28. Olivier Restif and Jacob C. Koella. 2003. Shared controlof epidemiological traits in a coevolutionary model ofhost-parasite interactions. The American Naturalist 161,6 (2003), 827–836.

29. Scott A. Ritchie, Peter Moore, Morven Carruthers, CraigWilliams, Brian Montgomery, Peter Foley, ShayneAhboo, Andrew F. Van Den Hurk, Michael D. Lindsay,and Bob Cooper. 2006. Discovery of a widespreadinfestation of Aedes albopictus in the Torres Strait,Australia. Journal of the American Mosquito ControlAssociation 22 (2006), 358–365.

30. Faryad Darabi Sahneh, Fahmida N. Chowdhury, andCaterina M. Scoglio. 2012. On the existence of athreshold for preventive behavioral responses to suppressepidemic spreading. Scientific reports 2 (2012).

31. Scott Sanner and David McAllester. 2005. Affinealgebraic decision diagrams (AADDs) and theirapplication to structured probabilistic inference. In IJCAI,Vol. 2005. 1384–1390.

32. Daniel Sheldon, Bistra Dilkina, Adam N. Elmachtoub,Ryan Finseth, Ashish Sabharwal, Jon Conrad, Carla P.Gomes, David Shmoys, William Allen, and OleAmundsen. 2012. Maximizing the spread of cascadesusing network design. arXiv preprint arXiv:1203.3514(2012).

33. Olivier Sigaud and Olivier Buffet. 2010. Markov decisionprocesses in artificial intelligence. John Wiley & Sons,Inc., New York, NY, USA.

34. James C. Spall. 2005. Introduction to stochastic searchand optimization: estimation, simulation, and control.Vol. 65. John Wiley & Sons.

35. Richard S. Sutton and Andrew G. Barto. 1998.Introduction to reinforcement learning. MIT Press.

36. Shan Xue, Alan Fern, and Daniel Sheldon. 2014.Dynamic resource allocation for optimizing populationdiffusion. In Artificial Intelligence and Statistics.1033–1041.

APPENDIX

Proof of Proposition 1We prove the following proposition:

PROPOSITION 4. We assume that K ≥ N p. We have:

||Vπ∗ −VπN ||∞ (16)

≤ RmaxγH

1− γ+


N

)

(1− γ)2 (17)

Let us denote by V Nπ the value as calculated in the the Neighbor

algorithm (Algorithm 1) of any policy π : S→ A, i.e. V Nπ (s) =

Q(π(s)) for each s∈ S. Let us first prove the following lemma.

LEMMA 1. For any policy π : S→ A and any state s ∈ S,we have:

0≤Vπ(s)−V Nπ (s) (18)

≤ RmaxγH

1− γ+


N

)

(1− γ)2 (19)

PROOF.

Vπ(s)−V Nπ (s) (20)

= E

[∑t≥0

γ tr(st ,π(st))|s0 = s

](21)

−E

∑

0≤t≤H−1,∑Ni=1 δst+1,ist,i≥N−K


(22)

= E

[∑

t≥Hγ tr(st ,π(st))|s0 = s

](23)

+E

∑

0≤t≤H−1,∑Ni=1 δst+1,ist,i<N−K


(24)

This sum is nonnegative because all rewards are nonnegative,which proves the first inequality of the lemma. For the secondinequality, we can write:

Vπ(s)−V Nπ (s) (25)

≤ RmaxE

[∑

t≥Hγ t |s0 = s

]+ (26)

RmaxE

∑

t≥0,∑Ni=1 δst+1,ist,i<N−K

γ t |s0 = s

(27)

=RmaxγH

1− γ+ (28)

Rmax

∑

t≥0γ t −E

∑

t≥0,∑Ni=1 δst+1,ist,i≥N−K

γ t |s0 = s

(29)

Recall that ∑Ni=1 δst+1,ist,i is the number of sub-states (nodes)

that do not change from time step t to time step t +1. Withst fixed and st+1 following the Markov process, this sum is

a random variable. It is equivalent to a binomial distributionwith N independent experiments, each with a probability ofsuccess of 1− p at least. So, the probability that a state stand its successor st+1 satisfy ∑N

i=1 δst+1,ist,i ≥ N −K is, byHoeffding’s inequality [11], no less than

PK := 1− exp(−2

(K +1−N p)2

N

)(30)

This result is based on our assumption that K ≥ N p. It impliesthat (PK)

t is a lower bound of the probability that all statesfrom s0 to st satisfy this property, which are the states involvedin the second sum of Eq. (29). So,

Vπ(s)−V Nπ (s) (31)

≤ RmaxγH

1− γ+Rmax

(∑t≥0

γ t −∑t≥0

γ t(PK)t

)(32)

=RmaxγH

1− γ+Rmax

(1

1− γ− 1

1− γPK

)(33)

≤ RmaxγH

1− γ+

(Rmaxγ(1−PK)

(1− γ)2

)(34)

=RmaxγH

1− γ+


N

)

(1− γ)2 (35)

which terminates the proof of the lemma.

Then we have, for each state s ∈ S:

0≤Vπ∗(s)−VπN (s) (36)

≤[Vπ∗(s)−V N

π∗(s)]+[V N

π∗(s)−V NπN(s)]+[V N

πN(s)−VπN (s)

]

(37)

The second term between brackets is non-positive because thepolicy πN is optimal with regard to the value function V N . Thefirst and third terms between brackets are bounded by the rightand left inequalities in Lemma 1 respectively, which yields:

Vπ∗(s)−VπN (s) (38)

≤ RmaxγH

1− γ+


N

)

(1− γ)2 (39)

This terminates the demonstration of Proposition 1.

Future states in the Neighbor algorithmIn this section, we show that the number of future statescomputed by the Neighbor algorithm is negligible comparedto that of value iteration when N grows to infinity for any levelprecision desired, provided that p < 1/2. Let us denote byF(N) the number of future states computed by the Neighboralgorithm with N nodes. Since the number of future statescomputed by value iteration is 2N , we want to find an upperbound on F(N)/2N .

Proposition 1 shows that the loss of performance due torestricting the future states through the variable K is at most:


N

)

(1− γ)2 (40)

So, we can ensure that this loss is below any given thresholdρ > 0 by setting

K = N p−1+

√√√√N log(

Rmaxγ(1−γ)2ρ

)

2(41)

= N p−1+C√

N, (42)

with the notation C =

√log(

Rmaxγ(1−γ)2ρ

)

2 .

We have the following upper bound on F(N), based onTheorem 5.3.2 in [15]:

F(N)≤ 2N−1e−(N−2K−2)2

4(N−K−1) (43)

So,

F(N)

2N ≤ 12

e−(N−2K−2)2

4(N−K−1) (44)

Further, we have

(N−2K−2)2

4(N−K−1)≥ (N−2K−2)2

4N(45)

=(N−2(N p−1+C

√N)−2)2

4N(46)

=(√

N(1−2p)−2C)2

4N→∞−−−→ ∞ (47)

(48)

under the assumption that p < 1/2. This implies

F(N)

2NN→∞−−−→ 0, (49)

i.e. the number of future states computed by the Neighboralgorithm is negligible compared to that of value iterationwhen N grows to infinity.

Chapter 5

Fast-tracking Stationary MOMDPs for Adap-tive Management Problems

In the previous two chapters, we focussed on ‘standard’ Markov decision pro-cesses, in the sense that all the parameters about the process were known.However, a key aspect in managing Aedes albopictus is the uncertainty aboutthe system dynamics, also called structural uncertainty, to which the stan-dard version of dynamic programming is not adapted. Similar types of prob-lems can occur in various contexts, including threatened species managementand natural resource management, medical science or machines or infrastruc-tures maintenance (Section 2.2.3). Such problems, called adaptive manage-ment in environmental sciences, can be modelled as MOMDPs. In practice,however, the high complexity of MOMDPs makes MOMDP solvers slow orintractable for all but trivial problems. This is the focus of the next twochapters.

This chapter addresses our third research question by accelerating POMDPor MOMDP solvers that are used when solving adaptive management prob-lems. More precisely, we propose a method to find a lower bound on theoptimal value function, which is used as an initial value function. In thecorners of the domain of the value function (belief space), this lower boundis provably equal to the optimal value function. We also show that underfurther assumptions, it is a linear approximation of the optimal value func-tion in a neighbourhood around the corners. Tested on two state-of-the-artPOMDP solvers, our approach shows significant computational gains in ourcase study and on a previously published data challenge. This chapter waspublished as

Peron, M., Becker, K. H., Bartlett, P., and Chades, I. (2017a). Fast-Tracking Stationary MOMDPs for Adaptive Management Problems. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence

78

(AAAI-17), pages 4531–4537.


• Martin Peron conceived the presented idea, developed the theory, de-signed and implemented the optimisation models, performed the analy-sis, drafted most of the manuscript and acted as corresponding author.Martin Peron implemented the optimisation models and drafted mostof the manuscript.

• Kai Helge Becker directed the research and edited the manuscript.

• Peter L. Bartlett edited the manuscript.

• Iadine Chades directed the research and edited the manuscript.

79

Fast-Tracking Stationary MOMDPs for Adaptive Management Problems

Martin Peron,1,2 Kai Helge Becker,3 Peter Bartlett,1,4 Iadine Chades21Queensland University of Technology, Brisbane QLD 4000, Australia ([email protected])

2CSIRO, Dutton Park QLD 4102, Australia ([email protected])3University of Strathclyde, Glasgow G1 1XQ, United Kingdom ([email protected])

4University of California, Berkeley, CA, United States ([email protected])

AbstractAdaptive management is applied in conservation and natu-ral resource management, and consists of making sequentialdecisions when the transition matrix is uncertain. Informallydescribed as ’learning by doing’, this approach aims to tradeoff between decisions that help achieve the objective and de-cisions that will yield a better knowledge of the true transi-tion matrix. When the true transition matrix is assumed tobe an element of a finite set of possible matrices, solvinga mixed observability Markov decision process (MOMDP)leads to an optimal trade-off but is very computationally de-manding. Under the assumption (common in adaptive man-agement) that the true transition matrix is stationary, we pro-pose a polynomial-time algorithm to find a lower bound ofthe value function. In the corners of the domain of the valuefunction (belief space), this lower bound is provably equal tothe optimal value function. We also show that under furtherassumptions, it is a linear approximation of the optimal valuefunction in a neighborhood around the corners. We evaluatethe benefits of our approach by using it to initialize the solversMO-SARSOP and Perseus on a novel computational sustain-ability problem and a recent adaptive management data chal-lenge. Our approach leads to an improved initial value func-tion and translates into significant computational gains forboth solvers.

IntroductionAdaptive management is an approach tailored for achievinga management objective in environmental problems whenthe system dynamics is partially unknown (Walters andHilborn 1978; Chades et al. 2016), with applications in con-servation (Chades et al. 2012; Runge 2013), fisheries (Fred-erick and Peterman 1995), natural resource management(Johnson, Kendall, and Dubovsky 2002) and forest manage-ment (Moore and Conroy 2006). Over time, we can learnabout the system dynamics by analyzing how the systemhas responded to our actions so far. Some actions might notseem optimal to achieve the management objective given ourcurrent knowledge but might be more informative about thesystem dynamics than others, potentially resulting in betterdecisions in the future.

The uncertainty about the system dynamics is often mod-eled by a finite set of scenarios (Walters and Hilborn 1976;

Copyright c© 2017, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Moore and Conroy 2006). Chades et al. (2012) showedthat this problem can be formulated as a mixed observabil-ity Markov decision process (MOMDP), a special case ofPOMDP (partially observable MDP). An optimal MOMDPpolicy accomplishes the best trade-off between informativeand rewarding actions, with regard to a precise managementobjective (Chades et al. 2012).

Researchers from other fields have also looked at vari-ations of the same problems: model-based Bayesian rein-forcement learning aims to find the best trade-off (Vlassis etal. 2012), but does not assume the transition matrix to belongto a finite given set - instead probabilities are often assumedto follow a Dirichlet distribution (Duff 2003).

In adaptive management, the true transition matrix iscommonly assumed to be stationary, i.e. it does not changeover time (Walters and Hilborn (1978), Chades et al. (2012),Runge (2013) to cite a few). We will make this assumptiontoo and will refer to the problem as a stationary MOMDP.Most MOMDP solvers are α-vector-based, i.e. they updatea piecewise linear value function converging to the optimalvalue function (Ong et al. 2010). In practice, the high com-plexity of stationary MOMDPs (PSPACE-complete; Chadeset al. 2012) leads to very slow convergence for all but trivialproblems.

Based on the properties of stationary MOMDPs, we pro-pose an algorithm generating a lower bound of the valuefunction (Proposition 1). We show that it runs in polynomialtime (Proposition 2). Any α-vector-based MOMDP solvercan be initialized with this lower bound, with a potentiallysignificant reduction of the computation time. Additionally,our lower bound is provably optimal in the corners of thedomain of the value function (Proposition 3). Finally, wedemonstrate in Theorems 1 and 2 that, under some assump-tions, the derivatives of the optimal value function exist andare equal to those of our lower bound in neighborhoodsaround the corners of the domain, i.e. our lower bound isa linear approximation of the optimal value function.

The paper is organized as follows: we first introduceMOMDPs formally. We then describe our approach to speedup MOMDP solvers. We illustrate the efficiency of our ap-proach on the management of the invasive mosquito Aedesalbopictus in an Australian archipelago and on case studiestaken from Nicol et al. (2013). The data is freely available atgoo.gl/6f4Rh0. In the last section we discuss our approach

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

4531

Figure 1: Relations between various Markovian models.

and the results obtained.

Mixed observability Markov decision processA partially observable Markov decision process (POMDP)is a mathematical framework to model the impact of sequen-tial decisions on a probabilistic system under imperfect ob-servation of the states (Sigaud and Buffet 2010). MOMDPsare a special case of POMDPs, where the state can be de-composed into a fully observable component and a partiallyobservable component (Ong et al. 2010). Alternatively, theycan be seen as MDPs extended with a non-observable com-ponent (Fig. 1). MOMDPs can model various decision prob-lems where an agent knows its position but evolves in a par-tially observable environment, or when the transition matri-ces or rewards are uncertain. Formally, a MOMDP (Ong etal. 2010) is a tuple 〈X,Y,A,O, Tx, Ty, Z,R, γ〉 in which:

• The state space is of the form X × Y . The current state(x, y) fully specifies the system at every time step. Thecomponent x ∈ X is assumed fully observable and y ∈ Yis partially observable;

• A is the finite action space;

• Tx(x, y, a, x′) = p(x′|x, y, a) is the probability of transi-

tioning from the state (x, y) to x′ when a is implemented.Ty(x, y, a, x

′, y′) = p(y′|x, y, a, x′) is the probability oftransitioning from y to y′ when a is implemented and theobserved component transitions from x to x′. The processrespects the Markov property in that these probabilities donot depend on past states or actions;

• The reward matrix is the immediate reward r(x, y, a) thatthe policy-maker receives for implementing a in state(x, y);

• O is the finite observation space;

• Z(a, x′, y′, o′) = p(o′|a, x′, y′) is the probability of ob-serving o′ ∈ O if the state is (x′, y′) after action a;

• γ is the discount factor (< 1 in infinite time horizon).

The sequential decision making process unfolds as fol-lows (Fig. 2a). Starting at time t = 0 in a given initial state(x0, y0), the decision maker chooses an action a0 and re-ceives the reward r(x0, y0, a0). The states x1 and y1 cor-responding to t = 1 are drawn according to the probabili-ties Tx(x0, y0, a0, ·) and Ty(x0, y0, a0, x1, ·). The observa-tion o1 is drawn according to the probability Z(a0, x1, y1, ·).The decision maker then observes x1 and o1, selects a newaction a1 and the process repeats.

The goal of a decision maker is to find a sequenceof actions that yields the best expected sum of rewardsover time, depending on the selected criterion. Here,we use an infinite time horizon, i.e. the criterion is

Figure 2: Illustration of the interdependencies betweenstates, observations and actions in a MOMDP and a station-ary MOMDP. The grey area surrounding the variable y indi-cates that it is partially observed.

E[∞∑t=0

γtr(xt, yt, at)|x0, y0]. Because the state yt is not per-

fectly observable, it is modeled by a probability vector bt,called a belief state, where each component represents astate in the set Y (Astrom 1965). Belief states are sufficientstatistics (Bertsekas 1995), i.e. sufficient knowledge aboutthe system is contained in (xt, bt) to make optimal decisions.The set of all belief states is the belief space, denoted B. Itis a simplex (e.g. triangle or tetrahedron when |Y | = 3 or 4respectively) whose ’corners’ (vertices) correspond to vec-tors of the form (0, . . . , 0, 1, 0, . . . , 0) ∈ B, and where eachpair of corners is linked by an edge.

A MOMDP policy π : X×B → A is a mapping from theset of components x and belief states b to the set of actions.A policy π is optimal if it maximizes the selected perfor-mance criterion:

π∗ = argmaxπ

E[∞∑

t=0

γtR(xt, bt, π(xt, bt))|x0, b0] (1)

with R(x, b, a) =∑

y∈Y b(y)r(x, y, a). Any policy π canbe assessed through its value function Vπ defined as, for allx, b ∈ X ×B:

Vπ(x, b) = E[∞∑

t=0

γtR(xt, bt, π(xt, bt))|x, b], (2)

We then have π∗ = argmaxπ Vπ(x0, b0). Its optimalvalue function is denoted V ∗.

An essential property of POMDPs that translates toMOMDPs is that the value function Vπ(x, ·) is piecewiselinear convex (PWLC) in the belief state b for finite horizonproblems (Smallwood and Sondik 1973). That is, there ex-ists a finite set Γx of |Y |-tuples (called α-vector hereafter)such that:

Vπ(x, b) = maxα∈Γx

b · α (3)

where b · α =∑

y∈Y b(y)α(y) is the inner product. In infi-nite horizon problems, the value function is only guaranteedto be convex, and can be approximated arbitrarily closelyby PWLC functions. Initialized with a lower bound of theoptimal value function, most MOMDP solvers calculate the

4532

policy by updating the sets Γx recursively through Bellman’sequation, causing Vπ to increase until it is close enough tothe optimal value function. To apply the policy, since eachα-vector is associated with an action, the best action to im-plement at any time step is found by selecting the α-vectorthat maximizes b · α in Eq. 3. This necessitates knowing thebelief state b, which can be calculated recursively. Given thecurrent belief state bt, the current and future state x and x′,the action a and the future observation o′, the future beliefstate bt+1 is unique and calculated as follows:

bt+1(y′) = p(y′|x, bt, a, x′, o′)

=p(o′|x, bt, a, x′, y′)p(y′|x, bt, a, x′)

p(o′|x, bt, a, x′)

= ηZ(a, x′, y′, o′)×∑

y∈Y

Tx(x, y, a, x′)Ty(x, y, a, x

′, y′)bt(y)

(4)

where η = 1/p(o′, x′|x, bt, a) is a normalizing term.

Stationary MOMDPsWe call a MOMDP ’stationary’ when its partially observ-able component y is stationary, i.e. it will not change overtime. Potential examples include the population dynamicsof a species or a patient’s condition, which can be reason-ably assumed stationary over a short period of time. Regard-ing adaptive management problems, the partially observablecomponent y represents the transition matrix, while the com-ponent x models the observed ’physical’ system (Chades etal. 2012). The transition matrix is typically assumed sta-tionary (Walters and Hilborn 1978; Chades et al. 2012;Runge 2013). This means Ty(x, y, a, x

′, y′) = 1 if y = y′,0 otherwise (Fig. 2b). In this case, the future belief state canbe written (Chades et al. 2012):

bt+1(y′) = ηZ(a, x′, y′, o′)Tx(x, y

′, a, x′)bt(y′) (5)

Proposed approachIn this section we describe how the structure of a stationaryMOMDP can be exploited to speed up any α-vector-basedMOMDP solver.

Property of stationary MOMDPsAssume that, at a certain time step t, the transition matrix isknown, i.e. bt(y) = 1 for some y ∈ Y and bt(y) = 0 for ally = y. This belief state is a corner of the belief space B andis denoted by the unit vector ey .

In a stationary MOMDP, the corner ey is absorbing (i.e.bt′ = ey for all t′ ≥ t), since for all y = y, bt+1(y) =ηZ(a, x′, y, o′)Tx(x, y, a, x

′) × 0 = 0, so bt+1(y) = 1 (theobservable component x may still change). From time stept on, the process is a fully observable Markov decision pro-cess, with state space X , action space A, transition matrixTx|y and rewards r|y . The new transition matrix and rewardsare the restriction of the MOMDP components to the state y:Tx|y(x, a, x′) = Tx(x, y, a, x

′) and r|y(x, a) = r(x, y, a).

AlgorithmOur approach (Algorithm 1) builds on this property to gener-ate a lower bound of the optimal value function. First, these|Y | MDPs that correspond to the corners of the belief spaceare solved (line 2), providing |Y | optimal MDP policies π∗

y

and values V ∗y . Then, each policy is evaluated on the |Y |−1

other MDPs (line 6). The combination of these evaluationsyields, for each policy, one α-vector per state X (line 8). So,there are |X||Y | α-vectors generated in total. The functionInit is defined for any x, b ∈ X × B as the maximum overthese α-vectors.

Algorithm 1 Calculation of the function Init

Input: MOMDP 〈X,Y,A,O, Tx, Ty, Z,R, γ〉, Ty = Id

1: for y ∈ Y do2: V ∗

y , π∗y ← SolveMDP (X,A, Tx|y, r|y, γ)

3: for x ∈ X do4: αx,y(y) ← V ∗

y (x)

5: for y ∈ Y − y do6: Vy,y ← PolicyV alue(π∗

y , X,A, Tx|y, r|y, γ)7: for x ∈ X do8: αx,y(y) ← Vy,y(x)

9: Init : (x, b) → maxy∈Y αx,y · b, (x, b) ∈ X ×B

Theoretical resultsProposition 1. The function Init is a lower bound of theoptimal value function V ∗.

Proof. Let y ∈ Y . By linearity, the linear functions (x, b) →αx,y · b equal the value functions of the MOMDP policyconsisting of implementing the action π∗

y(x) in state x ∈ X ,with no regard to the observations of y and no belief statecalculation. Consequently, these functions are lower boundsof V ∗; so is Init by definition.

Proposition 2. Algorithm 1 runs in polynomial time in thenumber of states |X|, |Y | and actions |A|.Proof. Algorithm 1 consists of solving |Y | MDPs, whichcan be solved in polynomial time in |X| and |A| (Littman,Dean, and Kaelbling 1995). The evaluation of |Y | MDPpolicies |Y | − 1 times also runs in polynomial time.

So, the lower bound can be quickly computed and usedas an initial value function in any α-vector-based solver. Agood initial value function (i.e. not too far from the optimalvalue function) can be critical for solving large stationaryMOMDPs rapidly, since the value function is calculated re-cursively through Bellman’s equation. In the following weshow that the lower bound is optimal in all corners ey:Proposition 3. V ∗

y (x) = V ∗(x, ey), for all x, y ∈ X × Y .

Proof. As discussed above, when bt = ey the MOMDPbehaves like a classic MDP. Being the optimal MDP valuefunction, V ∗

y is by definition no smaller than any other valuefunction, including V ∗(., ey). Conversely, since the processis also part of the MOMDP, the optimal MOMDP functionV ∗ satisfies V ∗(x, ey) ≥ V ∗

y (x) for all x ∈ X .

4533

Figure 3: Illustration of an optimal value function V ∗, Initand f = V ∗ − Init between ey and ey for a given x ∈ X .In infinite time horizon, the optimal value function V ∗ isconvex but is not necessarily piecewise linear (e.g. near ey).Under Assumption 1, the derivatives of f in ey and ey equalszero, i.e. Init is a linear approximation of V ∗ in neighbor-hoods around the corners of the belief space.

However, optimality in the corners does not imply thatthe lower bound Init will be close to V ∗ in the center ofthe belief space. The following property states that Init isa linear approximation of V ∗ in neighborhoods around thecorners of the belief space.

We prove that, under some assumptions (Assumptions 1and 2 below), the directional derivatives of V ∗ in the cor-ners exist and equal those of Init. Formally, the directionalderivative of V ∗ in (x, ey) along a vector d is defined as∇dV

∗(x, ey) = limh→0V ∗(x,ey+hd)−V ∗(x,ey)

h . With As-sumption 1, the allowed ’directions’ d are along the edgesof the belief space (Theorem 1). With Assumptions 1 and 2,all directions are allowed (Theorem 2).

First, satisfying Assumption 1 ensures that the optimalMDP policies are optimal around the corners, and not justin the corners. For all (x, a) ∈ X × A, denote πx,a the pol-icy selecting a in state x and following π∗

y in other states.Assumption 1: There exists y ∈ Y such that, for each

(x, a) ∈ X ×A, the optimal MDP policy π∗y satisfies either:

• V ∗y (x, ey) > Vπx,a(x, ey) (i.e. π∗

y(x) strictly better thana in state x);

• Or, for all y ∈ Y , Tx(x, y, π∗y(x), ·) = Tx(x, y, a, ·) and

r(x, y, π∗y(x)) = r(x, y, a) (i.e. π∗

y(x) and a have iden-tical outcomes in state x).

In other words, we do not consider cases where for somestate x, two optimal actions for transition matrix y have dif-ferent transition or reward on some transition matrix y ∈ Y .Under Assumption 1, the directional derivatives of V ∗ andInit in the corners towards other corners are equal (Fig. 3):

Theorem 1. We assume that Assumption 1 is satisfied forsome y ∈ Y . For all x ∈ X , the directional derivative of theoptimal value function in (x, ey) with respect to any y = yequals that of the function Init (obtained with Algorithm 1).Let d = ey − ey . For all x ∈ X and y ∈ Y , we have:

∇dV∗(x, ey) = ∇dInit(x, ey) = αx,y · ey −αx,y · ey (6)

Sketch of proof. (Full proof available in Appendix)

(a) Assumption 1 implies that the optimal MDP policy π∗y

is identical to the optimal MOMDP policy π∗ in a neighbor-hood of the corner (x, ey).

(b) We show that the belief in transition matrix ydoes not grow by more than a constant from onebelief state bt to its successors. The constant equalsmaxZ(a,x′,y,o′)Tx(x,y,a,x

′)Z(a,x′,y,o′)Tx(x,y,a,x′) |x, x′ ∈ X, a ∈ A, o′ ∈

O,Z(a, x′, y, o′)Tx(x, y, a, x′) = 0.

(c) Combining (a) and (b) applied recursively, we deducethat π∗

y and π∗ will be identical for as many time steps as wewant, provided bt is close enough to ey .

(d) This implies that the distributions of rewards and be-lief states for π∗

y and π∗ will be identical for as many timesteps as we want. So, the difference between V ∗ and Initwill only be due to events happening after a number of timesteps t′ which increases when bt converges to ey .

(e) The impact of these future events on V ∗ − Init canbe bounded by γt′C‖bt − ey‖1 (with C a constant), whichimplies that the difference V ∗−Init has derivative zero.

Another assumption on the transition matrices can yield astronger version of the theorem:

Assumption 2: There exists y ∈ Y such that, for each(x, x′) ∈ X , if Z(π∗

y(x), x′, y, o′)Tx(x, y, π

∗y(x), x

′) = 0,then Z(π∗

y(x), x′, y, o′)Tx(x, y, π

∗y(x), x

′) = 0 for all y ∈Y .

In other words, an event that is impossible to observe fortransition matrix y cannot be observed for any other transi-tion matrix. This happens, for example, when all scenariosconcur on which events are possible and which are not.Theorem 2. We assume that Assumptions 1 and 2 are sat-isfied for some y ∈ Y . Then, for all x ∈ X , the direc-tional derivative of the optimal value function in (x, ey)in any direction equals that of the function Init. For all(x, b) ∈ X ×B, denoting d = b− ey , we have:

∇dV∗(x, ey) = ∇dInit(x, ey) = αx,y · ey − b · αx,y (7)

So, under Assumption 1, the lower bound Init has thesame derivative in the corners as the optimal value functionalong the edges. Under Assumptions 1 and 2, their direc-tional derivatives in the corners are equal along any directioninside the belief space. These theorems state that the lowerbound is a linear approximation of the optimal value func-tion in neighborhoods of the corners of the belief space. Wenow introduce the real-world case study used to evaluate thevalidity of our approach.

Case study: managing invasive Aedes albopictusThe Asian tiger mosquito Aedes albopictus is a known vec-tor of several pathogens. Although the Australian mainlandis currently not infested, the nearby Torres Strait Islands are(Ritchie et al. 2006). The N = 17 inhabited islands con-stitute potential sources for the introduction of Aedes al-bopictus into mainland Australia through numerous human-related pathways between the islands and towards north-eastAustralia (see map in Fig. 4).

Management actions on islands include the treatment ofcontainers and mosquitoes with diverse insecticides. Since

4534

Figure 4: Connections between islands depict the possibili-ties of colonization of the mosquitoes on susceptible islands.

budget is limited, not all islands can be treated simultane-ously. The objective is to select islands to manage to max-imize the expected time before the mainland becomes in-fested. The effect of distances and populations on the prob-ability of dispersal between islands and the effectiveness ofsome of the management actions are partially unknown (i.e.the transition matrix is unknown). A mix of expert data andliterature review led us to narrow down the number of transi-tions matrices to eight. As traditionally in adaptive manage-ment, these transitions matrices are assumed equally likelyat t = 0, i.e. the initial belief state equals (1/8, . . . , 1/8).This decision problem is modeled as a MOMDP in which:• The observable component x ∈ X specifies the season

(wet/dry) and the presence or absence of the mosquitoesacross the islands N (|X| = 2N+1 + 1). The last ’+1’is an absorbing state corresponding to the presence ofmosquitoes in the mainland. The component y ∈ Y isthe unknown true transition function, with |Y | = 8;

• Each action a ∈ A describes which islands should bemanaged (up to three simultaneously) and the type ofmanagement (light or strong);

• The transition probabilities Tx(x, y, a, x′) accounts for

the possible eradications and transmissions between is-lands. Also, Ty(x, y, a, x

′, y′) = 1y=y′ (stationary);• The reward r(x, y, a) equals 0 if the mainland is infested

and 0.5 otherwise (it only depends on x);• O = X is the finite observation space;• Z(a, x′, y′, o′) = 1o′=x′ (x′ fully observable);• γ should ideally be 1 so the MOMDP value equals the ex-

pected time before infestation of Australia (in years sinceeach time step equals six months). Since most solvers donot support such a setting, we set γ = 0.999.We also tested our approach on adaptive management

problems of migratory shorebirds taken from Nicol et al.(2013), where we have changed the transition matrix fromnon-stationary to stationary. We programmed our approachwith the MOMDP solver MO-SARSOP (Kurniawati, Hsu,

and Lee 2008; Ong et al. 2010) with the MDPSolve pack-age (https://sites.google.com/site/mdpsolve/) and POMDPsolver Perseus with 500 beliefs states (Spaan and Vlassis2005). We compare the modified solvers (marked with a ’+’)with the original solvers through the quality of their initial-ization (Table 1) and their convergence speed (Fig. 5). Notethat MO-SARSOP has an advanced lower bound imple-mentation, which we have replaced with our lower bound.MO-SARSOP also initializes an upper bound (fast-informedbound), which is optimal in the corners for all case stud-ies. Perseus initializes its value function with the constant1

1−γ minx,y,a r(x, y, a).

ResultsWe show the computation times of mosquito instances withnumber of islands ranging from 7 to 9 (Table 1). Problemsfor more than 9 islands were not tractable. We show prob-lems Grey-tailed tattler, Red knot pearsonii and Red knotrogersi from Nicol et al. (2013). For problems Lesser sandplover, Terek sandpiper and Bar-tailed godwit m, the initial-ization is already optimal in MO-SARSOP. Our computerran out of memory when solving the problems Great knot,Far eastern curlew and Curlew sandpiper.

Table 1: Initial values and initialization times of original andmodified (+) MO-SARSOP and Perseus. These are the val-ues of the initial belief state, of the form (1/|Y |, . . . , 1/|Y |)in all problems. Experiments conducted on a dual 3.46GHzIntel Xeon X5690 with 96GB of memory.

Instance MO- MO- Perseus Perseus+(|X|/|Y |/|A|) SARSOP SARSOP+

7 islands 11.7 16.7 0 16.7(257/8/113) 159 s 165 s 0 s 76 s8 islands 12.2 17.4 0 17.4(513/8/157) 740 s 771 s 0 s 169 s9 islands 12.5 17.4 (intrac- (intrac-(1025/8/211) 3244 s 3316 s table) table)Grey-tailed 4987 5167 836 5167tattler (972/3/6) 23 s 20 s 0 s 70 sRed knot pear- 6049 6049 4444 6049sonii (8748/3/8) 140 s 125 s 0 s 874sRed knot roger- 6906 6947 (intrac- (intrac-si (8748/3/8) 717 s 592 s table) table)

Modified solvers consistently obtain a better initial valuethan original solvers, with the exception of MO-SARSOP onRed knot pearsonii (equal value). Moreover, MO-SARSOP+

initializes roughly as quickly as MO-SARSOP. The initial-ization in Perseus is much quicker than in Perseus+ but atthe cost of a lower initial value (0 in our case study becauseminx,y,a r(x, y, a) = 0).

Fig. 5 illustrates the evolution of the value over time forthe original and modified solvers, and the upper bound ascalculated in MO-SARSOP+. The modified solvers con-sistently outperform the original solvers. In our case studyAedes albopictus, MO-SARSOP+ obtains much better ini-tial values than MO-SARSOP (7 islands, Fig. 5a). Allsolvers converge very slowly, which makes this initial value

4535

102 103 10410

15

20

Value

(a) MO-SARSOP - 7 islands

102 103 104

69206940696069807000

(b) MO-SARSOP - Red knot r

102 103 104

Computation time (s)

6050606060706080

Value

(c) MO-SARSOP - Red knot p

OriginalModifiedUpper bound

100 102 104

Computation time (s)

0

2000

4000

6000 (d) Perseus - Tattler

Figure 5: Values over times of original and modified MO-SARSOP and Perseus on 4 problems (stopped after 3600sincluding initialization). We also show the upper bound ascalculated in modified MO-SARSOP. The point correspond-ing to the initialization time and initial value is circled.

all the more critical. For Red knot rogersi (Fig. 5b), MO-SARSOP+ initializes more rapidly and with a better valuethan MO-SARSOP, leading to a rapid reduction of the op-timality gap (i.e. difference to the upper bound). RegardingRed knot pearsonii, our approach does not improve the ini-tial value, but it significantly accelerates the reduction of theoptimality gap (Fig. 5c). This supplements Theorem 1 and2 in suggesting that the generated α-vectors do not solelyyield a good value on the initial belief state but all across thebelief space, which allows generating good future α-vectorsthrough Bellman’s equations. Finally, for all but small prob-lems, Perseus suffers from a poor initial value and is outper-formed by Perseus+ (Fig. 5d).

DiscussionWe proposed a method to improve the initialization of aMOMDP solver in the case where the partially observ-able component is stationary. We showed that our approach,which consists of solving a number of Markov decision pro-cesses, generates a lower bound that is optimal in the cor-ners of the belief space. With an additional assumption aboutthe optimal policy, we demonstrated that this lower boundis also a linear approximation to the value function. Thissimple and inexpensive initial lower bound can be used asan initialization to any α-vector-based solver. Tested on twostate-of-the-art MOMDP and POMDP solvers, our approachshowed significant computational gains on a novel compu-tational sustainability case study of management of an inva-sive species and on a previously published data challenge.

Our approach has several benefits. It quickly identifies theoptimal MDP policies and their values, which solvers maytake a very long time to match (Fig. 5a). Since α-vectors areupdated recursively through Bellman’s equation, α-vector-based solvers very much rely on a good initial value func-tion. Our initial lower bound algorithm has proven to triggera steeper reduction of the gap in the first steps of computa-

tion (Fig. 5b, 5c).Assumption 1 (two non-identical actions cannot be both

optimal) may seem like a strong assumption. However, theset of ’degenerate’ instances has measure zero, i.e. a ran-dom MOMDP instance will satisfy Assumption 1 with prob-ability 1. As meaningful instances are not random and maywell be degenerate, one can slightly perturb their rewards toavoid having two optimal actions. The same goes for As-sumption 2, where one can perturb the transition matrices toensure a transition matrix cannot have probability 0 whereother transition matrices have non-zero probability. So, withan arbitrarily small impact on the value of any policy, theassumptions can be fulfilled and the property of linear ap-proximation can be guaranteed.

This property can be exploited in various ways. First, abelief state that is close to a corner can be approximatedwith the initial value, which would save storage space andbackup time. Ideally, the error incurred should be controlledand linked to the distance between the belief state and thecorner (also guaranteeing that a policy is near optimal fordecision makers), perhaps by bounding the second derivativeof the optimal value function. This warrants further research.

The magnitude of the optimality gap after our initializa-tion provides precious information to decision makers. Asmall optimality gap means that some optimal MDP poli-cies are robust to a transition matrix falsely identified as be-ing true, so adaptive approaches might not be necessary. Alarge gap shows that a poor knowledge will be heavily pe-nalized and is an incentive to use adaptive methods to reducethe uncertainty; if the value is a financial cost or benefit, thisprovides an idea of how much money could be spent to re-duce uncertainty.

Our approach could be of use in other contexts of com-putational sustainability. In medical science, trade-offs mayoccur between learning about a patient’s condition andminimising the risk of death, complications, or discomfort(Hauskrecht 1997). In education, an educator may learn astudent’s profile while teaching in order to identify the bestway of teaching (Cassandra 1998).

Apart from computational sustainability, the maintenanceof machines, networks or infrastructures (Faddoul et al.2015) could benefit from our approach, with the partiallyobservable component containing information about the in-ner state to be maintained, e.g. deterioration or flaws. Inmarketing, a company or salesperson can learn about thecustomer as they are implementing their marketing strat-egy (Zhang and Cooper 2009). Dias, Vermunt, and Ramos(2015) infer hidden parameters driving stock markets; Sta-tionary MOMDPs would allow merging the learning and de-cision processes.

The method can be extended and improved in severalways. Nicol et al. (2013) extended the traditional adaptivemanagement framework by assuming the transition matrixnon-stationary. Our approach does not work under this as-sumption because in this case the corners of the belief spaceare not absorbing and so the optimal values on corners can-not be obtained by solving MDPs. However, we hope ourresearch will lead to a stronger focus from the artificial intel-ligence community on improving lower bounds for general-

4536

case MOMDPs or POMDPs. Another common assumptionis the finite number of transition matrices; by contrast, Merlet al. (2009) sample continuous parameters with a MonteCarlo approach, which could be combined with our algo-rithm. Finally, we could not solve very large instances thatNicol et al. (2013) solved with Symbolic Perseus, a fac-tored POMDP solver (Poupart 2005). Our approach couldbe adapted to factored POMDPs by solving factored MDP(Hoey et al. 1999), also allowing us to solve our case studywith a higher number of islands.

AcknowledgmentsThis research is supported by an Industry Doctoral TrainingCentre scholarship (MP) and CSIRO Julius Career Awards(IC). We acknowledge the critical contributions of time andexpertise provided by the Aedes albopictus Technical Advi-sory Group, and other experts who participated in the expertelicitation workshop. We thank Cassie C. Jansen, ChrystalMantyka-Pringle, Sam Nicol and Nancy A. Schellhorn fortreatment and analysis of the data. We also thank Yann Du-jardin and Andrew Higgins for their valuable feedback.

ReferencesAstrom, K. J. 1965. Optimal control of Markov processes withincomplete state information. Journal of Mathematical Analysisand Applications 10:174–205.Bertsekas, D. P. 1995. Dynamic programming and optimal con-trol. Athena Scientific, Belmont, MA.Cassandra, A. R. 1998. A survey of POMDP applications. InWorking Notes of AAAI 1998 Fall Symposium on Planning withPartially Observable Markov Decision Processes, 17–24.Chades, I.; Carwardine, J.; Martin, T. G.; Nicol, S.; Sabbadin, R.;and Buffet, O. 2012. MOMDPs: A Solution for Modelling Adap-tive Management Problems. In The Twenty-Sixth AAAI Confer-ence on Artificial Intelligence, 267–273. Toronto, Canada: AAAIPress.Chades, I.; Nicol, S.; Rout, T. M.; Peron, M.; Dujardin, Y.;Pichancourt, J.-B.; Hastings, A.; and Hauser, C. E. 2016. Opti-mization methods to solve adaptive management problems. The-oretical Ecology 1–20.Dias, J. G.; Vermunt, J. K.; and Ramos, S. 2015. Cluster-ing financial time series: New insights from an extended hid-den Markov model. European Journal of Operational Research243(3):852–864.Duff, M. 2003. Design for an optimal probe. In Proceedings ofthe 20th International Conference on Machine Learning, 131–138.Faddoul, R.; Raphael, W.; Soubra, A.-H.; and Chateauneuf, A.2015. Partially Observable Markov Decision Processes incorpo-rating epistemic uncertainties. European Journal of OperationalResearch 241(2):391–401.Frederick, S. W., and Peterman, R. M. 1995. Choosing fish-eries harvest policies: when does uncertainty matter? CanadianJournal of Fisheries and Aquatic Sciences 52(2):291–306.Hauskrecht, M. 1997. Planning and control in stochastic do-mains with imperfect information. Ph.D. Dissertation, Mas-sachusetts Institute of Technology.Hoey, J.; St-Aubin, R.; Hu, A.; and Boutilier, C. 1999. SPUDD:Stochastic planning using decision diagrams. In Proceedings of

the Fifteenth conference on Uncertainty in artificial intelligence,279–288. Morgan Kaufmann Publishers Inc.Johnson, F. A.; Kendall, W. L.; and Dubovsky, J. A. 2002. Con-ditions and limitations on learning in the adaptive managementof mallard harvests. Wildlife Society Bulletin 176–185.Kurniawati, H.; Hsu, D.; and Lee, W. S. 2008. SARSOP: effi-cient point-based POMDP planning by approximating optimallyreachable belief spaces. In Robotics: Science and Systems (RSS).65–72.Littman, M. L.; Dean, T. L.; and Kaelbling, L. P. 1995. Onthe complexity of solving Markov decision problems. 394–402.Morgan Kaufmann Publishers Inc.Merl, D.; Johnson, L. R.; Gramacy, R. B.; and Mangel, M. 2009.A statistical framework for the adaptive management of epidemi-ological interventions. PloS One 4(6):e5807.Moore, C. T., and Conroy, M. J. 2006. Optimal regenerationplanning for old-growth forest: addressing scientific uncertaintyin endangered species recovery through adaptive management.Forest Science 52(2):155–172.Nicol, S.; Buffet, O.; Iwamura, T.; and Chades, I. 2013. Adap-tive management of migratory birds under sea level rise. In Pro-ceedings of the Twenty-Third international joint conference onArtificial Intelligence, 2955–2957. Beijing, China: AAAI Press.Ong, S. C. W.; Png, S. W.; Hsu, D.; and Lee, W. S. 2010. Plan-ning under uncertainty for robotic tasks with mixed observability.International Journal of Robotics Research 29:1053–1068.Poupart, P. 2005. Exploiting structure to efficiently solve largescale partially observable Markov decision processes. Ph.D.Dissertation, University of Toronto, Toronto.Ritchie, S. A.; Moore, P.; Carruthers, M.; Williams, C.; Mont-gomery, B.; Foley, P.; Ahboo, S.; Van Den Hurk, A. F.; Lindsay,M. D.; and Cooper, B. 2006. Discovery of a widespread infesta-tion of Aedes albopictus in the Torres Strait, Australia. Journalof the American Mosquito Control Association 22:358–365.Runge, M. C. 2013. Active adaptive management for reintroduc-tion of an animal population. The Journal of Wildlife Manage-ment 77(6):1135–1144.Sigaud, O., and Buffet, O. 2010. Markov decision processes inartificial intelligence. New York, NY, USA: John Wiley & Sons,Inc.Smallwood, R. D., and Sondik, E. J. 1973. Optimal controlof partially observable Markov processes over a finite horizon.Operations Research 21:1071–1088.Spaan, M., and Vlassis, N. 2005. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelli-gence Research 24:195–220.Vlassis, N.; Ghavamzadeh, M.; Mannor, S.; and Poupart, P. 2012.Bayesian reinforcement learning. In Reinforcement Learning.Springer. 359–386.Walters, C. J., and Hilborn, R. 1976. Adaptive control of fishingsystems. Journal of the Fisheries Board of Canada 33(1):145–159.Walters, C. J., and Hilborn, R. 1978. Ecological optimizationand adaptive management. Annual Review of Ecology and Sys-tematics 9:157–188.Zhang, D., and Cooper, W. L. 2009. Pricing substitutable flightsin airline revenue management. European Journal of Opera-tional Research 197(3):848–861.

4537

Chapter 6

Continuous-time dual control

In all three communities that attempted to address structural uncertainty(Section 2.2.1), researchers have focussed on solving discrete-time problems,which leads to the use of dynamic programming. Dynamic programminghas its advantages, e.g. the guarantee that the returned policy is optimal (iftractable). However, it also falls prey to Bellman’s curse of dimensionality, aswe have seen in the previous chapter, where no more than nine islands couldbe accommodated for. To solve this problem, approaches from continuous–time optimal control are inspiring because they do not suffer from the curseof dimensionality (Section 2.2.4).

In this chapter, we address our fourth research question by using toolsfrom optimal control theory to handle structural uncertainty, on a stylisedmultidimensional problem. We show how this problem can be transformed toapply optimal control tools, and compare the performance of this approachto a dynamic programming approach. Our algorithm rivals dynamic pro-gramming on small problems and remains tractable on problems of higherdimensions, in contrast to dynamic programming. Besides, it achieves theright balance between aggressive and smoothly varying controls. This chap-ter has been submitted to Optimal Control Applications and Methods.


• Martin Peron conceived the presented idea, developed the theory, de-signed and implemented the optimisation models, performed the analy-sis, drafted most of the manuscript and acted as corresponding author.

• Christopher M. Baker guided the research and provided assistance indesigning, implementing, and writing on the optimal control approach.

• Barry D. Hughes contributed to building the optimal control model by

87

providing mathematical insights on stochastic processes and edited themanuscript.

• Iadine Chades directed the research and edited the manuscript.

88

Received: 00 Month 0000 Revised: 00 Month 0000 Accepted: 00 Month 0000

DOI: xxx/xxxx

RESEARCH ARTICLE

Continuous-time dual control

Martin Péron*1,2 | Christopher M. Baker1,3 | Barry D. Hughes4 | Iadine Chadès2

1School of Mathematical Sciences,Queensland University of Technology,Brisbane QLD 4000, Australia

2Land and Water, CSIRO, EcosciencesPrecinct, Dutton Park, Queensland 4102,Australia

3School of Biological Sciences, Universityof Queensland, St Lucia, Queensland 4072,Australia

4School of Mathematics and Statistics,University of Melbourne, Parkville,Victoria 3010, Australia.

Correspondence*Martin Péron Email:[email protected]

Present AddressQueensland University of Technology, 2George St, Brisbane City QLD 4000+61 (0) 421 488 778

Summary

Dual control denotes a class of control problems where the parameters governing thesystem are imperfectly known. The challenge is to find the optimal balance betweenprobing i.e. exciting the system to understand it more, and caution i.e. selectingconservative controls based on current knowledge to achieve the control objec-tive. Dynamic programming techniques can achieve this optimal trade-off. However,while dynamic programming performs well with discrete state and time, it is notadapted to problems with continuous time-frames or continuous or unbounded statespaces. Another limitation is that multidimensional states often cause the dynamicprogramming approaches to be intractable. In this paper, we investigate whethercontinuous-time optimal control tools could help circumvent these caveats whilst stillachieving the probing–caution balance. We introduce a stylized problem where thestate is governed by one of two differential equations. It is initially unknown whichdifferential equation governs the system, so we must simultaneously determine the‘true’ differential equation and control the system to the desired state. We show howthis problem can be transformed to apply optimal control tools, and compare the per-formance of this approach to a dynamic programming approach. Our results suggestthat the optimal control algorithm rivals dynamic programming on small problems,achieving the right balance between aggressive and smoothly varying controls. Asopposed to dynamic programming, the optimal control approach remains tractablewhen several states are to be controlled simultaneously.

KEYWORDS:Markov decision processes, dual control, mixed observability Markov decision process, stochastic differ-ential equations, dynamic programming, adaptive control

1 INTRODUCTION

Since many real-world control problems are stochastic, decision makers are often uncertain about how their actions affect thesystem1. This ‘structural uncertainty’ must be accounted for to make optimal decisions. In the control literature, where the aimis typically to control a system towards a desired state, this problem is called dual control1,2, while in environmental sciencesit is called adaptive management3,4. The problem arises in a broad range of fields, including industry5, conservation6,7 andnatural resource management8,9. In both the optimal control and environmental science communities, the problem is modeledby uncertain parameters augmenting the system state. Learning these parameters is not the objective, rather, the objective onlydepends on the state and control. Thus, the optimal solution trades off decisions to better understand the system and decisionsto guide the system towards a better state. Hence, informative controls (i.e. improving knowledge) should only be chosen over

2 AUTHOR ONE ET AL

more rewarding controls if the long-term benefits of learning outweigh the short-term loss of performance3. In dual controlterminology, this is ‘finding the optimal balance between probing and caution’2.Achieving the balance between probing and caution is not an easy task. Researchers have mostly focused on solving discrete-

time and discrete-state problems4,2,1,10, as they are deemed easier to solve. There exist continuous-time exceptions but with noattempt at finding the optimal trade-off11: the control follows the certainty equivalence principle, where the control is chosenas if the current estimate of the uncertain parameter were true2. In contrast, it is in theory possible to find the optimal discrete-time control by modeling the problem as a Markov decision problem (or a variant) and solving it by using stochastic dynamicprogramming12,6.The traditional discrete-time approach to this problem suffers from two main drawbacks. First, there are many real-world

problems that require continuous attention, including stock markets13, flight trajectory14 or medical sciences15. Second, exceptfor some simple linear systems, dynamic programming is the method of choice for solving discrete-time dual control problems1.This implies a finite set of states needs to be specified, which makes it challenging to solve where the state is unbounded or mul-tidimensional (curse of dimensionality)16. This is further reinforced by the PSPACE-complete complexity of such problems6.For these reasons, the optimal dual controller has even been said to be ‘impossible’ to calculate for real-world processes1.Tools from continuous–time optimal control can help circumvent these caveats. Continuous–time optimal control problems

can be solved by different methods, one of which is the Pontryagin minimum principle2. This approach leads to differentialequations that can be solved numerically to find the optimal control. Although this approach is efficient, it does not naturallydeal with stochastic systems, and even less to dual control problems.Here, we investigate whether optimal control tools, can be used to find actively learning policies in a continuous-time,

unbounded- and continuous-state setting.We first introduce a continuous-time, unbounded- and continuous-state problem. Then,we show how to augment the physical state with an unknown information state, which represents the uncertain system dynam-ics. We present the modeling and solving tools from both dynamic programming and optimal control theory. For the optimalcontrol, we identify the stochastic differential equation that the information state satisfies and we circumvent the stochastic-ity of our problem by replacing both the information and physical states by their expected values. The resulting deterministicproblem can be solved with an optimal control algorithm, namely a forward-backward algorithm17,18,19. We then evaluate bothapproaches through simulations in the real stochastic problem.

2 MATERIAL AND METHODS

In this section, we introduce the decision problem and the basic elements of solution methods. We then show how this problemcan be modeled and solved using mixed observability Markov decision processes (MOMDPs). We then introduce our approachbased on optimal control before comparing it to the MOMDP approach.

2.1 The problem: continuous-time dual controlWe aim to control a system with state x(t) ∈ ℝ by choosing a certain control u(t) ∈ [−U,U ] for time t ∈ [0, T ]. We use theconcise notation u(t) but the control may implicitly depend on the history of states and controls up to time t. The true state ofthe system, x, is governed by one of the two following stochastic differential equations:

dx(t) = u(t)dt + dBt, (1)dx(t) = −u(t)dt + dBt, (2)

where Bt is a Wiener Process, which satisfies for all t, t′ ∈ [0, T ] with t ≤ t′:

Bt′ − Bt ∼ (0, t′ − t), (3)

independently of past values Bs, s < t. In this problem, the state, x(t), is perfectly observable.We set as our objective that the state x(t) and the control u(t) are both kept small in a suitable mean-square sense over the

time interval [0, T ]:

minu

E

T

∫0

[x2(t) + u2(t)]dt

. (4)

AUTHOR ONE ET AL 3

The effect of the control u(t) on the state x(t) is initially unknown and depends on which of Eq. 1 or 2 is true. However, whetherEq. 1 or 2 is true should become clearer over time, based on past observations of the variation of x(t).

2.2 Information stateA common approach in both dual control (state augmentation,2) and adaptive management4, is to create an information state,y, representing the uncertain system dynamics. In our case, the state y equals 1 if Eq. 1 is true and 0 if Eq. 2 is true. Since ourknowledge is imperfect and may vary over time, we use the notation y(t) to describe our belief that Eq. 1 is true at any timet. Note that y is binary and should not be confused with y(t), which is continuous within [0, 1]. Our belief that Eq. 2 is trueis 1 − y(t). We denote by y0 the initial belief at the start of the problem, so y(0) = y0. We will often choose y0 = 1∕2 in theexperiments to model the lack of prior information (that is, assume the two state equations to be equally likely at t = 0).We firstly show how to solve this problem using MOMDPs, where both the time-frame and the state spaces are discretized.

2.3 Mixed observability Markov decision processesA partially observable Markov decision process (POMDP) is a mathematical framework to optimize sequential decisions on aprobabilistic system under imperfect observation of the states20. MOMDPs (mixed observability Markov decision processes)are a special case of POMDPs, where the state can be decomposed into a fully observable component, x, and a partially observ-able component, y21. MOMDPs can model various decision problems where an agent knows its position but evolves in apartially observable environment, or when the transition functions or rewards are uncertain. Formally, a MOMDP21 is a tuple⟨X, Y , A,O, Px, Py, Z,R⟩ with the following attributes.

• The state space is of the form X × Y , with both X and Y of finite cardinality. The current state (x, y) fully specifies thesystem at every time step. The component x ∈ X is assumed fully observable and y ∈ Y is partially observable.

• The action space A is finite.

• Transition probabilities between states are expressed succinctly using the notational conventions that (x, y) denotes thestate immediately before action a is implemented, (x′, y′) denotes the state immediately after action a is implemented.Also, where the state y is undetermined after the action a, it is replaced by a bullet point. We define

Px(x, y, a, x′) = Pr (x, y) → (x′, ∙) when

a is implemented

,

Py(x, y, a, x′, y′) = Pr (x, y) → (x′, y′) when

a is implemented|

|

|

(x, y) → (x′, ∙) whena is implemented

.

The process satisfies the Markov property in that these probabilities do not depend on past states or actions;

• The reward matrix is the immediate reward r(x, y, a) that the policy-maker receives for implementing a in state (x, y).

• The observation space O is finite.

• The observation probability Z is defined as

Z(a, x′, y′, o′) = Pr

observe o′ ∈ O |

|

|

state is (x′, y′)after action a

. (5)

The sequential decision making process unfolds as follows (Figure 1). Starting at time t = 0 in a given initial state (x0, y0), thedecision maker chooses an action a0 and receives the reward r(x0, y0, a0). The states x1 and y1 corresponding to t = 1 are drawnaccording to the probabilities Px(x0, y0, a0, ⋅) and Py(x0, y0, a0, x1, ⋅). The observation o1 is drawn according to the probabilityZ(a0, x1, y1, ⋅). The decision maker then observes x1 and o1, selects a new action a1 and the process repeats.The goal of a decision maker is to select actions sequentially to achieve the best expected sum of rewards over time, with

respect to a specific optimization criterion. Here, we use a finite time horizon with Nt time steps, and we seek choices of theactions a0, a1,… , aNt−1 to maximize

E[

Nt−1∑

t=0r(xt, yt, at)

|

|

|

x0, y0]

., (6)

4 AUTHOR ONE ET AL

FIGURE 1 Illustration of the interdependencies between states, observations and actions in a MOMDP. The gray areasurrounding the variable y indicates that it is partially observed.

In the general case, the choice of action at may depend on the entire history of actions and observations up to time step t21,20.We will present the concept of a policy in the next section more specifically for our context.

2.4 Framing and solving our problem with a MOMDPIn order to frame our problem (Eq. 1-4) as a MOMDP, we take the following steps:

• We discretize the state x ∈ ℝ into a set X, made of regular intervals of size x. Concerning the bounds of X, we can onlyguarantee that x will stay within a given interval with a certain probability, because the state equations contain a whitenoise. We set the bounds to three Gaussian standard deviation around the initial state x0, capturing 99.7% of the values(if the control remains zero). Values over the bounds will be projected back to the bounds.

• We discretize the time frame into Nt regular time intervals, and we denote by t = T ∕Nt the length of each interval.We map these time intervals [0, t], [t, 2t],… , [T − t, T ] to time steps 0, 1,… , Nt − 1 in the MOMDP. The transitionprobabilities of each action at (time step t) are calculated by assuming that at is implemented continuously on the entireinterval [tt, (t + 1)t].

• Wediscretize the control space into a set A, made of regular intervals of size a.We use the term action for these discretizedcontrols, as in the MOMDP literature.

• MOMDP solvers are well suited to handling the information state y by themselves: they only require the true value (0 or1) as input, without discretization. However, the best MOMDP solvers are for infinite time horizons and our problem hasa finite time horizon; we choose instead to use to apply dynamic programming on a discretized MOMDP. This comes atthe cost of discretizing the different states y ∈ [0, 1]. The resulting set Y is of the form [0, y,… , 1 − y, 1].

In dynamic programming, we are looking for an optimal policy ∶ X × Y × 0, 1,… , Nt − 1 → A. It is a mapping from thediscretized time and states spaces to the set of actions (or control) and maximizes the objective criterion. To do so, we evaluatethe optimal value function V ∶ X × Y × 0, 1,… , Nt − 1 → ℝ, defined such that V (x, y, t) is the optimal expected sum ofrewards received when the system evolves from (x, y) at time t to whatever the final state is at time step Nt − 1. Our goal is tocalculate V (x, y, t) and the associated optimal policy (x, y, t) for all states (x, y) ∈ X × Y and time t ∈ 0, 1,… , Nt − 1. Wecan do this by working backwards (Algorithm 1), using the Principle of Optimality16 that all remaining decisions must constitutean optimal policy relative to the current state. Thus

V (x, y,Nt − 1) = maxa∈A

r(x, y, a), (7)

(x, y,Nt − 1) = argmaxa∈A

r(x, y, a), (8)

AUTHOR ONE ET AL 5

and for t < Nt − 1, where P (x′, y′|x, y, a) is the single step transition probability from state (x, y) to state (x′, y′) afterimplementing action a, Bellman’s equation is satisfied:

V (x, y, t) = maxa∈A

[

r(x, y, a) +∑

x′,y′∈X×Y

P (x′, y′|x, y, a)V (x′, y′, t + 1)]

, (9)

(x, y, t) = argmaxa∈A

[

r(x, y, a) +∑

x′,y′∈X×Y

P (x′, y′|x, y, a)V (x′, y′, t + 1)]

. (10)

Note that this policy is optimal in the discretized problem only. It is not guaranteed to be optimal in the real, continuous problem.Its ‘real’ value will be assessed by simulations and will likely differ from the value predicted by Eq. 9).Although this dynamic programming approach can work efficiently for small state spaces, it suffers from large or multidi-

mensional state spaces (the curse of dimensionality16). For this reason, we now introduce an alternative approach using toolsfrom optimal control theory.

Algorithm 1 DynamicProgramming(Nt)

1: Initialization: V (x, y,Nt) ∶= 0 for all states x ∈ X, y ∈ Y2: Initialization: t ∶= Nt − 13: while t ≥ 0 do4: for x ∈ X, y ∈ Y do5: r(x, y, a) =

[

a2(t) + x2(t)]

t6: V (x, y, t) ∶= max

a∈A

[

r(x, y, a) +∑

x′∈X,y′∈YP (x′, y′|x, y)V (x′, y′, t + 1)

]

7: (x, y, t) ∶= argmaxa∈A

[

r(x, y, a) +∑

x′∈X,y′∈YP (x′, y′|x, y)V (x′, y′, t + 1)

]

8: end for9: t ∶= t − 1

10: end while11: Output: The policy

3 USING TOOLS FROM OPTIMAL CONTROL THEORY

3.1 Continuous-time optimal controlA continuous-time optimal control problem2 aims to control, for time t ∈ [0, T ], a system in state x(t) ∈ X by choosing acertain control u(t). The objective is to minimize a certain cost J defined as

J (u) =

T

∫0

g(x(t), u(t))dt + f (x(T )). (11)

The state equation describes the evolution of x(t) over time depending on the control u(t) through the differential equation, forall t ∈ [0, T ]:

dx(t)dt

= ℎ(x(t), u(t)). (12)

The control may be bounded by a positive constant U :

|u(t)| ≤ U. (13)

Formally, the objective is:

minu(t)

J (u(t)). (14)

6 AUTHOR ONE ET AL

3.2 The Pontryagin minimum principleThe Pontryagin minimum principle was developed in the 1960s to solve the type of problems defined by Eqs. 11–14. To thisend, one introduces the Hamiltonian function

H(x(t), u(t), (t)) = g(x(t), u(t)) + (t)ℎ(x(t), u(t)), (15)

where the adjoint function , which plays a similar role to the Lagrange multiplier in simpler optimization problems, satisfiesthe differential equation

′(t) = −)H(x(t), u(t), (t))

)x(16)

and the boundary condition

(T ) = f ′(x(T )), (17)

known as the transversality condition. The optimal solution is found by minimizing the Hamiltonian with respect to the con-trol, e.g. by solving )H∕)u = 0 if H is convex. Solving all these equations concurrently gives the optimal control and thecorresponding state.Since optimal control problems can rarely be solved analytically, numerical methods are standard. One common way is known

as the forward-backwards sweep, and is not dissimilar to fixed point iteration18. We start with an initial guess for the controlfunction, and then solve the state equations (Eq. 12) forwards in time.We can then solve the adjoint equations (Eq. 16) backwardsin time, using our guess for the control function and the corresponding state. Finally, at regular times t in the time-frame, thecontrols u(t) are updated so as to minimize H(x(t), u(t), (t)). The optimal u(t) can be found easily when the Hamiltonian isconvex and/or has a simple form—in our case, the Hamiltonian is a one-dimensional quadratic function. This process is repeateduntil convergence.

3.3 Using a deterministic method to solve a stochastic problemAlthough the Pontryagin minimum principle is an elegant and efficient way to solve deterministic continuous-time controlproblems, it is not directly applicable to stochastic differential equations. However, it can provide an approximate policy throughthe following general approach. First, we search for an (approximate) deterministic optimal control problem which aims atcapturing the uncertainty on the states. At regular time intervals, we solve this deterministic problem using the Pontryaginminimum principle and apply the resulting control in the real stochastic problem until the next time interval. Then, the controlis re-calculated using the Pontryagin minimum principle with the updated observed states. The new control is then applied untilthe next time interval, and so on until the end of the time-frame. By repeating this simulation many times, we can evaluate theaverage performance of this approach on our stochastic problem.We use the same process to evaluate the dynamic programmingapproach, the only difference being the way the control is found.

3.4 Framing our problem as a deterministic optimal control problemLet us return to our original problem (Eqs. 1–2). As outlined above, we need to find a deterministic optimal control problemcapturing the uncertainty on both x(t) and y(t). A natural approach to do so is to replace the states x(t) and y(t) by their expectedvalues. Let us consider the changes to the state y(t) first, which is our belief that Eq. 1 is true.

3.4.1 Changes to the state y(t)For t ≥ 0 and t > 0 we write x = x(t + t) − x(t). From Eqs. 1 and 2 we see that the Wiener component to the displacementx is

t+t

∫t

dBt′ =

x − ∫ t+tt u(t′)dt′ if Eq. 1 is true;

x + ∫ t+tt u(t′)dt′ if Eq. 2 is true

(18)

AUTHOR ONE ET AL 7

and so the relative likelihoods of the observed state change x are

1√

2texp

− 12t

[

x −

t+t

∫t

u(t′)dt′]2

if Eq. 1 is true;

1√

2texp

− 12t

[

x +

t+t

∫t

u(t′)dt′]2

if Eq. 2 is true.

Based on prior beliefs y(t) and 1− y(t) for Eqs. 1 and 2, respectively, we can update the information state y(t+ t) using Bayes’theorem. The updated information state is calculated as the multiplication of the prior belief y(t) and the relative likelihood ofthe observed state change x and is then normalized:

y(t + t) =

y(t)√

2texp

− 12t

[

x −

t+t

∫t

u(t′)dt′]2

y(t)√

2texp

− 12t

[

x −

t+t

∫t

u(t′)dt′]2

+1 − y(t)√

2texp

− 12t

[

x +

t+t

∫t

u(t′)dt′]2

(19)

=

y(t) exp[2xt

t+t

∫t

u(t′)dt′]

y(t) exp[2xt

t+t

∫t

u(t′)dt′]

+ 1 − y(t)

. (20)

It follows from this that

y(t + t) − y(t) = ℎ(2xt

t+t

∫t

u(t′)dt′)

, (21)

where for brevity we have writtenℎ(∙) =

y(t)[1 − y(t)](e∙ − 1)y(t)e∙ + 1 − y(t)

. (22)

We note for later use that

ℎ(0) = 0, ℎ′(0) = y(t)[1 − y(t)], ℎ′′(0) = y(t)[1 − y(t)][1 − 2y(t)]. (23)

We assume that u(t) is right-continuous, which implies that

limt↓0

1t

t+t

∫t

u(t′)dt′ = u(t).

We can now infer a stochastic differential equation for y(t). Taking the argument of ℎ in Eq. 21 as 2u(t)dx(t) we have dy(t) =ℎ(2u(t)dx(t)). Further progress requires us to specify which of Eq. 1 or 2 is correct, so we write dx(t) = ±u(t)dt + dBt wherethe upper sign is taken if Eq. 1 is correct and the lower sign is taken if Eq. 2 is correct. Thus we have

2u(t)dx(t) = ±2u(t)2dt + 2u(t)dBt,

which is a diffusion with drift t = ±2u(t)2 and standard deviation t = 2u(t). Following the normal approach in Itô calculuswe expand ℎ(2u(t)dx(t)) to second-order in the argument of ℎ(∙), using the values of ℎ(0), ℎ′(0) and ℎ′′(0) noted above, replace(dBt)2 by dt and retain only those terms multiplied by a single factor of dt or dBt, giving

dy(t) =[

tℎ′(0) +

2t2ℎ′′(0)

]

dt + tℎ′(0)dBt

= 2y(t)[1 − y(t)][1 ± 1 − 2y(t)]u(t)2dt + 2y(t)[1 − y(t)]u(t)dBt. (24)

8 AUTHOR ONE ET AL

In order to frame the problem as a deterministic optimal control problem, we now calculate the expected value of this derivativeby removing the Wiener process:

E[

dy(t)dt

]

=

4y(t)[1 − y(t)]2u(t)2 if Eq. 1 is true;−4[1 − y(t)]y(t)2u(t)2 if Eq. 2 is true.

(25)

This result can be interpreted in the terminology of dynamical systems, if we write a deterministic dy∕dt in place of the expec-tation of the random dy∕dt. The states y(t) = 0 and y(t) = 1 (which correspond to certainty about Eq. 2 being true or aboutEq. 1 being true, respectively) are equilibria. However, if 0 < y(0) < 1, then if Eq. 1 is true, we have dy∕dt > 0, and our con-fidence in the truth of Eq. 1 increases over time, while if Eq. 2 is true, we have dy∕dt < 0, and our confidence in the truth ofEq. 1 decreases over time, while correspondingly our confidence in the truth of Eq. 2 increases. Loosely speaking, in an averagesense our evolving confidence is attracted to the fixed point of truth. However, the closer we approach the truth, the more slowlywe learn. For example, if Eq. 1 is true, then we have

dy(t)dt

≤ 4[1 − y(t)]2u(t)2

and we can integrate to deduce that1 − y(t) ≥ 1 − y(0)

1 + 4[1 − y(0)]

t

∫0

u()2d

.

This shows that to achieve high confidence (that is, y(t) ≈ 1 if Eq. 1 is true) from an uncertain initial state, we need very longexperience with a weak control (|u(t)|≪ 1), but a shorter time interval for a stronger control.The analysis of y(t) has led to the deterministic differential equation 25, which has two different forms depending on which

of Eq. 1 or 2 is true. We follow the natural approach that consists of creating two information states y1 and y2 governed by thecorresponding cases that arise in Eq. 25:

dy1(t)dt

= 4y1(t)[1 − y1(t)]2u2(t), (26)

dy2(t)dt

= −4y2(t)2[1 − y2(t)]u2(t). (27)

The states y1 and y2 will be part of the deterministic optimal control problem. This concludes our analysis on the state y(t).

3.4.2 Changes to the state x(t)We now show how to deal with the state x(t). First, we argue that we only need to deal with the case where x(t) is positive,because the problem is symmetric. If x(t) < 0, the two possible true equations can be written

d(−x(t)) = −(u(t)dt + dBt) = (−u(t))dt − dBt, (if Eq. 1 is true) (28)d(−x(t)) = −(−u(t)dt + dBt) = −(−u(t))dt − dBt, (if Eq. 2 is true) (29)

Because the Wiener process has no drift, −dBt has the same distribution as dBt. So, the state equation governing −x(t) whenapplying −u(t) is equal in distribution as the state equation governing x(t) when applying u(t). Also, we can change the sign ofu(t) safely because the control space is of the form [−U,U ] and the objective depends on u(t)2 and thus does not depend on thesign of u(t). Hence, denoting u(t) an optimal control (there might be many) for the state x(t) and y(t), −u(t) is also an optimalcontrol on states −x(t) and y(t). The optimal control is thus of the form

u(t) = sgn(x(t))fu(|x(t)|, y(t), t), (30)

where the function fu is to be determined. So, we need only consider z(t) ∶= |x(t)| in the deterministic problem.We want the deterministic differential equation of z(t) to account for three source of randomness and uncertainty. Firstly, the

true equation is ‘selected’ randomly at t = 0. Secondly, our knowledge of the true equation is imperfect. Thirdly, each equationis stochastic due to the Wiener process. We treat these three aspects in order.Firstly, the randomness of the true equation is perhaps the simplest to deal with. Since Eqs. 1 and 2 are respectively true with

probabilities y0 and 1 − y0, the derivative of z(t) will be of the formdz(t)dt

= y0dz1(t)dt

+ (1 − y0)dz2(t)dt

, (31)

AUTHOR ONE ET AL 9

where z1 and z2 represent |x(t)| when Eq. 1 or 2 is true, respectively.Secondly, let us consider our imperfect knowledge of the true equation. Selecting the optimal control hinges entirely on the

information we have, or equivalently, on how we believe the system behaves. So, we define dz1(t)∕dt and dz2(t)∕dt based onour beliefs of the true equation, rather than on the true equation itself. For example, if we assume that Eq. 1 is true, our beliefsin Eq. 1 and 2 are y1(t) and 1 − y1(t), which yields:

dz1(t)dt

∶=

y1(t)E[

d|x(t)|dt

|Eq. 1]

+ (1 − y1(t))E[

d|x(t)|dt

|Eq. 2]

(32)

= sgn(x(t))

y1(t)E[

dx(t)dt

|Eq. 1]

+ (1 − y1(t))E[

dx(t)dt

|Eq. 2]

(33)

= sgn(x(t))[

2y1(t) − 1]

u(t) (34)=[

2y1(t) − 1]

fu(|x(t)|, y1(t), t). // from Eq. 30 (35)

Recall that we always want to reduce |x(t)| to minimize costs, so we would like to have dz1(t)∕dt ≤ 0. We can do this by setting

sgn(fu(|x(t)|, y1(t), t)) = − sgn(2y1(t) − 1) (36)

We conjecture that this sign is always optimal, which seems very sensible because there is no incentive to increase x(t) insteadof decreasing it. The equation becomes

dz1(t)dt

= −|2y1(t) − 1||u(t)|. (37)

For z2, we havedz2(t)dt

= −|2y2(t) − 1||u(t)|. (38)

Finally, combining Eqs. 31, 37 and 38 leads todz(t)dt

= −|u|(

y0 ||2y1(t) − 1||

+ (1 − y0) ||2y2(t) − 1||

)

. (39)

Note that z1 and z2 do not appear in the deterministic model—only z does. This model appears to somehow capture our uncer-tainty on the variable y. At t = 0, the derivative is −|u||2y0 − 1|, which is small for ‘uncertain’ values of y0 (around 0.5). Thisderivative can be seen as ‘bridled’ in order to penalize poor knowledge. When t increases, the beliefs y1 and y2 converge to 0and 1, and the derivative converges to −|u|. Any control u(t) has a higher impact on z(t) than in t = 0, for the same cost u(t)2.Thirdly, let us address the uncertainty of x(t) due to the Wiener process. The above formula is insufficient: if x0 = 0, z(0) = 0

so the optimal control is zero for a total cost of 0. In reality, x(t) varies according to theWiener process and the total cost is positivealmost surely. To account for this, we denote by (t) = E|x(t)| the expected modulus of an uncontrolled x(t), with x(0) = 0.Appropriate integration of the Gaussian probability density function for uncontrolled x(t) establishes that (t) = (4t∕)1∕2, fromwhich it follows that

d(t)dt

= 1(t)

. (40)

Although this differential equation is an approximation when x(0) ≠ 0, we expect it to remain accurate in our case because asuccessful control will drive the process towards small values of x.In order to account for all three aforementioned sources of uncertainty, we defined the differential state equation of x(t) in the

deterministic problem as the sum of Eqs. 39 and 40:dx(t)dt

= 1x(t)

− u[

y0 ||2y1(t) − 1||

+ (1 − y0) ||2y2(t) − 1||

]

. (41)

Note that combining equations this way is also an approximation, because the first term 1∕x(t) corresponds to the uncontrolledcase. However, we can observe that this differential equation achieves to capture the antagonism between our control underuncertainty (second term), which reduces x(t), and the ‘penalty’ caused by the stochasticity of the Wiener process (first term),

10 AUTHOR ONE ET AL

which increases x(t). The rest of the deterministic problem isdy1(t)dt

= 4y1(t)(1 − y1(t))2u2(t), (42)

dy2(t)dt

= −4y2(t)2(1 − y2(t))u2(t), (43)

minu(t)∈[0,U ]

T

∫0

[x2(t) + u2(t)]dt, (44)

with initial conditions

x(0) = x0, y1(0) = y2(0) = y0. (45)

Note that we restrict u to only non-negative values because the choice of sign has been made already. Interestingly, the trade-off on the control u appears perhaps more clearly in this deterministic problem than in the stochastic one, because u2 is both acost to minimize and a linear factor in the derivative of y1 and y2. In other words, ‘extreme’ controls are costly but increase ourknowledge, potentially leading to better future control. This is a classic trade-off in the dual control and adaptive managementliterature.

3.5 Applying the Pontryagin minimum principleWe can now apply the Pontryagin minimum principle. The Hamiltonian is:

H = g + ℎ,

= u(t)2 + x(t)2 + x

(

1x(t)

− u(t)y0 ||2y1(t) − 1||

− u(t)(1 − y0) ||2y2(t) − 1||

)

,

+ 4y1y1(t)(1 − y1(t))2u2(t) − 4y2y2(t)2(1 − y2(t))u2(t). (46)

The adjoint x, y1 and y2 satisfy the equations:dxdt

= −)H)x

= 1x(t)2

+ 2x(t), (47)

dy1dt

= −)H)y1

= 2u(t)x(t)y0|2y1(t) − 1|′ − 4y1(t)(3y1(t)2 − 4y1(t) + 1)u(t)2, (48)

dy2dt

= −)H)y2

= 2u(t)x(t)(1 − y0)|2y2(t) − 1|′ + 4y2(t)y2(t)(2 − 3y2(t))u(t)2, (49)

x(T ) = 0, y1(T ) = 0, y2(T ) = 0. (50)

The optimal control satisfies:

u(t) = argminu

H(u). (51)

3.6 Multidimensional formulationWe can also consider a higher-dimensional version of this problem, where we have N states and N controls. In this case, weassume that one of the following two sets of stochastic differential equations is true:

dxi(t) = ui(t)dt + dBt for all 1 ≤ i ≤ N, (52)dxi(t) = −ui(t)dt + dBt for all 1 ≤ i ≤ N, (53)

AUTHOR ONE ET AL 11

where Bt is a multidimensional Wiener Process. The deterministic problem becomes:dxi(t)dt

= 1xi(t)

− ui[

y0 ||2y1(t) − 1||

+ (1 − y0) ||2y2(t) − 1||

]

for all 1 ≤ i ≤ N, (54)

dy1(t)dt

= 4y1(t)(1 − y1(t))2N∑

i=1u2i (t), (55)

dy2(t)dt

= −4y2(t)2(1 − y2(t))N∑

i=1u2i (t), (56)

minu(t)∈[0,U ]N

T

∫0

[ N∑

i=1u2i (t) +

N∑

i=1x2i (t)

]

dt. (57)

3.7 Obtaining the optimal deterministic control using the forward-backwards sweepWe now outline how the deterministic control is obtained (Algorithm 2). Given the current policy, the trajectories of the statesx, y1 and y2 are obtained by solving the ordinary differential equation (Eqs. 41-44, initial conditions in Eq. 45) with the Matlabpackage ode45 (Line 3). Then, the adjoints x1 , x2 and y are also found by solving ordinary differential equation (Eqs. 47-49,final conditions in Eq. 50) in Line 4. Note that all states, adjoints and controls are discretized in 100 time steps but correspondto an exact solution of the ordinary differential equations. Based on our experiments, an increase in the number of time stepscauses an increase in computational time less than linear.The ‘candidate’ control uc is then chosen to maximize the Hamiltonian in each time step (Line 6). However, taken as a whole,

the sequence of controls uc rarely equals the sequential optimal control because the adjustment can essentially overshoot theoptimal point. A more stable update consists of a linear combination between uc with a weight !t and the previous control uwith a weight 1 − !t (Line 7). We chose for !t an exponential decay to allow for significantly changes at the beginning of thealgorithm but avoids overshoot after a few iterations. We set !t = e−0.15t as it achieved a stable and fast convergence over thevarious instances on which we evaluated this algorithm. The process repeats until the previous and new control are close enough(Line 2). We set the associated threshold to 10−5 as it achieved a good trade-off between accuracy and computational time.The output is the policy for all time steps. The process can be sped up by setting the initial policy (Line 1) as the output policyfrom the previous time step in the simulation.

Algorithm 2 SolveDeterministic(x0, y0, T )1: Initialization: u ∶= 0, uc ∶= 12: while ||u − uc||2 ≥ do3: x, y ∶= SolveODE(x0, y0, T , u)4: x1 , x2 , y ∶= SolveODE(0, 0, 0, T , u, x1, x2, y)5: for t = 1 ∶ nT imeSteps do6: uc(t) ∶= argminu[u2−x(uy0 ||2y1(t) − 1|

|

− u(1−y0) ||2y2(t) − 1||

)+4y1y1(t)(1−y1(t))2u2−4y2y2(t)2(1−y2(t))u2]7: u(t) ∶= !t.uc(t) + (1 − !t).u(t)8: end for9: end whilereturn the policy u

We can then evaluate the resulting deterministic policy on the stochastic problem via simulation with a moving window.

3.8 General algorithmBoth the dynamic programming and optimal control approaches are evaluated through simulations (3). At the beginning ofeach simulation, the true state equation, i.e. the variable y, is drawn randomly (Line 3). The control is obtained from dynamicprogramming or from the optimal control approach (Line 5), depending on which approach is being evaluated. There is a keydifference between these two approaches:

12 AUTHOR ONE ET AL

• For dynamic programming, the entire policy is generated offline, i.e. prior to the loop on simulations, thanks toAlgorithm1. Then, the simulations are fast because the control is simply equal to the policy corresponding to the nearest discretizedx and y.

• For our optimal control approach, the policy is obtained online: in each time step of each simulation, we run thedeterministic algorithm (Algorithm 2) with the current values of x(t) and y(t).

The next state is drawn based on this control and on the true state equation (Eq. 1 or 2, Line 6); the variable y(t) is updatedthrough Bayes’ theorem (Eq. 19, Line 7). The time step is updated until the end of the time horizon (Line 4). The output is theaverage cost per simulation.

Algorithm 3 SolveStochastic(x(0), y(0), T , t)1: for i = 1 ∶ nSimulations do2: Initialization: t ∶= 03: y ∶= DrawQ(y(0), 1 − y(0))4: while t < T do5: u(t) ∶= (x(t), y(t), t∕t) or u(t) ∶= SolveDeterministic(x(t), y(t), T − t) // depending on whether

the DP or OC approach is evaluated6: x(t + t) ∶= DrawNextState(u(t), x(t), y)7: y(t + t) ∶= UpdateKnowledge(y(t), x(t), x(t + t))8: Cost ∶= Cost + t.(u2(t) + x2(t))9: t ∶= t + t

10: end while11: end forreturn the average cost Cost∕nSimulations

3.9 Computational experimentsWe aim to compare dynamic programming to the optimal control approach. These two techniques can be hard to comparebecause dynamic programming is offline while the optimal control approach is online. Dynamic programming calculates theoptimal policy for all discretized states before the simulations start (offline). So, the simulations are very fast because decisionscan be found in a lookup-table. In contrast, our optimal control approach calculates the best control during the simulations bysolving an optimal control problem (online). There is no preprocessing time but the simulations are likely to be much slowerthan for dynamic programming.We show the average result, standard deviation and computational time for both dynamic programming and optimal control

algorithms and various instances. We run 500 simulations for both dynamic programming and optimal control (Table 1). For 1-or 2-dimensional problems, both approaches perform equally at controlling x, but dynamic programming is much faster. With3 or more dimensions, dynamic programming becomes intractable and the optimal control seems to be performing well as thecost grows less than linearly when the dimension of x grows.Figures 2 and 3 show simulations starting with x0 = 5 and x0 = (0, 5) respectively. The controls are found through dynamic

programming but the optimal control approach selects similar levels of control. The solver selects extreme controls in the firsttime steps despite the uncertainty lying on their consequences, in order to learn quickly. In this simulation the belief converges tothe true y after roughly 2 units of time. With two states, only 1 unit of time is needed to learn the value of y with quasi-certainty.

4 DISCUSSION

In this manuscript, we addressed a continuous-time dual control problem. Given the limitations of dynamic programming, wepropose an approach based on optimal control, where the unknown parameter is shown to follow a differential equation. All statesare replaced by their expected values, which leads to a deterministic model that is solved with an optimal control algorithm. We

AUTHOR ONE ET AL 13

Approach Dynamic programming Optimal controlProblem Cost ± 95%

confidenceTime Discretization

X/Y /ACost ± 95%confidence

Time

x0 = 0 16.7 ± 1.5 43s 70/45/35 16.7 ± 2.4 39,600sx0 = 5 93.1 ± 6.0 17s 50/20/15 96.6 ± 7.9 43,200sx0 = 15 1294.3 ± 37.0 15s 50/20/15 1295.1 ± 38.4 43,200sx0 = 0, U = 5 13.1 ± 0.6 32s 70/45/35 12.8 ± 0.5 45,500sx0 = 0, y0 = 1 11.2 ± 0.5 44s 70/45/35 11.5 ± 0.7 30,000sx0 = 5, y0 = 1 63.6 ± 3.2 17s 50/20/15 60.9 ± 3.1 32,400sx0 = (0, 0) 99.4 ± 4.7

28.2 ± 1.460s240s

10/20/530/20/5

26.6 ± 1.4 62,100s

x0 = (0, 0, 0) 43.0 ± 1.5Out ofmemory

420s 20/15/530/20/5

35.3 ± 1.3 76,200s

x0 = (0, 0, 0, 0) Out of memory 20/15/5 46.4 ± 1.3 92,700sx0 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

Out of memory 20/15/5 124.4 ± 2.6 216,000s

TABLE 1 Average cost, 95% confidence interval and computational times of both dynamic programming and optimal control,with y0 = 0.5 and U = 1 unless otherwise stated. For dynamic programming, the number of discretized bins of X, Y and Aare shown (number per dimension when X is multidimensional). The memory is set to 10GB. For 1- or 2-dimensional prob-lems, both approaches perform equally at controlling x, but dynamic programming is much faster. With 3 or more dimensions,dynamic programming, when tractable, yields higher costs than optimal control (43.0 vs 35.3 with 3 dimensions). It is interest-ing to note that the cost and computational time per dimension of optimal control decreases when the state dimension increases(16.7/43,200s in dimension 1 vs. 12.4/21,600s in dimension 10). This is because the state y is learned faster with more dimen-sions, reducing the average cost per dimension. A positive consequence is the decrease in computational time because a betterknowledge means the problem is simpler—as confirmed by the test starting with x0 = 0 and y0 = 1.

0 1 2 3 4 5 6 7 8 9 10-10

0

10

Stat

e x(

t)

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

Con

trol

u(t

)

0 1 2 3 4 5 6 7 8 9 10Time (t)

0

0.5

1

Kno

wle

dge

y(t)

When real y = 1When real y = 0

FIGURE 2 States x(t), control u(t) and knowledge y(t) for T = 10, x0 = 5, y0 = 0.5 and U = 1. The thick lines represent themedian values over time (doubled for the control and knowledge depending on the real value of y). The shaded areas containsimulations between the 5tℎ and 95tℎ percentiles over 500 simulations. The thin lines are some randomly drawn individualsimulations. The control manages to reduce x until it reaches zero at the end of the time-frame (a). For t ≤ 2, the optimal strategyis aggressive (‘bang-bang’) and selects extreme controls (b) to improve the knowledge (c). Until t ≈ 6, the improved knowledgeyields extreme controls, causing a steep decrease in the state x. Finally, for t ≥ 6, the state becomes small and controls graduallydecreases due to a lack of incentive.

14 AUTHOR ONE ET AL

0 1 2 3 4 5 6 7 8 9 10-5

0

5

Stat

e (x

1)

0 1 2 3 4 5 6 7 8 9 10-1

0

1

Con

trol

(u1

)

0 1 2 3 4 5 6 7 8 9 10-10

0

10

Stat

e (x

2)

0 1 2 3 4 5 6 7 8 9 10-1

0

1

Con

trol

(u2

)

0 1 2 3 4 5 6 7 8 9 10

Time (t)

0

0.5

1

(c)

Kno

wle

dge

y(t)

When real y = 1When real y = 0

FIGURE 3 States x (a), control (b) and knowledge (c) for T = 10, x0 = (0, 5), y0 = 0.5 and U = 1. Both controls u1 and u2are extreme at t = 0. This allows for a quick convergence of the information state y, within roughly one unit of time. Note thatwith no aim at actively learning the information state, we would have u1(0) = 0 because x1(0) = 0.

evaluate this policy through simulations in the real stochastic problem. This algorithm rivals dynamic programming on smallproblems and remains tractable on larger problems, as opposed to dynamic programming. It achieves the right balance betweenaggressive and smoothly varying controls.Both the dynamic programming and optimal control approaches have advantages and drawbacks. Dynamic programming is

naturally adapted to stochastic problems. It is also guaranteed to find the optimal solution of discrete-time, discrete-state problems(if tractable), since it explores the entire state space. It is an offline algorithm: it comes at a potentially large preprocessing time,which generates a complete policy. Hence, it suffers from large state spaces, for example when the state space is continuous(therefore infinite) and/or unbounded. It quickly becomes intractable with a multidimensional state (curse of dimensionality;16).In these cases, more modeling effort is required to trade off the quality of the solution against the tractability and computationaltime of the solver.On a side note, using MOMDP solvers instead of applying dynamic programming on a discretized MOMDP might help solve

larger problems22. However, most solvers would likely suffer from the curse of dimensionality with an exponentially growingnumber of states. An exception could be on-line solvers23,24, which circumvent the need to enumerating states and can thushandle very large state spaces.In contrast, the optimal control-based approach we proposed has no issues with large and complex state space, including

includes multidimensional, unbounded and continuous state spaces and enjoys a considerable advantage compared to dynamicprogramming. This also implies that the optimal control approach is online: there is no preprocessing time. Instead the algorithmis called at every time step of a simulation at a small computational cost. However, several limitations must be acknowledged.

AUTHOR ONE ET AL 15

Continuous-time optimal control approaches are not naturally adapted to stochastic problems. Framing the problem as a mean-ingful deterministic problem required significantly more modeling effort than was needed to run dynamic programming. Perhapsthe main drawback is the lack of numerical performance guarantee, as the solution might only be a local optimum.There exist several avenues for improvement. Framing the problem as a meaningful deterministic problem was not straightfor-

ward. There might be a way to do this more systematically, or to find a stochastic version of the Pontryagin minimum principlethat can be directly applied to our problem. Also, the uncertainty allowed in our problemwas limited to just two possible options.It would be beneficial to handle larger uncertainty sets (one of several equations would be true), or perhaps more realistically, acontinuous uncertainty about a parameter, e.g. dx(t) = .u(t)dt + dBt, with ∈ [−1, 1] to be determined.There are two principal sources of applications for our optimal control approach. Firstly, big data is becoming more regularly

available and can be seen as nearly continuous-time data flow, increasing the need for continuous-time solutions. Secondly, wedo not need to discretize states. With a few exceptions, such as presence/absence models, real-world problems usually have con-tinuous state spaces or at least, very large discrete state spaces that need to be partitioned if one is to use dynamic programming;optimal control methods have no such limitations. Further, we canmodel abundances across many spatial locations without caus-ing dimensionality issues. This can be done by considering either a meta-population25, or by using a spatially explicit optimalcontrol formulation26,27.There are many potential applications to our work. In environmental sciences, it is common to deal with two protected species

with uncertain prey-predator interactions between them28. Also, the predator might also predate on other preys competing forresources with the protected prey, resulting in an indirect positive impact on the protected prey29,30,31. In these cases, eithercontrolling or introducing predators could the best action, although their outcomes are opposite. In finance13, where there maybe hidden parameters governing stock markets, our methods may be useful due to their continuous-time nature and the usualmodeling of Wiener process. In marketing, trade-offs may occur between learning about the customers while implementing amarketing strategy32. In medical science, a doctor can learn about a patient’s condition while minimizing the risk of death, com-plications, or discomfort15. More generally, continuous-time dual control may have other interesting applications, for examplein Poisson processes, where the frequency of a recurrent event might be unknown.

ACKNOWLEDGMENTS

Christopher Baker is the recipient of a John Stocker Postdoctoral Fellowship from the Science and Industry Endowment Fund.We thank X and X for valuable feedback on this manuscript. Computational resources and services used in this work wereprovided by the HPC and Research Support Group, Queensland University of Technology, Brisbane, Australia.

References

1. Åström Karl J., Wittenmark Björn. Adaptive control. Courier Corporation; 2008.

2. Bertsekas Dimitri P.. Dynamic programming and optimal control. Athena Scientific Belmont, MA; 1995.

3. Walters Carl J., Hilborn Ray. Ecological optimization and adaptivemanagement.Annual Review of Ecology and Systematics.1978;9:157–188.

4. Chadès Iadine, Nicol Sam, Rout TracyM., et al. Optimization methods to solve adaptive management problems. TheoreticalEcology. 2017;1(1):1–20.

5. Åström Karl J.. Theory and applications of adaptive control—A survey. Automatica. 1983;19(5):471–486.

6. Chadès Iadine, Carwardine Josie, Martin Tara G., Nicol Samuel, Sabbadin Regis, Buffet Olivier.MOMDPs: A Solution for Modelling Adaptive Management Problems. In: :267–273. Available athttp://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/download/4990/5149AAAI Press; 2012; Toronto, Canada.

7. Runge Michael C.. Active adaptive management for reintroduction of an animal population. The Journal of WildlifeManagement. 2013;77(6):1135–1144.

16 AUTHOR ONE ET AL

8. Johnson Fred A., Kendall William L., Dubovsky James A.. Conditions and limitations on learning in the adaptivemanagement of mallard harvests. Wildlife Society Bulletin. 2002;:176–185.

9. Frederick Shane W., Peterman Randall M.. Choosing fisheries harvest policies: when does uncertainty matter?. CanadianJournal of Fisheries and Aquatic Sciences. 1995;52(2):291–306.

10. Wittenmark Björn. Adaptive dual control. In: Eolss Publishing CO. Ltd. heinz unbehauen ed.2002.

11. Naik Sanjeev M., Kumar P. R., Ydstie B. Erik. Robust continuous-time adaptive control by parameter projection. IEEETransactions on Automatic Control. 1992;37(2):182–197.

12. Puterman Martin L.. Markov decision processes: discrete stochastic dynamic programming. New York, NY, USA: JohnWiley & Sons, Inc.; 1994.

13. Dias José G., Vermunt Jeroen K., Ramos Sofia. Clustering financial time series: New insights from an extended hiddenMarkov model. European Journal of Operational Research. 2015;243(3):852–864.

14. Kang Wei, Bedrossian Naz. Pseudospectral Optimal Control Theory Makes Debut Flight, Saves NASA $1 M in UnderThree Hours. SIAM News. 2007;40(7).

15. Hauskrecht Milos. Planning and control in stochastic domains with imperfect information. PhD thesisMassachusettsInstitute of Technology1997.

16. Bellman Richard. Dynamic programming. Princeton University Press. 1957;.

17. Baker Christopher M., Bode Michael. Placing invasive species management in a spatiotemporal context. EcologicalApplications. 2016;26(3):712–725.

18. Hackbusch Wolfgang. A numerical method for solving parabolic equations with opposite orientations. Computing.1978;20(3):229–240.

19. Lenhart Suzanne, Workman John T.. Optimal control applied to biological models. Crc Press; 2007.

20. Sigaud Olivier, Buffet Olivier. Markov decision processes in artificial intelligence. New York, NY, USA: John Wiley &Sons, Inc.; 2010.

21. Ong Sylvie CW, Png Shao Wei, Hsu David, Lee Wee Sun. Planning under uncertainty for robotic tasks with mixedobservability. The International Journal of Robotics Research. 2010;29(8):1053–1068.

22. Kurniawati Hanna, Hsu David, LeeWee Sun. SARSOP: efficient point-based POMDP planning by approximating optimallyreachable belief spaces. In: 2008 (pp. 65–72).

23. Kurniawati Hanna, Yadav Vinay. An online POMDP solver for uncertainty planning in dynamic environment. In: Springer2016 (pp. 611–629).

24. Silver David, Veness Joel. Monte-Carlo planning in large POMDPs. In: :2164–2172; 2010.

25. Salinas Rene A., Lenhart Suzanne, Gross Louis J.. Control of a metapopulation harvesting model for black bears. NaturalResource Modeling. 2005;18(3):307–321.

26. Baker Christopher M.. Target the Source: Optimal Spatiotemporal Resource Allocation for Invasive Species Control.Conservation Letters. 2016;.

27. KellyMichael R., Xing Yulong, Lenhart Suzanne. Optimal fish harvesting for a population modeled by a nonlinear parabolicpartial differential equation. Natural Resource Modeling. 2016;29(1):36–70.

28. Chadès Iadine, Curtis Janelle MR, Martin Tara G.. Setting realistic recovery targets for two interacting endangered species,sea otter and northern abalone. Conservation Biology. 2012;26(6):1016–1025.

AUTHOR ONE ET AL 17

29. Baker Christopher M., Gordon Ascelin, Bode Michael. Ensemble ecosystem modeling for predicting ecosystem responseto predator reintroduction. Conservation Biology. 2017;31(2):376–384.

30. Bode Michael, Baker Christopher M., Benshemesh Joe, et al. Revealing beliefs: using ensemble ecosystem modelling toextrapolate expert beliefs to novel ecological scenarios. Methods in Ecology and Evolution. 2016;.

31. Bode Michael, Baker Christopher M., Plein Michaela. Eradicating down the food chain: optimal multispecies eradicationschedules for a commonly encountered invaded island ecosystem. Journal of Applied Ecology. 2015;52(3):571–579.

32. Zhang Dan, Cooper William L.. Pricing substitutable flights in airline revenue management. European Journal ofOperational Research. 2009;197(3):848–861.

Chapter 7

Conclusions

7.1 Summary

In this thesis we propose solution techniques to solve sequential decisionproblems under uncertainty, inspired by the management of the invasive tigermosquito Aedes albopictus. We showed that these decisions can be modelledand optimised using a Markov decision process (MDP). If the transition func-tion is uncertain, a partially observable Markov decision process (POMDP)can be used to find optimal decisions. However, both MDPs and POMDPsfall prey to the curse of dimensionality: they are computationally very de-manding, or intractable, for all but small problems. In this thesis, we developfour novel approaches allowing for solving larger problems, and faster.

In Chapter 3, we address our first research question by developing a newapproach to assist decision-makers when actions are simultaneous and ofdifferent durations. This approach modifies time constraints to reduce themodel size by several orders of magnitude to obtain bounds on the unknownexact performance, for problems too large for dynamic programming to com-pute the exact solution. Applied to our case study, the bounds provide anarrow range guaranteed to contain the performance of the exact optimalpolicy. This research impacts metapopulations and network managementproblems in biosecurity, health and ecology when the budget allows the im-plementation of simultaneous actions.

In Chapter 4, we address our second research question by proposingtwo new approximate dynamic programming algorithms adapted to largeSusceptible-Infected-Susceptible networks. We show that these two algo-rithms have a lower computational complexity than the standard version ofdynamic programming. These approaches are tractable on the managementof Aedes albopictus (17 islands), as opposed to standard dynamic program-

106

ming, and rival its performance on simpler problems (10 islands). This workcan be re-used on Susceptible-Infected-Susceptible networks or graph MDPsin various fields, to deal with individuals or locations in a network, or prod-ucts in an inventory problem, for example.

In Chapter 5, we address our third research question by proposing amethod to improve the initialisation of POMDP solvers that are used whensolving adaptive management problems. We show that our approach, whichconsists of solving a number of Markov decision processes, generates a lowerbound that is optimal in the corners of the belief space. With an additionalassumption about the optimal policy, we demonstrate that this lower boundis also a linear approximation to the value function. Tested on two state-of-the-art POMDP solvers, our approach shows significant computationalgains in our case study and on a previously published data challenge. Thissimple and inexpensive initial lower bound can be used as an initialisationto POMDP solvers. It is relevant for managing systems where the systemresponse is partially unknown, in fields as varied as natural resource manage-ment, medical science, or machine, network or infrastructure maintenance.

In Chapter 6, we address our fourth research question by using tools fromoptimal control theory to handle structural uncertainty. On a stylised prob-lem, given the limitations of dynamic programming, we propose an approachbased on optimal control where the variable representing our knowledge ofthe unknown parameter is shown to follow a differential equation. All statesare replaced by their expected values, which leads to a deterministic modelthat is solved with an optimal control algorithm. We evaluate this policythrough simulations in the real stochastic problem. This algorithm rivalsdynamic programming on small problems and remains tractable on largerproblems, in contrast to dynamic programming. It achieves the right bal-ance between aggressive and smoothly varying controls. This approach canbe beneficial for continuous-time real-world problems, such as stock portfoliooptimization or flight trajectory planning, or to problems where the state ismultidimensional or near-continuous.

Together, these four chapters make an original and substantial contribu-tion to knowledge by proposing novel optimisation techniques to solve largerproblems. These techniques are also applied on a novel computational sus-tainability case study about the management of an invasive species. We willoutline the management recommendations arising from our research in thenext section.

107

7.2 Management recommendations against in-

vasive mosquito Aedes albopictus

We have identified in Chapters 3 and 4 some general rules to manage Aedesalbopictus cost-effectively. It is possible to identify an order in which islandsshould be managed until eradication is achieved. It is reassuring to see thatthe approaches in Chapters 3 and 4 provide us with nearly identical orders.

Let us summarise here these recommendations. All models target Thurs-day, Horn and Mulgrave Islands as management priorities in this order be-cause these islands are highly populated and close to mainland Australia.They have the highest probability of leading to a direct colonisation of main-land Australia. Knowing that these islands are close to each other (whichfavours transmission) and that Horn Island is the ‘transport hub’ of theTorres Strait adds further credence to their high prioritisation. The prioriti-sation of these three islands is insensitive to the number of islands included(1–13) and to the transmission probabilities (low/high), showing the robust-ness of this policy. Finally, it is worth noting that although it is not perfectlyknown which Torres Strait Islands are infested, it is deemed likely by expertsthat the three prioritised islands are infested, which makes this managementrecommendation credible (Beebe et al., 2013).

Many factors affect this prioritisation ranking. The probability of directlycolonising the mainland seems like a greedy, short term consideration but itappears to be the most important factor: Thursday, Horn and MulgraveIslands score the highest on this criterion and appear at the top of the priori-tisation ranking. Another factor is the effectiveness of management: islandswhere the management actions are effective have an advantage over islandsthat are difficult to manage. Finally, the layout of islands is paramount inmaking optimal decisions: an isolated island like Coconut Island has a lowpriority because mosquitoes have a low chance of colonising other islandsfrom there. In conclusion, our modelling and analysis provide managementrecommendations by accounting for multiple factors of different natures.

Although the prioritisation ranking is consistent for different levels oftransmission probabilities, the mean time until infestation is not: it rangesfrom 13 to 50 years when calculated using high and low transmission prob-abilities respectively. Obtaining more precise estimates of the transmissionprobabilities will produce a narrower time range estimate. Higher budgets al-located to management can also postpone infestation, more sensitively whentransmission probabilities are low (40 years with no budget/80 years with un-limited budget) than high (10/15 years). A comprehensive sensitivity anal-ysis would help the decision maker set the most suitable budget.

108

7.3 Future work

A great deal of work remains to be done. In this section, we outline thework needed to overcome some limitations of our current approaches andthen discuss some of the restrictive assumptions we have made. We finish byproposing new avenues for research.

7.3.1 Current limitations

The approaches presented in this thesis come with a number of limitations.

In Chapter 3, we developed a new approach to assist decision-makerswhen actions are simultaneous and of different durations. The quality of thebounds returned by our approach are very much dependent on the durationsof the different actions, with a relative error of up to 21% with 3 actions ofdurations 2, 5 and 7. The error might be even greater on problems with moresimultaneous actions. Further research in this direction is needed to increaserobustness and reduce the relative error.

Our most successful approximate approach in Chapter 4, i.e. our approachwith continuous states, has the disadvantage of not accounting for changes ofaction in the future: it might perform poorly on systems that rely on changingactions drastically within a short time. We could potentially improve theaccuracy of the approach on harder problems (while remaining tractable) byallowing actions to change in the future only if some important nodes switch,such as Thursday Island in our case study.

The approach introduced in Chapter 5 only accelerates MOMDP solversat the beginning, through an initial lower bound. As we have seen in thehardest problems, the convergence of the value function is still very slow,although with a sizeable head start. The main issue is probably that we didnot exploit the SIS structure of the problem at all. A way to do so is to usea factored POMDP solver, which is tailored for problems where the statescan be naturally decomposed as a combination of sub-states (e.g. nodes in anetwork). We have tried using Symbolic Perseus (Poupart, 2005), a factoredversion of Perseus (Spaan and Vlassis, 2005), but only a handful of islandscould be accommodated for. Since Perseus is outperformed by SARSOP(Kurniawati et al., 2008), a factored version of SARSOP might yield betterresults than Symbolic Perseus. More generally, an efficient factored POMDPsolver is in high demand as applications with factored states are countlessand will probably play a more prominent role in the future as larger problemsare tackled.

Although we accelerate solvers addressing adaptive management in Chap-

109

ter 5, we were unable to solve the adaptive management problem on morethan 9 islands. In Chapter 6 we propose an approach to deal with problemsin high dimensions, but this is only applied to a stylised problem. More workis needed to adapt this to real-world problems. Also, framing the problemas a meaningful deterministic problem was not straightforward. There mightbe a way to do this more systematically, or to find a stochastic version of thePontryagin minimum principle that can be directly applied to our problem.Linking back to the previous paragraph, it would also be beneficial to handlelarger uncertainty sets (several equations could be true—not just two), orperhaps more realistically, a continuous uncertainty about a parameter, e.g.dx(t) = α.u(t)dt+ dBt, with α ∈ [−1, 1] to be determined.

7.3.2 Restrictive assumptions

We have made a number of restrictive assumptions throughout our study. Forexample, Aedes albopictus is difficult to detect and decision-makers cannot becertain that an island is susceptible, which we have disregarded in Chapters3, 4 and 5. We have used partially observable MDPs to model structuraluncertainty but its primary use is to address observational uncertainty. Itwould be interesting to address observational and structural uncertainty atthe same time and investigate which type causes the highest computationalburden.

An assumption we have made when dealing with structural uncertaintyis the stationarity of the transition function (Chapters 5 and 6). This israther restrictive, as phenomena like climate change may make the transitionfunction inherently non-stationary. Further research is needed to evaluate thecost of disregarding the non-stationarity. If this cost is too high, then betterlower bounds or solution techniques are needed for general-case MOMDPsor POMDPs.

Besides, assumming that there is a finite number of transition functionswas crucial to solve structural uncertainty because it allowed us to create onestate per possible transition function (Chapters 5 and 6). Having differentsuch scenarios is common in environmental sciences but might be less realisticin other fields. For example, the transition function could belong to a contin-uous set of possible functions. In this case, we should be able to combine theworks of Merl et al. (2009), who sample continuous parameters with a MonteCarlo approach, with a Monte Carlo POMDP solver (Silver and Veness,2010). This sample approach is more robust to the curse of dimensionalityand allows for more complex state spaces. Nilim and El Ghaoui (2005) havealso considered continuous uncertainty sets from a robustness perspective,

110

where ‘nature’ chooses the worst possible transition function within the al-lowed set of transition functions. Wiesemann et al. (2013) push boundariesfurther by allowing more sets to be considered. All these approaches areinspiring to deal with continuous sets of transition functions.

Finally, each of these approaches relies on the fact that the real transitionfunction lies in this pre-defined set of transition functions. The true systemdynamics might, however, be completely different from the ones we haveimagined possible. In this case, one might need a tool to detect when theobserved transitions do not seem to match any of the allowed transitions,potentially leading to timely model and policy adjustments. Yang and Wang(2003), Chandola et al. (2012), Ye (2000) have proposed inspiring solutionsin the context of Markov chains. It would be interesting to see how theseapproaches fare on real-world environmental problems like our case study.

7.3.3 Future avenues of research

Further improvement of the current model and solution techniques we usedshould continue to be pursued. First, our techniques have been tested only onthe case study of the Asian tiger mosquitoes—with the exception of Chapter5, where an additional case study from the literature was also used. In orderto further reinforce the validity of our approaches, one would need to testthem on different case studies.

Also, our four core Chapters introduced four different approaches to ad-dress sequential decision problems. Can these approaches be coupled, e.g.an approximate approach applied to adaptive management problems? Thisis left for further study. Also, we have disregarded other factors potentiallyinfluencing management recommendations, such as species interactions, in-creased migration flow and the effects of climate change.

On a more theoretical level, it would be beneficial to obtain some perfor-mance guarantees for adaptive management problems. With no prior infor-mation on the transition function, some works in reinforcement learning haveachieved to calculate bounds on the total regret accumulated (Auer et al.,2009, Bartlett and Tewari, 2009). In our formulation with ample prior infor-mation (only a few possible transition functions), it may also be possible tocalculate such bounds for the optimal policy. This would provide an upperbound on the cost of structural uncertainty (many transition functions in-stead of one), which is related to the notion of value of information (Chadeset al., 2017). Also, this would be a good opportunity to compare the optimalpolicy with the heuristics used in reinforcement learning.

Finally, little attention has been devoted to the shape of the value func-

111

tion of an MOMDP or a POMDP. In Chapter 5, we studied the derivative ofthe value function near the corners of the belief space. It is not known, how-ever, in what case this function is differentiable across the entire belief space,except in trivial cases such as only one α-vector, i.e. one action. Note thatthis could only be true in infinite time horizon because the value function ispiecewise linear in finite time horizon (Section 2.2.2), i.e. not differentiable.Then, it would be interesting to explore the connection between the differ-entiability of the value function and the difficulty to solve the POMDP. Forexample, non-trivial differentiable value functions are made of an infinitenumber of α-vectors, which tends to suggest that differentiability implies dif-ficulty. Further, recall that the value function is calculated recursively, i.e.the value of one belief state is expressed as a weighted sum of the value ofother, often nearby, belief states. Hence, from this angle, the definition of thevalue function appears to share similarities with the concept of differentialequations. An interesting avenue of research would be to investigate whetherthe optimal value function of a POMDP can be expressed as, or approxi-mated by, the solution of a differential equation. One could start with a verysimple POMDP problem, for instance with two states, two observations andtwo actions. This might lead to a completely different approach for solv-ing POMDPs, i.e. through solving differential equations, with the benefits ofhaving very different strengths from current approaches, as we have seen inChapter 6.

112

Bibliography

Astrom, K. J. (1983). Theory and applications of adaptive control—A survey.Automatica, 19(5):471–486.

Astrom, K. J. and Wittenmark, B. (2008). Adaptive Control. Courier Cor-poration.

Anderson, C. and Franks, N. R. (2001). Teams in animal societies. BehavioralEcology, 12(5):534–540.

Auer, P., Jaksch, T., and Ortner, R. (2009). Near-optimal regret boundsfor reinforcement learning. In Advances in Neural Information ProcessingSystems, pages 89–96.

Baker, C. M. and Bode, M. (2016). Placing invasive species management ina spatiotemporal context. Ecological Applications, 26(3):712–725.

Baker, C. M., Gordon, A., and Bode, M. (2017). Ensemble ecosystem mod-eling for predicting ecosystem response to predator reintroduction. Con-servation Biology, 31(2):376–384.

Bartlett, P. L. and Tewari, A. (2009). REGAL: A regularization based al-gorithm for reinforcement learning in weakly communicating MDPs. InProceedings of the Twenty-Fifth Conference on Uncertainty in ArtificialIntelligence, pages 35–42. AUAI Press.

Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchicalreinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379.

Beebe, N. W., Ambrose, L., Hill, L. A., Davis, J. B., Hapgood, G., Cooper,R. D., Russell, R. C., Ritchie, S. A., Reimer, L. J., Lobo, N. F., Syafruddin,D., and van den Hurk, A. F. (2013). Tracing the tiger: Population geneticsprovides valuable insights into the Aedes (Stegomyia) albopictus invasionof the Australasian Region. PLoS neglected tropical diseases, 7:e2361.

113

Bellman, R. (1957). Dynamic programming. Princeton University Press.

Bellman, R. (1961). Adaptive Control Process: A Guided Tour.

Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control, vol-ume 1. Athena Scientific Belmont, MA.

Bonizzoni, M., Gasperi, G., Chen, X., and James, A. A. (2013). The in-vasive mosquito species Aedes albopictus: Current knowledge and futureperspectives. Trends in Parasitology, 29:460–468.

Boutilier, C. and Brafman, R. I. (1997). Planning with concurrent interactingactions. In AAAI/IAAI, pages 720–726.

Boutilier, C. and Poole, D. (1996). Computing optimal policies for partiallyobservable decision processes using compact representations. In Proceed-ings of the National Conference on Artificial Intelligence, pages 1168–1175.

Cassandra, A. R. (1998). A survey of POMDP applications. In WorkingNotes of AAAI 1998 Fall Symposium on Planning with Partially Observ-able Markov Decision Processes, pages 17–24.

Cassandra, A. R., Kaelbling, L. P., and Littman, M. L. (1994). Actingoptimally in partially observable stochastic domains. volume 94, pages1023–1028.

Chades, I., Carwardine, J., Martin, T. G., Nicol, S., Sabbadin, R., and Buffet,O. (2012a). MOMDPs: A Solution for Modelling Adaptive ManagementProblems. In The Twenty-Sixth AAAI Conference on Artificial Intelli-gence, pages 267–273, Toronto, Canada. AAAI Press.

Chades, I., Chapron, G., Cros, M.-J., Garcia, F., and Sabbadin, R. (2014).MDPtoolbox: A multi-platform toolbox to solve stochastic dynamic pro-gramming problems. Ecography, 37.

Chades, I., Curtis, J. M., and Martin, T. G. (2012b). Setting realistic recov-ery targets for two interacting endangered species, sea otter and northernabalone. Conservation Biology, 26(6):1016–1025.

Chades, I., Curtis, J. M. R., and Martin, T. G. (2012c). Setting realisticrecovery targets for interacting endangered species. Conservation Biology,26:1016–1025.

114

Chades, I., Martin, T. G., Nicol, S., Burgman, M. A., Possingham, H. P.,and Buckley, Y. M. (2011). General rules for managing and surveyingnetworks of pests, diseases, and endangered species. Proceedings of theNational Academy of Sciences of the United States of America, 108:8323–8328.

Chades, I., McDonald-Madden, E., McCarthy, M. A., Wintle, B., Linkie, M.,and Possingham, H. P. (2008). When to stop managing or surveying cryp-tic threatened species. Proceedings of the National Academy of Sciences,105(37):13936–13940.

Chades, I., Nicol, S., Rout, T. M., Peron, M., Dujardin, Y., Pichancourt,J.-B., Hastings, A., and Hauser, C. E. (2017). Optimization methods tosolve adaptive management problems. Theoretical Ecology, 1(1):1–20.

Chandola, V., Banerjee, A., and Kumar, V. (2012). Anomaly detection fordiscrete sequences: A survey. IEEE Transactions on Knowledge and DataEngineering, 24(5):823–839.

Convention on biological diversity (2002). Alien species that threaten ecosys-tems, habitats or species. COP 6 Decision VI/23.

Dias, J. G., Vermunt, J. K., and Ramos, S. (2015). Clustering financial timeseries: New insights from an extended hidden Markov model. EuropeanJournal of Operational Research, 243(3):852–864.

Dimitrakakis, C. (2009). Complexity of stochastic branch and bound forbelief tree search in Bayesian reinforcement learning. In 2nd InternationalConference on Agents and Artificial Intelligence (ICAART 2010), pages259–264.

Duff, M. (2002). Optimal Learning: Computational Procedures for Bayes-Adaptive Markov Decision Processes. PhD thesis, University of Mas-sachusetts Amherst.

Duff, M. (2003). Design for an optimal probe. In Proceedings of the 20thInternational Conference on Machine Learning, pages 131–138.

Duke, J. M., Dundas, S. J., and Messer, K. D. (2013). Cost-effective con-servation planning: Lessons from economics. Journal of environmentalmanagement, 125:126–133.

Dumont, G. A. and Astrom, K. J. (1988). Wood chip refiner control. IEEEControl Systems Magazine, 8(2):38–43.

115

Duncan, D. H. and Wintle, B. A. (2008). Towards adaptive managementof native vegetation in regional landscapes. In Landscape Analysis andVisualisation, pages 159–182. Springer.

Faddoul, R., Raphael, W., Soubra, A.-H., and Chateauneuf, A. (2015). Par-tially Observable Markov Decision Processes incorporating epistemic un-certainties. European Journal of Operational Research, 241(2):391–401.

Fard, M. M. and Pineau, J. (2010). PAC-Bayesian model selection for re-inforcement learning. In Advances in Neural Information Processing Sys-tems, pages 1624–1632.

Firn, J., Rout, T., Possingham, H., and Buckley, Y. M. (2008). Managingbeyond the invader: Manipulating disturbance of natives simplifies controlefforts. Journal of Applied Ecology, 45:1143–1151.

Forsell, N. and Sabbadin, R. (2006). Approximate linear-programming algo-rithms for graph-based Markov decision processes. Frontiers in ArtificialIntelligence and Applications, 141:590.

Forsell, N., Wikstrom, P., Garcia, F., Sabbadin, R., Blennow, K., and Eriks-son, L. O. (2011). Management of the risk of wind damage in forestry:A graph-based Markov decision process approach. Annals of OperationsResearch, 190:57–74.

Frederick, S. W. and Peterman, R. M. (1995). Choosing fisheries harvestpolicies: When does uncertainty matter? Canadian Journal of Fisheriesand Aquatic Sciences, 52(2):291–306.

Gherardi, F. (2007). Measuring the impact of freshwater NIS: What are wemissing? Biological invaders in inland waters: Profiles, distribution, andthreats, pages 437–462.

Gillies, D., Thornley, D., and Bisdikian, C. (2009). Probabilistic approachesto estimating the quality of information in military sensor networks. TheComputer Journal, 53(5):493–502.

Grechi, I., Chades, I., Buckley, Y., Friedel, M., Grice, A. C., Possingham,H. P., van Klinken, R. D., and Martin, T. G. (2014). A decision frame-work for management of conflicting production and biodiversity goals fora commercially valuable invasive species. Agricultural Systems, 125:1–11.

Hackbusch, W. (1978). A numerical method for solving parabolic equationswith opposite orientations. Computing, 20(3):229–240.

116

Hauskrecht, M. (1997). Planning and Control in Stochastic Domains withImperfect Information. PhD thesis, Massachusetts Institute of Technology.

Hill, L. A., Davis, J. B., Hapgood, G., Whelan, P. I., Smith, G. A., Ritchie,S. A., Cooper, R. D., and van den Hurk, A. F. (2008). Rapid identifica-tion of Aedes albopictus, Aedes scutellaris, and Aedes aegypti life stagesusing real-time polymerase chain reaction assays. The American journalof tropical medicine and hygiene, 79:866–875.

Ho, C., Kochenderfer, M. J., Mehta, V., and Caceres, R. S. (2015). Controlof epidemics on graphs. In Decision and Control (CDC), 2015 IEEE 54thAnnual Conference On, pages 4202–4207. IEEE.

Hoey, J., St-Aubin, R., Hu, A., and Boutilier, C. (1999). SPUDD: Stochasticplanning using decision diagrams. In Proceedings of the Fifteenth Con-ference on Uncertainty in Artificial Intelligence, pages 279–288. MorganKaufmann Publishers Inc.

Johnson, F. A., Kendall, W. L., and Dubovsky, J. A. (2002). Conditions andlimitations on learning in the adaptive management of mallard harvests.Wildlife Society Bulletin, pages 176–185.

Kakade, S. M. (2003). On the Sample Complexity of Reinforcement Learning.PhD thesis, University of London.

Kang, W. and Bedrossian, N. (2007). Pseudospectral optimal control theorymakes debut flight, saves NASA $1M in under three hours. SIAM News,40(7).

Kolter, J. Z. and Ng, A. Y. (2009). Near-Bayesian exploration in polyno-mial time. In Proceedings of the 26th Annual International Conference onMachine Learning, pages 513–520. ACM.

Kurniawati, H., Hsu, D., and Lee, W. S. (2008). SARSOP: Efficientpoint-based POMDP planning by approximating optimally reachable be-lief spaces. In Robotics: Science and Systems (RSS), pages 65–72.

Lenhart, S. and Workman, J. T. (2007). Optimal Control Applied to Biolog-ical Models. CRC Press.

Littman, M. L. (1996). Algorithms for Sequential Decision Making. PhDthesis, Brown University.

117

Madani, O., Hanks, S., and Condon, A. (1999). On the undecidability ofprobabilistic planning and infinite-horizon partially observable Markov de-cision problems. pages 541–548.

Mantyka-Pringle, C. S., Martin, T. G., Moffatt, D. B., Udy, J., Olley, J.,Saxton, N., Sheldon, F., Bunn, S. E., and Rhodes, J. R. (2016). Prioritizingmanagement actions for the conservation of freshwater biodiversity underchanging climate and land-cover. Biological Conservation, 197:80–89.

Mazza, G., Tricarico, E., Genovesi, P., and Gherardi, F. (2014). Biologicalinvaders are threats to human health: An overview. Ethology Ecology andEvolution, 26:112–129.

McCarthy, M. A. and Possingham, H. P. (2007). Active adaptive manage-ment for conservation. Conservation Biology, 21(4):956–963.

McCarthy, M. A., Possingham, H. P., and Gill, A. M. (2001). Using stochasticdynamic programming to determine optimal fire management for Banksiaornata. Journal of Applied Ecology, 38:585–592.

Merl, D., Johnson, L. R., Gramacy, R. B., and Mangel, M. (2009). A statis-tical framework for the adaptive management of epidemiological interven-tions. PloS One, 4(6):e5807.

Monahan, G. E. (1982). A survey of partially observable Markov decisionprocesses: Theory, models, and algorithms. Management Science, 28:1–16.

Monterrubio, C. L., Rioja-Paradela, T. M., and Carrillo-Reyes, A. (2015).State of knowledge and conservation of endangered and critically endan-gered lagomorphs worldwide. Therya, 6(1):11–30.

Moore, C. T. and Conroy, M. J. (2006). Optimal regeneration planning forold-growth forest: Addressing scientific uncertainty in endangered speciesrecovery through adaptive management. Forest Science, 52(2):155–172.

Mwebaze, P., Bennett, J., Beebe, N. W., Devine, G. J., and De Barro, P.(2017). Economic Valuation of the Threat Posed by the Establishmentof the Asian Tiger Mosquito in Australia. Environmental and ResourceEconomics, pages 1–23.

Naik, S. M., Kumar, P. R., and Ydstie, B. E. (1992). Robust continuous-time adaptive control by parameter projection. IEEE Transactions onAutomatic Control, 37(2):182–197.

118

Nicol, S., Buffet, O., Iwamura, T., and Chades, I. (2013). Adaptive man-agement of migratory birds under sea level rise. In Proceedings of theTwenty-Third International Joint Conference on Artificial Intelligence,pages 2955–2957, Beijing, China. AAAI Press.

Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision pro-cesses with uncertain transition matrices. Operations Research, 53(5):780–798.

Ong, S. C., Png, S. W., Hsu, D., and Lee, W. S. (2010). Planning underuncertainty for robotic tasks with mixed observability. The InternationalJournal of Robotics Research, 29(8):1053–1068.

Papadimitriou, C. H. and Tsitsiklis, J. N. (1987). The complexity of Markovdecision processes. Mathematics of operations research, 12:441–450.

Pastor-Satorras, R. and Vespignani, A. (2001). Epidemic spreading in scale-free networks. Physical review letters, 86(14):3200.

Paupy, C., Delatte, H., Bagny, L., Corbel, V., and Fontenille, D. (2009).Aedes albopictus, an arbovirus vector: From the darkness to the light.Microbes and Infection, 11:1177–1185.

Pelizza, S. A., Scorsetti, A. C., Bisaro, V., Lastra, C. C. L., and Garcıa,J. J. (2010). Individual and combined effects of Bacillus thuringiensis var.israelensis, temephos and Leptolegniachapmanii on the larval mortality ofAedes aegypti. BioControl, 55(5):647–656.

Peron, M., Bartlett, P. L., Becker, K. H., Helmstedt, K. J., and Chades, I.(2018). Two approximate dynamic programming algorithms for managingcomplete SIS networks.

Peron, M., Becker, K. H., Bartlett, P., and Chades, I. (2017a). Fast-TrackingStationary MOMDPs for Adaptive Management Problems. In Proceedingsof the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17),pages 4531–4537.

Peron, M., Jansen, C. C., Mantyka-Pringle, C., Nicol, S., Schellhorn, N. A.,Becker, K. H., and Chades, I. (2017b). Selecting simultaneous actions ofdifferent durations to optimally manage an ecological network. Methods inEcology and Evolution, 8(10):1332–1341.

Phelan, P. L., Norris, K. H., and Mason, J. F. (1996). Soil-management his-tory and host preference by Ostrinia nubilalis: Evidence for plant mineral

119

balance mediating insect–plant interactions. Environmental Entomology,25(6):1329–1336.

Pichancourt, J. B., Chades, I., Firn, J., van Klinken, R. D., and Martin,T. G. (2012). Simple rules to contain an invasive species with a complexlife cycle and high dispersal capacity. Journal of Applied Ecology, 49:52–62.

Pimentel, D., McNair, S., Janecka, J., Wightman, J., Simmonds, C., OCon-nell, C., Wong, E., Russel, L., Zern, J., Aquino, T., and Tsomondo, T.(2001). Economic and environmental threats of alien plant, animal, andmicrobe invasions. Agriculture, Ecosystems & Environment, 84:1–20.

Pineau, J., Gordon, G., and Thrun, S. (2003). Point-based value iteration:An anytime algorithm for POMDPs. volume 3, pages 1025–1032.

Piorr, A., Ungaro, F., Ciancaglini, A., Happe, K., Sahrbacher, A., Sattler,C., Uthes, S., and Zander, P. (2009). Integrated assessment of future CAPpolicies: Land use changes, spatial patterns and targeting. environmentalscience & policy, 12(8):1122–1136.

Possingham, H. P. (1997). Optimal fire management strategies for threat-ened species: An application of stochastic dynamic programming to state-dependent environmental decision-making. Bulletin of the Ecological So-ciety of America, 78.

Poulin, F. J. and Franks, P. J. (2010). Size-structured planktonic ecosystems:Constraints, controls and assembly instructions. Journal of plankton re-search, page fbp145.

Poupart, P. (2005). Exploiting Structure to Efficiently Solve Large ScalePartially Observable Markov Decision Processes. PhD thesis, Universityof Toronto, Toronto.

Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solutionto discrete Bayesian reinforcement learning. In Proceedings of the 23rdInternational Conference on Machine Learning, pages 697–704. ACM.

Powell, W. B. (2007). Approximate Dynamic Programming: Solving theCurses of Dimensionality, volume 703. John Wiley & Sons, Inc., NewYork, NY, USA.

Puterman, M. L. (1994). Markov Decision Processes: Discrete StochasticDynamic Programming. John Wiley & Sons, Inc., New York, NY, USA.

120

Regan, T. J., Chades, I., and Possingham, H. P. (2011). Optimally manag-ing under imperfect detection: A method for plant invasions. Journal ofApplied Ecology, 48:76–85.

Regan, T. J., McCarthy, M. A., Baxter, P. W. J., Panetta, F. D., and Poss-ingham, H. P. (2006). Optimal eradication: When to stop looking for aninvasive plant. Ecology Letters, 9:759–766.

Richards, S. A., Possingham, H. P., and Tizard, J. (1999). Optimal firemanagement for maintaining community diversity. Ecological Applications,9:880–892.

Ritchie, S. A., Moore, P., Carruthers, M., Williams, C., Montgomery, B.,Foley, P., Ahboo, S., Van Den Hurk, A. F., Lindsay, M. D., and Cooper,B. (2006). Discovery of a widespread infestation of Aedes albopictus inthe Torres Strait, Australia. Journal of the American Mosquito ControlAssociation, 22:358–365.

Rohanimanesh, K. and Mahadevan, S. (2002). Learning to take concurrentactions. In Advances in Neural Information Processing Systems, pages1619–1626.

Runge, M. C. (2013). Active adaptive management for reintroduction of ananimal population. The Journal of Wildlife Management, 77(6):1135–1144.

Sahneh, F. D., Chowdhury, F. N., and Scoglio, C. M. (2012). On the existenceof a threshold for preventive behavioral responses to suppress epidemicspreading. Scientific reports, 2.

Shani, G., Pineau, J., and Kaplow, R. (2013). A survey of point-basedPOMDP solvers. Autonomous Agents and Multi-Agent Systems, 27:1–51.

Shea, K. and Possingham, H. P. (2000). Optimal release strategies for bio-logical control agents: An application of stochastic dynamic programmingto population management. Journal of Applied Ecology, 37:77–86.

Sigaud, O. and Buffet, O. (2010). Markov Decision Processes in ArtificialIntelligence. John Wiley & Sons, Inc., New York, NY, USA.

Silver, D. and Veness, J. (2010). Monte-Carlo planning in large POMDPs.pages 2164–2172.

Silver, E. A. (1963). Markovian decision processes with uncertain transitionprobabilities or rewards. Technical report, MASSACHUSETTS INST OFTECH CAMBRIDGE OPERATIONS RESEARCH CENTER.

121

Sim, H. S., Kim, K.-E., Kim, J. H., Chang, D.-S., and Koo, M.-W. (2008).Symbolic Heuristic Search Value Iteration for Factored POMDPs. InAAAI, pages 1088–1093.

Singh, S. and Cohn, D. (1998). How to dynamically merge Markov decisionprocesses. In Proceedings of the 1997 Conference on Advances in NeuralInformation Processing Systems 10, pages 1057–1063. MIT Press.

Smallwood, R. D. and Sondik, E. J. (1973). Optimal control of partiallyobservable Markov processes over a finite horizon. Operations Research,21:1071–1088.

Spaan, M. and Vlassis, N. (2005). Perseus: Randomized point-based valueiteration for POMDPs. Journal of Artificial Intelligence Research, 24:195–220.

Spall, J. C. (2005). Introduction to Stochastic Search and Optimization:Estimation, Simulation, and Control, volume 65. John Wiley & Sons.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differ-ences. Machine learning, 3(1):9–44.

Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learn-ing. MIT Press.

Vlassis, N., Ghavamzadeh, M., Mannor, S., and Poupart, P. (2012).Bayesian reinforcement learning. In Reinforcement Learning, pages 359–386. Springer.

Walters, C. J. (1986). Adaptive management of renewable resources. Macmil-lan, New York, USA.

Walters, C. J. and Hilborn, R. (1976). Adaptive control of fishing systems.Journal of the Fisheries Board of Canada, 33(1):145–159.

Walters, C. J. and Hilborn, R. (1978). Ecological optimization and adaptivemanagement. Annual Review of Ecology and Systematics, 9:157–188.

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis,King’s College, Cambridge.

White, C. C. (1991). A survey of solution techniques for the partially ob-served Markov decision process. Annals of Operations Research, 32:215–230.

122

Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decisionprocesses. Mathematics of Operations Research, 38(1):153–183.

Williams, B. K. (1996). Adaptive optimization and the harvest of biologicalpopulations. Mathematical Biosciences, 136:1–20.

Williams, B. K. (2011). Resolving structural uncertainty in natural resourcesmanagement using POMDP approaches. Ecological Modelling, 222:1092–1102.

Wittenmark, B. (2002). Adaptive dual control. In Control Systems, Roboticsand Automation, Encyclopedia of Life Support Systems (EOLSS), Devel-oped under the Auspices of the UNESCO. Eolss Publishing CO. Ltd., heinzunbehauen edition.

Yakowitz, S. J. (1969). Mathematics of adaptive control processes.

Yang, J. and Wang, W. (2003). CLUSEQ: Efficient and effective sequenceclustering. In Data Engineering, 2003. Proceedings. 19th InternationalConference On, pages 101–112. IEEE.

Ye, N. (2000). A markov chain model of temporal behavior for anomalydetection. In Proceedings of the 2000 IEEE Systems, Man, and Cybernet-ics Information Assurance and Security Workshop, volume 166, page 169.West Point, NY.

Zhang, D. and Cooper, W. L. (2009). Pricing substitutable flights in air-line revenue management. European Journal of Operational Research,197(3):848–861.

123

Appendix A

Optimization methods to solve adaptive man-agement problems

124

REVIEW PAPER

Optimization methods to solve adaptive management problems

Iadine Chadès1 & Sam Nicol1 & Tracy M. Rout2 & Martin Péron1,3& Yann Dujardin1

&

Jean-Baptiste Pichancourt1 & Alan Hastings4 & Cindy E. Hauser2

Received: 7 December 2015 /Accepted: 22 September 2016 /Published online: 24 October 2016# Springer Science+Business Media Dordrecht 2016

Abstract Determining the best management actions is chal-lenging when critical information is missing. However, urgen-cy and limited resources require that decisions must be madedespite this uncertainty. The best practice method for managinguncertain systems is adaptive management, or learning by do-ing. Adaptive management problems can be solved optimallyusing decision-theoretic methods; the challenge for thesemethods is to represent current and future knowledge usingeasy-to-optimize representations. Significant methodologicaladvances have been made since the seminal adaptive manage-ment work was published in the 1980s, but despite recent ad-vances, guidance for implementing these approaches has beenpiecemeal and study-specific. There is a need to collate andsummarize new work. Here, we classify methods and updatethe literature with the latest optimal or near-optimal approachesfor solving adaptive management problems. We review threemathematical concepts required to solve adaptive managementproblems: Markov decision processes, sufficient statistics, andBayes’ theorem.We provide a decision tree to determine whetheradaptive management is appropriate and then group adaptive

management approaches based on whether they learn only fromthe past (passive) or anticipate future learning (active).We discussthe assumptions made when using existing models and providesolution algorithms for each approach. Finally, we propose newareas of development that could inspire future research. For a longtime, limited by the efficiency of the solution methods, recenttechniques to efficiently solve partially observable decision prob-lems now allow us to solve more realistic adaptive managementproblems such as imperfect detection and non-stationarity insystems.

Keywords Adaptivemanagement . Markov decisionprocess . MDP . Partially observableMarkov decisionprocess . POMDP . Stochastic dynamic programming . Valueof information . HiddenMarkovmodels . Natural resourcemanagement . Conservation

Introduction

Resources tomanage ecological systems are limitedworldwide.Managers have the difficult task of making decisions withoutperfect knowledge of system dynamics or the consequences oftheir actions (Wilson et al. 2006). In ecology, uncertainty mayarise from measurement error, systematic error, natural varia-tion, inherent randomness, structural uncertainty, and subjectivejudgment (Regan et al. 2002). In conservation, adaptive man-agement is acknowledged as the principal tool for decisionmaking under structural uncertainty (Keith et al. 2011), and ithas the capacity to address most other forms of uncertainty.Decisions are selected to achieve amanagement objective whilesimultaneously gaining information to improve future manage-ment success (Holling 1978; McCarthy et al. 2012; Walters1986). Adaptive management is designed to help managerslearn about the best suite of management actions to implement

Electronic supplementary material The online version of this article(doi:10.1007/s12080-016-0313-0) contains supplementary material,which is available to authorized users.

* Iadine Chadè[email protected]

1 CSIRO, GPO Box 2583, Brisbane QLD 4001, Australia2 School of BioSciences, University ofMelbourne, Parkville Vic 3010,

Australia3 School of Mathematical Sciences, Queensland University of

Technology, Brisbane QLD 4000, Australia4 Department of Environmental Science and Policy, University of

California, Davis, CA 95616, USA

Theor Ecol (2017) 10:1–20DOI 10.1007/s12080-016-0313-0

by monitoring their effectiveness in complex ecological sys-tems (Westgate et al. 2013). In this sense, adaptive manage-ment is a systematic approach to improving the managementprocess and accommodating changes by learning while doing(Gregory et al. 2006; Holling 1978; Walters 1986). There aretwo main approaches to adaptive management: decision-theoretic and resilience-based. We provide an overview ofthe decision-theoretic approaches available for optimizingadaptive management; interested readers can refer to Runge(2011) for a discussion of resilience approaches.

Before solving an adaptive management problem, we need tocharacterize the type of uncertainty we are facing. The literatureon adaptive management refers to four kinds of uncertainty: (1)environmental variation, or process uncertainty, (2) control un-certainty, (3) state uncertainty, or partial observability, and (4)structural uncertainty (Williams et al. 1996). Environmental var-iation, or process uncertainty, comes from the inherent variabilityin natural processes. Regardless of how effectively our modelsdescribe the behavior of a natural population, we cannot expectthese models to predict the exact state of the system at any giventime in the future. The future will, at best, be described in prob-abilistic terms (Parma 1998). Control uncertainty refers to partialcontrollability and arises because managers cannot perfectly pre-dict the consequences of their management actions (Fackler andPacifici 2014; Williams et al. 1996). State uncertainty or partialobservability results from imperfections in measuring equipmentand monitoring techniques. The state of a system must be in-ferred from imperfect monitoring systems (Fackler and Pacifici2014;Williams et al. 1996). Adaptivemanagement is particularlytailored to address a last type of uncertainty, structural uncertain-ty, which corresponds to an imperfect knowledge of the systemdynamics. Structural uncertainty is characterized by an uncertain-ty in parameters (parameter uncertainty) or in the model (modeluncertainty) of the system dynamics.

Here, we are emphasizing approaches that have wide appli-cability, but an illustration of how these different kinds of un-certainty enter into fisheries management is illustrative (Fultonet al. 2011; Sethi et al. 2005). Year-to-year variation in currentsand climate (as well as varying impacts of other species) leadsto process uncertainty in the dynamics of fish populations.Even if managers set fisheries policy, it is not possible to predictwith certainty how fishermen will respond (Fulton et al. 2011)which leads to control uncertainty. Since any assessment of thesize of a stock is imperfect, there is clearly state uncertainty.Finally, the true dynamics even without environmentalstochasticity for fisheries are not know (and may depend onmany factors such as age and space that may not be fullyincluded), i.e., there is structural uncertainty. Similar issuesclearly arise for other pressing environmental problems, suchas control of invasive species (Mehta et al. 2007).

To guide readers, we provide a decision tree that outlinesthe best order for key questions to be addressed, before under-taking an adaptive management approach (Fig. 1).

First, an adaptive management problem should satisfythree prerequisites: a clear management objective, an iterativeaction/observation process, and uncertain system dynamics. Amanagement objective is required to distinguish adaptivemanagement from post hoc learning, where learning may oc-cur but is not planned as part of a targeted approach to reduceuncertainty. Iterative actions are essential as feedback is re-quired to Blearn by doing.^

Second, it is essential to identify the type of structuraluncertainty. The structural uncertainty can be driven by anuncertain quantity that may take infinitely many values, asin the case of parameter uncertainty. For example, the un-certain parameter can represent the growth rate of a popu-lation (Charles 1992), the recovery rate after stock collapse(Hauser and Possingham 2008; Moore et al. 2008), thesurvival rate of a species (Runge 2013; Springborn andSanchirico 2013), the mortality rate of translocated popu-lation (Rout et al. 2009), a colonization rate between sub-populations (Southwell et al. 2016), or the probability ofsuccess of a management action in forestry (McCarthy and

Fig. 1 Decision tree summarizing the main questions to address beforeundertaking an adaptive management approach

2 Theor Ecol (2017) 10:1–20

Possingham 2007). Alternatively, the structural uncertaintymay be about finite competing models for the system dy-namics. Examples of model uncertainty include whetherpopulation dynamics follow a Ricker or Beverton-Holtmodel (Walters and Hilborn 1976), uncertain populationresponse to harvest and survival (Williams et al. 1996),uncertain growth and aging models of a forest (Mooreand Conroy 2006), uncertain consequences of climatechange (Nicol et al. 2014, 2015), and different plausiblepopulation growth models arising from uncertain diseaselatency (McDonald-Madden et al. 2010b).

Third, once the type of uncertainty is defined, the value oflearning can be calculated using the value of information (Box1, (Canessa et al. 2015; Schlaifer and Raiffa 1961)). Learningmay not be inherently valuable for management, i.e., it maynot result in improved management outcomes (Martin et al.2016). If the value of learning is high, an adaptive manage-ment approach is usually justified. Otherwise, managers donot require an adaptive management approach as reducinguncertainty will not improve the management outcomes.

Box 1: Will reducing uncertainty provide a better outcome?

Principles of value of information analysis

Value of information (VoI) analysis (Schlaifer and Raiffa 1961) deter-mines the critical uncertainties in a problem. VoI quantifies whetherreducing uncertainty will improve performance (Canessa et al. 2015;Runge et al. 2011). Note that quantifying the value of informationgained by resolving model or parameter uncertainty iteratively is dif-ficult because it requires evaluating the gain of implementing anadaptive policy rather than a single action. Little guidance is availablein the literature (Hauser and Possingham 2008;Walters 1986;Williamsand Johnson 2015).

Calculation of the expected value of perfect information (EVPI)

Expected value of perfect information is the difference between theexpected benefit with perfect information (PI) and the expected benefitgiven the current level of uncertainty (no learning; NL)EVPI =PI −NL. This value depends on the current knowledge aboutthe system dynamics, given by the belief b (i.e. probability that theuncertain quantity is correct). We provide the equations for modeluncertainty; changing the summation to an integral leads to theparameter uncertainty formulation.

The expected benefit with perfect information, PI, depends on the optimalvalues Vm

* for each modelm ∈M (obtained with SDP) =∑m ∈Mb(m)Vm* .

PI corresponds to the value we would obtain if we knew the true modelfrom the beginning of the process. However, because the true model isunknown, PI is the average of the optimal values for each modelweighted by a prior belief that each model is the true model.

The literature provides alternative ways of calculating the expectedbenefit given the current level of uncertainty for dynamic systems, NL(Hauser and Possingham 2008; Walters 1986). Here, we provide aformulation that is easy to calculate. It requires creating a new MDPwith transitions PNL. PNL is the average of the transitions of each modelPm weighted by the model beliefsPNL(st + 1|st, at) =∑m ∈Mb(m)Pm(st + 1|st, at). The valueNL is the optimalvalue of the MDP with transition function PNL (states, actions, andrewards are assumed the same in all models). Because the transitionfunction does not change over time, NL refers to the situation where nolearning is undertaken.

Intuitive results

If we denote AAM the optimal value obtained when implementing activeadaptive management, we have

NL ≤ AAM because AAM anticipates the knowledge improvementsbrought by actions and trades off informative against reward decisionsoptimally;

AAM ≤ PI because the knowledge is perfect inPI from the very first time step.

This implies EVPI = PI −NL ≥ AAM −NL ≥ 0: the potential gain ofimplementing active adaptive management is no greater than the EVPI.A small EVPI (relative to the values) means that an adaptivemanagement approach will bring very little improvement whencompared to a Bno learning^ approach. Note that this very much relieson the current knowledge of the system dynamics (b) and might bemisleading if our estimation of b is wrong. A sensitivity analysis on bshould be carried out.

Fourth, once it is established that adaptive management isjustified, active and passive adaptive managements are the twocommonly used approaches to solve adaptive managementproblems (Walters and Hilborn 1978). Both approaches areiterative learning procedures that provide at each time step anaction to implement, given existing knowledge of the systemdynamics. The difference between the approaches lies in howthe recommended action is calculated. Passive adaptive ap-proaches act as if the current knowledge of the system is correctwhile expecting some mistakes, which can be used to improvethe knowledge as management proceeds over time. Activeadaptive approaches explicitly acknowledge that the currentknowledge of the system might not be correct, predict the mis-takes that may arise and future improvement as managementproceeds over time. Solutions to active adaptive managementproblems maximize the chance of achieving the objective byexplicitly accounting for future learning opportunities. In con-trast, passive adaptive management uses only past experienceand does not account for future learning opportunities. From anoptimization perspective, passive adaptive managementmethods are heuristics to solve more complex active adaptivemanagement problems (Bertsekas 1995, p. 293).

In the following section, we introduce three mathematicalconcepts that are required to solve an adaptive managementproblem (Fig. 1): Markov decision processes, sufficient statis-tics, and Bayes’ theorem. We then present existing decisionmodels and algorithms that solve active and passive adaptivemanagement problems for model and parameter uncertainty.Finally, we discuss the challenges that impede greater uptakeof adaptive management approaches.

Important concepts

Markov decision processes

Solving an adaptive management problem results in astrategy that provides the best action given available

Theor Ecol (2017) 10:1–20 3

information, so that the chance of achieving a manage-ment objective is maximized. To provide these best deci-sions, adaptive management problems are modeled as se-quential decision-making problems under uncertainty.Sequential decision-making processes in stochastic sys-tems, including adaptive management, can be modeledusing Markov decision processes (MDPs) (Bellman1957; Marescot et al. 2013) as a theoretical foundation.MDP problems can be solved exactly using stochasticdynamic programming techniques (SDP, Marescot et al.2013). Continuing the idea of fisheries management intro-duced earlier, these sequential decisions could be yearlylimits on effort or catch (Sethi et al. 2005).

MDPs are controlled stochastic processes satisfying theMarkov property and assigning reward values to state transi-tions (Puterman 1994; Sigaud and Buffet 2010). Formally,MDPs are described by a tuple <S,A,P,r,(T)> where S is thefinite state space that describes the possible configurations ofthe system; A is the finite set of all possible actions or deci-sions that control the state dynamics; P denotes the state dy-namics of the system, i.e., P(st+1|st,at) represents the probabil-ity of transitioning to state st+1 given the system is in state stand action at is applied; and r denotes the reward functiondefined on state transitions: r(st,at). Desirable transitions re-ceive strong rewards. T is the time horizon over which deci-sions must be made and can be either finite or infinite.

Because MDPs assume that the variables influencing thedynamics of the system are completely observable, a policy issimply defined as a function π: S→A that associates a deci-sion (i.e., action) to each state configuration of the system. Apolicy provides the rules that a decision maker would follow toperform an optimal action in each state of the system. A policymay be dependent on a time step t or independent of time.

The solution to a Markov decision process is an optimalpolicy, given an objective that decision makers wish toachieve (Sigaud and Buffet 2010). For fisheries, this objectivecould be maximizing the net present value of the fisherywhich would depend on catch in the upcoming and all futureyears. More generally, an objective, also called optimizationcriterion, may be γ -discounted, where the discount factor γ isa real number: 0 ≤ γ < 1 (Marescot et al. 2013). A value func-tion is used to evaluate the expected performance of a policy πstarting from state s. Solving an MDP problem means findingan optimal policy π* such that its value function V*(s) is thebest value possible. Linear programming, value iteration, andpolicy iteration are among the most popular stochastic dynam-ic programming methods to solve MDP exactly (Marescotet al. 2013; Puterman 1994). The ready-to-use solversMDPSOLVE (Fackler 2013) and MDPToolbox (Chadèset al. 2014) now empower users to solve MDPs.

MDP applications have ranged from prioritizing globalconservation effort (Wilson et al. 2006), weed control (Firnet al. 2008; Pichancourt et al. 2012), metapopulation

management (Nicol and Possingham 2010), fire regime man-agement (McCarthy et al. 2001), and harvest problems(Hauser and Possingham 2008; Walters and Hilborn 1978)to cite a few. In behavioral ecology, MDPs have been usedto test if species optimize their reproductive fitness over time(Houston et al. 1988; Venner et al. 2006).

Sufficient statistics

Adaptive management problems differ from classicalMDPs because the value of a parameter or the true modelis hidden from the decision maker and influences the dy-namics of the system (Fig. S1) (Bertsekas 1995; Chadèset al. 2012). Because a state variable that influences thedynamics is hidden (and consequently also influences thebest decision), the optimal policy π* depends on both theobservable state variable and the value of the hidden vari-able. The value of the hidden variable must be estimatedusing the history of observations and actions. Because it isnot feasible to remember the complete past history of ob-servations and actions, sufficient statistics are used(Bertsekas 1995, p 251; Fisher 1922). Sufficient statisticsallow us to retain data without losing any important infor-mation. To be useful in adaptive management problems,sufficient statistics must obey the Markov property and beeasy to represent and update. Finding sufficient statisticsthat best represent uncertain variables is a central and longstanding challenge of adaptive management (Walters1986). For problems with uncertain variables that can takefinite values, belief states are widely used sufficient statis-tics. Belief states are probability distributions over finitequantities and can be updated using Bayes’ theorem(Sigaud and Buffet 2010). When confronted with uncertainvariables that can take infinite values, sufficient statisticsthat take finite values facilitate the use of fast and accuratesolution methods. A common example of such convenientsufficient statistics is the number of successes and failuresof an experiment to represent an unknown probability ofmanagement success (see example 1, Box 2).

Box 2: Active adaptive management examples

Harvesting under parameter uncertainty

In Hauser and Possingham (2008), the optimal strategy recommendswhether or not to harvest a population given a current 3-state popula-tion size (S = robust, vulnerable, or collapsed) and unknown recov-ery rate p; the probability of transition from a collapsed to a vulnerablepopulation size. All possible recovery rates between 0 and 1 are plau-sible, and uncertainty surrounding the parameter p can be representedusing a beta distribution. Given a Beta(∝, β) prior for p, the posterior isa beta distribution with new parameters ∝ + R and β +N–R, where thepopulation is observed to recover in R out of N years spent in a col-lapsed state. Consequently, ∝ and β can be used as sufficient statistics.The optimal policy is derived by solving a MDP (Tables 2 and 3) as ∝

4 Theor Ecol (2017) 10:1–20

and β take finite discrete values. The state space is defined asX = S ×α × β. The action space is harvest or no harvest. The transitionprobabilities from collapsed to vulnerable are derived for all possiblevalue of ∝ and β. Profits are accrued when the population is harvested.The optimal policymatches an action to a population size and values of∝ and β. In exploring optimal strategies over short-, medium-, andlong-term management time horizons, the authors found that activeadaptive strategies could be more precautionary than passive strategiesdepending on the length of time considered.

Climate change mitigation under model uncertainty

In Nicol et al. (2015), the optimal strategy recommends where to investresources to protect migratory shorebird populations in the East Asian-Australasian (EAA) flyway given uncertain consequences of sea levelrise (SLR). The impacts of sea level rise can be mitigated with protectivemanagement actions at a single location in each time step—the objectiveis to find the best location to invest resources at each time step. Theconsequences of sea level rise on shorebird populations are uncertainand are represented by three alternative SLR scenarios. Because there is afinite amount of scenarios we are uncertain about, belief states—probability distribution over the set of scenarios—are used as sufficientstatistics. The optimal policy is derived by solving a factored POMDP(Table 5). The states are discrete breeding population sizes and the pro-tection level of each location; actions are the level of protection applied toeach location of the EAA flyway. States are fully observable; however,the correct scenario and the expected future states are only partiallyobservable and must be learned by observing the system over time.Transition probabilities are derived based on the SLR scenario and thelevel of protection of each location. Rewards are a function of the pop-ulation size and the cost of management actions. The optimal policymatches a protective action to a location, given the current belief in eachSLR scenario.

Bayes’ theorem

The application of Bayes’ rule is the underlying mecha-nism for learning in all adaptive management problems.Bayes’ rule states that P(B|A), the probability that thesystem follows model B given the observed outcome A,can be calculated using P(A|B), the likelihood of receiv-ing new information A when system follows model B, andP(B), the prior probability that B is the best availablemodel to describe the system. Mathematically, P(B|A) =P(A|B)P(B)/P(A). The probability P(A|B) frequently de-pends on the management action we take, enabling us tolearn about the efficiency of management actions(McCarthy 2007).

Expressions of Bayes’ theorem are different for un-known quantities that take a finite number of possiblevalues compared with those that take an infinite numberof possible values. When an uncertain parameter has aninfinite range of possible values, the distribution bt(θ)represents the values of parameter θ at time t as a proba-bility density function and is referred to as the belief in θ.Observing the system response to management actionsbetween times t and t + 1 provides information that canbe used to update this belief. Bayes’ theorem provides ameans of updating distribution bt(θ) as the system is

managed (at) in a given configuration (st) and data aregathered (st+1):

btþ1 θjst; at; stþ1; btð Þ ¼ Pθ stþ1jst; atð Þbt θð Þ∫θPθ stþ1jst; at

bt θð Þdθ

; ð1Þ

where Pθ(st+1|st,at) is the state transition probability assumingthat the true parameter value is θ. Useful sufficient statisticsfor bt(θ) can be found when bt(θ) is a conjugate prior forPθ(st+1|st,at). This approach is elegant but addresses only alimited set of problem structures (Walters 1986, p202).

Unknown quantities can also take a finite discrete set ofvalues; this is often the case under model uncertainty. In thiscase, Eq. 1 is expressed in discrete form:

btþ1 mjst; at; stþ1; btð Þ ¼ Pm stþ1jst; atð Þbt mð ÞXm∈M

Pm stþ1jst; atð Þbt mð Þð2Þ

where Pm(st+1|st,at) is the state transition probability assumingthat the true model is m. The discrete belief value bt(m) isinterpreted as the probability that m best describes systemdynamics of the available models.

Active adaptive management

Given the mathematical concepts described in the previoussection, we are now ready to introduce the two adaptive man-agement solution approaches: active and passive adaptivemanagement.

Introduction

Active management requires Bthinking ahead^ and calcu-lating the consequences of all possible values of the un-known information before deciding the optimal action. Aprobability distribution or belief is used to describe therange of plausible values and their relative credibility(Eqs. 1 and 2). Optimal decisions at a given time stepdepend on current knowledge of the uncertain quantitiesθ. Formally, a policy is defined as follows: πt: S, bt→A.The optimal value function V* that characterizes the per-formance of a policy is a function of a probability distri-bution which is a potentially continuous variable, bt. Anactive adaptive manager projects future data generation andbelief distribution using a variation of Bellman’s optimalityequation (Williams et al. 2009):

Theor Ecol (2017) 10:1–20 5

This optimization requires that the trajectory of belief bt to bt+1 and state transitions Pθ be calculated for all times t considered.

There are two main branches of solution methods for adap-tive management that are based on the kind of uncertainty thatneeds to be resolved (Fig. 1). Parameter uncertainty refers tosystems where the values of the parameters driving the systemare uncertain. Model uncertainty refers to the lack of under-standing about the structure of biological and ecological rela-tions that drive resource dynamics (Williams et al. 2009).While parameter and model refer to different types of uncer-tainty, in terms of solution methods, the most important ques-tion is whether or not the uncertain variable driving the dy-namics of the systems takes continuous or discrete values(Fig. 2). We first discuss exact continuous methods to solveproblems of parameter uncertainty. Then, we discuss solutionmethods for a discrete number of models (model uncertainty).

Parameter uncertainty

In adaptive management to resolve parameter uncertainty, thetask is to manage the system while simultaneously learning thevalue of the parameter to improve future management deci-sions. It is assumed that the unknown underlying parameterhas a fixed value. The literature usually assumes that parameteruncertainty refers to an unknown quantity that could potentiallytake one of an infinite number of values. In the case where theunknown quantity takes a finite number of values, optimizationmethods to solve model uncertainty are applied.

Walters and Hilborn (1976) introduced the concepts ofadaptive management from control theory to deal with uncer-tainty in the management of renewable resources such as fish-eries and wildlife. In control theory, adaptive management isreferred to as adaptive control (Åström and Wittenmark 2008;Bertsekas 1995). Parameter uncertainty was targeted in theearliest formulations of adaptive management problems. Forsystem models that are perfectly observable and linear in theuncertain parameters with additive, normally distributed nat-ural variation, parameter means and their covariance matrixare sufficient statistics for characterizing uncertainty (Walters

1986). That is, the means and covariance matrix can be carriedas knowledge state variables in the dynamic programmingequations used to determine the optimal policy (Eq. 3). Thistechnique is known more generally as Badaptive filtering^ incontrol theory (Walters 1986, p. 200–202) and includes ex-tended Kalman filters for nonlinear process models (Waltersand Hilborn 1978). Since extended Kalman filters linearize thesystem around the best estimates, it can be a poor approxima-tion technique for highly nonlinear systems and small data sets(Walters 1986, p. 211). Sufficient statistics can also be devel-oped using principal component analysis (Walters 1986, p.178)

Fig. 2 Can it be solved using active adaptive management? Decision treesummarizing the choice of optimization approaches available to solveactive adaptive management problems

(3)

6 Theor Ecol (2017) 10:1–20

or singular value decomposition (Walters 1986, p. 180). Earlystudies introduced other extensions of the basic adaptive man-agement framework such as exponential weighting to Bforget^older data (Walters 1986, p. 213; Walters and Hilborn 1976),random or systematic shifts in the underlying parameter valuesover time (Smith and Walters 1981; Walters 1986, p. 212),partial controllability (Walters and Hilborn 1976), partial ob-servability (Walters and Ludwig 1981), and a risk-averse utilityfunction (Walters and Ludwig 1987).

Walters and Hilborn (1976)’s parameter-uncertain Rickermodels were the first to take advantage of conjugate distribu-tions describing the prior and posterior to streamline Bayesianupdating of uncertainty. In their case, describing parameteruncertainty with a normal distribution in a linear process mod-el with additive normal environmental variation yielded a nor-mally distributed posterior distribution for parameter uncer-tainty. The advantage of using conjugate distributions is thatit is possible to obtain a closed form expression for the poste-rior; so, the distribution can be updated exactly withoutresorting to numerical simulation methods. A list of someknown conjugate distributions is included in Table S1.

Many new applications of adaptive management problemsin areas outside of fisheries and harvest management haveutilized conjugate distributions. McCarthy and Possingham(2007) posed a general conservation management problemwhere a manager must choose between implementing twoactions, both with unknown probabilities of success. Eachprobability of success is described by a beta distribution withparameters α and β. After observing s successes and f failuresfrom the trials implemented, sufficient statistics α and β areupdated as α + s and β + f. The authors used a case study ofchoosing between high- or low-density planting for successfulrevegetation. The beta-binomial conjugate relationship hassince been applied to adaptive management of wildlife harvest(Hauser and Possingham 2008; Moore et al. 2008)(Example1, Box 2), threatened species translocation (Rout et al. 2009;Runge 2013), and conservation of a metapopulation(Southwell et al. 2016).

Tables 2 and 3 provide the algorithms required to solveactive adaptive management under parameter uncertainty inthe case where the uncertain parameter is defined as a betadistribution. First, the optimal policy is calculated for all pos-sible beliefs (Table 2). The procedure then applies the bestaction given the current state and belief (Table 3). After eachimplementation, the system is monitored and the belief up-dated (Table 3, lines 7 to 9). The process repeats for the dura-tion of the time horizon.

Although the use of conjugate distributions reduces thedimension of the optimization state space from a potentially(continuous) infinite state space to a finite state space, thereremain computational challenges. In particular, the domain ofplausible values can expand over time. For example, a beta-binomial management problem with a known prior at time 0

and n trials per time step could project to any one of n*t + 1states at time t (Example 1, Box 2).

It is not possible to find an exact conjugate prior for everyparameter uncertainty problem, and numerical solutions maybe required to update sufficient statistics. Density projectionfor distributions from the exponential family can provide analternative (Zhou et al. 2010). This approach calculates theposterior distribution by projecting the continuous belief spaceof the unknown parameter to the closest (in the sense of theminimum Kullback–Leibler divergence) discrete distributionthat matches the family of the prior distribution. This projectedbelief becomes a continuous state MDP that can be solved in anumber of ways, for example using discretization techniques.In resource management, this has been applied to a hierarchi-cal beta distributed model with both continuous action andbelief state spaces (Springborn and Sanchirico 2013).

A simpler alternative treatment of parameter uncertainty isto discretize the parameter into a finite number of plausiblevalues and attach a degree of belief to each value (Fig. 2). Thisis equivalent to the treatment of model uncertainty and can becomputed using discretized belief MDP or partially observ-ableMarkov decision processes (POMDP), as described in thenext section.

Model uncertainty

In adaptive management, model uncertainty is represented asalternative hypotheses (Bmodels^) about how the system dy-namics function. Adaptive management tools to reduce modeluncertainty were first proposed in the fisheries literature asearly as 1978 (Silvert 1978) and were included in Walters’seminal text on adaptive management (Walters 1986). In themid-1990s, adaptive management under model uncertaintywas successfully implemented by the US Fish and WildlifeService to set harvest quotas for mallards in the USA (Johnsonet al. 1997; Nichols et al. 1995), which set the stage for aplethora of other adaptive management studies designed toreduce model uncertainty in conservation and resource man-agement (Johnson et al. 2002; Martin et al. 2009; McDonald-Madden et al. 2010b; Moore and Conroy 2006; Smith et al.2013; Williams 2011a).

The key prerequisite for an optimal adaptive managementsystem designed to reduce model uncertainty is that plausiblealternative hypotheses about system function can be articulat-ed. The hypotheses (models) can take many forms, so long asthe transition probabilities between states can be computedunder each possible hypothesis (model). This is a key pointof difference between the methods used to solve model andparameter uncertainty (which requires either specific parame-ter distributions with known conjugate priors or other suitablesufficient statistics). Because convenient sufficient statisticsmay not exist when confronted to parameter uncertainty, manyadaptive management studies use the methods of model

Theor Ecol (2017) 10:1–20 7

uncertainty to distinguish between a small number of values ofa single parameter (McDonald-Madden et al. 2010b; Mooreet al. 2011; Runge 2013), and in these cases, parameter uncer-tainty could, in principle, be used instead. However, wheremultiple parameters are uncertain and key hypotheses needto be tested, model uncertainty is currently the only tractableapproach (Williams 2009). For example, Moore et al. (2008)posited two alternative models of how burning affects popu-lation growth of a threatened plant by varying parametersassociated with juvenile survival and reproduction.

Similarly to parameter uncertainty, when modeling an activeadaptive management problem under model uncertainty, wemust predict how implementing actions will change our futureknowledge, so that our chance of achieving our objective ismaximized. To do so, one must include in the state space notonly the information about the state of the systems but also thecurrent and future knowledge, i.e., the probability distributionover possible models (belief). Finding the best action to imple-ment becomes a function of both the state of the system andbelief over the models. The classic approach requires solvingan MDP with a continuous belief state space (belief MDP).

Continuous MDPs are computationally hard to solve, andapproximate solution techniques must be used to derive solu-tions. A natural way to overcome this limitation is to discretizethe continuous belief state space and solve a discrete stateMDP. We describe a simple method that illustrates the re-quired steps (see algorithm Table 4). In the planning stage,the optimal policy is determined for discrete portions of thebelief space. For a problem withM hypotheses or models, thebelief state space is discretized into an M-dimensional gridrepresenting the possible belief states of the system. Discreteelements of the grid contain areas of continuous belief space Bfor which the transition matrix P(st+1,bt+1|st,at,bt) must be cal-culated. This can be done by repeatedly simulating each pos-sible model in the proportions indicated by the target beliefstate (Line 6, Table 4). Simulated results are stored, and theprobability of transition is determined by dividing the numberof transitions observed by the total number of simulations.The policy can then be calculated and implemented by exe-cuting the optimal action for the current state and belief state.The execution stage is similar to the parameter uncertaintycase (Table 3), except that the belief is updated using thediscrete formulation (Eq. 2). This approach and variants havebeen applied broadly from the harvest of natural resources(Martin et al. 2009) to the conservation of threatened species(McDonald-Madden et al. 2010b; Moore et al. 2011)(seeTable 1 for additional references).

Despite the simplicity of solving discretized belief-MDPs,the computational costs become very high as the dimension-ality of the problem increases (Brafman 1997; Zhou andHansen 2001). The limitations and inefficiency of discretizedbelief-MDP approaches (fixed or variable) are well-docu-mented, and this solution technique is inadequate to solve

problems with more than a handful of models (Bonet 2002;Lovejoy 1991). Following upon the work of Chadès et al.(2008), MacKenzie (2009) first raised the possibility of usingPOMDP to tackle adaptive management problems. Later on,Williams (2011b) recognized that model uncertainty can bemodeled using methods developed for dealing with partialobservability. However, Williams (2011b) proposed a com-plex transition function suggesting that POMDPs must ac-count for both observational and state uncertainty (Facklerand Pacifici 2014). Building on these previous works,Chadès et al. (2012) took advantage of a useful simplification:not knowing the correct model is equivalent to not being ableto observe the model. This realization allows us to transformmodel uncertainty into observation uncertainty, making itanalogous to a standard POMDP where the hidden variablerepresents the correct model. Chadès et al. (2012) furthershowed that where some state variables are perfectly observ-able and some partially observable, the problem can bemodeled as a mixed observability MDP (MOMDP). This ob-servation allows modelers to factor the state space, whichexploits the conditional independence of variables within thejoint probability of transition to develop even faster solutionmethods. Indeed, classic non-factored representations need tostore the probabilities of transition between all possible stateseven though a state variable may not affect the state of anothervariable and vice versa. In this way, the classic algorithms areinefficient because they store information that is not needed tofind an optimal solution. A better way is to use the structure ofthe problem to store information for the state variables thatdirectly affect other state variables (Chadès et al. 2011, 2012).In the case of adaptive management (Fig S1), an unknownvariable (ht, model or parameter) influences the state of thesystem (st, e.g., abundance of a population); however, theunknown variable is not influenced by the state of the systemor the management actions, i.e., p(ht+1 | st, ht, at) = p(ht+1 | ht).If an optimization problem has many independent variables,we can solve larger size problems using factored representa-tions because we have fewer state interactions to consider andstore (Boutilier and Dearden 1994).

Partially observable MDPs are MDPs where one or morestate variables cannot be observed with certainty. In the ecol-ogy literature, examples of state variables that cannot be ob-served with certainty include abundance of a population(Nicol and Chadès 2012), presence of cryptic threatened spe-cies (Chadès et al. 2008) and, infected or susceptible status ofpopulations vulnerable to disease (Chadès et al. 2011).

First studied in the operation research literature, POMDPprovides a way of reasoning about trade-offs between actionsto gain rewards and actions to gain information (Monahan1982). To take into account the incomplete observability ofthe system, POMDP models augment MDP models with afinite set of possible observations O and the correspondingobservation function Z that maps to each state-action pair a

8 Theor Ecol (2017) 10:1–20

Tab

le1

Non-exhaustivelisto

fpapersthatuseadecision-theoreticadaptiv

emanagem

entapproach

Reference

Objectiv

eof

theproblem

Uncertain

parameter

Uncertain

model

Passive/Active

Optim

izationmethod

(Walters1975)

Maxim

izeharvestv

alue

and

minim

izeinter-annual

harvestv

ariance

Rickerproductio

nparameter

with

3em

pirically

inspired

distributio

ns:

optim

istic,naturalandpessim

istic

shapes

NA

Passive

SDP/MDP

(WaltersandHilb

orn1976)

Maxim

izeharvestv

alue

inafishery

Stockproductio

nparameter,

equilib

rium

stockparameter

andcovariance

matrix.The

parametersareupdatedusinga

recursiveleastsquareestim

ators.

RickerandBeverton-Holt

populationmodel

Activeforparameter

StochasticDynam

icProgramming,

solutions

foroneuncertain

parameter

atatim

earepresented.

Managem

entscenarios

are

evaluatedby

simulationfor

modeluncertainty.

(Smith

andWalters1981)

Maxim

izeharvestv

alue

inafishery

Stockproductio

nparameter,

equilib

rium

stockparameter

andcovariance

matrix.The

parametersareupdatedusinga

recursiveleastsquareestim

ators.

NA

Active

Approximatestochasticdynamic

programming"W

idesensedual

control"

(Ludwig

andWalters1981)

Maxim

izeharvestv

alue

inafishery

Non-specific

NA

Passive

Num

ericaloptim

izationatequilibriu

m(M

angeland

Clark

1983)

Maxim

izefish

populationdiscoveries

Average

density

ofdetectablefish

schoolsparameter

isupdated

usingagammadistributio

n

NA

Active

Dynam

icprogrammingwith

gamma

parametersas

statevariables,

optim

ized

foroneupdating

timestep

(Walters1986)p269-273

Maxim

izeharvestv

alue

inafishery

NA

Twoalternativemodelsof

population

response

toescapementchoice

Active

SDP/MDP

(Walters1986)p273-275

Maxim

izeharvestv

alue

inafishery

Sensitivity

ofproductiv

ityis

updatedusingaNormal

distributio

n

NA

Active

Dynam

icprogramming

(Walters1986)p275-278,286-291

Maxim

izeharvestv

alue

inafishery

Normaldistributio

nsNA

Active/Passive

Wide-sensedualcontrol

(Charles

1992)

Maxim

izeharvestv

alue

inafishery

Growth

rateandmaxim

umsustainablepopulationsize

(Normaldistribution)

NA

Passive

Simulation.In

each

timestep

optim

alpolicydeterm

ined

bymeanof

parameters,afterpolicy

implem

entedparametersare

updated.

(Frederick

andPeterm

an1995)

Maxim

izelong

term

harvestv

alue

inafishery

Stock-recruitm

entp

aram

eters

(lognorm

al)

NA

Active/Passive

SDP/MDP

(Williamsetal.1996)

Maxim

izelong-term

cumulative

harvesto

fwaterfowl,abovea

certaindensity

threshold

NA

Twoalternativemodelsof

population

response

toharvestand

survival

Active

SDP/MDP

(Johnson

etal.2002)

Maxim

umlong-term

cumulative

harvesto

fwaterfowl,abovea

certaindensity

threshold

NA

Twomodelsfortheinfluenceof

kill

rateoversurvivalandtwomodels

fortheinfluenceof

thenumberof

pondsover

reproductio

nrate

Active/Passive

SDP/MDP

(Moore

andConroy2006)

Perpetuatin

gamaxim

umstream

ofold-

grow

thforesthabitatinanatio

nal

wild

-liferefugee

NA

3modelsdescribing

thegrow

thand

ageing

ofaforest

Active

Discretized

beliefMDP

(McC

arthyandPo

ssingham

2007;

Moore

andMcC

arthy2010)

Maxim

izetheexpected

numberof

successes

over

aspecifiednumberof

timeperiods

ormaxim

izetheexpected

numberof

time

periodsin

which

thenumberof

successes

isconsidered

acceptable

Probability

ofsuccessof

managem

ent

actio

ndefinedas

Betadistributio

nwith

binomialu

pdating

NA

Active/Passive

SDP/MDP

Theor Ecol (2017) 10:1–20 9

Tab

le1

(contin

ued)

Reference

Objectiv

eof

theproblem

Uncertain

parameter

Uncertain

model

Passive/Active

Optim

izationmethod

(HauserandPossingham

2008)

Maxim

umlong

term

fish

stockharvest

Recoveryrateafterstockcollapse,

modeled

asaBetadistributio

nwith

binomialu

pdating

NA

Active/Passive

SDP/MDP

(Mooreetal.2008)

Maxim

umlong

term

fish

stockharvest

Betawith

binomialu

pdating

NA

Active/Passive

SDP/MDP

(Martin

etal.2009)

Maxim

izesocio-econom

icbenefitsfrom

harvestin

gecosystem,w

hileminim

izing

theprobability

thatthesocialandecological

system

scrossagivencriticalb

utuncertain

threshold.

NA

Twoscenariosfortheminim

umam

ount

ofwater

inapatch

necessaryforaspeciesto

colonize

thispatch.

Active/Passive

SDP/MDP/discretized

beliefMDP

(Routetal.2009)

Translocatio

nof

threatened

species,choosing

betweenintroducingto

twosites

Mortalityrateatonesiterepresented

asBetadistributio

nwith

Binom

ial

updating

NA

Active/Passive

SDP/MDP

(McD

onald-Maddenetal.2010b)

Maxim

izingthepopulationviability

ofthe

Tasm

anianDevilaffected

byatumor

facialdisease

NA

Uncertain

diseaselatency,yielding

differentp

opulationgrow

thrates

undereach

actio

n

Active

Discretized

beliefMDP

(Martin

etal.2011)

Maxim

izeoutflowbutp

rovide

aminim

umflow

forhabitat

NA

Three

modelsrepresentin

galternative

water

demands

underdifferentrates

ofsealevelrise

Passive

SDP/MDP

(Mooreetal.2011)

Maxim

izetim

e-discounted

plantp

opulation

size

across

yearswith

outb

urning

NA

Twomodelsdescribing

thejuvenile

plantstage

response

toburning

Active

Discretized

beliefMDP

(Williams2011a)

Maxim

izeim

poundm

entp

roductivity

and

shorebirduse

NA

Three

modelsrepresentin

galternative

responsesto

draw

down

Active/Passive

SDP/MDP/

discretized

beliefMDP

(Williamsetal.2011)

General

NA

Twomodelsrepresentin

galternative

responsesto

managem

ent

Active

POMDP

(Chadèsetal.2012)

Maxim

izedensity

ofathreatened

species

NA

Four

modelsrepresentingalternative

responsesto

managem

ent

Active

Modeled

andsolved

asPO

MDPand

factored

POMDP

(Runge

2013)

Maxim

izeintroductio

nsuccessof

athreatened

species

Survivalratedefinedas

Betawith

binomialu

pdating

NA

Active

Discretized

beliefMDP

(Smith

etal.2

013)

Maxim

izeharvestp

rofito

faspecieswhile

maintaining

abundanceof

anotherspecies

(predator–prey)

NA

Three

modelsrepresentin

gtheeffect

ofweightg

ainon

survivaland

fecundity

Active

Discretized

beliefMDP

(SpringbornandSanchirico2013)

Maxim

izeharvestp

rofit

Survivorship

rate(fish)

followsabeta

distributionover

theyears,whose

meanisunknow

n(knownvariance)

anddescribedby

anotherbeta

NA

Active/Passive

Discretized

beliefMDP

(Fackler

andPacifici2

014)

Maxim

izeoccurrencesof

a"good"

state

NA

Twoalternativemodelsrepresentin

gan

optim

istic

orapessim

istic

outcom

e

Active

Form

ulated

asan

extended

POMDP

andsolved

asadiscretized

belief

MDP

(Nicol

etal.2

014)

Maintainminim

umproportio

nof

different

habitatsunderclim

atechange

NA

Three

inflow

scenariosbasedon

historicalrainfalltrends

Passive

SDP/MDP

(Nicol

etal.2

0 15)

Maxim

izemigratory

shorebirds

populatio

nsacross

spaceandsubjecttoseallevelrise

NA

Three

modelsrepresentin

galternative

responsesto

managem

entu

nder

sealevelrise

Active

Modeled

andsolved

asPO

MDPand

factored

POMDP

(Southwelletal.2016)

Maxim

izeprobability

thatatleasto

nepatchis

occupied

atendof

program,ormaxim

ize

totaln

umberof

patchesoccupied

atend

ofprogram

Colonizationratebetweenpatches

NA

Active/Passive

SDP/MDP

10 Theor Ecol (2017) 10:1–20

probability distribution over O. In the case of model uncer-tainty, a factored POMDP is characterized by the tuple ⟨X, A,O, P, Z, r, γ⟩. X = S ×M represents the factored state space. Sdenotes states in the MDP sense—i.e., the possible condi-tions of the system. We consider the unknown model set Mto be state variables. A is the set of available managementactions to control the system. O is the set of observationsperceived by a manager. If all states in S are observable, thenO = S. In adaptive management with model uncertainty, theunknown model is hidden and cannot be observed—we inferthe state of the model through observation of the state variablesS. P is the transition matrix. Elements of the matrix Pm(st + 1|st,at) represent the probability of observing state st + 1 after takingaction at given the current state is st and the correct model is m.Z is the observation function p(ot + 1|xt + 1, at) describing theprobability of observing ot + 1 from factored state xt + 1 aftertaking action at. r is the reward function and γ is the discountfactor as defined for a MDP (section Important concepts). Theoptimal decision at time t depends on the complete history ofpast actions and observations. Belief states are sufficient statis-tics used to summarize and overcome the difficulties of incom-plete detection (Åström 1965) (section Important concepts).Solving a factored POMDP means finding a policy π: S xB→A, mapping a decision given a state of the system (s ∈ S)and a current belief over the set of models M (b ∈ B). Anoptimal policy maximizes the discounted expected sum of re-wards over a finite or infinite time horizon. As for MDPs, thisexpected summation is referred to as the value function. In thecontext of adaptive management of problems subject to modeluncertainty but with perfect state observation, the value functionequations can be simplified because the model m is the only

unknown variable (Chadès et al. 2012):where bt + 1 is calculated according to Bayes’ rule (Eq. 2). Theoptimal policy can be obtained from the optimal value

function by selecting the action that gives the highest valuefor each state of the system. While algorithms have been de-veloped over the past years, exact resolution of POMDPs is ingeneral intractable: finite horizon POMDPs are PSPACE-complete (Papadimitriou and Tsitsiklis 1987), and infinite-horizon POMDPs are undecidable (Madani et al. 2003).Approximate methods have been developed to solve largePOMDPs. Among them, point-based approaches approximatethe value function (Eq. 4) by updating it only for selectedbelief states (Pineau et al. 2003; Spaan and Vlassis 2005).Typical point-based methods sample belief states by simulat-ing interactions with the environment and then update thevalue function and its gradient over a selection of those sam-pled belief states. In many cases, software is available forthese methods. For example, methods such as SARSOP(Kurniawati et al. 2008), Perseus (Spaan and Vlassis 2005),and Symbolic Perseus (Poupart 2005) have been used suc-cessfully for adaptive management problems (Nicol et al.2013) and spatial optimization (Chadès et al. 2011).The sizeand complexity of the problems that have been solved withthese advanced methods are much larger than those solvedwith discretized belief-MDP approaches (Ong et al. 2010).

There are costs associated with modeling an adaptivemanagement problem as a POMDP. Using POMDP re-quires users to navigate the specialized literature on thetopic, but see Chadès et al. (2008), McDonald-Maddenet al. (2011), Regan et al. (2011) for illustrative applica-tions in conservation and discussions about pros andcons. An additional issue arises when exploring the so-lutions provided by POMDPs. Although a POMDP solu-tion can be represented as a decision graph (Nicol and

Chadès 2012), decision graphs are often too detailed tobe presented to managers, and simplifications are re-quired. Simplifications have traditionally been restricted

(4)

Theor Ecol (2017) 10:1–20 11

to heuristics or rules of thumb (Chadès et al. 2011; Nicoland Chadès 2012). However, a recent approximate meth-od (alpha-min) allows users to set the maximum numberof decisions they are willing to consider at each timestep with a measure of the performance loss accrued(Dujardin et al. 2015). This approach empowers man-agers to trade simplicity of the POMDP solution againstperformance. Alpha-min is able to provide simple, near-optimal policies for POMDPs which enables improvedinterpretation of results and better communication of out-comes to decision makers. Although alpha-min does notyet scale well with large problems, this research directionwarrants further study.

Perhaps because the solution methods for reducingmodel uncertainty are more general than those designedto reduce parameter uncertainty, the majority of adaptivemanagement studies published today rely on the methodsdeveloped to reduce model uncertainty. Software formodel uncertainty methods has also been more availablethan for parameter uncertainty (Fackler 2013; Lubow1997), which may have contributed to its relativepopularity.

Passive adaptive management

Active adaptive management is the state-of-the-art adaptivemanagement method because it offers guaranteed optimalperformance: there is in theory no better way of achievingour management objective. Where practical, active adap-tive management should be used. However, active adaptive

management requires augmenting the state space with suf-ficient statistics and projecting the sufficient statistics intothe future, generating an important computational cost (thisis known as the Bcurse of dimensionality^). This cost is

particularly high when the sufficient statistics take contin-uous values. Motivated by the need to scale up applicationsof adaptive management, passive adaptive managementmethods (Walters 1975, 1986) designate heuristics that cal-culate the best next decision assuming that the belief willnot change into the future (this assumption is referred to asBcertainty equivalence^ in control theory (Bertsekas1995)). In many practical cases, it is possible to achievegood performance using passive adaptive management(Rout et al. 2009). Two approaches are commonly usedto generate passive adaptive management policies: theweighted average and the most likely value. The weightedaverage methods to solve adaptive management problemsare similar regardless of whether the uncertainty is param-eter uncertainty or model uncertainty; so, we address thetwo uncertainties together.

Recall that when solving passive adaptive managementproblems, the optimization assumes that the currentknowledge of the system will not change over time.Learning occurs once the effect of an implemented actionis monitored and not during the optimization procedure.The sufficient statistic used is a belief over the models orparameters and is updated using Bayes’ rule (Eqs. 1 and2). During the implementation phase, the optimal action,at, is selected using weighted averaging. For each poten-tial action, model or parameter responses (i.e., transitionprobabilities) are averaged across all models or parametervalues, with the weights given by the current belief(Williams 2011a). To determine an optimal policy, a pas-sive adaptive manager averages future outcomes over allplausible parameter values (Walters 1975):

When the unknown quantity θ takes discrete values, theintegral in Eq. 5 is replaced by a summation. The resultingstate, st + 1, is observed and belief, bt + 1, is updated accordingto Bayes’ rule. Because the transition matrices change as the

(5)

12 Theor Ecol (2017) 10:1–20

belief changes, this approach requires that an MDP is solvedin every time step (Tables 6 and 7).

For model uncertainty problems, there exists a more basicprocedure for solving passive adaptive management problems:the Bmost likely value.^ In this method, actions are selectedbased on the model with the highest belief (Williams 2011a).In the planning phase, a MDP is solved for each of the |M|candidate models, resulting in |M| policies specifying the opti-mal action to take for each state of the system (Table 8). Duringthe implementation phase, the optimal action, at, is selectedaccording to the model with the highest belief, i. The resultingstate, st + 1, is observed and belief, bt + 1, is updated according toBayes’ rule. This approach does not require that an MDP issolved in every time step; so, the computational cost of thisapproach is less than that of the previous approach.

Numerous approaches based on the certainty equivalenceprinciple have extensively been developed in the adaptive con-trol literature (Filatov and Unbehauen 2000;Wittenmark 1995).

Similarly, in the artificial intelligence literature, heuristics havebeen developed to solve large POMDPs that could easily beused as passive adaptive management approaches (Cassandraand Kaelbling 1995). Advantages and limitations of these ap-proaches have yet to be assessed in an ecological context.

Methodological challenges and discussion

Recent advances

Adaptive management methods have advanced significantlywith the advent of powerful computational techniques forBayesian updating. In particular, modeling adaptive manage-ment problems under model uncertainty as POMDPs allowsus to solve previously unsolvable problems. As POMDPmethods are more widely adopted in the ecological modelingcommunity, the diversity of ecological challenges that can be

Table 2 Pseudocode for solvingactive adaptive managementproblems under parameteruncertainty using an exactstochastic dynamic programmingapproach, assuming that θ followsa Beta distribution

Calculate optimal active adaptive policy under parameter uncertainty assuming θ follows a Beta distribution withsufficient statistics (α,β)

Input S: set of states; P(.) = f(st,at, θ ∼Beta (α,β)): generates a vector of probabilitydistributions over future states based on the equation of thedynamics of the system as a function of the current state,action and unknown parameter θ ∼Beta (α,β);

A: set of actions;

r: rewards;

T: time horizon;

(α,β)t: t finite sets of shapeparameters for the betadistribution;

Output P: probability of transitions over set of states S and shape parameters (α,β)t and actions A;π1, π2,…, πT: optimal policy for the finite horizon MDP

% calculate transition probabilities P using sufficient statistics to represent uncertain parameter

1 For all ai in A, all sj in S, (α,β)i in (α,β)t2 P(.|si, (α,β)t,ai) = f(si,ai, θi ∼Beta (αi,βi)) ;

3 endFor

4 % solve finite MDP problem e.g. (Chadès et al. 2014; Fackler 2013; Marescot et al. 2013)

π1, π2,…, πT = solve_MDP(S,(α,β)tt=1..T,A,P,r,T);

Table 3 Pseudocode toimplement active adaptivemanagement problems. In thiscase, we use the sufficientstatistics (α,β) to represent ourknowledge on the uncertainparameter θ.

Implement active adaptive management

Input sinit: initial state; α0,β0 : initial shape parameters defining a Beta distribution; T: time horizon; π1,π2,…, πT : finite horizon optimal policy

Output a1, s1, α1,β1, .., aT, sT, αT,βT: history of actions implemented, states observed, updated shapeparameters at each time step 1 to T;

5 st = sinit; αt =α0; βt =β0;

6 For t = 1:T

7 at = πt(st,αt,βt);

8 st+1 = implement_action(st,at); % implement action using optimal policy calculated Table 2and monitor

9 αt,βt = update_sufficient_statistics(st+1,st, αt,βt,at); % see Eq. 1

10 endFor

Theor Ecol (2017) 10:1–20 13

managed with adaptive management methods will expand.Major contemporary ecological issues like climate change(non-stationarity), imperfect detection, and multi-objectivemanagement are now being solved using POMDPs. Here,we review some recent advances in these cutting-edge prob-lem domains and discuss how adaptive management optimi-zation methods could be expanded to robust and multi-actordecision-making.

Non-stationary dynamics

The dynamics of the system are commonly assumed to stay thesame over time, i.e., stationary. In the case where changingdynamics of the system must be accounted for over time, it ispossible to calculate non-stationary adaptive management strat-egies. This is particularly useful for models that accommodate

climate change. There are two main approaches to incorporatenon-stationary dynamics. First, the suite of models can be com-posed of different rates of change so that the transition matrixchanges every year in a known way (Martin et al. 2011; Nicholset al. 2011; Nicol et al. 2014). This approach uses standardmodel uncertainty techniques to learn the true rate of changebut can only change at the rates specified in the model suite.In the second approach, the suite of models is composed ofstationary models so that the transition matrices may change atany rate (Nicol et al. 2013, 2015). Unlike the first approach,which pre-specifies the rate of change, Nicol et al. (2015) spec-ify a small probability of transition between models, allowingany candidate model to be true at a given time. While thisapproach allows more freedom in the rate of change betweenmodels, change is assumed to be less gradual than in the firstapproach. This approach requires estimates of the probability of

Table 4 Pseudocode for solvingactive adaptive management formodel uncertainty solved usingthe discretized belief MDPapproach

Calculate active adaptive management using discretized belief MDP approach

Input S: set of states; A: set of actions; r: rewards;

Output B: finite set of discretized belief points

PB: probability of transitions over set of states S and discretized belief points B

π: optimal policy for the discretized belief MDP

1 B = discretize_grid(k,distance); % discretize the belief space over models using a grid approach

2 For action ai in A, state sj in S, belief points b in B % calculate PB by means of simulations

3 ns = zeros(B,1);

4 For n = 1 to maxn

5 bn = sample(b,distance); % sample neighboring points

6 [sj’,b’] = simulate_action(ai,sj,bn); % b’ in B, is the nearest belief point

7 ns(b’) = ns(b’) + 1; % counting

8 endFor

9 PB(sj’,b’|sj,b,ai) = ns(b’)/maxn; % averaging

10 endFor

11 π = solve_MDP(S,B,APB, r); % solve the discretized belief MDP e.g. (Chadès et al. 2014;Fackler 2013; Marescot et al. 2013)

Table 5 Pseudocode for solvingactive adaptive management formodel uncertainty solved usingPOMDP approach

Calculate optimal active adaptive policy under model uncertainty using belief states as sufficient statistics

Input S: set of states; P(.) = f(st,at, mi): generates a vector of probability distributions over futurestates based on the equation of the dynamics of the system as a functionof the current state st, action at and model mi in M;

A: set of actions;

r: rewards;

T: time horizon;

M: finite set of models;

Output P: probability of transitions over set of states S, models M, actions A;

π: optimal policy

% calculate transition probabilities P using sufficient statistics to represent uncertain parameter

1 For all ai in A, all sj in S, mi in M

2 P(.|si, mi,ai) = f(si,ai, mi) ;

3 endFor

% solve factored POMDP (Ong et al. 2010; Poupart 2005)

4 π = solve_POMDP(S,M,A,P,r);

14 Theor Ecol (2017) 10:1–20

transition between models; however, this assumption can beremoved by treating the rate of change between models as ahidden probability that can take discrete values (Nicol et al.2015). This comes at an additional computational cost becausean additional hidden parameter must be learned.

Imperfect detection and monitoring

Adaptive management techniques are tailored to addressstructural uncertainty; however, they can be extended to ad-dress other kinds of uncertainty including state uncertaintydue to measurement error. POMDPs are an obvious candidateto tackle imperfect detection in adaptive management as theyhave been applied to partially observable problems in roboticsfor decades (Chadès et al. 2012; Fackler and Haight 2014). Inadaptive management, POMDPs can help to decide when tochange monitoring effort under imperfect detection (Facklerand Haight 2014). If the cost of adaptive management is animpediment to uptake, accounting for the cost of monitoringin adaptive management can help to minimize the cost of anadaptive management program (Haight and Polasky 2010;Moore and McCarthy 2010; White 2005). Indeed, some spe-cies might not require monitoring at each time step, as moni-toring might not be needed to inform the next decisions.Monitoring decisions should be part of the optimization prob-lem or else we risk wasting precious resources (Chadès et al.2008; MacKenzie 2009; McDonald-Madden et al. 2010a).

Multi-objective approaches

Acurrent challenge in applied ecology is the need to account formultiple objectives when deciding the best management actionto implement (Kareiva et al. 2014). Inmulti-objective problems,the objective is transformed into a vector of objectives. Unlike

single-objective problems, multi-objective problems generallyadmit several optimal vectors of values. Each optimal vectorcorresponds to a possible Bbest compromise^ between the dif-ferent objectives. The set of these optimal vectors is called thePareto-frontier (Ehrgott 2005). One way to solve a multi-objective problem consists of generating the entire Pareto-fron-tier. Because the Pareto-frontier can be exponentially large evenfor two objectives, approximate Pareto-frontiers are usuallysought. Roijers et al. (2015) propose a way of generating thePareto-frontier of multi-objective POMDP producing stochasticpolicies. This work is attractive because it enables multi-objective active adaptive management problems to be solved.However, dealing with stochastic policies is not convenient inapplied fields where simplicity of the solution is important. Inthe case of the passive adaptive management approach, MDPscan be solved in a multi-objective context providing determin-istic policies and at a small computational cost (Perny andWeng2010).

Multi-actor approaches

In the case where several actors manage a same resource, amulti-actor adaptive management problem could be formulat-ed as a sequential decision problem under uncertainty. In arti-ficial intelligence, these types of decision models are knownas decentralized MDP, decentralized POMDP (Bernstein et al.2002) or Multiagent MDP (Boutilier 1999; Littman 1994).However, most of these multi-actor problems do not haveexact solution methods (Amato and Oliehoek 2015; Chadeset al. 2002; Dibangoye et al. 2016). Perhaps the most acces-sible but also the most constrained model are the multiagentMDPmodels (Boutilier 1999; Chades and Bouteiller 2005). Inits simplest form, a multiagent MDP assumes that the actionspace is factored and actors share a common objective. The

Table 6 Pseudocode for solvingpassive adaptive managementproblems under parameteruncertainty using the weightedaverage approach

Implement passive adaptive management policy under parameter uncertainty assuming θ follows a Betadistribution with sufficient statistics (α,β)

Input S: set of states; sinit: initial state;A: set of actions;r: rewards;

α0,β0 : initial shape parameters defining a Beta distribution;

T: time horizon

Output a1, s1, α1,β1, .., aT, sT, αT,βT: actions implemented, states observed, updated shape parameters at eachtime step 1 to T;

1 st = sinit; αt =α0; βt =β0;

2 For t = 1:T

3 Pαt,βt = calculateP(αt,βt); % calculate P(st+1|st,at) for a given αt,βt

4 πt =MDPsolve(S,A, Pαt,βt,R,T); % finite horizon or infinite horizon

5 at = πt(st);

6 st+1 = implement_action(st,at); % implement action using policy πt calculated line 4 andmonitor

7 αt,βt = update_sufficient_statistics(st+1,st, αt,βt,at); % see Eq. 1

8 endFor

Theor Ecol (2017) 10:1–20 15

complexity of solving a multiagent MDP is the same as anMDP; however, the size of the action space is exponential inthe number of actors which would be a strong computationallimiting factor. Multiagent MDP could be easily used in apassive adaptive management context and address some ofthe most pressing ecological problems.

Robust approaches

In the case where decision makers are interested in risk averse adap-tivemanagement solutions,methods to solve robustMarkovdecisionprocesses (Givan et al. 2000; Wiesemann et al. 2013) could beadapted directly as passive adaptive management approaches. In

operation research,Nilim and ElGhaoui (2005) provide an approachfor solving robust MDP with unknown transition matrices in a pas-sive adaptive management set up. In the active case, we would relyon available methods to solve robust POMDPs (Osogami 2015).None of these approaches have been assessed in an ecologicalcontext.

Caveats

Under parameter uncertainty, it is assumed that the uncertainparameter can be modeled using a specified probability distri-bution (Table S1). Research on the consequences of assumingthe wrong probability distribution does not exist, but a poor

Table 8 Pseudocode for solvingpassive adaptive managementalgorithm under modeluncertainty using the most likelymodel approach

Planning stage of passive adaptive management under model uncertainty

Input S: set of states; A: set of actions; r: rewards; P1-k: probability transitions for models1 to k

Output π1, π2,…, πk: optimal policy for models 1 to k

1 For all models (i = 1:k)

2 πi = solve_MDP(S,A,Pi,r); % solve the corresponding MDP

3 endFor

Implement passive adaptive management policy under model uncertainty

Input sinit: initial state; binit: initial belief over the k models; T: time horizon

Output a1, s1, b1, .., aT, sT, bT: actions implemented, states observed, updated belief states ateach time step 1 to T;

4 st = sinit; bt = binit;

5 For t = 1:T

6 i = argmax (bt); % select the model with the highest belief

7 at = πi(st);

8 st+1 = implement_action(st,at); % implement and monitor

9 bt+1 = update_sufficient_statistics(st+1,st,bt,at); % see Eq. 2

10 endFor

Table 7 Pseudocode for solvingpassive adaptive managementunder model uncertainty using theweighted average approach

Implement passive adaptive management policy under model uncertainty

Input S: set of states; sinit: initial state;

A: set of actions; binit: initial belief over the k models;

r: rewards; T: time horizon;

P1-k: probability transitions for models 1 to k

Output a1, s1, b1, .., aT, sT, bT: actions implemented, states observed, updated belief states at each time step 1 to T;

1 st = sinit; bt = binit;

2 For t = 1:T

3 Pa = weighted_average(P1-k, bt) % calculate weighted average transition probabilities

4 πa = solve_MDP(S,A,Pa,r); % solve the MDP with the new transition probabilities

5 at = πa(st); % Select action at using policy πa

6 st+1 = implement_action(st,at); % implement and monitor

7 bt+1 = update_sufficient_statistics (st+1,st,bt,at); % see Eq. 2

endFor

16 Theor Ecol (2017) 10:1–20

model selection may result in poor performance. Similarly,under model uncertainty, it is assumed that one of the candidatemodels representing future dynamics must be close to the truemodel. If the model set does not approximate the true scenario,then the best solutions may not be optimal. POMDP can inprinciple accommodate a large number of models, but as thenumber of possible models increases, distinguishing betweenmodels becomes difficult. Models must be similar enough toprovide adequate resolution but different enough to requirealternative optimal management strategies; there is no need todistinguish between models if the management response is thesame (Nicol et al. 2015). Selecting the minimum set of modelsto include in adaptive management problems is a modelingdecision for which no guidance can be found in the literature.For both parameter and model uncertainty, providing the toolsto detect when these assumptions about the true model areviolated during the adaptive management process would pro-vide further confidence in optimization approaches.

We have assumed that the dynamics of the system, al-though uncertain, can be modeled as a Markov chain. Thisassumption is common and rarely discussed. Assuming theMarkov property means that adaptive management problemscan be modeled as MDP or POMDPs and solved using sto-chastic dynamic programming techniques. However, manyecological systems would not fit this property. This is partic-ularly true for systems that exhibit delays in response to man-agement. For example, restoration problems require manage-ment actions such as planting trees for which the benefitswould only be known in the future. More generally, manage-ment of species with complex life cycles for which manage-ment only targets specific stages, ages, or sizes are unlikely tofit a Markov chain. Unfortunately, there are no off-the-shelfadaptive management optimization methods for non-Markovian problems. In the control theory literature, not as-suming that theMarkov property usually means that the meth-od will rely on sub-optimal Monte Carlo simulation ap-proaches, or the problem formulation would need to be sim-plified, e.g., linear transitions and a quadratic objective func-tion can be solved using Kalman filters (Grewal 2011).

Conclusions

Despite progress in developing optimal adaptive managementstrategies, uptake of adaptive management remains low.Financial commitment to long-term monitoring and manage-ment is rarely achieved (Keith et al. 2011; Westgate et al.2013). A major challenge for adaptive management theoreti-cians is to generate and communicate real-world applications toprove that these methods can work successfully in practice(Canessa et al. 2016). At least one impediment to uptake is thatoptimal adaptive management is designed for a specific set ofpre-conditions (Fig. 1) but has been poorly defined in the past

(Runge 2011). In fact, optimal adaptive management may notbe the panacea for all management problems that commonwisdom suggests. It is perhaps surprising that adaptive manage-ment has not had more of an impact even within fisheries man-agement or more generally management of coastal ecosystems(Walters 1997). As outlined by Walters, some of the problemsmay be difficult to overcome, such as cross-scale issues.However, other challenges summarized by Walters are essen-tially the issues we have addressed—such as how to learn fromlimited empirical data and address structural uncertainty withmodeling efforts. By specifying the appropriate decision con-text and providing detailed methods of the state-of-the art inoptimal adaptive management methods, we have clarifiedwhich adaptive management approach is appropriate (Figs. 1and 2) and how to implement it (Tables 2, 3, 4, 5, 6, 7, and 8), sothat practitioners and modelers may finally bridge the gap be-tween theory and implementation.

Acknowledgments The authors would like to thank Gwen Iacona andAyesha Tulloch for commenting on earlier versions of this manuscript.The idea of this review paper emerged at the BNatural resourcemanagement^ workshop organized by the Mathematical BiosciencesInstitute, Columbus (2013) and an adaptive management workshop sup-ported by a CSIRO Julius Career Award (IC). TMR was supported by anAustralian Research Council Discovery Grant (DP110101499). CEHwassupported by the National Environmental Research ProgramEnvironmental Decisions Hub.

References

Amato C, Oliehoek FA Scalable Planning and Learning for MultiagentPOMDPs. In: Twenty-Ninth AAAI Conference on ArtificialIntelligence, 2015

Åström KJ (1965) Optimal control of Markov decision processes withincomplete state estimation. J Math Anal Appl 10:174–205

Åström K, Wittenmark B (2008) Adaptive control, 2nd edn. DoverPublications, Mineola

Bellman RE (1957) Dynamic Programming. Princeton University Press,Princeton

Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complex-ity of decentralized control of Markov decision processes. MathOper Res 27:819–840

Bertsekas DP (1995) Dynamic programming and optimal control vol 1,vol 2. Athena Scientific Belmont, MA

Bonet B (2002) An epsilon-optimal grid-based algorithm for partiallyobservable Markov decision processes. In: Proceedings of the 19thInternational Conference onMachine Learning (ICML-02), Sydney,Australia. Morgan Kaufman Publishers Inc., pp 51–58

Boutilier C, Dearden R (1994) Using abstractions for decision-theoreticplanningwith time constraints. In: Proceedings of the Twelfth AAAINational Conference on Artificial Intelligence. AAAI Press, pp1016–1022

Boutilier C (1999) Sequential optimality and coordination in multiagentsystems. In: IJCAI. pp 478–485

Brafman R (1997) A heuristic variable grid solution method forPOMDPs. In: Proceedings of the National Conference on ArtificialIntelligence (AAAI-97), Providence, Rhode Island. pp 727–733

Theor Ecol (2017) 10:1–20 17

Canessa S et al (2015) When do we need more data? A primer on calcu-lating the value of information for applied ecologists. Methods EcolEvol 6:1219–1228. doi:10.1111/2041-210x.12423

Canessa S et al (2016) Adaptive management for improving species con-servation across the captive-wild spectrum. Biol Conserv 199:123–131. doi:10.1016/j.biocon.2016.04.026

Cassandra AR, Kaelbling LP (1995) Learning policies for partially ob-servable environments: Scaling up. In: Machine LearningProceedings 1995: Proceedings of the Twelfth InternationalConference on Machine Learning, Tahoe City, California. MorganKaufmann, p 362

Chades I, Bouteiller B Solving multiagent Markov decision processes: aforest management example. In: Proceedings of the InternationalCongress on Modelling and Simulation (MODSIM 2005), 2005.pp 1594–1600

Chades I, Scherrer B, Charpillet F (2002) A heuristic approach for solvingdecentralized-pomdp: Assessment on the pursuit problem. In:Proceedings of the 2002 ACM symposium on Applied computing.ACM, pp 57–62

Chadès I, McDonald-Madden E, McCarthy MA, Wintle B, Linkie M,PossinghamHP (2008)When to stopmanaging or surveying crypticthreatened species. Proc Natl Acad Sci U S A 105:13936

Chadès I, Martin TG, Nicol S, Burgman MA, Possingham HP, BuckleyYM (2011) General rules for managing and surveying networks ofpests, diseases, and endangered species. Proc Natl Acad Sci 108:8323–8328. doi:10.1073/pnas.1016846108

Chadès I, Carwardine J,Martin TG,Nicol S, Sabbadin R, Buffet O (2012)MOMDPs: a solution for modelling adaptive management prob-lems. In: The Twenty-Sixth AAAI Conference on ArtificialIntelligence (AAAI-12), Toronto, Canada. pp 267–273

Chadès I, Chapron G, Cros M-J, Garcia F, Sabbadin R (2014)MDPtoolbox: a multi-platform toolbox to solve stochastic dynamicprogramming problems. Ecography 37:916–920

Charles AT (1992) Uncertainty and information in fishery managementmodels: a Bayesian updating algorithm. Am J Math Manag Sci 12:191–225

Dibangoye JS, Amato C, Buffet O, Charpillet F (2016) Optimally solvingDec-POMDPs as continuous-stateMDPs. JArtif Intell Res 55:443–497

Dujardin Y, Dietterich T, Chadès I (2015) alpha-min: a compact POMDPsolver. In: International Joint Conference on Artificial Intelligence(IJCAI-2015), Buenos Aires, Argentina

Ehrgott M (2005) Multicriteria optimization, 2nd edn. Springer, BerlinFackler P (2013) MDPSOLVE Software for Dynamic OptimizationFackler PL, Haight RG (2014) Monitoring as a partially observable de-

cision problem. Resour Energy Econ 37:226–241Fackler P, Pacifici K (2014) Addressing structural and observational un-

certainty in resource management. J Environ Manag 133:27–36.doi:10.1016/j.jenvman.2013.11.004

Filatov NM, Unbehauen H (2000) Survey of adaptive dual controlmethods. IEE Proc - Control Theory Appl 147:118–128.doi:10.1049/ip-cta:20000107

Firn J, Rout T, PossinghamH, Buckley YM (2008)Managing beyond theinvader: manipulating disturbance of natives simplifies control ef-forts. J Appl Ecol 45:1143–1151. doi:10.1111/j.1365-2664.2008.01510.x

Fisher RA (1922) On the Mathematical Foundations of TheoreticalStatistics. Philos Trans R Soc Lond A: Math, Phys Eng Sci 222:309–368. doi:10.1098/rsta.1922.0009

Frederick SW, Peterman RM (1995) Choosing fisheries harvest policies:when does uncertainty matter? Can J Fish Aquat Sci 52:291–306.doi:10.1139/f95-030

Fulton EA, Smith ADM, Smith DC, van Putten IE (2011) Human behav-iour: the key source of uncertainty in fisheries management. FishFish 12:2–17. doi:10.1111/j.1467-2979.2010.00371.x

Givan R, Leach S, Dean T (2000) Bounded-parameter Markov decisionprocesses. Artif Intell 1:71–109

Gregory R, Ohlson D, Arvai J (2006) Deconstructing adaptive manage-ment: citeria for applications to environmental management. EcolAppl 16:2411–2425

Grewal MS (2011) Kalman filtering. SpringerHaight RG, Polasky S (2010) Optimal control of an invasive species with

imperfect information about the level of infestation. Resour EnergyEcon 32:519–533

Hauser CE, Possingham HP (2008) Experimental or precautionary?Adaptive management over a range of time horizons. J Appl Ecol45:72–81. doi:10.1111/j.1365-2664.2007.01395.x

Holling CS (1978) Adaptive environmental assessment and management.John Wiley & Sons, London

Houston A, Clark C,McNamara J, Mangel M (1988) Dynamic models inbehavioural and evolutionary ecology. Nature 332:29–34

Johnson FA, Clinton TM,KendallWL,Dubovsky JA, CaithamerDF,KelleyJR Jr, Byron KW (1997) Uncertainty and the Management of MallardHarvests. J Wildl Manag 61:202–216. doi:10.2307/3802429

Johnson FA, Kendall WL, Dubovsky JA (2002) Conditions and limita-tions on learning in the adaptive management of mallard harvests.Wildl Soc Bull 176–185

Kareiva P, Groves C, Marvier M (2014) REVIEW: The evolving linkagebetween conservation science and practice at The Nature Conservancy.J Appl Ecol 51:1137–1147. doi:10.1111/1365-2664.12259

Keith DA, Martin TG, McDonald-Madden E, Walters C (2011)Uncertainty and adaptive management for biodiversity conserva-tion. Biol Conserv 144:1175–1178

Kurniawati H, Hsu D, Lee W-S (2008) SARSOP: Efficient Point-BasedPOMDP Planning by Approximating Optimally Reachable BeliefSpaces. In: Proceedings of Robotics: Science and Systems IV,Zurich, Switzerland. pp 65–72

Littman ML (1994) Markov games as a framework for multi-agent rein-forcement learning. In: Proceedings of the eleventh internationalconference on machine learning. pp 157–163

Lovejoy W (1991) Computationally feasible bounds for partially ob-served Markov decisions processes. Oper Res 39:162–175

Lubow BC (1997) Adaptive Stochastic Dynamic Programming (ASDP):Supplement to SFP User’s Guide, 20th edn. Colorado CooperativeFish andWildlife ResearchUnit, Colorado State University, Fort collins

Ludwig D, Walters CJ (1981) Measurement Errors and Uncertainty inParameter Estimates for Stock and Recruitment. Can J Fish AquatSci 38:711–720. doi:10.1139/f81-094

MacKenzie DI (2009) Getting the biggest bang for our conservationbuck. Trends Ecol Evol (Personal Ed) 24:175–177

Madani O, Hanks S, Condon A (2003) On the undecidability of proba-bilistic planning and related stochastic optimization problems. ArtifIntell 147:5–34

Mangel M, Clark CW (1983) Uncertainty, search, and information infisheries. J Conseil 41:93–103. doi:10.1093/icesjms/41.1.93

Marescot L, Chapron G, Chadès I, Fackler P, Duchamp C, Marboutin E,Gimenez O (2013) Complex decisions made simple: a primer onstochastic dynamic programming. Methods Ecol Evol 4:872–884

Martin J, Runge MC, Nichols JD, Lubow BC, Kendall WL (2009)Structured decision making as a conceptual framework to identifythresholds for conservation andmanagement. Ecol Appl 19:1079–1090

Martin J et al (2011) Structured decision making as a proactive approachto dealing with sea level rise in Florida. Clim Chang 107:185–202

Martin TG, Camaclang AE, Possingham HP, Maguire LA, Chadès I(2016) Timing of Protection of Critical Habitat Matters. ConservLett:n/a-n/a. doi:10.1111/conl.12266

McCarthy MA (2007) Bayesian methods for ecology. CambridgeUniversity Press, Cambridge

McCarthy MA, Possingham HP (2007) Active adaptive management forconservation. Conserv Biol 21:956–963

McCarthy MA, Possingham HP, Gill AM (2001) Using stochastic dy-namic programming to determine optimal fire management forBanksia ornata. J Appl Ecol 38:585–592

18 Theor Ecol (2017) 10:1–20

McCarthyMA,Armstrong DP, RungeMC (2012) AdaptiveManagementof Reintroduction. In: Reintroduction Biology. John Wiley & Sons,Ltd, pp 256–289. doi:10.1002/9781444355833.ch8

McDonald-Madden E et al (2010a) Active adaptive conservation ofthreatened species in the face of uncertainty. Ecol Appl 20:1476–1489. doi:10.1890/09-0647.1

McDonald-Madden E, Baxter PWJ, Fuller RA, Martin TG, Game ET,Montambault J, Possingham HP (2010b) Monitoring does not al-ways count. Trends Ecol Evol 25:547–550. doi:10.1016/j.tree.2010.07.002

McDonald-Madden E, Chadès I, McCarthy MA, Linkie M, PossinghamHP (2011) Allocating conservation resources between areas wherepersistence of a species is uncertain. Ecol Appl 21:844–858.doi:10.1890/09-2075.1

Mehta SV, Haight RG, Homans FR, Polasky S, Venette RC (2007) Optimaldetection and control strategies for invasive species management. EcolEcon 61:237–245. doi:10.1016/j.ecolecon.2006.10.024

Monahan GE (1982) Survey of Partially Observable Markov DecisionProcesses: Theory, Models, and Algorithms. MGMT SCI 28:1–16

Moore CT, Conroy MJ (2006) Optimal regeneration planning for old-growth forest: addressing scientific uncertainty in endangered spe-cies recovery through adaptive management. For Sci 52:155–172

Moore AL, McCarthy MA (2010) On Valuing Information in Adaptive-Management Models. Conserv Biol 24:984–993. doi:10.1111/j.1523-1739.2009.01443.x

Moore AL, Hauser CE, McCarthy MA (2008) How we value thefuture affects our desire to learn. Ecol Appl 18:1061–1069.doi:10.1890/07-0805.1

Moore CT et al (2011) An Adaptive Decision Framework for theConservation of a Threatened Plant. J Fish Wildl Manag 2:247–261. doi:10.3996/012011-jfwm-007

Nichols JD, Johnson FA, Byron KW (1995) Managing North AmericanWaterfowl in the Face of Uncertainty. Annu Rev Ecol Syst 26:177–199. doi:10.2307/2097204

Nichols JD et al (2011) Climate change, uncertainty, and natural resourcemanagement. J Wildl Manag 75:6–18

Nicol S, Chadès I (2012) Which States Matter? An Application of anIntelligent Discretization Method to Solve a Continuous POMDPin Conservation Biology. PLoS ONE 7:e28993. doi:10.1371/journal.pone.0028993

Nicol SC, Possingham HP (2010) Should metapopulation restorationstrategies increase patch area or number of patches? Ecol Appl 20:566–581

Nicol S, Buffet O, Iwamura T, Chadès I (2013) Adaptive Management ofMigratory Birds Under Sea Level Rise. In: Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence, Beijing. pp2955–2957

Nicol S, Griffith B, Austin J, Hunter CM (2014) Optimal water depthmanagement on river-fed National Wildlife Refuges in a changingclimate. Clim Chang 124:271–284

Nicol S, Fuller RA, Iwamura T, Chadès I (2015) Adapting environmentalmanagement to uncertain but inevitable change. Proc R Soc B 282doi:10.1098/rspb.2014.2984

Nilim A, El Ghaoui L (2005) Robust control of Markov decision process-es with uncertain transition matrices. Oper Res 53:780–798

Ong SCW, Png SW, Hsu D, Lee S (2010) Planning under Uncertainty forRobotic Tasks with Mixed Observability. Int J Robot Res 29:1053–1068

Osogami T (2015) Robust partially observable Markov decision process.In: Proceedings of the 32nd International Conference on MachineLearning, ICML 2015, Lille, France. pp 106–115

Papadimitriou CH, Tsitsiklis JN (1987) The complexity of Markov deci-sion processes. Math Oper Res 12:441–450. doi:10.1287/moor.12.3.441

Parma AM (1998) What can adaptive management do for our fish, for-ests, food, and biodiversity? Integr Biol: Issues, News, Rev 1:16–26

Perny P, Weng P (2010) On finding compromise solutions inmultiobjective Markov decision processes. In: EuropeanConference on Artificial Intelligence (ECAI-2010), Lisbonne,Portugal. pp 969–970

Pichancourt JB, Chadès I, Firn J, van Klinken RD, Martin TG (2012)Simple rules to contain an invasive species with a complex life cycleand high dispersal capacity. J Appl Ecol 49:52–62

Pineau J, Gordon G, Thrun S (2003) Point-based value iteration: Ananytime algorithm for POMDPs. In: International Joint Conferenceon Artificial Intelligence. Lawrence Erlbaum Associates LTD, pp1025–1032

Poupart P (2005) Exploiting structure to efficiently solve large scale par-tially observable Markov decision processes. University of Toronto

Puterman ML (1994) Markov Decision Processes: Discrete StochasticDynamic Programming. John Wiley & Sons, Inc, New York

Regan HM, ColyvanM, BurgmanMA (2002) A taxonomy and treatmentof uncertainty for ecology and conservation biology. Ecol Appl 12:618–628. doi:10.1890/1051-0761(2002)012[0618:atatou]2.0.co;2

Regan TJ, Chadès I, Possingham HP (2011) Optimal strategies for man-aging invasive plants in partially observable systems. J Appl Ecol48:76–85

Roijers DM, Whiteson S, Oliehoek FA (2015) Point-based planning formulti-objective POMDPs. In: Proceedings of the Twenty-FourthInternational Joint Conference on Artificial Intelligence (IJCAI-2015), Buenos Aires, Argentina.

Rout TM, Hauser CE, Possingham HP (2009) Optimal adaptive manage-ment for the translocation of a threatened species. Ecol Appl 19:515–526. doi:10.1890/07-1989.1

Runge MC (2011) An Introduction to Adaptive Management forThreatened and Endangered Species. J Fish Wildl Manag 2:220–233. doi:10.3996/082011-jfwm-045

Runge MC (2013) Active adaptive management for reintroductionof an animal population. J Wildl Manag 77:1135–1144.doi:10.1002/jwmg.571

Runge MC, Converse SJ, Lyons JE (2011) Which uncertainty? Usingexpert elicitation and expected value of information to design anadaptive program. Biol Conserv 144:1214–1223

Schlaifer R, Raiffa H (1961) Applied statistical decision theory. ClintonPress, Inc., Boston

Sethi G, Costello C, Fisher A, Hanemann M, Karp L (2005) Fisherymanagement under multiple uncertainty. J Environ Econ Manag50:300–318. doi:10.1016/j.jeem.2004.11.005

Sigaud O, Buffet O (2010) Markov decision processes in artificial intelli-gence: MDPs, beyond MDPs and applications. ISTE/Wiley, Hoboken

Silvert W (1978) The Price of Knowledge: Fisheries Management as aResearchTool. J FishResBoardCan 35:208–212. doi:10.1139/f78-034

Smith ADM, Walters CJ (1981) Adaptive Management of Stock–Recruitment Systems. Can J Fish Aquat Sci 38:690–703.doi:10.1139/f81-092

Smith DR, McGowan CP, Daily JP, Nichols JD, Sweka JA, Lyons JE(2013) Evaluating a multispecies adaptive management framework:must uncertainty impede effective decision-making? J Appl Ecol 50:1431–1440. doi:10.1111/1365-2664.12145

Southwell DM, Hauser CE, McCarthy MA (2016) Learning about colo-nization when managing metapopulations under an adaptive man-agement framework. Ecol Appl 26:279–294. doi:10.1890/14-2430

Spaan M, Vlassis N (2005) Perseus: Randomized Point-based ValueIteration for POMDPs. J Artif Intell Res 24:195–220

Springborn M, Sanchirico JN (2013) A density projection approach fornon-trivial information dynamics: adaptive management of stochas-tic natural resources. J Environ Econ Manag 66:609–624

Venner S, Chadès I, Bel-Venner M-C, Pasquet A, Charpillet F, LeborgneR (2006) Dynamic optimization over infinite-time horizon: Web-building strategy in an orb-weaving spider as a case study. J TheorBiol 241:725–733

Theor Ecol (2017) 10:1–20 19

Walters CJ (1975) Optimal Harvest Strategies for Salmon in Relation toEnvironmental Variability and Uncertain Production Parameters. JFish Res Board Can 32:1777–1784. doi:10.1139/f75-211

Walters CJ (1986) Adaptive management of renewable resources.McGraw Hill, New York

Walters C (1997) Challenges in adaptive management of riparian andcoastal ecosystems. Conserv Ecol 1:1

Walters CJ, Hilborn R (1976) Adaptive Control of Fishing Systems. JFish Res Board Can 33:145–159. doi:10.1139/f76-017

Walters CJ, Hilborn R (1978) Ecological optimization and adaptive man-agement. Annu Rev Ecol Syst 9:157–188

Walters CJ, Ludwig D (1981) Effects of Measurement Errors on theAssessment of Stock–Recruitment Relationships. Can J FishAquat Sci 38:704–710. doi:10.1139/f81-093

Walters CJ, Ludwig D (1987) Adaptive management of harvest rates inthe presence of a risk averse utility function. Nat Resour Model 1:321–337

Westgate MJ, Likens GE, Lindenmayer DB (2013) Adaptive manage-ment of biological systems: A review. Biol Conserv 158:128–139.doi:10.1016/j.biocon.2012.08.016

White B (2005) An economic analysis of ecological monitoring. EcolModel 189:241–250

Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov DecisionProcesses. Math Oper Res 38:153–183. doi:10.1287/moor.1120.0566

Williams BK (2009) Markov decision processes in natural resourcesmanagement: Observability and uncertainty. Ecol Model 220:830–840. doi:10.1016/j.ecolmodel.2008.12.023

Williams BK (2011a) Passive and active adaptive management:Approaches and an example. J Environ Manag 92:1371–1378.doi:10.1016/j.jenvman.2010.10.039

Williams BK (2011b) Resolving structural uncertainty in natural re-sources management using POMDP approaches. Ecol Model 222:1092–1102. doi:10.1016/j.ecolmodel.2010.12.015

Williams BK, Johnson FA (2015) Value of information in naturalresource management: technical developments and applicationto pink-footed geese. Ecol Evol 5:466–474. doi:10.1002/ece3.1363

Williams BK, Johnson FA, Wilkins K (1996) Uncertainty and the adap-tive management of waterfowl harvests. JWildlManag 60:223–232.doi:10.2307/3802220

Williams B, Szaro R, Shapiro C (2009) Adaptive management: the U.S.Department of the Interior technical guide, 2 edn. U.S. Departmentof the Interior, Washington, D.C. doi:http://www.doi.gov/initiatives/AdaptiveManagement/TechGuide.pdf

Williams BK, Eaton MJ, Breininger DR (2011) Adaptive resource man-agement and the value of information. Ecol Model 222:3429–3436.doi:10.1016/j.ecolmodel.2011.07.003

Wilson KA, McBride MF, Bode M, Possingham HP (2006) Prioritizingglobal conservation efforts. Nature 440:337–340

Wittenmark B (1995) Adaptive Dual Control Methods: An Overview. In:In 5th IFAC symposium on Adaptive Systems in Control and SignalProcessing

Zhou R, Hansen E (2001) An improved grid-based approximation algo-rithm for POMDPs. In: Proceedings of the 17th International JointConference on Artificial Intelligence (IJCAI-2001), Seattle,Washington, USA

Zhou E, FuMC,Marcus S (2010) Solving continuous-state POMDPs viadensity projection. IEEE Trans Autom Control 55:1101–1116

20 Theor Ecol (2017) 10:1–20

Appendix B

Appendix to Chapter 3

145

1

Appendix S1: 1

A Markov decision problem (MDP) is a mathematical framework used to model a sequential decision 2

problem. The system dynamics are partly random and partly under the control of a decision maker 3

(Bellman 1957). When modelling an optimisation problem as a MDP, we assume that the Markov 4

property holds, i.e. the process history has no impact on future dynamics (Puterman 1994). 5

A MDP is defined by five components: (i) a state space 𝑆, (ii) an action space 𝐴, (iii) a transition 6

probability matrix 𝑃 for each action, (iv) immediate rewards 𝑟 for each state and action and (v) a 7

performance criterion (Puterman 1994). The set of possible action can depend on the current state 8

𝑠 ∈ 𝑆 and is noted 𝐴(𝑠). Solving a MDP means finding a best policy 𝜋 (action to take in each state, 9

i.e. 𝜋: 𝑆 → 𝐴) to maximise (or minimise) the sum of expected future rewards. The performance 10

criterion provides details about the objective (maximisation or minimisation), the time horizon 11

(finite or infinite), the initial state s0 and the presence of a discount factor (𝛾). We focus on the 12

maximisation of the discounted sum in infinite time horizon: 13

𝐸𝜋 [∑ 𝛾𝑡𝑟(𝑠𝑡 , 𝑎𝑡)∞

𝑡=0|𝑠0] (𝑒𝑞𝑛 1) 14

Stochastic dynamic programming (SDP) denotes a collection of solution methods to solve MDPs, 15

such as policy iteration and value iteration. The peculiarity of policy iteration is that it puts the 16

emphasis on the policy instead of the value: starting from any initial policy 𝜋0, the best current 17

policy 𝜋 is evaluated (step 1) and improved (step 2) repeatedly until it is optimal (𝜋∗). 18

1) The evaluation consists of calculating a value (or performance) 𝑉𝜋(𝑠) for each state 𝑠 ∈ 𝑆. This 19

value corresponds to the sum of future rewards one can expect, starting from the state s when 20

implementing the policy 𝜋. Formally, 21

𝑉𝜋(𝑠) = 𝐸𝜋 [∑ 𝛾𝑡𝑟(𝑠𝑡 , 𝑎𝑡)∞

𝑡=0|𝑠0 = 𝑠] 22

2

This can be calculated iteratively (backwards induction) or through a matrix inversion, since the 23

value function satisfies the following equation 24

𝑉𝜋(𝑠) = 𝑟(𝑠, 𝜋(𝑠)) + 𝛾 ∑ 𝑃(𝑠′ | 𝑠, 𝜋(𝑠)) × 𝑉𝜋(𝑠′)

𝑠′∈ 𝑆

∀ 𝑠 ∈ 𝑆 25

𝑉𝜋 = 𝑟𝜋 + 𝛾𝑃𝜋 𝑉𝜋 26

with 𝑟𝜋 and 𝑃𝜋 the reward and transition matrix associated with the policy 𝜋. Noting 𝐼 the identity 27

matrix, this implies 28

𝑉𝜋 = (𝐼 − 𝛾𝑃𝜋)−1𝑟𝜋 (𝑒𝑞𝑛 2) 29

Note that 𝐼 − 𝛾𝑃𝜋 is always invertible when 𝛾 < 1 because 𝑃𝜋 is a transition matrix. That we deal 30

with undiscounted sums (𝛾 = 1) in this manuscript is not an issue because of the absorbing state is 31

reachable from any state and has reward zero. 32

2) Once the value 𝑉𝜋 of the policy 𝜋 has been evaluated, we can improve this policy by applying 33

Bellman’s equation on all states: 34

𝜋(𝑠) = argmax𝑎 ∈ 𝐴(𝑠)

[𝑟(𝑠, 𝑎) + 𝛾 ∑ 𝑃(𝑠′ | 𝑠, 𝑎) × 𝑉𝜋(𝑠′)

𝑠′∈ 𝑆

] ∀ 𝑠 ∈ 𝑆 (𝑒𝑞𝑛 3) 35

When equations 2 and 3 are computed several times, 𝜋 converges to the optimal policy 𝜋∗. The 36

outputs of policy iteration (and other SDP techniques) are the optimal policy 𝜋∗ and the optimal 37

value 𝑉𝜋∗ . In this manuscript, the value is of high importance because it equals the expected time 38

until the mainland becomes infested starting from a given state. 39

40

Appendix S2: 41

We show how the number of states of the exact model can be calculated. First, note that 𝑡𝑖 = 0 42

indicates that the sub-action 𝑎𝑖 has just terminated, and thus does not restrict the choice of future 43

3

sub-actions; therefore, 𝑎𝑖 need not be stored in the state, and can be replaced by ‘null’. For each 44

1 ≤ 𝑖 ≤ 𝑁, (𝑎𝑖 , 𝑡𝑖) belongs to 𝐴𝑖+ = ′𝑛𝑢𝑙𝑙′, 0 ∪ (𝑎𝑖 , 𝑡𝑖): 𝑎𝑖 ∈ 𝐴𝑖 , 1 ≤ 𝑡𝑖 ≤ 𝑑(𝑎𝑖) − 1. Then, the 45

state space of the exact model is: 𝑆𝑒𝑥𝑎𝑐𝑡 = (𝑠, 𝑎1, 𝑡1, 𝑎2, 𝑡2, … , 𝑎𝑁 , 𝑡𝑁): 𝑠 ∈ 𝑆, (𝑎𝑖 , 𝑡𝑖) ∈ 𝐴𝑖+, 1 ≤ 𝑖 ≤46

𝑁 = 𝑆 × ∏ 𝐴𝑖+𝑁

𝑖=1 . 47

For each 1 ≤ 𝑖 ≤ 𝑁: 48

|𝐴𝑖+| = |′𝑛𝑢𝑙𝑙′, 0 ∪ (𝑎𝑖 , 𝑡𝑖): 𝑎𝑖 ∈ 𝐴𝑖 , 1 ≤ 𝑡𝑖 ≤ 𝑑(𝑎𝑖) − 1| 49

= 1 + |(𝑎𝑖 , 𝑡𝑖): 𝑎𝑖 ∈ 𝐴𝑖 , 1 ≤ 𝑡𝑖 ≤ 𝑑(𝑎𝑖) − 1| 50

= 1 + ∑ |(𝑎𝑖 , 𝑡𝑖): 1 ≤ 𝑡𝑖 ≤ 𝑑(𝑎𝑖) − 1|

𝑎𝑖∈𝐴𝑖

51

= 1 + ∑ (𝑑(𝑎𝑖) − 1)

𝑎𝑖∈𝐴𝑖

52

Finally, 53

|𝑆𝑒𝑥𝑎𝑐𝑡| = |𝑆| ∏|𝐴𝑖+|

𝑁

𝑖=1

= |𝑆| ∏ (1 + ∑ (𝑑(𝑎𝑖) − 1)

𝑎𝑖∈𝐴𝑖

)

𝑁

𝑖=1

54

The number of states is exponential in the number of sub-actions 𝑁. Also, the exponentiation base 55

grows with the durations of actions. 56

In our case study, the set of possible actions on each island is A=57

𝑛𝑜 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑙𝑖𝑔ℎ𝑡 𝑚𝑎𝑛𝑎𝑔𝑒𝑚𝑒𝑛𝑡, 𝑠𝑡𝑟𝑜𝑛𝑔 𝑚𝑎𝑛𝑎𝑔𝑒𝑚𝑒𝑛𝑡 of durations one, six and six timesteps, 58

respectively (i.e. six months, and three years). The number of states, that also accounts for the 59

absorbing state 𝜎 (‘mainland Australia infested’) equals: 60

|𝑆𝑒𝑥𝑎𝑐𝑡| = || + |𝑆| ∏(1 + (𝑑(𝑛𝑜 𝑎𝑐𝑡𝑖𝑜𝑛)– 1) + (𝑑(𝑙𝑖𝑔ℎ𝑡)– 1) + (𝑑(𝑠𝑡𝑟𝑜𝑛𝑔)– 1))

𝑁

𝑖=1

61

= 1 + 11𝑁|𝑆| 62

4

With |𝑆| = 2𝑁 because each of the 𝑁 islands is either infested or susceptible. 63

64

Appendix S3: 65

This appendix includes three proofs. We first prove that reducing the action set reduces the 66

performance in a maximisation problem (a). Based on this, we then prove that the lower bound 67

model has a lower performance than the exact model (b). Then, we show that the upper bound 68

model has a higher performance than the exact model (c). 69

70

a) Let 𝑉 and 𝑉′ denote the performances of any MDPs < 𝑆, 𝐴, 𝑃, 𝑟 > and < 𝑆, 𝐴′, 𝑃, 𝑟 > and 𝛱 and 71

𝛱’ be their sets of policies, respectively. The following holds: 72

[∀ 𝑠 ∈ 𝑆, 𝐴′(𝑠) ⊆ 𝐴(𝑠)] ⇒ 𝛱’ ⊆ 𝛱 73

It follows, for every state s: 74

𝑉′(𝑠) = max𝜋∈𝛱’

𝐸𝜋 [∑ 𝛾𝑡𝑟(𝑠𝑡 , 𝜋(𝑠𝑡

))|𝑠0 = 𝑠

∞

𝑡=0] ≤ max

𝜋∈𝛱𝐸𝜋 [∑ 𝛾𝑡𝑟(𝑠𝑡

, 𝜋(𝑠𝑡 ))|𝑠0

= 𝑠∞

𝑡=0] = 𝑉(𝑠) 75

b) We prove that the lower bound model has a lower performance than the exact model <76

𝑆𝑒𝑥𝑎𝑐𝑡 , 𝐴, 𝑃𝑒𝑥𝑎𝑐𝑡 , 𝑟 >. We consider the definition of the lower bound in the first step (see Materials 77

and Methods), i.e. after addition of the synchronisation constraint. In this definition, the lower 78

bound model has the same state space, transition function and rewards as the exact model. 79

Let 𝐴′ denote the new action space in the lower bound model. Let 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑒𝑥𝑎𝑐𝑡 and 𝑆𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠 80

denote the states where at least one action is in progress, i.e. 𝑆𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠 =81

(𝑠, 𝑎1, 𝑡1, 𝑎2, 𝑡2, … , 𝑎𝑁 , 𝑡𝑁) ∈ 𝑆𝑒𝑥𝑎𝑐𝑡: 𝑡𝑖 > 0 𝑓𝑜𝑟 𝑠𝑜𝑚𝑒 1 ≤ 𝑖 ≤ 𝑁. If 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠, the set of 82

possible action 𝐴′(𝑠𝑒𝑥𝑎𝑐𝑡) only contains the action applied at the previous timestep (in order to 83

extend the action). The set of actions allowed 𝐴′(𝑠) is a subset of the action space in the exact 84

model: 85

𝐴′(𝑠𝑒𝑥𝑎𝑐𝑡) ⊆ 𝐴(𝑠𝑒𝑥𝑎𝑐𝑡), ∀ 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠 86

5

For the other states, the set of possible actions is left unchanged: 87

𝐴′(𝑠𝑒𝑥𝑎𝑐𝑡) = 𝐴(𝑠𝑒𝑥𝑎𝑐𝑡) ∀ 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑒𝑥𝑎𝑐𝑡\𝑆𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠 88

So, 𝐴′(𝑠𝑒𝑥𝑎𝑐𝑡) ⊆ 𝐴(𝑠𝑒𝑥𝑎𝑐𝑡) for all 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑒𝑥𝑎𝑐𝑡. This implies (a) that the lower bound model has a 89

lower performance than the exact model. 90

The second step in designing the lower bound model (see Materials and Methods) is a reformulation 91

that relies on the observation that states where only one action is possible can be removed without 92

increasing or decreasing the performance. However, the transition function and rewards have to be 93

modified to account for the states removed. 94

c) We prove that the upper bound model has a higher performance than the exact model <95

𝑆𝑒𝑥𝑎𝑐𝑡 , 𝐴, 𝑃𝑒𝑥𝑎𝑐𝑡 , 𝑟 >. We define the set 𝑆𝑠𝑡𝑜𝑝 ⊆ 𝑆, made of states separated by 𝐺𝐶𝐷 timesteps. 96

Formally, 97

𝑆𝑠𝑡𝑜𝑝 = 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑒𝑥𝑎𝑐𝑡 , ∀ 1 ≤ 𝑖 ≤ 𝑁, 𝐺𝐶𝐷 ∣ 𝑠𝑖𝑇 98

All actions are stopped after 𝐺𝐶𝐷 timesteps. Equivalently, we can modify the action space by setting 99

𝐴′′(𝑠𝑒𝑥𝑎𝑐𝑡) = ∏ 𝐴𝑖

𝑁

𝑖=1

, ∀ 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑠𝑡𝑜𝑝 100

i.e. all actions are possible in states 𝑆𝑠𝑡𝑜𝑝. We have: 101

𝐴(𝑠𝑒𝑥𝑎𝑐𝑡) ⊆ 𝐴′′(𝑠𝑒𝑥𝑎𝑐𝑡), ∀ 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑠𝑡𝑜𝑝 102

For the other states, the set of possible actions is left unchanged: 103

𝐴(𝑠𝑒𝑥𝑎𝑐𝑡) = 𝐴′′(𝑠𝑒𝑥𝑎𝑐𝑡) ∀ 𝑠𝑒𝑥𝑎𝑐𝑡 ∈ 𝑆𝑒𝑥𝑎𝑐𝑡\𝑆𝑠𝑡𝑜𝑝 104

So, 𝐴(𝑠𝑒𝑥𝑎𝑐𝑡) ⊆ 𝐴′′(𝑠𝑒𝑥𝑎𝑐𝑡) for all 𝑠 ∈ 𝑆𝑒𝑥𝑎𝑐𝑡. This implies (a) that the upper bound model has a 105

better performance than the exact model. As in the lower bound model, the reformulation in the 106

second step (see Materials and Methods) does not alter the performance of the upper bound. 107

108

Appendix S4: 109

The inputs parameters required for the program are: 110

6

The number of islands 𝑁; 111

Two 1 × |𝐴1| arrays describing the durations and costs of each sub-action; 112

The budget received per timestep; 113

The effectiveness of each action on each island; 114

The colonisation probability between each pair of island (including Papua New Guinea and 115

mainland Australia); 116

The discount factor 𝛾. The time horizon is infinite. 117

118

119

7

Appendix S5: 120

Effectiveness of different management actions on all Torres Strait Islands. The effectiveness of an 121

action is defined as the probability of eradicating the tiger mosquito over one timestep and depends 122

on key characteristics such as vegetation type, number of human dwellings (human population) and 123

terrain type (as a measure of accessibility; Appendix S6). We collected data estimated by experts on 124

management effectiveness for each management action and each island at an expert elicitation 125

workshop held in 2013. Islands are ranked from the highest to lowest probability of infesting the 126

mainland in one timestep (second last column), assuming low transmission probabilities. In our 127

computational experiments, islands are added following this ranking, which is also the rule of thumb 128

‘highest transmission first’. The last column shows the prioritisation ranking that emerges from the 129

upper bound policy for 11 islands. 130

Management

action No action Light Management

Strong

Management

Probability of

transmission

to mainland

Management

Prioritisation

ranking

Cost 0 x 2x

Budget (every

timestep) 3x

Duration

six months

(one

timestep)

three years

(six timesteps)

three years

(six

timesteps)

Island

Thursday 0.020379 0.112205 0.173365 0.019841 1

Horn 0.033678 0.114894 0.169567 0.0053089 2

Mulgrave 0.033678 0.114894 0.169567 0.0030053 3

Banks 0.020379 0.036144 0.073427 0.0020876 8

Hammond 0.033678 0.061252 0.10015 0.0015947 6

Sue 0.033678 0.138956 0.203725 0.00098455 4

8

Prince of

Wales 0.033678 0.05848 0.096742 0.00093403 10

Yam 0.020379 0.112205 0.173365 0.00073332 7

Jervis 0.020379 0.112205 0.173365 0.00063267 5

Coconut 0.033678 0.138956 0.203725 0.00038382 9

Saibai 0.020379 0.036144 0.073427 0.00038163 11

Murray 0.026951 0.046704 0.08442 0.00034181

Yorke 0.033678 0.138956 0.203725 0.00034083

Talbot 0.020379 0.036144 0.073427 0.00026109

Darnley 0.028933 0.050074 0.087939 0.00023087

Mt Cornwallis 0.033678 0.061252 0.10015 0.00017667

Stephens 0.028933 0.050074 0.087939 0.00005999

131

9

Appendix S6: 132

Belief Bayesian network providing the effectiveness of actions depending on four islands 133

characteristics. The 11 participants comprised experts in invasive species, vector biology and 134

ecology, mosquito control, public health management and biosecurity. Experts provided anonymous 135

estimates of the actions effectiveness, i.e. probability of eradicating the mosquito over one timestep 136

(six months), for each island and each management action: no action (one timestep), light 137

management (six timesteps) and strong management (six timesteps). The amount of unmanaged 138

area, accessibility/terrain, vegetation refuge and number of dwellings affect the infestation 139

probability. Experts first estimated the operational feasibility and the suitability to mosquitoes for 140

different combinations of these four characteristics (top arrows) and second estimated the 141

probability of mosquito eradication for different combinations of operational feasibility, mosquito 142

suitability and management action (bottom arrows). The effectiveness of any action on any island 143

can then be obtained, provided the four island characteristics are known. Note that the estimates of 144

the actions effectiveness were presented individually and then discussed as a group; subsequently, 145

experts could revise their estimates (Martin et al. 2012) before an average was calculated. 146

147

148

Vegetation refuge

SparseDense

50.050.0

Terrain - Control access

EasyDifficult

50.050.0

Number of dwellings

LowHigh

50.050.0

Amount of unmanaged area

SmallLarge

50.050.0

Island mosquito suitability

LowHigh

21.978.1

Operational feasibility

HighLow

45.055.0

Mosquito eradication

YesNo

0 100

Management action

Do nothingLight managementStrong management

33.333.333.3

10

Appendix S7: 149

Parameters used in the Cauchy formula for the low and high transmissions. The Cauchy formula is: 150

𝑝𝑖𝑗

= 𝐶 × 𝑝𝑜𝑝𝑖 × 𝑝𝑜𝑝𝑗

1 + (𝑑𝑖𝑗

𝛽)

2 (Cauchy − 𝑒𝑞𝑛 13) 151

152 Configuration Low transmissions High transmissions

Constant C = 5 × 10-8 C = 10-7

Shape parameter

(distances are in km) 𝛽 = 50 𝛽 = 50

11

Appendix S8: 153

Human population size (bottom row) and distances (shortest coast to coast distance in kilometres) 154

between islands: 155

ISLAND NAME AUSTRALIAN MAINLAND

THURSDAY HORN MULGRAVE BANKS HAMMOND SUE PRINCE OF

WALES YAM

AUSTRALIAN MAINLAND 0 27 16 66 53 29 61 16 90

THURSDAY 27 0 2 43 35 1 76 1 94

HORN 16 2 0 45 34 4 69 2 89

MULGRAVE 66 43 45 0 2 38 69 46 66

BANKS 53 35 34 2 0 30 52 38 55

HAMMOND 29 1 4 38 30 0 74 2 91

SUE 61 76 69 69 52 74 0 79 33

PRINCE OF WALES 16 1 2 46 38 2 79 0 98

YAM 90 94 89 66 55 91 33 98 0

JERVIS 86 66 67 10 17 61 73 69 62

COCONUT 91 108 101 95 80 105 30 111 34

SAIBAI 140 134 132 85 85 130 87 138 52

MURRAY 181 211 202 203 188 209 136 212 138

YORKE 140 157 150 138 124 154 80 160 69

TALBOT 157 141 142 85 91 136 118 144 86

DARNLEY 179 199 192 180 168 197 122 203 112

MT CORNWALLIS 138 130 128 79 80 125 91 133 57

STEPHENS 171 186 179 161 150 183 110 190 94

PAPUA NEW GUINEA 151 144 142 93 94 139 100 147 65

POPULATION

2548 586 818 439 212 247 103 313

156

ISLAND NAME

JERVIS COCONUT SAIBAI MURRAY YORKE TALBOT DARNLEY MT

CORNWALLIS STEPHENS

PAPUA NEW GUINEA

AUSTRALIAN MAINLAND 86 91 140 181 140 157 179 138 171 151

THURSDAY 66 108 134 211 157 141 199 130 186 144

HORN 67 101 132 202 150 142 192 128 179 142

MULGRAVE 10 95 85 203 138 85 180 79 161 93

BANKS 17 80 85 188 124 91 168 80 150 94

HAMMOND 61 105 130 209 154 136 197 125 183 139

SUE 73 30 87 136 80 118 122 91 110 100

PRINCE OF WALES 69 111 138 212 160 144 203 133 190 147

YAM 62 34 52 138 69 86 112 57 94 65

JERVIS 0 95 74 202 133 72 175 67 155 81

COCONUT 95 0 78 106 48 120 90 89 79 92

SAIBAI 74 78 0 148 77 36 107 5 83 5

12

MURRAY 202 106 148 0 70 205 46 172 70 125

YORKE 133 48 77 70 0 133 41 100 30 80

TALBOT 72 120 36 205 133 0 164 30 140 7

DARNLEY 175 90 107 46 41 164 0 134 24 76

MT CORNWALLIS 67 89 5 172 100 30 134 0 109 11

STEPHENS 155 79 83 70 30 140 24 109 0 58

PAPUA NEW GUINEA

81 92 5 125 80 7 76 11 58 0

POPULATION 251 166 337 484 300 284 320 153 76

157

158

159

13

Appendix S9: 160

Mean time until infestation of mainland Australia for the three models, and six rules of thumb when 161

transmissions are low. The best rule of thumb is ‘highest transmission first’, followed by ‘highest 162

population first’, ‘closest first’ and ‘easiest first’. 163

164

165

166

14

Appendix S10: 167

Prioritisation ranking on four islands for low and high transmission probabilities. The rankings that 168

emerge from the exact, lower bound and upper bound models are the same. At each timestep, only 169

the two infested islands with highest ranking are managed, due to limited budget. 170

171

172

15

Appendix S11: 173

Relative errors (%) of model performances compared to the upper bound with different sub-action 174

durations. Recall that in our case study, durations are 1, 6 and 6 for a relative error up to 16%. With 175

durations 3, 6 and 6, the LCM and GCD are close: GCD(3,6,6) = 3 and LCM(3,6) = 6, which leads to 176

relative errors less than 6%. With durations 2, 5 and 7, we have GCD(2,5,7) = 1 and LCM(2,5,7)=70. 177

The maximum relative error between the bounds increases and remains under 20%. For durations 178

(3,6,6) and (2,5,7), the bound models are tractable until 12 islands (compared to 13 islands for 179

durations (1,6,6)) because the transition matrices for durations (1,6,6) are sparser than with 180

durations (3,6,6) and (2,5,7). For durations (3,6,6) and (2,5,7), the exact model is intractable above 181

six islands (compared to eight islands for durations (1,6,6)) because long durations mean a high 182

number of states in the exact model. 183

184

185

186

Durations (no action, light and strong manage-

ment)

Transmis- sion

probabilities

#islands

included 2 3 4 5 6 7 8 9 10 11 12

3, 6, 6

Low

Exact 1.6 4.9 5.4 5.5 5.9 intractable

LB 1.6 4.9 5.4 5.5 5.9 5.8 6 6 6 5.5 5.3

High


LB 1.2 2.4 1.6 1.3 1.2 1.1 1 0.9 0.8 0.8 0.7

2, 5, 7

Low


LB 4.6 18.2 18.6 18.6 19.4 19.2 19.4 19.2 19.2 18 17.6

High

Exact 3.2 6.6 5.5 5 5.1 intractable

LB 3.2 8.6 6.6 5.8 5.9 5.5 5.1 4.7 4.6 4.2 4.1

16

187

Appendix S12: 188

Computational times of the exact, lower bound and upper bound models for low transmission 189

probabilities. The computational times for the exact model are several orders of magnitudes larger 190

than those of the bound models. The computational times have been obtained on a dual 3.46GHz 191

Intel Xeon X5690, which could not solve the largest instances (eight islands for the exact model and 192

13 islands for the bound models). 193

194

Appendix C

Proof of Theorems in Chapter 5

In order to demonstrate Theorems 1 and 2 we first need to prove Lemmas 1and 2. First, recall that for stationary MOMDPs (Bayes’ theorem - Eq. 5 ofthe main manuscript):

bt+1(y′)p(o′, x′|x, bt, a) =

Z(a, x′, y′, o′)Tx(x, y′, a, x′)bt(y

′)(C.1)

for all belief states bt, x, x′ ∈ X, y ∈ Y, o′ ∈ O, and successor bt+1.

Lemma 1. Any belief state bt is the average of its successors bt+1 weightedby the probability to reach them:

bt(y) =∑

b∈B

p(bt+1 = b|x, bt, a)b(y)

for all y ∈ Y, x ∈ X, bt ∈ B, a ∈ A(C.2)

Remark: Since the belief space B is a continuous set, the sum is infinite,but only a finite number of terms will be non-zeros (namely, the successors).

Proof. For all y ∈ Y ,

∑

b∈B

p(bt+1 = b|x, bt, a)b(y)

=∑

b∈B,x′∈X,o′∈O

p(bt+1 = b, o′, x′|x, bt, a)b(y)

=∑

x′∈X,o′∈O

p(x′, o′|x, y, a)×∑

b∈B

p(bt+1 = b|o′, x′, x, bt, a)b(y)

(C.3)

162

Notice that, given o′, x′, x, bt, a, the successor bt+1 is fully determined, i.e.p(bt+1 = b|o′, x′, x, bt, a) = 1 if bt+1 = b, 0 otherwise. The right-hand side canbe written:

=∑

x′∈X,o′∈O

p(x′, o′|x, bt, a)bt+1(y)

=∑

x′∈Xo′∈O

Z(a, x′, y, o′)Tx(x, y, a, x′)bt(y) (Eq. C.1)

= bt(y)∑

x′∈X,o′∈O

p(x′, o′|x, y, a) = bt(y)

which proves Lemma 1. We can apply this equality recursively i times toobtain Lemma 2:

Lemma 2.

bt(y) =∑

b∈B

p(bt+i = b|x, bt, a)b(y)

for all y ∈ Y, x ∈ X, bt ∈ B, a ∈ A(C.4)

where p(bt+i = b) is the probability that the successor bt+i of bt at time stept+ i equals the belief state b.

Proof. This generalizes Lemma 1 for several time steps. Each successor bt+1

can be replaced by the weighted average of its successors bt+2, and so on. Theequality in Lemma 2 is then due the linearity of the weighted average.

Now that we have proven these two lemmas, we can prove the theorems.First, let us prove Theorem 1 and recall Assumption 1.

C.1 Proof of Theorem 1

Assumption 1: There exists y ∈ Y such that, for each (x, a) ∈ X × A, theoptimal MDP policy π∗y satisfies either:

• V ∗y (x, ey) > Vπx,a(x, ey) (i.e. π∗y(x) strictly better than a in state x);

• Or, for all y ∈ Y , Tx(x, y, π∗y(x), ·) = Tx(x, y, a, ·) and r(x, y, π∗y(x)) =

r(x, y, a) (i.e. π∗y(x) and a have identical outcomes in state x).

Theorem 1. We assume that Assumption 1 is satisfied for some y ∈ Y .For all x ∈ X, the directional derivative of the optimal value function in

163

(x, ey) with respect to any y 6= y equals that of the function Init (obtainedwith Algorithm 1). Let d = ey − ey. For all x ∈ X and y ∈ Y , we have:

∇dV∗(x, ey) = ∇dInit(x, ey) = αx,y · ey − αx,y · ey (C.5)

Proof. We follow the same structure in five steps as the sketch of proof inthe main manuscript: we first show that the MDP and MOMDP policies areidentical on a neighborhood of the corner (x, ey) (a). Then we show that thebelief in transition matrix y does not grow by more than a constant fromone belief state bt to its successors (b). (a) and (b) imply that the MDP andMOMDP policies are identical for as many time steps as we want, providedbt is close enough to the corner (c). This implies that the distribution of re-wards and belief states for the MDP and MOMDP policies will be identicalfor as many time steps as we want (d). Finally, the discounted value of thefuture events that are not identical can be shown to have derivative zero (e).

(a) Without loss of generality, we can write ey = (1, 0, . . . , 0) and ey =(0, 1, 0, . . . , 0). Let bεt = (1 − ε)ey + ε ey = (1 − ε, ε, 0, . . . , 0), with ε theprobability of model y.

Let x ∈ X. Assumption 1 states that the optimal MDP action in x(referred to as a) is strictly better than other actions. The only possibleexceptions have their transitions and rewards identical to the optimal: sincethe latter actions are interchangeable, we will merge them into one optimalaction hereafter. Note that they may be merged differently for differentvalues of x.

Since the MOMDP is equivalent to a MDP in y, the value of any policycoincide in the MDP and in the MOMDP. So, both the optimal MOMDPand MDP actions in (x, ey) are a and yield a value strictly greater than allother actions. By continuity of value functions Init and V ∗ across the beliefspace B, there exists a neighborhood of the corner (x, ey) in which a yieldsa value greater than all other actions, for both Init and V ∗. Note thatthe ’neighborhoods’ in this Theorem are one-dimensional, along the edgesbetween ey and ey.

We denote by ε > 0 the size of the smallest of these neighborhoods forall x ∈ X. We define, for any 0 < δ < 1, ρδ =

⋃x∈X(x, bεt)|ε ≤ δ, which is

made of neighborhoods of the corners X×ey. By construction, the policiesπ∗y and π∗ will select the same action anywhere in ρε.

(b) Let y ∈ Y . We aim at controlling the maximum distance between abelief state and its successors through the following ratio: let

J = max(1,maxZT /ZT |x, x′ ∈ X, a ∈ A, o′ ∈ O,ZT 6= 0), (C.6)

164

with simplifying notations Z = Z(a, x′, y, o′), T = Tx(x, y, a, x′), Z = Z(a, x′, y, o′)

and T = Tx(x, y, a, x′) (unambiguous from context). J is well defined be-

cause X, O and A are finite sets. After implementing action a, the successor(x′, bβt+1) of the state (x, bεt) satisfies (Bayes’ theorem - Eq. 5 of the mainmanuscript):

bβt+1(y) =ZTbεt(y)

p(o′, x′|x, bεt, a)(C.7)

which can be written

β =εZT

(1− ε)ZT + εZT(C.8)

The case ZT = 0 is not an issue since the successor is the corner ey =(0, 1, 0, . . . , 0), in which the function V ∗ and Init are equal. In the caseZT > 0, we can write:

β =εZT

(1− ε)ZT + εZT

=ε ZTZT

1− ε+ ε ZTZT

≤ εJ

1− ε+ εJ, (since

ZT

ZT≤ J)

=εJ

1 + ε(J − 1)

≤ εJ, (J ≥ 1)

(C.9)

Therefore, if ZT > 0, we have (x, bεt) ∈ ρε/J ⇐⇒ ε ≤ ε/J =⇒ β ≤ ε ⇐⇒(x′, bβt+1) ∈ ρε.

(c) Further, if ε ≤ ε/J i for some i > 0, any successor of (x, bεt) after i timesteps will still be in ρε. This means that the optimal MOMDP and MDPpolicies π∗y and π∗ will select the same action between time steps t and t+ i.

(d) Consequently, if ε ≤ ε/J i for some i > 0, the distribution probabilitiesover the successors (x, bεt) between time steps t and t + i + 1 are the samewhether we apply π∗y or π∗. In turn, the rewards are the same between timesteps t and t+ i, whether we apply π∗y or π∗.

(e) We introduce the notation f = V ∗ − Init. We first demonstrate animportant equality on f (i), and that f(x, bεt) ≤ εUB/ε, for all 0 ≤ ε ≤ 1

165

(ii) before finishing to demonstrate the theorem (iii).

(i) If ε ≤ ε/J i, we can write:

f(x, bεt) = V ∗(x, bεt)− Init(x, bεt)

= E[∞∑

t′=t

γt′−tR((xt′ , bt′), π

∗(xt′ , bt′))]

− E[∞∑

t′=t


∗y(xt′ , bt′))]

= E[∞∑

t′=t+i+1


∗(xt′ , bt′))]

− E[∞∑

t′=t+i+1


∗y(xt′ , bt′))]

(C.10)

because the rewards obtained by π∗y and π∗ are equal between time steps tand t + i (d). We denote by pπ(xt+i+1, bt+i+1) the probability of reaching(xt+i+1, bt+i+1) in time step t+ i+ 1 after visiting (x, bεt) in time step t, whenfollowing the policy π. By rearranging the terms in the above sums, wenotice that they equal the average of value functions in states (xt+i+1, bt+i+1)weighted by the probability to reach them:

= γi+1∑

xt+i+1∈X,bt+i+1∈B

pπ∗(xt+i+1, bt+i+1)×

V ∗(xt+i+1, bt+i+1)

− γi+1∑


pπ∗y(xt+i+1, bt+i+1)×

Init(xt+i+1, bt+i+1)

= γi+1∑


p(xt+i+1, bt+i+1)×

f(xt+i+1, bt+i+1)

(C.11)

because the probabilities pπ∗(xt+i+1, bt+i+1) and pπ∗y(xt+i+1, bt+i+1) are equal(d). We omitted the index π∗ in the last line for the same reason.

We also know that Init is a lower bound that is optimal in the corners,i.e. f ≥ 0 and for all x ∈ X, 0f(x, b0

t ) = 0, f(x, b1t ) = 0. Because f is con-

tinue and X is finite, f is also bounded: 0 ≤ f(x, bεt) ≤ UB, for all x ∈X and 0 ≤ ε ≤ 1.

166

(ii) We now show by contradiction that f(x, bεt) ≤ εUB/ε, for all 0 ≤ε ≤ 1 (recall that bεt(y) = ε). Let us assume that this is not true, whichmeans that M = supx∈X,0≤ε≤1 f(x, bεt) − εUB/ε > 0. By definition of thesupremum and since γM < M , there exist x ∈ X and 0 ≤ ε ≤ 1 such thatf(x, bε) − εUB/ε > γM . This implies ε/ε < f(x, bεt)/UB ≤ 1, which means(x, bεt) ∈ ρε. So,

γM < f(x, bεt)− εUB/ε= γ

∑

x′∈X,bβt+1∈B

p(x′, bβt+1)f(x′, bβt+1)

− εUB/ε ((i) for i = 1)

≤ γ∑

x′∈X,bβt+1∈B


− γεUB/ε (γ < 1)

= γ∑

x′∈X,bβt+1∈B


− γbεt(y)UB/ε

= γ∑

x′∈X,bβt+1∈B


− γ∑

x′∈X,bβt+1∈B

p(x′, bβt+1)bβt+1(y)UB/ε

(Lemma 1)

= γ∑

x′∈X,bβt+1∈B

p(x′, bβt+1)(f(x′, bβt+1)

− βUB/ε)≤ γ

∑

x′∈X,bβt+1∈B

p(x′, bβt+1)M

= γM

(C.12)

which is false and shows that f(x, bεt) ≤ εUB/ε.

167

(iii) If ε ≤ ε/J i, this implies:

f(x, bεt)

= γi+1∑

xt+i+1∈X,bβt+i+1∈B

p(xt+i+1, bβt+i+1)×

f(xt+i+1, bβt+i+1) (i)

≤ γi+1∑


p(xt+i+1, bβt+i+1)βUB/ε (ii)

= γi+1∑


p(xt+i+1, bβt+i+1)×

bβt+i+1(y)UB/ε

= γi+1bεt(y)UB/ε (Lemma 2)

= γi+1εUB/ε

(C.13)

So, 0 ≤ f(x, bεt)/ε ≤ γi+1UB/ε. As ε → 0, (x, bεt) ∈ ρε/Ji with i growing toinfinity; the right hand side γi+1UB/ε converges to 0, i.e. limε→0 f(x, bεt)/ε =0. Recalling that b0

t = ey,

V ∗(x, ey)− V ∗(x, bεt)ε

=

f(x, ey)− f(x, bεt)

ε+Init(x, ey)− Init(x, bεt)

ε

(C.14)

Recall that f(x, ey) = 0 and f(x, bεt)→ 0 when ε→ 0. Since Init equals αx,yin a neighborhood of (x, ey), the right-hand side equals αx,y · ey −αx,y · ey forε small enough, which shows that the limit when ε → 0 is well defined onboth sides of the equations.

limε→0

V ∗(x, ey)− V ∗(x, bεt)ε

=

limε→0

Init(x, ey)− Init(x, bεt)ε

(C.15)

Noticing that bεt = ey+εd, with d = ey−ey, the above limits are the directionalderivatives of functions V ∗ and Init along the vector d:

∇dV∗(x, ey) = ∇dInit(x, ey) = αx,y · ey − αx,y · ey (C.16)

168

C.2 Proof of Theorem 2

Assumption 2: There exists y ∈ Y such that, for each (x, x′) ∈ X, ifZ(π∗y(x), x′, y, o′)Tx(x, y, π

∗y(x), x′) = 0, then Z(π∗y(x), x′, y, o′)Tx(x, y, π

∗y(x), x′) =

0 for all y ∈ Y .

Theorem 2. We assume that Assumptions 1 and 2 are satisfied forsome y ∈ Y . Then, for all x ∈ X, the directional derivative of the optimalvalue function in (x, ey) in any direction equals that of the function Init.For all (x, b) ∈ X ×B, denoting d = b− ey, we have:

∇dV∗(x, ey) = ∇dInit(x, ey) = αx,y · ey − b · αx,y (C.17)

Proof. The proof is the same in spirit as for Theorem 1 but in higher dimen-sion.

(a) The value function V ∗ is Lipschitz continuous for all x ∈ X withconstant Rmax−Rmin

1−γ and the distance L1 on belief states Pineau et al. (2003).So, since the optimal action is strictly better than the others, it is so in aneighborhood of the corner as well. There exists ε > 0 such that the optimalaction is strictly better than others on balls of L1 radius ε, for all x ∈ X. Wedenote by ρε the union of these balls, on which the policies π∗y and π∗ selectthe same action on a neighborhood of size at least ε.

(b) We aim at controlling the L1 distance between the corner ey andothers belief states when successors are calculated. We re-arrange Y asy, y2, . . . , y|Y | and define the notation Zi = Z(a, x′, yi, o

′), Ti = Tx(x, yi, a, x′)

for 2 ≤ i ≤ |Y |, Z = Z(a, x′, y, o′) and T = Tx(x, y, a, x′). We use a slightly

different version of ratio than in theorem 1: let J = maxZiTi/ZT |x, x′ ∈X, a ∈ A, o′ ∈ O, i ∈ 2, . . . , |Y |, ZT 6= 0. After implementing action a,the successor (x′, bt+1) of the state (x, bt) satisfies, for all i ∈ 2, . . . , |Y |:

bt+1(yi) = ηZiTibt(yi) (C.18)

The case ZT = 0 is ruled out because it would imply ZiTi = 0 for all i, i.e.the state x′ and observation o′ have probability 0. In the case ZT > 0, wecan write:

‖ey − bt+1‖1 = (1− ηZTbt(y)) +

|Y |∑

i=2

ηZiTibt(yi)

= ηZT [1

ηZT− bt(y) +

|Y |∑

i=2

ZiTiZT

bt(yi)]

(C.19)

169

Recall that

1

η= bt(y)ZT +

|Y |∑

i=2

ZiTibt(yi) (C.20)

which implies

1

ηZT= bt(y) +

|Y |∑

i=2

ZiTiZT

bt(yi) (C.21)

So,

‖ey − bt+1‖1 = 2ηZT

|Y |∑

i=2

ZiTiZT

bt(yi)

= 2

∑|Y |i=2

ZiTiZT

bt(yi)

bt(y) +∑|Y |

i=2ZiTiZT

bt(yi)

≤ 2

∑|Y |i=2 Jbt(yi)

bt(y) +∑|Y |

i=2 Jbt(yi)

= 2J1− bt(y)

bt(y) + J(1− bt(y))

≤ J(2(1− bt(y))

= J((1− bt(y) +

|Y |∑

i=2

bt(yi))

= J‖ey − bt‖1

(C.22)

So, we obtain the same property as in Theorem 1 (b): (x, bt) ∈ ρε/J =⇒(x′, bt+1) ∈ ρε.

The steps (c), (d) and (e-i) are the same as in Theorem 1, leading to

f(x, bεt) = γi+1∑


p(xt+i+1, bt+i+1)×

f(xt+i+1, bt+i+1)

(C.23)

Finally, in steps (e-ii) and (e-iii), ε is replaced by ‖ey − bt‖1: the function fsatisfies f(x, bt) ≤ ‖ey − bt‖1UB/ε for some upper bound UB.

Then, f(x, bt)/‖ey − bt‖1 ≤ γi+1UB/ε for all (x, bt) ∈ ρε/Ji . This implieslimbt→ey f(x, bt)/‖ey−bt‖1 = 0 and terminates to demonstrate that functionsV ∗ and Init have equal directional derivatives in ey along any vector, i.e.

∇dV∗(x, ey) = ∇dInit(x, ey) = αx,y · ey − b · αx,y (C.24)

170

for any b ∈ B and d = b− ey.

171

optimal sequential decision-making under uncertainty brice_peron_thesis.pdfhence, it is necessary to...

Documents