2009wr008898-pip

Upload: blagoj-delipetrev

Post on 10-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 2009WR008898-pip

    1/61

    WATER RESOURCES RESEARCH, VOL. ???, XXXX, DOI:10.1029/,

    Tree-based reinforcement learning for optimal w

    reservoir operation

    A. Castelletti,1,

    S. Galelli,1

    M. Restelli,1

    R. Soncini-Sessa1

  • 8/8/2019 2009WR008898-pip

    2/61

    X - 2 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    Abstract. Although being one of the most popular and extensively stud-

    ied approaches to design water reservoir operations, Stochastic Dynamic Pro-

    gramming is plagued by a dual curse that makes it unsuitable to cope with

    large water systems: the computational requirement grows exponentially with

    the number of state variables considered (curse of dimensionality) and an

    explicit model must be available to describe every system transition and the

    associated rewards/costs (curse of modeling). A varieties of simplications and

    approximations have been devised in the past, which, in many cases, make

    the resulting operating policies inefficient and of scarce relevance in practi-

    cal contexts In this paper a reinforcement-learning approach called fitted

  • 8/8/2019 2009WR008898-pip

    3/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    1. Introduction

    Despite the great progress made in the last decades, optimal operation of wat

    systems still remains a very active research area (see the recent review by Laba

    The combination of multiple, conflicting water uses, non-linearities in the mo

    objectives, strong uncertainties in the inputs, and high dimensional state spac

    problem challenging and intriguing (Castelletti et al. [2008] and references the

    Stochastic Dynamic Programming (SDP) is one of the most suitable meth

    signing (Pareto) optimal reservoir operating policies (see, e.g., Soncini-Sessa

    and references therein). SDP is based on the formulation of the operating po

  • 8/8/2019 2009WR008898-pip

    4/61

    X - 4 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    and Shane [1982]; Read [1989]; Hooper et al. [1991]; Piccardi and Soncini-Se

    Vasiliadis and Karamouz [1994]; Castelletti and Soncini-Sessa [2007]).

    Despite being studied so extensively in the literature, SDP suffers from a

    which, de facto, prevents its practical application to even reasonably complex

    tems. (i) The computational complexity grows exponentially with state, de

    disturbance dimensions (Bellmans curse of dimensionality [Bellman, 1957]), s

    cannot be used with water systems where the number of reservoirs is greater

    (2-3) units. (ii) An explicit model of each component of the water system

    (curse of modeling [Bertsekas and Tsitsiklis, 1996]) to anticipate the effects

    tem transitions Any information included into the SDP framework can onl

  • 8/8/2019 2009WR008898-pip

    5/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    Incremental Dynamic Programming [Larson, 1968], Differential Dynamic Pro

    [Jacobson and Mayne, 1970], and problem-specific heuristics [Wong and Luenbe

    Luenberger, 1971]. However, these methods have been conceived mainly for de

    problems and are of scarce interest for the optimal operation of reservoirs netwo

    the uncertainty associated with the underlying hydro-meteorological processes

    neglected. Alternative approaches can be classified in two main classes (see

    et al. [2008] and references therein for further details) depending on the str

    adopt to alleviate the dimensionality burden: methods based on the simplifica

    water system model and methods based on the restriction of the degrees of

    the policy design problem

  • 8/8/2019 2009WR008898-pip

    6/61

    X - 6 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    and Turgeon [1988] and Saad et al. [1992] proposed a method based on Prin

    ponent Analysis to reduce the complexity in a five-reservoir hydropower syst

    ten to a four-state variable problem, which was then solvable by SDP. Be

    mance was obtained on the same system by Saad et al. [1994], who used a disa

    technique based on neural networks. A major contribution to hierarchical mu

    composition comes from Haimes [Haimes, 1977]. The idea behind such appro

    different decomposition levels are separately modeled and analyzed, but some i

    is transmitted from lower to higher levels in the decomposition hierarchy.

    The second class of approaches to avert the curse of dimensionality is ba

    introduction of some hypotheses on the regularity of the SDP optimal value func

  • 8/8/2019 2009WR008898-pip

    7/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    is the central idea of Reinforcement Learning (RL), a well-known framework

    tial decision-making (e.g., Barto and Sutton [1998]) that combines concepts

    stochastic approximation via simulation, and function approximation. The l

    perience can be acquired on-line, by directly experimenting decisions on the r

    without any model, or generated off-line, either by using an external simulator

    cal observations. While the first option is clearly impracticable on real reservoi

    off-line learning has been already experimented in the operation of water syste

    letti et al. [2001] (see also Soncini-Sessa et al. [2007]) proposed a partially

    version of classical Q-learning [Watkins and Dayan, 1992] to design the daily o

    a multi-purpose regulated lake The storage dynamics was simulated via the m

  • 8/8/2019 2009WR008898-pip

    8/61

    X - 8 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    Lately, a new approach, called fitted Q-iteration, which combines RL concept

    learning and functional approximation of the value function, has been propo

    et al., 2005]. Unlike traditional stochastic approximation algorithms [Bell

    1963; Bertsekas and Tsitsiklis, 1996; Tsitsiklis and Roy, 1996], which use

    function approximators and thus require a time consuming parameter estimat

    at each iteration step, fitted Q-iteration uses tree-based approximation [Brei

    1984]. The use of tree-based regressors offers a twofold advantage: first, a grea

    flexibility, which is a paramount characteristic in the typical multi-objective

    water reservoir systems with multi-dimensional states, where the value func

    approximated are unpredictable in shape; second a higher computational effic

  • 8/8/2019 2009WR008898-pip

    9/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    In this paper, the fitted Q-iteration is demonstrated on Lake Como, a mu

    regulated lake in Italy. As originally proposed in Ernst et al. [2005], fitted

    yields a stationary policy, which is perfectly suited for the artificial systems the

    has been conceived for, while it is less conforming to natural resources systems

    in this paper. An improved version is therefore proposed that includes non

    policies, which are more effective in adapting to the natural seasonal varia

    focus of the paper is first on studying the properties of the algorithm, with a

    of the results sensitivity to the tree-based method parameters. The potential

    of the approach are then explored and evaluated against traditional SDP, w

    natural term of comparison

  • 8/8/2019 2009WR008898-pip

    10/61

    X - 10 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    denotes the time instant at which such variable assumes a deterministic valu

    lake storage is measured at time t an thus is denoted with st, while the disturb

    interval [t, t + 1) is denoted with t+1 since it can be deterministically known

    end of the interval [Piccardi and Soncini-Sessa, 1991].

    2.1. Model of the Water System

    The reservoir dynamics is governed by the mass conservation equation:

    st+1 = st + at+1 rt+1

    where at+1 is the net inflow volume in the time interval [t, t + 1), which in

    evaporation and other losses; and r 1 is the release over the same period

  • 8/8/2019 2009WR008898-pip

    11/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    ance between compactness and accuracy (e.g., Young [2006]), and are generall

    over the first in designing optimal reservoir operation. In the most general f

    the inflow can be described as

    at+1 = At(It, t+1)

    where At() is a periodic function with period T. For example, at+1 can be m

    cyclostationary, log-normal autoregressive process of order d (i.e., a log-PAR(

    at+1 = exp(yt+1t + t)

    yt+1 =d

    i=1

    i,tyti+1 + t+1

    where t and t are the periodic mean and standard deviation of the process

  • 8/8/2019 2009WR008898-pip

    12/61

    X - 12 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    where P can be equal, smaller, or greater than N and nx = N + L P. The d

    vector t+1 St+1 Rn is composed ofP disturbances lt+1 (i.e., n = P) with

    pdf lt(). Finally, the release decision vector ut Ut(st) Sut Rnu , whose c

    are the release decision ujt from each reservoir j (with j = 1, . . . , N and nu = N

    the scalar decision ut in equation (5).

    The presence of multiple, say q, operating objectives, corresponding to di

    ter users and other social and environmental interests, can be formalized by

    periodic, with period T, step reward function gt+1 = gt(xt, ut,t+1) associa

    stochastic state transition from xt to xt+1. According to the multi-objective na

    problem this function can be obtained as a weighted sum (Weighting Metho

  • 8/8/2019 2009WR008898-pip

    13/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    can be critical in most of cases. Conversely, when an infinite time horizon is

    discount factor must be fixed to ensure convergence of the policy design algori

    Discount Cost (TDC) formulation).

    For a given value of the weights i, with i = 1, . . . , q, the total reward function

    with the operating policy p over infinite time horizon can be defined as

    J(p) = limh E1,...,hh1

    t=0

    t

    gt(xt, ut, t+1)

    where 0 < < 1 and the expected value is used as criterion to filtering the

    disturbances (see Orlovski et al. [1984]; Nardini et al. [1992]; Soncini-Sessa e

    for details and alternative solutions). The optimal policy p is obtained by

  • 8/8/2019 2009WR008898-pip

    14/61

    X - 14 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    xt+1 = ft (xt, ut, t+1) t = 0, . . . h 1

    mt(xt) = ut Ut (xt) t = 0, . . . h 1

    t+1 t (|xt, ut) t = 0, . . . h 1

    x0 given

    ph {mt () ; t = 0, . . . h 1}

    By reformulating and solving the problem for some different values of i

    1, . . . , q), a finite subset of the generally infinite Pareto optimal policy set is

    Since the system (equations (11b-d)) and the total reward function (11a) a

  • 8/8/2019 2009WR008898-pip

    15/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    2. The disturbance vector is known (equation (11d)) and either the distur

    independent in time or any dependency upon the past at time t can be accoun

    value of the state at the same time.

    3. The step reward functions are known and separable, i.e., gt() only d

    variables defined for the time interval [t, t + 1).

    The solution to problems (9) and (11) is computed by recursively solving th

    Bellman equation formulated according to the TDC framework:

    Qt(xt, ut) = Et+1

    [gt(xt, ut,t+1) + max

    ut+1

    Qt+1(xt+1, ut+1)

    ](xt, ut) Sxt

    where Qt(, ) is the so-called Q-function or value function, i.e., the cumulativ

  • 8/8/2019 2009WR008898-pip

    16/61

    X - 16 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    To determine the right hand side of equation (12), the domains Sxt, Sut , an

    state, release decision, and disturbance must be discretized and, at each itera

    the resolution process, explored exhaustively. The choice of the domain discr

    essential as it reflects on the algorithm complexity, which is combinatorial in t

    of states, release decisions, and disturbances, and in their domain discretizatio

    Nut, and Nt+1 be the number of elements in the discretized state, release de

    disturbance sets Sxt Rnx, Sut R

    nu , and St+1 Rn: the recursive resolut

    for kT iteration steps (where k is usually lower than ten) requires

    kT Nnxxt Nnuut Nnt

  • 8/8/2019 2009WR008898-pip

    17/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    ing from experience with the concept of continuous approximation of the valu

    developed for large-scale dynamic programming (see for example Gordon [1995

    and Roy [1996]). This results into an improved reduction of the computation

    Indeed, a continuous mapping of state-decision pair into the value function sho

    the same level of accuracy as a look-up table representation based on an extre

    grid, but using a definitely coarser grid for the state-decision space. Further, t

    process is performed off-line, without the need for directly experimenting on t

    tem, which is a fundamental requirement when dealing with water resources

    experiments would led to unsustainable costs in terms both of time, social an

    loss

  • 8/8/2019 2009WR008898-pip

    18/61

    X - 18 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERATION

    from a finite set of transition samples, the policy generated by fitted Q-itera

    an approximation of the optimal policy p that solves problem (11). Precisely

    Q-iteration yields an approximation of the optimal Q-functions of the TDC p

    by iteratively extending the optimization horizon h, i.e. by iteratively solvin

    (11).

    The deterministic and stationary (T = 1) case is useful to describe the

    Under these simplifying assumptions the state transition (11b) and associa

    depend only on the state xt and decision ut. It can be shown [Ernst, 199

    following sequence of Qh-functions, defined for all (xt, ut) Sx Su

  • 8/8/2019 2009WR008898-pip

    19/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    In the stochastic case, the right hand side of equation (16b) is a realization

    variable and Qh(xt, ut) is redefined as its expectation. However, the expectat

    be operationally applied when Qh() is approximated with a regression func

    the least squares method, because this latter generates an approximation of th

    expectation of the output variables given the input. Its application to th

    constructed considering stochastic transitions provides a continuous app

    Qh() over the whole state-decision set.

    As originally proposed by Ernst et al. [2005], fitted Q-iteration generates

    i.e., just one operating rule of the form ut = m(xt), which is the optimal po

    system However natural systems are not stationary and thus a periodic poli

  • 8/8/2019 2009WR008898-pip

    20/61

    X - 20 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    Set Q0() = 0 over the whole state-decision space Sx Su.

    Iterations: repeat until stopping conditions are met

    Set h = h + 1.

    Build the training set T S= {< il, ol >, l = 1, . . . , #F}

    where il = ((t, xt)l, ult) and o

    l = glt+1 + maxut+1

    Qh1((t + 1, xt+1)l, ut+1)

    Run the regression algorithm on T S to get Qh(), from which the policy p

    Fitted Q-iteration is said to be a batch RL algorithm, because the whole

    processed in a batch mode, in contrast to traditional RL algorithms that per

    d t f th l f ti i th f t l ti ll It ti

  • 8/8/2019 2009WR008898-pip

    21/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    improve this near-historical policy is to enlarge release decisions explorat

    different values around the historical one (see Gaskett [2002]), for each pa

    (Figure 1a). This is, however, a risky approach: if the state-decision set has b

    during the historical operation (typically, in poorly controlled systems [Tsits

    the informative content of the learning data-set can be low and the result

    is very likely to be quite far from optimality. Further, the approach is imp

    water system has never been operated before (e.g., in planning problems).

    An alternative approach is to explore the behavior of the water system, v

    for different state values and under different operating policies, namely to a

    approach However the modeling effort does not need to involve the who

  • 8/8/2019 2009WR008898-pip

    22/61

    X - 22 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    it would not take any advantage of the continuous approximation of the Q-f

    by fitted Q-iteration. Rather, a coarse grid can exponentially reduce the com

    by linearly reducing Nxt and Nut in equation (15). Such a coarse grid can b

    as a uniform sub-sampling of the SDP dense grid (Figure 1c) or generated w

    discretization methods (Figure 1d), such as orthogonal arrays, Latin hypercube

    discrepancy sequences (see Cervellera et al. [2006] and reference therein). What

    adopted to build the learning data-set, this might contain redundancies, whic

    computational requirements with no advantages in terms of policy performan

    the data-set is to adopt active learning techniques [Cohn et al., 1996], based

    samples that mostly improve the performance of the learning algorithm (see

  • 8/8/2019 2009WR008898-pip

    23/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    computational requirements. Further, no human tuning of the function appro

    must be ensured (fully automated approximation).

    Some parametric function approximators can provide a great modeling

    neural networks, for instance, are provably able to approximate any cont

    function to any desired degree of accuracy. This modeling flexibility, howev

    since it is often reflected in a large number of parameters requiring expli

    strongly affecting the computational efficiency (see Castelletti et al. [2005]

    risk of over-parameterization. As the problem size scales up, neural networ

    more neurons, thus increasing the computational cost of the training pha

    function approximators particularly tree-based methods ensure modeling fl

  • 8/8/2019 2009WR008898-pip

    24/61

    X - 24 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    Tree-based methods include KD-Tree, Classification and Regression Tree

    1984], Tree Bagging [Breiman, 1996], Totally Randomized Trees and Extre

    Trees (Extra-Trees) [Geurts et al., 2006]. These methods basically differ by

    the termination test they adopt, and the number of trees they grow. Extr

    later) were demonstrated to perform better than other tree-based methods c

    fitted Q-iteration algorithm [Ernst et al., 2005] and are therefore adopted in

    ticularly, they provide great scalability by adapting the trees structure to t

    each iteration, thus resulting in a better accuracy of the final policy. The dra

    continuous changes in the structure is that Extra-Trees do not ensure conver

    iteration and so the algorithm cannot simply be stopped based on the dista

  • 8/8/2019 2009WR008898-pip

    25/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    Three parameters are thus associated to Extra-Trees, whose values can

    of empirical evaluations:

    K, the number of alternative cut-directions, can be chosen in the inter

    n is the number of regressor inputs. When K is equal to n, the choice of

    not randomized and the randomization acts only through the choice of th

    contrary, low values of K increase the randomization of the trees and wea

    of their structure on the output of the training data-set. Geurts et al. [20

    demonstrated that, for regression problems, the optimal default value for K

    nmin, the minimum cardinality for splitting a node. Large values of nmi

    (few leaves) with high bias and small variance Conversely low values o

  • 8/8/2019 2009WR008898-pip

    26/61

    X - 26 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    algorithm to Extra-Trees parameters and the stopping conditions. Second, th

    system makes the control problem solvable with SDP, and this is key to perfo

    evaluation of the algorithm. Based on this analysis, the advantages of the fitte

    SDP can be easily extrapolated to more complex cases, where SDP requiremen

    prohibitive for a comparison.

    4.1. Description

    Lake Como is the third biggest regulated lake in Italy with a surface are

    an active storage of 260 Mm3. The 4500 Km2 lakes catchment area produces

    inflow of 4.73 Gm3/year with the typical two-peak (spring and autumn) suba

  • 8/8/2019 2009WR008898-pip

    27/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    release decision ut is the volume to be released in the next 24 hours from

    finally, according to (6), the step reward function gt() is a linear combinat

    (negative rewards) accounting for flood damage and downstream water de

    data-set F of four-tuples < (t, xt), ut, (t + 1, xt+1), gt+1 > required by th

    algorithm was built adopting a partial model-free approach. A 80-points

    was used for the state-decision space (Figure 6); precisely, 10 points for

    points for the release decision ut, the first six of which correspond to downst

    values, plus two greater values. For the inflow at at time t, which plays the

    to the system, 15-years daily streamflow data (1965-1979) were directly us

    state transitions were performed by running a one-step simulation of equatio

  • 8/8/2019 2009WR008898-pip

    28/61

    X - 28 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    where wt is the aggregated agricultural and hydropower demand and rt+1 is

    from the lake given by equation (2).

    The policy is designed by solving an equivalent to problem (11) where, accor

    of the objectives, the operator max is substituted for min and the aggregated

    gt() in equation (8) is computed as

    gt() = gft () + (1 )g

    w

    t

    ()

    with 0 1.

    5. Analysis of Fitted Q-iteration Properties

  • 8/8/2019 2009WR008898-pip

    29/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    far from optimality depending on the accuracy of the approximation. How

    randomization, as h increases the value of Jh fluctuates. The recursive nat

    filters the pure random fluctuation (high frequency) of the approximation an

    smooth fluctuations. For small h, fluctuations are dominated by the perfor

    due to the policy learning process and are not evident. When increasing

    useful information for improving the policy, oscillations become the domina

    To choose the number h of iterations at which to stop the algorithm, it is

    to resort to some empirical criterion. In principle, h should be the value ofh

    learning process is nearly over and the improvement in performance is so l

    by random fluctuations due to the Extra-Trees approximation As far as t

  • 8/8/2019 2009WR008898-pip

    30/61

    X - 30 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    beginning of the new irrigation season is completely independent from the op

    in the previous season and, thus, for what concerns the irrigation compone

    constant value. Similar reasoning applies to floods, as they occur at the en

    the lake can be emptied in 15 days. Therefore, to design a receding horizon p

    iteration algorithm de facto does, 5 months (about 150 days) are enough, and

    in Figure 3 is close to this value. This observation not only supports the e

    proposed above, but also suggests another stopping criterion, to some extent

    does not require to compute Jh after each algorithm iteration): whenever th

    re-framed as a receding h-steps horizon problem, h is the natural stopping limi

    since the policy learning process does not improve anymore when the numb

  • 8/8/2019 2009WR008898-pip

    31/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    The value nmin, the minimum sample size for splitting a node, determines

    in a tree and thus the ensembles trade-off between bias and variance. By

    in Figure 4 (top panel) the experiment in Figure 3 is replicated for nmin

    Reducing nmin decreases the bias (in the average the performance is nearer to

    negatively affects the variance (higher amplitude of the fluctuations). As ant

    dealing with stochastic function approximation, the regression algorithm

    conditional expectation of the output given the input. nmin should therefo

    to the number of disturbance realizations available for each state-decision pa

    the learning data-set F of Lake Como water system was generated using a 15

    the best performance is expected to be obtained for nmin 15 This is co

  • 8/8/2019 2009WR008898-pip

    32/61

    X - 32 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    The potential of the fitted Q-iteration was analyzed via comparison with a

    formulation. The learning data-set F of fitted Q-iteration was generated usin

    free approach (see Section 4.2). As for SDP, according to the requirement of ex

    the system components, the inflow at+1 was described as a cyclostationary (wi

    log-normal, stochastic process, whose pdf is defined by the parameters t a

    PAR(0) model was assumed (for more details see Pianosi and Soncini-Sessa

    at+1 = etmodTt+1+tmodT t+1 N(0, 1)

    where t+1 is a Gaussian white noise. The state-decision domain was discret

    grid of 27,048 points (Nst = 161; Nut = 168, see Figure 6), while a 9-points gri

  • 8/8/2019 2009WR008898-pip

    33/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    grid, is more accurate than the SDP look-up table, even if this is based on

    state-decision discretization grid.

    By way of demonstration, the policy associated with point A in Figure

    fitted Q-iteration, dominates the corresponding policy A, obtained with

    floodings per year and nearly 3.5 106 m3 of deficit per year. Both the polic

    right the water demand (front flat area in panels (a) and (c) in Figure 8) ov

    relatively wide range of storage values and strongly increase the release rate

    seasons. In so doing they create a time-varying flood buffer zone, whose dim

    designed as it is either learnt from the flood events and the associated eff

    data-set (fitted Q-iteration) or implicitly inferred from the stochastic inflow

  • 8/8/2019 2009WR008898-pip

    34/61

    X - 34 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    does. An example is provided in Figure 11. The same example shows that

    significantly outperform the historical operation, which appears to be much m

    By moving toward the left extreme of the Pareto front (point B and B in

    increasing the relative importance of irrigation over floods, SDP performs bet

    iteration. This is basically due to the approximation error in the tree-based in

    Q-functions. Indeed, as the importance of the irrigation increases, the confl

    becomes negligible and the optimal policy simply suggests to release the w

    anticipated, water demand values belong to the release decision discretization

    algorithms. However, while the release decision chosen by SDP is necessarily

    thus a water demand value fitted Q-iteration uses a continuous approximation

  • 8/8/2019 2009WR008898-pip

    35/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    6.1. Computational Requirements

    A comparative analysis of the computational requirements by fitted Q-ite

    be empirically performed by inferring some general rules from the computin

    the Lake Como case study.

    As anticipated, the time tSDP required to design an operating policy with

    to the number of evaluations of the operator E[], which is given by equati

    the state dimensionality (nx) in two ns storages and nI hydro-meteorolog

    time tSDP can be expressed as

    tSDP = a kT (

    Nnsst NnIIt Nnuut N

    nt

    )

  • 8/8/2019 2009WR008898-pip

    36/61

    X - 36 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    disturbances. Assuming that the fitted Q-iteration coarse grid is obtained by r

    grid of an equally performing SDP of a factor rs and ru respectively for state a

    tQ1 = b T

    (

    Nstrs

    )ns (Nutru

    )nu Na

    where b is a constant, machine-dependent parameter, and Na is the numb

    realizations (i.e., the number of years in the historical data set used for the infl

    hydro-meteorological information). Time tQ2 grows linearly in the time horizon

    of regressors k (i.e., ns + nI + nu + 1) and trees M, and superlinearly in the

    (Nst

    rs)ns (

    Nutru

    )nu Na T) of four-tuples in the data-set. Precisely

    tQ2 = c (Nst )ns (

    Nut )nu N T log((Nst )ns (

    Nut )nu N T ) M (n +n

  • 8/8/2019 2009WR008898-pip

    37/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    lines & white circles) is evident from both the panels. Fitted Q-iteration g

    computational limits of SDP (i.e., ns 2) on complex networks includin

    (top panel). The improvement is even more remarkable when the operating

    exogenous information (bottom panel), as the model-free (uncontrolled!) c

    nearly no computational additional time for fitted Q-iteration: while SDP r

    days for a configuration with 1 reservoir and 2 exogenous information, fitted

    hours.

    7. Conclusions

    One major technical challenge in expanding the scope of water resources

  • 8/8/2019 2009WR008898-pip

    38/61

    X - 38 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    The application to Lake Como water system was used to infer general gui

    propriate setting for the algorithm parameters, to define an empirical stoppi

    to demonstrate the potential of the approach compared to traditional SDP. T

    with fitted Q-iteration on an extremely coarse state-decision discretization g

    generally outperform an equivalent SDP-derived policy computed on a very

    dominance is particularly remarkable on flood events (Figure 9), when the tim

    of both the policies, which is key to anticipate and buffer floods when no inflo

    considered, is more effectively exploited by the fitted Q-iteration. Based on

    Lake Como, a general rule was also derived to quantify the computational

    fitted Q-iteration over SDP in designing daily operating policies for large wa

  • 8/8/2019 2009WR008898-pip

    39/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    2007] combined with policy reconstruction procedure [Schneega et al., 2007

    operating policy is first identified and then iteratively improved by the algo

    While the Extra-Trees used by the fitted Q-iteration have been shown t

    racy/efficiency trade-off, this comes at the price of a lack of a well-defined and

    condition, which, in turn, might negatively affect both the accuracy (the p

    the best one explored) and the efficiency (a better policy could have been fo

    algorithm earlier). Strictly, this happens because of the tree structures refres

    of the fitted Q-algorithm, which is key to build an accurate approximation

    of the algorithms, but prevents the approximated Q-functions from stabili

    improvement in the policy performance is marginal Further investigations

  • 8/8/2019 2009WR008898-pip

    40/61

    X - 40 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    control approaches. These also include process-based modeling of rainfall-run

    generate climate change scenarios and investigate adaptive management strate

    the combined use of fitted Q-iteration and model reduction techniques [Cast

    is worth be explored for these purposes.

    Acknowledgments. The work was completed while Andrea Castelletti, S

    Rodolfo Soncini-Sessa were on leave at the Centre for Water Research, Univ

    Australia. This paper forms CWR reference 2329 AC.

    References

    A ( ) A

  • 8/8/2019 2009WR008898-pip

    41/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    Bhattacharya, A., A. Lobbrecht, and D. Solomatine (2003), Neural network

    learning in control of water systems, Journal of Water Resources Plannin

    ASCE, 129(6), 458465.

    Bonarini, A., A. Lazaric, and M. Restelli (2007), Piecewise constant reinfo

    robotic applications, in Proceedings of the 4th International Conferenc

    Control, Automation and Robotics (ICINCO 2007).

    Breiman, L. (1996), Bagging predictors, Machine Learning, 24 (2), 123140

    Breiman, L. (2001), Random forests, Machine Learning, 45(1), 532.

    Breiman, L., J. Friedman, R. Olsen, and C. Stone (1984), Classification a

    Wadsworth & Brooks Pacific Grove CA

  • 8/8/2019 2009WR008898-pip

    42/61

    X - 42 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    Castelletti, A., M. De Zaiacomo, S. Galelli, M. Restelli, P. Sanavia, R. S

    J. Antenucci (2009), An emulation modelling approach to reduce the co

    hydrodynamic-ecological model of a reservoir, in Proceedings of Internatio

    Environmental Software Systems (ISESS2009), October 2-9, Venice, I.

    Cervellera, C., V. Chen, and A. Wen (2006), Optimization of a large-scale water

    by stochastic dynamic programming with efficient state space discretization,

    of Operational Research, 171 (3), 11391151.

    Cohn, D., Z. Ghahramani, and M. Jordan (1996), Active learning with statistic

    of artificial intelligence research, 4, 129145.

    Ernst D (1999) Near optimal closed-loop control application to electric pow

  • 8/8/2019 2009WR008898-pip

    43/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    Galelli, S., and R. Soncini-Sessa (2010), Combining metamodelling and stoc

    gramming for the design of reservoirs release policies, Environmental M

    25(2), 209222.

    Galelli, S., C. Gandolfi, R. Soncini-Sessa, and D. Agostani (2010), Build

    an irrigation district distributed-parameter model, Agricultural Water

    187200.

    Gaskett, C. (2002), Q-learning for robot control, Ph.D. thesis, Australian N

    Canberra, AUS.

    Geurts, P., D. Ernst, and L. Wehenkel (2006), Extremely randomized trees

    63 (1) 342

  • 8/8/2019 2009WR008898-pip

    44/61

    X - 44 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    Haimes, Y. (1977), Hierarchical Analyses of Water Resources Systems, McGra

    NY.

    Hall, W., and N. Buras (1961), The dynamic programming approach to wate

    opment, Journal of Geophysical Research, 66(2), 510520.

    Hall, W., W. Butcher, and A. Esogbue (1968), Optimization of the operation o

    reservoir by dynamic programming, Water Resources Research, 4 (3), 4714

    Heidari, M., V. Chow, P. Kokotovic, and D. Meredith (1971), Discrete di

    programming approach to water resources systems optimisation, Water Re

    7(2), 273282.

    Hejazi M X Cai and B Ruddell (2008) The role of hydrologic information

  • 8/8/2019 2009WR008898-pip

    45/61

    CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPER

    Kelman, J., J. Stedinger, L. Cooper, E. Hsu, and S. Yuan (1990), Sampling

    Programming applied to reservoir operation, Water Resources Research,

    Labadie, J. (2004), Optimal operation of multireservoir systems: State-of-the

    of Water Research Planning and Management - ASCE, 130(2), 93111.

    Larson, R. (1968), State Incremental Dynamic Programming, American Else

    Lee, J.-H., and J. W. Labadie (2007), Stochastic optimization of multir

    reinforcement learning, Water Resources Research, 43(11), 116.

    Luenberger, D. (1971), Cyclic dynamic programming: a procedure for proble

    Operations Research, 19(4), 11011110.

    Nardini A C Piccardi and R Soncini-Sessa (1992) On the integration

  • 8/8/2019 2009WR008898-pip

    46/61

    X - 46 CASTELLETTI ET AL.: FITTEDQ-ITERATION FOR WATER RESERVOIR OPERAT

    Piccardi, C., and R. Soncini-Sessa (1991), Stochastic dynamic programming

    mal control: dense discretization and inflow correlation assumption made p

    computing, Water Resources Research, 27(5), 729741.

    Read, E. (1989), Dynamic Programming for Optimal Water Resources Systems

    dual approach to stochastic dynamic programming for reservoir release schedu

    Prentice-Hall, Englewood Cliffs.

    Saad, M., and A. Turgeon (1988), Application of principal component ana

    reservoir management, Water Resources Research, 24 (7), 907912.

    Saad, M., A. Turgeon, , and J. Stedinger (1992), Censored-data correlation a

    ponent dynamic programming Water Resources Research 28 (8) 2135214

  • 8/8/2019 2009WR008898-pip

    47/61

    !! Please write \lefthead{} in file !!: !! Please write \righthead{

  • 8/8/2019 2009WR008898-pip

    48/61

    X - 48!! Please write \lefthead{} in file !!: !! Please write \righthead{

  • 8/8/2019 2009WR008898-pip

    49/61

    !! Please write \lefthead{} in file !!: !! Please write \righthead{

  • 8/8/2019 2009WR008898-pip

    50/61

  • 8/8/2019 2009WR008898-pip

    51/61

  • 8/8/2019 2009WR008898-pip

    52/61

  • 8/8/2019 2009WR008898-pip

    53/61

  • 8/8/2019 2009WR008898-pip

    54/61

  • 8/8/2019 2009WR008898-pip

    55/61

  • 8/8/2019 2009WR008898-pip

    56/61

  • 8/8/2019 2009WR008898-pip

    57/61

  • 8/8/2019 2009WR008898-pip

    58/61

  • 8/8/2019 2009WR008898-pip

    59/61

  • 8/8/2019 2009WR008898-pip

    60/61

  • 8/8/2019 2009WR008898-pip

    61/61