control techniques for complex systems
DESCRIPTION
The systems & control research community has developed a range of tools for understanding and controlling complex systems. Some of these techniques are model-based: Using a simple model we obtain insight regarding the structure of effective policies for control. The talk will survey how this point of view can be applied to approach resource allocation problems, such as those that will arise in the next-generation energy grid. We also show how insight from this kind of analysis can be used to construct architectures for reinforcement learning algorithms used in a broad range of applications. Much of the talk is a survey from a recent book by the author with a similar title,Control Techniques for Complex Networks. Cambridge University Press, 2007. https://netfiles.uiuc.edu/meyn/www/spm_files/CTCN/CTCN.htmlTRANSCRIPT
Control Techniques for Complex SystemsDepartment of Electrical & Computer Engineering
University of Florida
Sean P. Meyn
Coordinated Science Laboratoryand the Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign, USA
April 21, 2011
1 / 26
Outline
Control TechniquesFOR
Complex Networks
Sean Meyn
Markov Chainsand
Stochastic Stability
S. P. Meyn and R. L. Tweedie
π(f)<
∞
∆V (x) ≤ −f(x) + bIC(x)
‖Pn(x, · )− π‖f → 0
sup
CEx [S
τC(f)]<
∞1 Control Techniques
2 Complex Networks
3 Architectures for Adaptation & Learning
4 Next Steps
2 / 26
Control Techniques
System model
???
d
dtα = µ σ −Cα+ . . .
. . .d
dtq = 1
2µ I−1 (C −d
dtθ = q
Control Techniques?
3 / 26
Control Techniques
Typical steps to control design
. Obtain simple model that capturesessential structure
– An equilibrium model if the goal is regulation
System model
???
d
dtα = µ σ −Cα+ . . .
. . .d
dtq = 1
2µ I−1 (C −d
dtθ = q
. Obtain feedback design, using dynamic programming, LQG, loop shaping, ...
Design for performance and reliability. Test via simulations and experiments, and refine design
If these steps fail, we may have to re-engineer thesystem (e.g., introduce new sensors), and start over.
This point of view is unique to control
4 / 26
Control Techniques
Typical steps to control design
. Obtain simple model that capturesessential structure
– An equilibrium model if the goal is regulation
System model
???
d
dtα = µ σ −Cα+ . . .
. . .d
dtq = 1
2µ I−1 (C −d
dtθ = q
. Obtain feedback design, using dynamic programming, LQG, loop shaping, ...
Design for performance and reliability. Test via simulations and experiments, and refine design
If these steps fail, we may have to re-engineer thesystem (e.g., introduce new sensors), and start over.
This point of view is unique to control
4 / 26
Control Techniques
Typical steps to control design
. Obtain simple model that capturesessential structure
– An equilibrium model if the goal is regulation
System model
???
d
dtα = µ σ −Cα+ . . .
. . .d
dtq = 1
2µ I−1 (C −d
dtθ = q
. Obtain feedback design, using dynamic programming, LQG, loop shaping, ...
Design for performance and reliability. Test via simulations and experiments, and refine design
If these steps fail, we may have to re-engineer thesystem (e.g., introduce new sensors), and start over.
This point of view is unique to control
4 / 26
Control Techniques
Typical steps to control design
. Obtain simple model that capturesessential structure
– An equilibrium model if the goal is regulation
System model
???
d
dtα = µ σ −Cα+ . . .
. . .d
dtq = 1
2µ I−1 (C −d
dtθ = q
. Obtain feedback design, using dynamic programming, LQG, loop shaping, ...
Design for performance and reliability. Test via simulations and experiments, and refine design
If these steps fail, we may have to re-engineer thesystem (e.g., introduce new sensors), and start over.
This point of view is unique to control
4 / 26
Control Techniques
Typical steps to scheduling
A simplified model of a semiconductormanufacturing facility
Similar demand-driven models can be usedto model allocation of locational reservesin a power grid
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
Obtain simple model –Frequently based on simple statistics to obtain a Markov model
Obtain feedback design based on heuristics, or dynamic programming
Performance evaluation via computation(e.g., Neuts’ matrix-geometric methods)
5 / 26
Control Techniques
Typical steps to scheduling
A simplified model of a semiconductormanufacturing facility
Similar demand-driven models can be usedto model allocation of locational reservesin a power grid
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
Obtain simple model –Frequently based on simple statistics to obtain a Markov model
Obtain feedback design based on heuristics, or dynamic programming
Performance evaluation via computation(e.g., Neuts’ matrix-geometric methods)
5 / 26
Control Techniques
Typical steps to scheduling
A simplified model of a semiconductormanufacturing facility.
Similar demand-driven models can be usedto model allocation of locational reservesin a power grid
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
Difficulty : A Markov model is not simple enough!
Obtain simple model –Frequently based on exponential statistics to obtain a Markov model
Obtain feedback design based on heuristics, or dynamic programming
Performance evaluation via computation (e.g., Neut’s matrix-geometric methods)
With the 16 buffers truncated to 0 ≤ x ≤ 10,
policy synthesis reduces to a linear program of dimension 1116 !
6 / 26
Control Techniques
Typical steps to scheduling
A simplified model of a semiconductormanufacturing facility.
Similar demand-driven models can be usedto model allocation of locational reservesin a power grid
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
Difficulty : A Markov model is not simple enough!
Obtain simple model –Frequently based on exponential statistics to obtain a Markov model
Obtain feedback design based on heuristics, or dynamic programming
Performance evaluation via computation (e.g., Neut’s matrix-geometric methods)
With the 16 buffers truncated to 0 ≤ x ≤ 10,policy synthesis reduces to a linear program of dimension 1116 !
6 / 26
Control Techniques
Control-theoretic approach to scheduling ddtq = Bu+ α
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
q: Queue length evolves on R16+ .
u: Scheduling/routing decisions —Convex relaxation
α: Mean exogenous arrivals of work
B: Captures network topology
Control-theoretic approach to scheduling:
Dimension reduced from a linear program of dimension 1116...to an HJB equation of dimension 16
Does this solve the problem?
7 / 26
Control Techniques
Control-theoretic approach to scheduling ddtq = Bu+ α
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
q: Queue length evolves on R16+ .
u: Scheduling/routing decisions —Convex relaxation
α: Mean exogenous arrivals of work
B: Captures network topology
Control-theoretic approach to scheduling:
Dimension reduced from a linear program of dimension 1116...to an HJB equation of dimension 16
Does this solve the problem?
7 / 26
Control Techniques
Control-theoretic approach to scheduling ddtq = Bu+ α
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
q: Queue length evolves on R16+ .
u: Scheduling/routing decisions —Convex relaxation
α: Mean exogenous arrivals of work
B: Captures network topology
Control-theoretic approach to scheduling:
Dimension reduced from a linear program of dimension 1116...to an HJB equation of dimension 16
Does this solve the problem?
7 / 26
Complex Networks
UncongestedCongestedHighly Congested
Complex Networks
First, a review of some control theory...
8 / 26
Complex Networks
UncongestedCongestedHighly Congested
Complex NetworksFirst, a review of some control theory...
8 / 26
Complex Networks
Dynamic Programming EquationsDeterministic model x = f(x, u)
Controlled generator
Duh (x) = ddth(x(t))
∣∣∣ t=0
x(0)=x
u(0)=u
= f(x, u) · ∇h (x)
Minimal total cost:
J∗(x) = infU
∫ ∞0
c(x(t), u(t)) dt , x(0) = x
HJB Equation:minu
{c(x, u) +DuJ∗ (x)
}= 0
9 / 26
Complex Networks
Dynamic Programming EquationsDeterministic model x = f(x, u)
Controlled generator
Duh (x) = ddth(x(t))
∣∣∣ t=0
x(0)=x
u(0)=u
= f(x, u) · ∇h (x)
Minimal total cost:
J∗(x) = infU
∫ ∞0
c(x(t), u(t)) dt , x(0) = x
HJB Equation:minu
{c(x, u) +DuJ∗ (x)
}= 0
9 / 26
Complex Networks
Dynamic Programming EquationsDeterministic model x = f(x, u)
Controlled generator
Duh (x) = ddth(x(t))
∣∣∣ t=0
x(0)=x
u(0)=u
= f(x, u) · ∇h (x)
Minimal total cost:
J∗(x) = infU
∫ ∞0
c(x(t), u(t)) dt , x(0) = x
HJB Equation:minu
{c(x, u) +DuJ∗ (x)
}= 0
9 / 26
Complex Networks
Dynamic Programming EquationsDeterministic model x = f(x, u)
Controlled generator
Duh (x) = ddth(x(t))
∣∣∣ t=0
x(0)=x
u(0)=u
= f(x, u) · ∇h (x)
Minimal total cost:
J∗(x) = infU
∫ ∞0
c(x(t), u(t)) dt , x(0) = x
HJB Equation:minu
{c(x, u) +DuJ∗ (x)
}= 0
9 / 26
Complex Networks
Dynamic Programming EquationsDiffusion model dX = f(X,U)dt+ σ(X)dN
Controlled generator
Duh (x) =d
dtE[h(X(t))]
∣∣∣ t=0
x(0)=x
u(0)=u
= f(x, u) · ∇h (x) + 12trace
(σ(x)T∇2h (x)σ(x)
)
Minimal average cost:
η∗ = infU
limT→∞
1
T
∫ T
0c(X(t), U(t)) dt
ACOE (Average Cost Optimality Equation):
minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗ is the relative value function
10 / 26
Complex Networks
Dynamic Programming EquationsDiffusion model dX = f(X,U)dt+ σ(X)dN
Controlled generator
Duh (x) =d
dtE[h(X(t))]
∣∣∣ t=0
x(0)=x
u(0)=u
= f(x, u) · ∇h (x) + 12trace
(σ(x)T∇2h (x)σ(x)
)Minimal average cost:
η∗ = infU
limT→∞
1
T
∫ T
0c(X(t), U(t)) dt
ACOE (Average Cost Optimality Equation):
minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗ is the relative value function
10 / 26
Complex Networks
Dynamic Programming EquationsDiffusion model dX = f(X,U)dt+ σ(X)dN
Controlled generator
Duh (x) =d
dtE[h(X(t))]
∣∣∣ t=0
x(0)=x
u(0)=u
= f(x, u) · ∇h (x) + 12trace
(σ(x)T∇2h (x)σ(x)
)Minimal average cost:
η∗ = infU
limT→∞
1
T
∫ T
0c(X(t), U(t)) dt
ACOE (Average Cost Optimality Equation):
minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗ is the relative value function
10 / 26
Complex Networks
Dynamic Programming EquationsMDP model X(t+ 1)−X(t) = f(X(t), U(t), N(t+ 1))
Controlled generator
Duh (x) = E[h(X(1))− h(X(0))]
= E[h(x+ f(x, u,N))]− h(x)
Minimal average cost:
η∗ = infU
limT→∞
1
T
T−1∑0
c(X(t), U(t))
ACOE (Average Cost Optimality Equation):
minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗ is the relative value function
11 / 26
Complex Networks
Dynamic Programming EquationsMDP model X(t+ 1)−X(t) = f(X(t), U(t), N(t+ 1))
Controlled generator
Duh (x) = E[h(X(1))− h(X(0))]
= E[h(x+ f(x, u,N))]− h(x)
Minimal average cost:
η∗ = infU
limT→∞
1
T
T−1∑0
c(X(t), U(t))
ACOE (Average Cost Optimality Equation):
minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗ is the relative value function
11 / 26
Complex Networks
Approximate Dynamic ProgrammingODE model from the MDP model, X(t+ 1)−X(t) = f(X(t), U(t), N(t+ 1))
Mean drift: f(x, u) = E[X(t+ 1)−X(t) | X(t) = x, U(t) = u]
Fluid Model: x(t) = f(x(t), u(t))
First-order Taylor series approximation:
Duh (x) = E[h(x+ f(x, u,N))]− h(x)
≈ f(x, u) · ∇h (x)
A second-order Taylor series expansionleads to a Diffusion Model.
12 / 26
Complex Networks
Approximate Dynamic ProgrammingODE model from the MDP model, X(t+ 1)−X(t) = f(X(t), U(t), N(t+ 1))
Mean drift: f(x, u) = E[X(t+ 1)−X(t) | X(t) = x, U(t) = u]
Fluid Model: x(t) = f(x(t), u(t))
First-order Taylor series approximation:
Duh (x) = E[h(x+ f(x, u,N))]− h(x)
≈ f(x, u) · ∇h (x)
A second-order Taylor series expansionleads to a Diffusion Model.
12 / 26
Complex Networks
Approximate Dynamic ProgrammingODE model from the MDP model, X(t+ 1)−X(t) = f(X(t), U(t), N(t+ 1))
Mean drift: f(x, u) = E[X(t+ 1)−X(t) | X(t) = x, U(t) = u]
Fluid Model: x(t) = f(x(t), u(t))
First-order Taylor series approximation:
Duh (x) = E[h(x+ f(x, u,N))]− h(x)
≈ f(x, u) · ∇h (x)
A second-order Taylor series expansionleads to a Diffusion Model.
12 / 26
Complex Networks
Approximate Dynamic ProgrammingODE model from the MDP model, X(t+ 1)−X(t) = f(X(t), U(t), N(t+ 1))
Mean drift: f(x, u) = E[X(t+ 1)−X(t) | X(t) = x, U(t) = u]
Fluid Model: x(t) = f(x(t), u(t))
First-order Taylor series approximation:
Duh (x) = E[h(x+ f(x, u,N))]− h(x)
≈ f(x, u) · ∇h (x)
A second-order Taylor series expansionleads to a Diffusion Model.
12 / 26
Complex Networks
ADP for Stochastic NetworksConclusions as of April 21, 2011
Stochastic Model: Q(t+ 1)−Q(t) = B(t+ 1)U(t) +A(t+ 1)
Fluid Model:d
dtq(t) = Bu(t) + α Cost c(x, u) = |x|
Relative value function h∗
Total cost value function J∗
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
q: Queue length evolves on R16+ .
u: Scheduling/routing decisions —Convex relaxation
α: Mean exogenous arrivals of work
B: Captures network topology
13 / 26
Complex Networks
ADP for Stochastic NetworksConclusions as of April 21, 2011
Stochastic Model: Q(t+ 1)−Q(t) = B(t+ 1)U(t) +A(t+ 1)
Fluid Model:d
dtq(t) = Bu(t) + α Cost c(x, u) = |x|
Relative value function h∗
Total cost value function J∗
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
q: Queue length evolves on R16+ .
u: Scheduling/routing decisions —Convex relaxation
α: Mean exogenous arrivals of work
B: Captures network topology
13 / 26
Complex Networks
ADP for Stochastic NetworksConclusions as of April 21, 2011
Stochastic Model: Q(t+ 1)−Q(t) = B(t+ 1)U(t) +A(t+ 1)
Fluid Model:d
dtq(t) = Bu(t) + α Cost c(x, u) = |x|
Relative value function h∗
Total cost value function J∗
Key conclusions – analytical
Stability of q implies stochastic stability of Q Dai, Dai & M. 1995
h∗(x) ≈ J∗(x) for large |x| M. 1996–2011
In many cases, the translation of the optimal policy for q isapproximately optimal, with logarithmic regret M. 2005 & 2009
14 / 26
Complex Networks
ADP for Stochastic NetworksConclusions as of April 21, 2011
Stochastic Model: Q(t+ 1)−Q(t) = B(t+ 1)U(t) +A(t+ 1)
Fluid Model:d
dtq(t) = Bu(t) + α Cost c(x, u) = |x|
Relative value function h∗
Total cost value function J∗
Key conclusions – engineering
Stability of q implies stochastic stability of Q
Simple decentralized policies based on q Tassiulas, 1995 –
Workload relaxation for model reductionM. 2003 –, following “heavy traffic” theory: Laws, Kelly, Harrison, Dai, ...
Intuition regarding structure of good policies
15 / 26
Complex Networks
ADP for Stochastic NetworksWorkload Relaxations
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
w2
-20 0 50
-20
0
50
R∗RSTO
w1
Workload process: W evolves on R2
Relaxation: Only lower bounds on rates are preservedEffective cost: c(w) is the minimum of c(x), over all x consistent w.
Optimal policy for fluid relaxation: Non-idling on region R∗
Optimal policy for stochastic relaxation: Introduce hedging
16 / 26
Complex Networks
ADP for Stochastic NetworksWorkload Relaxations
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
w2
-20 0 50
-20
0
50
R∗RSTO
w1
Workload process: W evolves on R2
Relaxation: Only lower bounds on rates are preservedEffective cost: c(w) is the minimum of c(x), over all x consistent w.
Optimal policy for fluid relaxation: Non-idling on region R∗
Optimal policy for stochastic relaxation: Introduce hedging
16 / 26
Complex Networks
ADP for Stochastic NetworksPolicy translation
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
w2
-20 0 50
-20
0
50
R∗RSTO
w1
Complete Policy Synthesis
1. Optimal control of relaxation
2. Translation to physical system:2a. Achieve the approximation c(Q(t)) ≈ c(W (t))2b. Address boundary constraints ignored in fluid approximations
achieved using safety stocks.
17 / 26
Complex Networks
ADP for Stochastic NetworksPolicy translation
demand 1
demand 2
Inventory model: Controlled work-release, controlled routing, uncertain demand
w2
-20 0 50
-20
0
50
R∗RSTO
w1
Complete Policy Synthesis
1. Optimal control of relaxation
2. Translation to physical system:2a. Achieve the approximation c(Q(t)) ≈ c(W (t))2b. Address boundary constraints ignored in fluid approximations
achieved using safety stocks.
17 / 26
Architectures for Adaptation & Learning
Singular PerturbationsWorkload Relaxations
Fluid model
0.01
0.02
0.03
0.04
0.05
0.06
−1 0 1−1
0
1
Optimal policy
50 100 150 200 250 30011
11.2
11.4
11.6
11.8
12
12.2
12.4
12.6
Iteration n
AverageCost
Standard VIA
Initialized with quadratic
Initialized with optimal �uid value function
Di�usion model
Mean-Field Games
x 1040 1 2 3 4 5 6 7 8 9 10
0
1
-1
(indi
vidu
al st
ate)
(ens
embl
e st
ate)
Agent 5 barely controllable
Agent 4
w2
-20 0 50
-20
0
50
R∗RSTO
w1
Station 4 Station 3
Station 5
Stat
ion
1
Stat
ion
2
1q
2q
1d
2d
3q
5q
6q
7q
8q
4q
9q
12q
11q
10q
13q
15q
14q
16q
µ 10a
µ 10b
Adaptation & Learning
18 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: Q-learning
ACOE Equation: minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗: Relative value function
η∗: Minimal average cost
“Q-function”: Q∗(x, u) = c(x, u) +Duh∗ (x)Watkins 1989 ... “Machine Intelligence Lab”@ece.ufl.edu
Q-Learning: Given parameterized family {Qθ : θ ∈ Rd}.Qθ is an approximation of the Q-function, or Hamiltonian Mehta & M. 2009
Compute θ∗ based on observations — without using a system model.
19 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: Q-learning
ACOE Equation: minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗: Relative value function
η∗: Minimal average cost
“Q-function”: Q∗(x, u) = c(x, u) +Duh∗ (x)Watkins 1989 ... “Machine Intelligence Lab”@ece.ufl.edu
Q-Learning: Given parameterized family {Qθ : θ ∈ Rd}.Qθ is an approximation of the Q-function, or Hamiltonian Mehta & M. 2009
Compute θ∗ based on observations — without using a system model.
19 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: Q-learning
ACOE Equation: minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗: Relative value function
η∗: Minimal average cost
“Q-function”: Q∗(x, u) = c(x, u) +Duh∗ (x)Watkins 1989 ... “Machine Intelligence Lab”@ece.ufl.edu
Q-Learning: Given parameterized family {Qθ : θ ∈ Rd}.Qθ is an approximation of the Q-function, or Hamiltonian Mehta & M. 2009
Compute θ∗ based on observations — without using a system model.
19 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: Q-learning
ACOE Equation: minu
{c(x, u) +Duh∗ (x)
}= η∗
h∗: Relative value function
η∗: Minimal average cost
“Q-function”: Q∗(x, u) = c(x, u) +Duh∗ (x)Watkins 1989 ... “Machine Intelligence Lab”@ece.ufl.edu
Q-Learning: Given parameterized family {Qθ : θ ∈ Rd}.Qθ is an approximation of the Q-function, or Hamiltonian Mehta & M. 2009
Compute θ∗ based on observations — without using a system model.
19 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: TD-learning
Value functions: For a given policy U(t) = φ(X(t)),
η = limT→∞
1
T
∫ T
0c(X(t), U(t)) dt
Poisson’s equation: h is again called a relative value function,{c(x, u) +Duh (x)
}∣∣∣u=φ(x)
= η
TD-Learning: Given parameterized family {hθ : θ ∈ Rd}.
min{‖h− hθ‖ : θ ∈ Rd} Sutton 1988, Tsitsiklis & Van Roy, 1997
Compute θ∗ based on observations — without using a system model.
20 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: TD-learning
Value functions: For a given policy U(t) = φ(X(t)),
η = limT→∞
1
T
∫ T
0c(X(t), U(t)) dt
Poisson’s equation: h is again called a relative value function,{c(x, u) +Duh (x)
}∣∣∣u=φ(x)
= η
TD-Learning: Given parameterized family {hθ : θ ∈ Rd}.
min{‖h− hθ‖ : θ ∈ Rd} Sutton 1988, Tsitsiklis & Van Roy, 1997
Compute θ∗ based on observations — without using a system model.
20 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: TD-learning
Value functions: For a given policy U(t) = φ(X(t)),
η = limT→∞
1
T
∫ T
0c(X(t), U(t)) dt
Poisson’s equation: h is again called a relative value function,{c(x, u) +Duh (x)
}∣∣∣u=φ(x)
= η
TD-Learning: Given parameterized family {hθ : θ ∈ Rd}.
min{‖h− hθ‖ : θ ∈ Rd} Sutton 1988, Tsitsiklis & Van Roy, 1997
Compute θ∗ based on observations — without using a system model.
20 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: How do we choose a basis?
Basis selection: hθ(x) =∑θiψi(x)
ψ1: Linearize
ψ2: Fluid model with relaxation
ψ3: Diffusion model with relaxation
ψ4: Mean-field game
Examples: Decentralized control, nonlinear control, processor speed-scaling
0.01
0.02
0.03
0.04
0.05
0.06
−1 0 1−1
0
1Optimal policy
x 104
0
1
-1
Agent 4
0 5 10
Mean-Field Game Linearization Fluid Model
0 5
10
0
5
15
J∗hApproximate relative value function
Fluid value function
h∗Relative value function
21 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: How do we choose a basis?
Basis selection: hθ(x) =∑θiψi(x)
ψ1: Linearize
ψ2: Fluid model with relaxation
ψ3: Diffusion model with relaxation
ψ4: Mean-field game
Examples: Decentralized control, nonlinear control, processor speed-scaling
0.01
0.02
0.03
0.04
0.05
0.06
−1 0 1−1
0
1Optimal policy
x 104
0
1
-1
Agent 4
0 5 10
Mean-Field Game Linearization Fluid Model
0 5
10
0
5
15
J∗hApproximate relative value function
Fluid value function
h∗Relative value function
21 / 26
Architectures for Adaptation & Learning
Reinforcement LearningApproximating a value function: How do we choose a basis?
Basis selection: hθ(x) =∑θiψi(x)
ψ1: Linearize
ψ2: Fluid model with relaxation
ψ3: Diffusion model with relaxation
ψ4: Mean-field game
Examples: Decentralized control, nonlinear control, processor speed-scaling
0.01
0.02
0.03
0.04
0.05
0.06
−1 0 1−1
0
1Optimal policy
x 104
0
1
-1
Agent 4
0 5 10
Mean-Field Game Linearization Fluid Model
0 5
10
0
5
15
J∗hApproximate relative value function
Fluid value function
h∗Relative value function
21 / 26
Next Steps
0
50
100
4am 9am 2pm 7pm0
10,000
20,000Stratford
Otahuhu
http://www.electricityinfo.co.nz/
March 25:
March 26:
4am 9am 2pm 7pm
Nodal Power Prices in NZ: $/MWh
Next Steps
22 / 26
Next Steps
Complex SystemsMainly energy
Entropic Grid : Advances in systems theory...
. Complex systems: Model reduction specialized to tomorrow’s gridShort term operations and long-term planning
. Resource allocation: Controlling supply, storage, and demandResource allocation with shared constraints.
. Statistics and learning: For planning and forecastingBoth rare and common events
. Economics for an Entropic Grid: Incorporate dynamics and uncertaintyin a strategic setting.
How to create policies to protect participants on both sides of themarket, while creating incentives for R&D on renewable energy?
23 / 26
Next Steps
Complex SystemsMainly energy
Entropic Grid : Advances in systems theory...
. Complex systems: Model reduction specialized to tomorrow’s gridShort term operations and long-term planning
. Resource allocation: Controlling supply, storage, and demandResource allocation with shared constraints.
. Statistics and learning: For planning and forecastingBoth rare and common events
. Economics for an Entropic Grid: Incorporate dynamics and uncertaintyin a strategic setting.
How to create policies to protect participants on both sides of themarket, while creating incentives for R&D on renewable energy?
23 / 26
Next Steps
Complex SystemsMainly energy
Entropic Grid : Advances in systems theory...
. Complex systems: Model reduction specialized to tomorrow’s gridShort term operations and long-term planning
. Resource allocation: Controlling supply, storage, and demandResource allocation with shared constraints.
. Statistics and learning: For planning and forecastingBoth rare and common events
. Economics for an Entropic Grid: Incorporate dynamics and uncertaintyin a strategic setting.
How to create policies to protect participants on both sides of themarket, while creating incentives for R&D on renewable energy?
23 / 26
Next Steps
Complex SystemsMainly energy
How to create policies to protect participants on both sides of the market,while creating incentives for R&D on renewable energy?
Our community must consider long-term planning and policy, along withtraditional systems operations
Planning and Policy, includes Markets & Competition
Evolution? Too slow!What we need is Intelligent Design
24 / 26
Next Steps
Complex SystemsMainly energy
How to create policies to protect participants on both sides of the market,while creating incentives for R&D on renewable energy?
Our community must consider long-term planning and policy, along withtraditional systems operations
Planning and Policy, includes Markets & Competition
Evolution? Too slow!What we need is Intelligent Design
24 / 26
Next Steps
Complex SystemsMainly energy
How to create policies to protect participants on both sides of the market,while creating incentives for R&D on renewable energy?
Our community must consider long-term planning and policy, along withtraditional systems operations
Planning and Policy, includes Markets & Competition
Evolution?
Too slow!What we need is Intelligent Design
24 / 26
Next Steps
Complex SystemsMainly energy
How to create policies to protect participants on both sides of the market,while creating incentives for R&D on renewable energy?
Our community must consider long-term planning and policy, along withtraditional systems operations
Planning and Policy, includes Markets & Competition
Evolution? Too slow!
What we need is Intelligent Design
24 / 26
Next Steps
Complex SystemsMainly energy
How to create policies to protect participants on both sides of the market,while creating incentives for R&D on renewable energy?
Our community must consider long-term planning and policy, along withtraditional systems operations
Planning and Policy, includes Markets & Competition
Evolution? Too slow!What we need is Intelligent Design
24 / 26
Next Steps
Conclusions
The control community has created many techniques for understandingcomplex systems, and a valuable philosophy for thinking about controldesign
In particular, stylized models can have great value:
. Insight in formulation of control policies
. Analysis of closed loop behavior, such as stability via ODE methods
. Architectures for learning algorithms
. Building bridges between OR, CS, and control disciplinesThe ideas surveyed here arose from partnerships with researchers in
mathematics, economics, computer science, and operations research.
Besides the many technical open questions, my hope is to extend theapplication of these ideas to long-range planning, especially in applicationsto sustainable energy.
25 / 26
Next Steps
Conclusions
The control community has created many techniques for understandingcomplex systems, and a valuable philosophy for thinking about controldesign
In particular, stylized models can have great value:
. Insight in formulation of control policies
. Analysis of closed loop behavior, such as stability via ODE methods
. Architectures for learning algorithms
. Building bridges between OR, CS, and control disciplinesThe ideas surveyed here arose from partnerships with researchers in
mathematics, economics, computer science, and operations research.
Besides the many technical open questions, my hope is to extend theapplication of these ideas to long-range planning, especially in applicationsto sustainable energy.
25 / 26
Next Steps
Conclusions
The control community has created many techniques for understandingcomplex systems, and a valuable philosophy for thinking about controldesign
In particular, stylized models can have great value:
. Insight in formulation of control policies
. Analysis of closed loop behavior, such as stability via ODE methods
. Architectures for learning algorithms
. Building bridges between OR, CS, and control disciplinesThe ideas surveyed here arose from partnerships with researchers in
mathematics, economics, computer science, and operations research.
Besides the many technical open questions, my hope is to extend theapplication of these ideas to long-range planning, especially in applicationsto sustainable energy.
25 / 26
Next Steps
References
S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press,Cambridge, 2007.
S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Second edition,Cambridge University Press – Cambridge Mathematical Library, 2009.
S. Meyn. Stability and asymptotic optimality of generalized MaxWeight policies. SIAM J.Control Optim., 47(6):3259–3294, 2009.
V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochasticapproximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
S. P. Meyn. Sequencing and routing in multiclass queueing networks. Part II: Workloadrelaxations. SIAM J. Control Optim., 42(1):178–217, 2003.
P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Proc. ofthe 48th IEEE Conf. on Dec. and Control, pp. 3598–3605, Dec. 2009.
W. Chen, D. Huang, A. A. Kulkarni, J. Unnikrishnan, Q. Zhu, P. Mehta, S. Meyn, andA. Wierman. Approximate dynamic programming using fluid and diffusion approximationswith applications to power management. In Proc. of the 48th IEEE Conf. on Dec. andControl, pp. 3575–3580, Dec. 2009.
26 / 26