limitation of markov models and event-based learning

40
1 Plenary Presentation at 2008 Chinese Control and Decision Conference July 2, 2008 Yaitai, China Limitation of Markov Models and Event-Based Learning & Optimization Xi-Ren Cao Hong Kong University of Science and Technology

Upload: others

Post on 12-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Limitation of Markov Models and Event-Based Learning

1

Plenary Presentationat

2008 Chinese Control and Decision ConferenceJuly 2, 2008 Yaitai, China

Limitation of Markov Models andEvent-Based Learning & Optimization

Xi-Ren Cao

Hong Kong University of Science and Technology

Page 2: Limitation of Markov Models and Event-Based Learning

2

Table of Contents

0. Review: Optimization Problems (state-based policies)

1. Event-Based FormulationLimitation of the state-based formulation Events and event-based policiesEvent-Based Formulation

2. Sensitivity-Based Approach to OptimizationA unified framework for optimizationExtensions to event-based optimization

3. Summary

Structure of the Presentation

Overview ofState-Based Optimization

Introduction toEvent-Based Formulation

Sensitivity-Based Approach toState-Based Optimization

Solution toEvent-Based Optimization

Introduction toEvent-Based Formulation

Sensitivity-Based Approach toState-Based Optimization

Solution toEvent-Based Optimization

Overview ofState-Based Optimization

Page 3: Limitation of Markov Models and Event-Based Learning

3

wBuAxdtdx

++=

Cxu −=u

x

x: Stateu: Control variablew: Random noise

∫=T

dttutxfET 0

)]}(),([{1η

Performance measure

LQG problem

A Typical Formulation of a Control Problem(Continuous Time Continuous State Model)

Control u depends on state xA policy u(x): x u ∫ +=

T

dtBuuAxxET 0

}{1 ττη

Page 4: Limitation of Markov Models and Event-Based Learning

4

Discrete-time Discrete State Model (I)- an example

A random walk of a robot

1 (100) (-100) 2

0 (0)

q

p

(100) 43 (-100)

1=+ qp

Reward function

Probabilities

f(0) = 0f(1) =f(4)=100f(2) =f(3)= -100

∑−

=∞→=

1

0

)(1limT

ttT Xf

Performance measure0

1 2

3 4

α 1−α

α 1−α

Page 5: Limitation of Markov Models and Event-Based Learning

5

Shannon Mouse (Theseus)

Page 6: Limitation of Markov Models and Event-Based Learning

6

A Sample Path (system dynamics):

A random walk of a robot

4

x

t

0

3

0

1

0

2

0

34

1 (100) (-100) 2

0 (0)

q

p

(100) 43 (-100)

Discrete Model (II)- the dynamics

α 1−α

α 1−α

Page 7: Limitation of Markov Models and Event-Based Learning

7

System performance: – Reward function: f=(f(1),…,f(M))T

– Performance measure:∑∑∈

=∞→ ===

Si

T

ttT ififXf

T)()()(1lim

1

0

ππη

Steady-state probability:– Steady-state probability: π=(π(1), π(2),..,π(M)).

π(I-P)=0, πe=1 I:identity matrix, e=(1,…,1)T

1

32

p(1,3)p(3,1)

p(2,1)

p(3,2)

p(1,2)p(2,3)

Discrete Model (III)- the Markov model

System dynamics:-X = {Xn, n=1,2,…}, Xn in S = {1,2,…,M}- Transition Prob. Matrix P=[p(i,j)]i,j=1,..,M

1 2

3 4

α α−1

α α−1

0p

q

Random Walker

Page 8: Limitation of Markov Models and Event-Based Learning

8

Control of Transition Probabilities

1 (100) (-100) 2

0 (0)

q

p

(100) 43 (-100)

Turn on red with prob. α

- move to leftα

α

Turn on green with prob. 1- α

- move to right

1−α

1−α

Page 9: Limitation of Markov Models and Event-Based Learning

9

α: Action controls transition probabilitiespα(i,j): governs the system dynamicsα=d(x): policy (state based)

)(xd=αα

x1

32

),( jipα

System dynamics: Markov model

Performance depend on policies, πd , ηd , etc

Goal of Optimization: Find a policy d that maximizes ηd in policy space

- the Control Model

Discrete Model (IV)- Markov decision processes (MDPs)

wBuAxdtdx

++=

Cxu −=u

x

∑−

=∞→=

1

0)(1lim

T

t

dtT

d XfT

η

Page 10: Limitation of Markov Models and Event-Based Learning

10

0. Review: Optimization Problems (state-based policies)

1. Event-Based OptimizationLimitation of the state-based formulation Events and event-based policiesEvent-Based Optimization

2. Sensitivity-Based Approach to Optimization A unified framework for optimizationExtensions to event-based optimization

3. Summary

Overview ofState-Based Optimization

Introduction toEvent-Based Optimization Sensitivity-Based Approach to

State-Based Optimization

Solution toEvent-Based Optimization

Page 11: Limitation of Markov Models and Event-Based Learning

11

The policy space is too largeM = 100 states, N=2 actions,

NM = 2100= 1030 policies(10GHZ 3* 1012 years to count!)

Special structures not utilized

Limitation of State-Based Formulation (I)

May not perform well

Page 12: Limitation of Markov Models and Event-Based Learning

12

Limitation of State-Based Formulation (II)

Example: Random walk of a robot

Choose α to maximize the average performance

0

1 2(100) (-100)

3 4(-100) (100)

α α−1

α α−1

p

q

1 (100) (-100) 2

0 (0)

(100) 43 (-100)

p

q

α

α

1−α

1−α

Page 13: Limitation of Markov Models and Event-Based Learning

13

)1( α−p )1( α−qαq01 2 3 4αp

Transition probabilities:

At state 0, if moves top, α needs to be as large as possibleif moves down, α needs to be as small as possible

Let p = q = 1/2,Average perf in next step = 0, no matter what α you choose (best you can do with a state-based model)

(-100)

0

1 2

3 4

α α−1

α α−1

p

q

(100)

(-100) (100)

A large α leads a largereward at state 1

(100)

But a small reward at state 3 (-100)

But a small reward at state 2

(-100)

A small α leads a largereward at state 4

(100)

Limitation of State-Based Formulation (III)

Page 14: Limitation of Markov Models and Event-Based Learning

14

We can do better!

Group two up transitions together as an event “a” and two down transitions as event “b”.When “a” happens, choose the largest α,When “b” happens, choose the smallest α.Average performance = 100, if α=1.

0

1 2(100) (-100)

3 4(-100) (100)

α α−1

α α−1

a

b

1/2

1/2

α large

α small

Page 15: Limitation of Markov Models and Event-Based Learning

15

Events and Event-Based Policies

An event is defined as a set of state transitionsEvent-based optimization:

• May lead to a better performance than the state-based formulation• MDP model may not fit:

- Only controls a part of transitions - An event may consist of transitions from many states

• May reflect and utilize special structuresQuestions:

• Why it may be better?• How general is the formulation?• How to solve event-based optimization problems?

01 2 3 4αp )1( α−p

Event a

αq )1( α−q

Event b

0

1 2(100) (-100)

3 4(-100) (100)

α α−1

α α−1

a

b

1/2

1/2

α large

α small

Page 16: Limitation of Markov Models and Event-Based Learning

16

Notations:A single transition <i,j>,

i,j in S ={1,2, …, M}An event: a set of transitions,

2M setsa = {<0,1>, <0,2>} b = {<0,3>, <0,4>}

Why it is better?An event contains information

about the future!(compared with the state-based policies)

Physical interpretation

α

α

1−α

1−α

(b)

(a)

1 (100) (-100) 2

0 (0)

q

p

(100) 43 (-100)

Page 17: Limitation of Markov Models and Event-Based Learning

17

How general is the formulation?

λα(n)

1-α(n)

q0iqij

n: populationNo. of customers in network

ni: No. of customers at server i n=(n1,…,nM): stateN: network capacity

Event: a customer arrival finding population nAction: accept or reject

Only applies when an event occursMDP does not apply: Same action is applied for different

state with the same population

Admission control

Page 18: Limitation of Markov Models and Event-Based Learning

18

Riemann Sampling vs. Lebesgue Sampling

Sample the system whenever the signal reaches a certain prespecified level, and control is added then.

t1 t2 tk… …

d3

d2

d1

d4d5

t1 t2 tk… ……

RS:

LS:

Page 19: Limitation of Markov Models and Event-Based Learning

19

*x

*1τ *

X(t)

t

A Model for Stock Price or Financial Assess

.),()),(,()())(,())(,()( ∫ −++= dzdtNztXttdwtXtdttXtbtdX γσ

w(t): Brownian motion; N(dt,dz): Poisson random measureX(t): Ito-Levy process

Page 20: Limitation of Markov Models and Event-Based Learning

20

How to solve event-based optimization problems?

0. Review: Optimization Problems (state-based policies)

1. Event-Based OptimizationLimitation of the state-based formulation Events and event-based policiesEvent-Based Optimization

2. Sensitivity-Based Approach to Optimization A unified framework for optimizationExtensions to event-based optimization

3. Summary

Overview ofState-Based Optimization

Introduction toEvent-Based Optimization

Sensitivity-Based Approach toState-Based Optimization

Solution toEvent-Based Optimization

Page 21: Limitation of Markov Models and Event-Based Learning

21

An overview of the paths to the top of a hill

Page 22: Limitation of Markov Models and Event-Based Learning

22

(perturbation analysis)Continuous Parameters

θ

A Sensitivity-Based View of Optimization

(policy iteration)Discrete Policy Space

θ+Δθ

Qgdd π

δη

=

η: performanceπ: steady-state probg: perf. potentialsQ=P’-P

Qg'' πηη =−

Page 23: Limitation of Markov Models and Event-Based Learning

23

Poisson Equationg(i) = potential contribution of state i (potential, or bias)

= contribution of the current state f(i)-η+ expected long term contribution after a transition

∑=

+−=M

jjgjipifig

1)(),()()( η

In matrix (Poisson equation): fegPI =+− η)(Potential is relative: if g(i) is solution, i=1,…, M, so is g(i)+c, c: constant

t

0

4x

3

01

0

2

0

3 4

Physical interpretation:

∑∞

=

=−=0

0 }|])([{)(l

l iXXfEig η≈)4(g average of ∑ )( lXf

starting from 40 =X

Page 24: Limitation of Markov Models and Event-Based Learning

24

For two Markov chains P, η, π and P’, η’, π’, let Q=P’-P

gPPQg )'(''' −==− ππηηPerformance difference:

fegPI =+− η)(:'π×One line simple derivation:

Two Sensitivity Formulas

gd

dPd

dθθπ

θθη )()(

=

Performance derivative: P is a function of θ: P(θ )

Derivative =average change in expected potential at next step

Perturbation analysis: choose the direction with the largest average change in expected potential at next step

])([ gPdd θπθ

=

Page 25: Limitation of Markov Models and Event-Based Learning

25

Policy Iteration

gPPQg )'(''' −==− ππηη

1. η’>η if P’g>Pg (Fact: π’>0 )

2. Policy iteration: At any state find a policy P’ with P’g>Pg

3. Reinforcement learning (Stochastic approximation algorithms)

Policy iteration: Choose the action with largest changes in expected potential at next step

Page 26: Limitation of Markov Models and Event-Based Learning

D: Policy space D0: Perf. optimal policies

D1: (1st) Bias optimal policies D2: 2nd Bias optimal policies

…… DM: Blackwell optimal policies

D

D0 D1

D2

D3 …

DM

Bias measures transient behavior

Mutli-Chain MDPsPerf./ Bias/ Blackwell Optimization

With perf. difference formulas, we can derive a simple, intuitive approach without discounting

Page 27: Limitation of Markov Models and Event-Based Learning

27

Online gradient based optimi Online policy iteration

RLTD(λ), Q-learning, Neuro-DP ..

(online estimate)

Qgdd π

δη

=

Two policies: P, P’, Q=P’-PSteady-state prob: π, π’Long-run ave. perf: η, η’Poisson eq: (I-P+e π)g =f

PA

StochasticApproximation

Potentials g

Qg'' πηη =−

SACMDP(Policy iteration)(Policy gradient)

Gradient-based PI

RL: reinforcement learningPA: perturbation analysisMDP: Markov decision proc.SAC: stochastic adaptive cont.

A Map of the L&O World

Page 28: Limitation of Markov Models and Event-Based Learning

28

Overview ofState-Based Optimization

Introduction toEvent-Based Optimization

Sensitivity-Based Approach toState-Based Optimization

Solution toEvent-Based Optimization

Extension of the sensitivity-based approachto event-based optimization

Page 29: Limitation of Markov Models and Event-Based Learning

29

Two sensitivity formulas• Performance derivatives• Performance differencesPA & PI• PA: Choose the direction with largest average

change in expected potential at next step • PI: Choose the action with largest changes

in expected potential at next step Potentials are aggregated according to event structure

Page 30: Limitation of Markov Models and Event-Based Learning

30

Solution to Random Walker Problem

0

1 2(100) (-100)

3 4(-100) (100)

α α−1

α α−1

a

b

p

qTwo policies:

),(ada =α )(bdb =α),('' ada =α )('' bdb =α

Apparently, g(a)>0 and g(b)<0 for any policy

Policy iteration: at any iteration choose and .Optimal policy: is the largest and is the smallest.*

aαaa αα >' bb αα <'

*bα

2. Performance deriv:

)]4()3([)()( θθθ θθαπ gg

ddb b −+

)]2()1([)()( θθθθ

θθαπ

θη gg

dda

dd a −=

)(θαa )(θαbContinuous with θ: ,

1. Performance diff:

π’(a), π’(b): perturbed steady-state prob. of events a and b

)]()')[(('' aga aa ααπηη −=−)]()')[((' bgb bb ααπ −+

)2()1()( ggag −= )4()3()( ggbg −=

Choose the action with the largest changesIn expected potential at next step

g(a), g(b) aggregated

Page 31: Limitation of Markov Models and Event-Based Learning

31∑

=

−=1

0

)}()]()(')[({N

n

ndnnnpdd ααδη2. Performance deriv:

Solution to Admission Control Problem α(n)

1-α(n) Two policies: α(n) and α’(n)

Potential aggregation:p(n): prob. of arrival finding n cust.

∑∑ ∑∑∑

−=== =

+

nn

M

i nnii

ii

ngnpngnpqnp

nd )}()()]()([{)(

1)(1

0

∑−

=

−=−1

0

)}()]()(')[('{'N

n

ndnnnp ααηη1. Performance diff:

d(n)= changes in expected potentials of accepting and rejecting a cust.

Policy iteration: Choose α’(n) such that [α’(n) – α(n) ]d(n) >0 d(n) can be estimated on a sample path

Choose the action with the largest changesIn expected potential at next step

d(n): aggregated potential

Page 32: Limitation of Markov Models and Event-Based Learning

32

Constructing NewSensitivity Eqs!

RL: reinforcement learningPA: perturbation analysisMDP: Markov decision proc.SAC: stochastic adaptive cont.

Sensitivity-Based Approaches to Event-Based Optimization

Gradient-based PI

MDP(Policy iteration)

Online gradient based optimi Online policy iteration

PA SAC

StochasticApproximation

RLTD(λ), Q-learning, Neuro-DP ..

(online estimate)

Qgdd π

δη

= Qg'' πηη =−

Potentials g

(Policy gradient)

agggeQe )|(*)('' πηη =−agggeQedd )|(*)(πδη

=

Page 33: Limitation of Markov Models and Event-Based Learning

33

Summary

Page 34: Limitation of Markov Models and Event-Based Learning

34

Advantages of the Event-Based Approach

2. # of aggregated potentials d(n): Nmay be linear in system

3. Actions at different states are correlatedstandard MDPs do not apply

4. Special features captured by eventsaction depends on future information

5. Opens up a new direction to many engineering problems

POMDPs: observation y as eventhierarchical control: mode change as event

network of networks: transitions among subnets as eventsLebesgue Sampling

1. May have better performance

Page 35: Limitation of Markov Models and Event-Based Learning

35

1. A map of the learning and optimization world: Different approaches can be obtained from two

sensitivity equations2. Extension to event-based optimization

Policy iteration, perturbation analysis reinforcement learning, time aggregation

stochastic approximation, Lebesgue sampling……

3. Simpler and complete derivation for MDPsMulti-chains, different perf. criteria

Average performance with no discountingN-bias optimality – Blackwell optimality

Sensitivity-Based View of Optimization

Page 36: Limitation of Markov Models and Event-Based Learning

36

a

b

0

1 2(100) (-100)

3 4(-100) (100)

α α−1

α α−1

1/2

1/2α small

α large

Pictures to Remember (I)

Page 37: Limitation of Markov Models and Event-Based Learning

37Online gradient based optimi Online policy iteration

PA AC

StochasticApproximation

MDP(Policy iteration)(Policy gradient)

Gradient-based PI

RLTD(λ), Q-learning, Neuro-DP ..

(online estimate)

Qgdd π

δη

= Qg'' πηη =−

Potentials g

Constructing NewSensitivity Eqs!

agggeQe )|(*)('' πηη =−agggeQedd )|(*)(πδη

=

Pictures to Remember (II)

Page 38: Limitation of Markov Models and Event-Based Learning

38

?!???????

?????

?????

0 Yautai

1 Alaska

2 Hawaii

Limitation of State-Based Formulation (I)

01

2

Page 39: Limitation of Markov Models and Event-Based Learning

39

Thank You!

Page 40: Limitation of Markov Models and Event-Based Learning

40

Xi-Ren Cao:

Stochastic Learningand Optimization- A Sensitivity BasedApproach

9 Chapters, 566 pages119 Figures, 27 Tables, 212 homework problems

SpringerOctober 2007