learning first order markov models for control pieter abbeel and andrew y. ng, poster 48 tuesday
DESCRIPTION
Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday. Consider modeling an autonomous RC-car’s dynamics from a sequence of states and actions collected at 100Hz. We have training data: (s 1 , a 1 , s 2 , a 2 , …). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/1.jpg)
Learning First Order Markov Models for ControlPieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday
Consider modeling an autonomous RC-car’s dynamics from a sequence of states and actions collected at 100Hz.
We have training data: (s1, a1, s2, a2, …).
We’d like to build a model of the MDP’s transition probabilities P(st+1|st, at). Slide #1
![Page 2: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/2.jpg)
Learning First Order Markov Models for ControlPieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday
• If we use maximum likelihood (ML) to fit the parameters of the MDP, then we are constrained to fit only the 1-step transitions:
max t p(st+1 | st, at)
• But in RL, our goal is to maximize the long-term rewards, so we aren’t really interested in the 1/100th-second dynamics.
• The dynamics on longer time-scales are often only poorly approximated (assuming the system isn’t really first-order).
• Algorithms for building models that better capture dynamics on longer time-scales.
• Experiments on autonomous RC car driving.
Slide #2
![Page 3: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/3.jpg)
Learning First Order Markov Models for Control
Pieter Abbeel and Andrew Y. Ng
Stanford University
![Page 4: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/4.jpg)
Autonomous RC Car
![Page 5: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/5.jpg)
Motivation
• Consider modeling an RC-car’s dynamics from a sequence of states and actions collected at 100Hz.
• Maximum likelihood fitting of a first order Markov model constrains the model to fit only the 1-step transitions. However for control applications, we do not care only about the dynamics on the time-scale of 1/100 of a second, but also about longer time-scales.
![Page 6: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/6.jpg)
Motivation
• If we use maximum likelihood (ML) to fit the parameters of a first-order Markov model, then we are constrained to fit only the 1-step transitions.
• The dynamics on longer time-scales are often only poorly approximated [unless the system dynamics are really first-order].
• However for control: interested in maximizing the long-term expected rewards.
![Page 7: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/7.jpg)
Regardless of true model, ML will return
same model with .
Random Walk Example
• Random walk.
• Consider two cases
Increments i perfectly correlated: Var(ST) = T2.
Increments i independent: Var(ST) = T.
![Page 8: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/8.jpg)
Examples of physical systems
• Influence of wind disturbances on helicopterVery small over one time step.Strong correlations lead to substantial effect over time.
• First order ML model may overestimate ability to control helicopter and car [thinking variance is O(T) rather than O(T2)]. This leads to danger of, e.g., flying too close to a building, or driving on too narrow a road.
• Systematic model errors can show up as correlated noise. E.g., oversteering or understeering of car.
![Page 9: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/9.jpg)
Problem statement
• The learning problem: Given: state/action sequence data from a system. Goal: model the system for purposes of control (such as
to use with a RL algorithm).
• Even when dynamics are not governed by an MDP, we often would still like to model it as such (rather than as a POMDP), since MDPs are much easier to solve.
• How do we learn an accurate first order Markov model from data for control?
[Our ideas are also applicable to higher order, and/or more structured models such as dynamic Bayesian networks and mixed memory Markov models.]
![Page 10: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/10.jpg)
Preliminaries and Notation
• Finite-state decision process (DP) S: set of states, A: set of actions, P: set of state transition probabilities
[not Markov!] : discount factor, D: initial state distribution, R: reward function, 8 s R(s) · Rmax .
• We will fit a model , with estimates of the transition probabilities .
• Value of state s0 in under policy
![Page 11: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/11.jpg)
Where is the variational distance.
Parameter estimation when no actions
• Consider
• dvar is hard to optimize from samples, but can be upper-bounded by a function of KL-divergence.
• Minimizing KL-divergence is, in turn, identical to minimizing log-loss.
![Page 12: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/12.jpg)
dvarKLlog-likelihood
[The last step reflects we are equally interested in every state
as possible starting state s0.]
![Page 13: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/13.jpg)
The resulting lagged objective
• Given a training sequence s0:T, we propose to use
• Compare this to the maximum likelihood objective
![Page 14: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/14.jpg)
S2S1
S2
S1
Lagged objective vs. ML
S0 S3S2S1
S0 S1
S2S1
S3S2
S0 S2
S3S1
S0 S3
• Consider a length four training sequence, which could have various dependencies.
• ML takes into account only the following transitions:
• Our lagged objective also takes into account:
[Yellow nodes are observed, white nodes are unobserved.]
![Page 15: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/15.jpg)
• M-step: update such that
EM-algorithm to optimize lagged objective
• E-step: compute expected counts
and store in stats. I.e., 8 t, k, l, i, j
![Page 16: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/16.jpg)
Computational Savings for E-step
• Inference for E-step can be done using standard forward and backward message passing. For every pair (t, t+k), the forward messages at position t+i depend on t only, not on k. So, computation of different terms in the inner-summation can share messages. Similarly for backward messages. This reduces the number of message computations by a factor T.
• Often only interested in some maximum horizon H. I.e., in the inner-summation of the objective only consider k=1,…,H.
Reduction from O(T3) to O(T H2).
• More substantial savings: (St=i, St+k=j) and (St’=i, St’+k=j) contribute same to stats( . , . )
Computing stats( . , . ) contribution for all such pairs only once.
Further reduction to O(|S|2 H2).
![Page 17: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/17.jpg)
Incorporating actions
• If actions are incorporated, our objective becomes:
• The EM-algorithm is trivially extended by conditioning
on the actions during the E-step.
• Forward messages need to be computed only once for
every t, backward messages once for every t+k. [as before]
• Number of possibilities for at:t+k-1 is O(|A|k).
Use only a few deterministic exploration policies.
Can still obtain same computational savings as before.
![Page 18: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/18.jpg)
Experiment 1: shortest vs. safest path
• Actions are 4 compass directions.
• Move in intended direction with probability 0.7, and a random direction with probability 0.3.
• The directions of the “random transitions” are dependent, and correlated over time. A parameter q controls the correlation between the directions of the random transitions on different time steps (uncorrelated if q=0, perfectly correlated if q=1).
• We will fit a first order Markov model to these dynamics (with each grid position being a state).
[Details: Noise process governed by a Markov process (not directly observable by the agent) with each of the 4 directions as states, with Prob(staying in same state) = q.]
![Page 19: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/19.jpg)
Experiment 1: shortest vs. safest path
[Details: Learning was done using a 200,000 length state-action sequence. Reported results are averages over 5 independent trials. The exploration policy used independent random actions at each time step.]
If the noise is strongly correlated across time (large q), our model estimates the dynamics to have a higher “effective noise level.” As a consequence the more cautious policy (path B) is used.
(q)
![Page 20: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/20.jpg)
Experiment 2: Queue
Actions: 3 service rates, with faster service rates being more expensive.
q0 = 0 reward = 0
q1 = p reward = -1
q2 = .75 reward = -10
Queue buffer length = 20; buffer overflow results in reward -1000.
Customers arrive over time to be served.
At every time, the arrival probability equals p.
Service rate = probability that the customer first in queue
gets serviced successfully in the current time step.
![Page 21: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/21.jpg)
P( arrival | slow mode ) = 0.01
P( arrival | fast mode ) = 0.99
Steady state: P(slow mode)=0.8, P(fast mode)=0.2
Experiment 2: Queue
Underlying (unobserved!) arrival process has 2 different modes (fast arrivals and slow arrivals)
Additional parameter determines how rapidly system changes between fast and slow modes.
Fast switching
Slow switchingbetween modes
between modes
![Page 22: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/22.jpg)
Experiment 2: Queue
Estimate/Learn first order Markov model with State = size of the queue, Actions = 3 service rates Exploration policy = repeatedly use same service rate for 25 time-steps. We used 8000 such trials.
15% better performance
at high correlation levels.
Same performance
at low correlation levels.
![Page 23: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/23.jpg)
Experiment 3: RC-car
Consider the situation where the RC-car can choose between 2 paths
• A curvy path with high reward if successful in reaching the goal.
• An easier path with lower reward if successful in reaching the goal
We build a dynamics model of the car, and find a policy/controller in simulation for following each of the paths. The decision about which path to follow is then made based upon this simulation.
![Page 24: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/24.jpg)
RC-car model
• : angular direction the RC-car is headed
• : angular velocity
• V : velocity of the RC-car (kept constant)
• ut : steering input to the car ( 2 [-1,1])
• C1, C2, C3 : parameters of the model, estimated using linear regression
• wt : noise term, zero-mean Gaussian with variance 2
.
Using the lagged objective, we re-estimate the variance 2, and compare its performance to the first-order estimate of 2.
![Page 25: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/25.jpg)
Controller
• We use the following controller
desired steering angle = p1*(y-ydes) + p2*(-des);
u = f(desired steering angle);
We optimize over the parameters p1, p2 to follow the straight line y=0, for which we set ydes=0, des=0.
For the specific two trajectories, ydes(x), des(x) are optimized as a function of the current x position.
• For localization, we use an overhead camera.
![Page 26: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/26.jpg)
Simulated performance on curvy trajectory
Plot shows 100 sample runs in simulation under the ML-model.
The ML-model predicts the RC-car can follow the curvy road >95% of the time.
Plot shows 10 sample runs in simulation under the lag-learned model.
The lag-learned model predicts the RC-car can follow the curvy road < 10% of the time.
Green lines: simulated trajectories, Black lines: road boundaries.
![Page 27: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/27.jpg)
Simulated performance on easier trajectory
Plot shows 100 sample runs in simulation under the ML-model.
The ML-model predicts the RC-car can follow the easier road >99% of the time.
Plot shows 100 sample runs in simulation under the lag-learned model.
The lag-learned model predicts the RC-car can follow the curvy road > 70% of the time.
Green lines: simulated trajectories, Black lines: road boundaries.
ML would choose the curvy road if high reward along curvy road.
![Page 28: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/28.jpg)
Actual performance on easier trajectory
[Movies available.]
The real RC-car succeeded on the easier road 20/20 times.
The real RC-car failed on the curvy road 19/20 times.
![Page 29: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/29.jpg)
RC-car movie
![Page 30: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/30.jpg)
Conclusions
• Maximum likelihood with a first order Markov model only tries to model the 1-step transition dynamics.
• For many control applications, we desire an accurate model of the dynamics on longer time-scales.
• We showed that, by using an objective that takes into account the longer time scales, in many cases a better dynamical model (and a better controller) is obtained.
Special thanks to Mark Woodward, Dave Dostal, Vikash Gilja and Sebastian Thrun.
![Page 31: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/31.jpg)
Cut out slides follow
![Page 32: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/32.jpg)
Lagged objective vs. ML
• Consider a length four training sequence, which could have various dependencies.
• ML takes into account only the following transitions.
• Our lagged objective also takes into account
[Shaded nodes are observed, white nodes are unobserved.]
![Page 33: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/33.jpg)
Experiment 2: Queue [use this one or previous one?]
Queue size at time t Queue size at time t+1
s(t)
s(t+1) = s(t)+1
s(t+1) = s(t)
s(t+1) = s(t)-1
arrival
no arrival
unsuccessful servicing
unsuccessful servicing
successful servicing
successful servicing
Choice of actions between 3 service rates
q0 = 0 reward = 0
q1 = p reward = -1
q2 = .75 reward = -10
Buffer size = 20. Buffer overflow results in reward of -1000.
Arrival probability = p
![Page 34: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/34.jpg)
Actual performance on curvy trajectory
[Movies available.]
Green lines: simulated trajectories, Black lines: road boundaries.
Real trajectories obtained as obtained on floor.
The actual RC-car fell off the curvy trajectory 19/20 times.
![Page 35: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/35.jpg)
Alternative title slides follow
![Page 36: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/36.jpg)
Learning First Order Markov Models for Control
Pieter Abbeel and Andrew Y. Ng
Stanford University
![Page 37: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/37.jpg)
Learning First
![Page 38: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/38.jpg)
Order Markov
![Page 39: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/39.jpg)
Models for
![Page 40: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/40.jpg)
Control
![Page 41: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/41.jpg)
Pieter Abbeel and Andrew Y. Ng
Stanford University
![Page 42: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday](https://reader035.vdocument.in/reader035/viewer/2022062518/568145be550346895db2c927/html5/thumbnails/42.jpg)
Pieter Abbeel and Andrew Y. Ng
Stanford University