ideal @ ut - spectral learning algorithms for …geoff gordon—ut austin—feb, 2012 given past...
TRANSCRIPT
Spectral learning algorithms for dynamical systems
Geoff Gordonhttp://www.cs.cmu.edu/~ggordon/
Machine Learning DepartmentCarnegie Mellon University
joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola
Geoff Gordon—UT Austin—Feb, 2012
. . . . . .
What’s out there?
2
ot-2 ot-1 ot ot+1 ot+2
video frames, as vectors of pixel intensities
Geoff Gordon—UT Austin—Feb, 2012
. . . . . .
What’s out there?
2
ot-2 ot-1 ot ot+1 ot+2
video frames, as vectors of pixel intensities
Geoff Gordon—UT Austin—Feb, 2012
Given past observations from a partially observable system
Predict future observations
A dynamical system
3
. . . . . .ot-2 ot-1 ot ot+1 ot+2
← Past Future →
State
Geoff Gordon—UT Austin—Feb, 2012
Given past observations from a partially observable system
Predict future observations
A dynamical system
3
. . . . . .ot-2 ot-1 ot ot+1 ot+2
← Past Future →
State
Geoff Gordon—UT Austin—Feb, 2012
This talk
• General class of models for dynamical systems
• Fast, statistically consistent learning algorithm
‣ no local optima
• Includes many well-known models & algorithms as special cases
• Also includes new models & algorithms that give state-of-the-art performance on interesting tasks
4
Geoff Gordon—UT Austin—Feb, 2012
Includes models
• hidden Markov model
‣ n-grams, regexes, k-order HMMs
• PSR, OOM, multiplicity automaton, RR-HMM
• LTI system
‣ Kalman filter, AR, ARMA
• Kernel versions and manifold versions of above
5
Geoff Gordon—UT Austin—Feb, 2012
Includes algorithms
• Subspace identification for LTI systems
‣ recent extensions to HMMs, RR-HMMs
• Tomasi-Kanade structure from motion
• Principal components analysis (e.g., eigenfaces); Laplacian eigenmaps
• New algorithms for learning PSRs, OOMs, etc.
• A new way to reduce noise in manifold learning
• A new range-only SLAM algorithm
6
Geoff Gordon—UT Austin—Feb, 2012
Interesting applications
• Structure from motion
• Simultaneous localization and mapping
‣ Range-only SLAM
‣ “SLAM” from inertial sensors
‣ very simple vision-based SLAM (so far)
• Video textures
• Opponent modeling, option pricing, audio event classification, …
7
Geoff Gordon—UT Austin—Feb, 2012
Bayes filter (HMM, Kalman, etc.)
• Goal: given ot, update Pt-1(st-1) to Pt(st)
• Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st | st-1) P(ot | st)
• Marginalize: Pt-1(ot, st) = ! Pt-1(st-1, ot, st) dst-1
• Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot)8
. . . . . .ot-2 ot-1 ot ot+1 ot+2
Belief Pt(st) = P(st | o1:t)
. . . . . .st-2 st-1 st st+1 st+2
Geoff Gordon—UT Austin—Feb, 2012
Bayes filter (HMM, Kalman, etc.)
• Goal: given ot, update Pt-1(st-1) to Pt(st)
• Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st | st-1) P(ot | st)
• Marginalize: Pt-1(ot, st) = ! Pt-1(st-1, ot, st) dst-1
• Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot)8
. . . . . .ot-2 ot-1 ot ot+1 ot+2
Belief Pt(st) = P(st | o1:t)
. . . . . .st-2 st-1 st st+1 st+2
Geoff Gordon—UT Austin—Feb, 2012
A common form for Bayes filters
• Find covariances as (linear) fns of previous state
‣ Σo(st-1) = E(!(ot) !(ot)T | st-1)
‣ Σso(st-1) = E(st !(ot) | st-1)‣ nb: uncentered covars; st & !(ot) include constant
• Linear regression to get current state
‣ st = Σso Σo !(ot)
• Exact if discrete (HMM, PSR), Gaussian (Kalman, AR), RKHS w/ characteristic kernel [Fukumizu et al.]
• Approximates many more9
-1
Geoff Gordon—UT Austin—Feb, 2012
Why this form?
• If Bayes filter takes (approximately) the above form, can design a simple spectral algorithm to identify the system
• Intuitions:
‣ predictive state
‣ rank bottleneck
‣ observable representation
10
Geoff Gordon—UT Austin—Feb, 2012
Predictive state
• If Bayes filter takes above form, we can use a vector of predictions of observables as our state
‣ E(!(ot+k) | st) = linear fn of st
‣ for big enough k, E([!(ot+1) … !(ot+k)] | st) = invertible linear fn of st
‣ so, take E([!(ot+1) … !(ot+k)] | st) as our state
11
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: example
12
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: example
12
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: example
12
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: example
12
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: minimal example
• For big enough k, E([!(ot+1) … !(ot+k)] | st) = Wst invertible linear fn (if system observable)
13
P(st | st−1) =
23 0 1
413
23 0
0 13
34
= T
P(ot | st−1) =
�12 1 012 0 1
�= O
OT =
�23
23 0
13
13
78
�
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: minimal example
• For big enough k, E([!(ot+1) … !(ot+k)] | st) = Wst invertible linear fn (if system observable)
13
P(st | st−1) =
23 0 1
413
23 0
0 13
34
= T
P(ot | st−1) =
�12 1 012 0 1
�= O
OT =
�23
23 0
13
13
78
�
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: minimal example
• For big enough k, E([!(ot+1) … !(ot+k)] | st) = Wst invertible linear fn (if system observable)
13
P(st | st−1) =
23 0 1
413
23 0
0 13
34
= T
P(ot | st−1) =
�12 1 012 0 1
�= O
OT =
�23
23 0
13
13
78
�W
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: summary
• E([!(ot+1) … !(ot+k)] | st) is our state rep’n
‣ interpretable
‣ observable—a natural target for learning
14
Geoff Gordon—UT Austin—Feb, 2012
Predictive state: summary
• E([!(ot+1) … !(ot+k)] | st) is our state rep’n
‣ interpretable
‣ observable—a natural target for learning
14
• To find st, predict !(ot+1) … !(ot+k) from o1:t
Geoff Gordon—UT Austin—Feb, 2012
Rank bottleneck
stat
ecompress expand
bottleneck
predict
data about past(many samples)
data about future(many samples)
15
Can find best rank-k bottleneck via matrix factorization ⇒ spectral method
Geoff Gordon—UT Austin—Feb, 2012
Finding a compact predictive state
16
Y X"
W
linear function predicts Yt from Xt
Geoff Gordon—UT Austin—Feb, 2012
Finding a compact predictive state
16
Y X"
W
linear function predicts Yt from Xt
Geoff Gordon—UT Austin—Feb, 2012
Finding a compact predictive state
16
Y X"
W
linear function predicts Yt from Xt
Geoff Gordon—UT Austin—Feb, 2012
Finding a compact predictive state
16
Y X"
W
linear function predicts Yt from Xt
Geoff Gordon—UT Austin—Feb, 2012
Finding a compact predictive state
16
Y X"
W
linear function predicts Yt from Xt
Geoff Gordon—UT Austin—Feb, 2012
Finding a compact predictive state
16
Y X"
W
linear function predicts Yt from Xt
Geoff Gordon—UT Austin—Feb, 2012
Finding a compact predictive state
16
Y X"
W
linear function predicts Yt from Xt
rank(W) # k
Geoff Gordon—UT Austin—Feb, 2012
Observable representation
• Now that we have a compact predictive state, need to estimate fns Σo(st) and Σso(st)
• Insight: parameters are now observable: the problem is just to estimate some covariances from data
‣ to get Σo, regress (!(o) $ !(o)) ← state
‣ to get Σso, regress (!(o) $ future+) ← state
17
Future+ →
. . . . . .ot-2 ot-1 ot ot+1 ot+2
← Past Future →
State
Geoff Gordon—UT Austin—Feb, 2012
Algorithm
• Find compact predictive state: regress future ← past
‣ use any features of past/future, or a kernel, or even a learned kernel (manifold case)
‣ constrain rank of prediction weight matrix
• Extract model: regress
‣ (o $ future+) ← [predictive state]
‣ (o $ o) ← [predictive state]
‣ future+: use same rank-constrained basis as for predictive state
18
Geoff Gordon—UT Austin—Feb, 2012
Does it work?
• Simple HMM: 5 states, 7 observations
• N=300 thru N=900k
19
Tran
sitio
nsO
bser
vatio
nsXT
YT
Geoff Gordon—UT Austin—Feb, 2012
103 104 10510 2
10 1
100
run 1run 2c/sqrt(n)
Does it work?
20
mea
n ab
s er
r(1
00 =pr
edic
t un
iform
)
training data
Model True
P(o 1
, o2,
o 3)
Geoff Gordon—UT Austin—Feb, 2012
Discussion
• Impossibility—learning DFA in poly time = breaking crypto primitives [Kearns & Valiant 89]
‣ so, clearly, we can’t always be statistically efficient
‣ but, see McDonald [11], HKZ [09], us [09]: convergence depends on mixing rate
• Nonlinearity—Bayes filter update is highly nonlinear in state (matrix inverse), even though we use a linear regression to identify the model
‣ nonlinearity is essential for expressiveness
21
Geoff Gordon—UT Austin—Feb, 2012
Range-only SLAM
• Robot measures distance from current position to fixed beacons (e.g., time of flight or signal strength)
‣ may also have odometry
• Goal: recover robot path, landmark locations
22
Geoff Gordon—UT Austin—Feb, 2012
Typical solution: EKF
• Problem: linear/Gaussian approximation very bad23
2 0 2
2
0
2
2 0 2
2
0
2
PriorAfter 1 range measurement
to known landmark
Baye
s fil
ter
vs. E
KF
Geoff Gordon—UT Austin—Feb, 2012
Typical solution: EKF
• Problem: linear/Gaussian approximation very bad23
2 0 2
2
0
2
2 0 2
2
0
2
PriorAfter 1 range measurement
to known landmark
Baye
s fil
ter
vs. E
KF
Geoff Gordon—UT Austin—Feb, 2012
Spectral solution (simple version)
24
landmark 1, time step 2
d211 d212 . . . d21Td221 d222 . . . d22T...
......
...d2N1 d2N2 . . . d2NT
landmark xy
Geoff Gordon—UT Austin—Feb, 2012
Spectral solution (simple version)
24
landmark 1, time step 2
d211 d212 . . . d21Td221 d222 . . . d22T...
......
...d2N1 d2N2 . . . d2NT
1 x1 y1 (x21 + y21)/2
1 x2 y2 (x22 + y22)/2
......
......
1 xN yN (x2N + y2N )/2
(u21 + v21)/2 (u2
2 + v22)/2 . . . (u2T + v2T )/2
−u1 −u2 . . . uT
−v1 −v2 . . . vT1 1 . . . 1
landmark position x, y
landmark xy
robot position u, v
1/2
=
Geoff Gordon—UT Austin—Feb, 2012
Experiments (full version)
25
20
0
20
30
40
50
60
0
0
20
30
40
50
60
0
C.B. Plaza 2
Fig. 4. The autonomous lawn mower and spectral SLAM. A.) The robotic lawn mower platform. B.) In the first experiment, the robot traveled 1.9kmreceiving 3,529 range measurements. This path minimizes the effect of heading error by balancing the number of left turns with an equal number of rightturns (a commonly used pattern for lawn mowing). Spectral SLAM recovers the robot’s path and landmark positions accurately. C.) In the second experiment,the robot traveled 1.3km receiving 1,816 range measurements. This path highlights the effect of heading error on dead reckoning performance by turning inthe same direction repeatedly. Again, spectral SLAM is able to recover the path and landmarks accurately.
VI. CONCLUSION
We proposed a novel formulation and solution for the range-only SLAM problem that differs substantially from previousapproaches. The essence of this new approach is to formulateSLAM as a factorization problem, which allows us to derive alocal-minimum free spectral learning method that is closely re-lated to SfM and spectral approaches to system identification.We provide theoretical guarantees for our algorithm, discusshow to derive an online algorithm, and show how to generalizeto a full robot system identification algorithm. Finally, wedemonstrate that our spectral approach to SLAM improves onother state-of-the-art SLAM approaches in real-world range-only SLAM problems.
REFERENCES
[1] Michael Biggs, Ali Ghodsi, Dana Wilkinson, and MichaelBowling. Action respecting embedding. In In Proceedings of the
Twenty-Second International Conference on Machine Learning,pages 65–72, 2005.
[2] Byron Boots and Geoff Gordon. Predictive state temporaldifference learning. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural
Information Processing Systems 23, pages 271–279. 2010.[3] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing
the learning-planning loop with predictive state representations.In Proceedings of Robotics: Science and Systems VI, 2010.
[4] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. An onlinespectral learning algorithm for partially observable nonlineardynamical systems. In Proceedings of the 25th National
Conference on Artificial Intelligence (AAAI-2011), 2011.[5] Matthew Brand. Fast low-rank modifications of the thin singular
value decomposition. Linear Algebra and its Applications, 415(1):20–30, 2006.
[6] Joseph Djugash. Geolocation with range: Robustness, efficiencyand scalability. PhD. Thesis, Carnegie Mellon University, 2010.
[7] Joseph Djugash and Sanjiv Singh. A robust method of localiza-tion and mapping using only range. In International Symposium
on Experimental Robotics, July 2008.[8] Joseph Djugash, Sanjiv Singh, and Peter Ian Corke. Further
results with localization and mapping using range from radio.In International Conference on Field and Service Robotics (FSR
’05), July 2005.[9] Brian Ferris, Dieter Fox, and Neil Lawrence. WiFi-SLAM using
Gaussian process latent variable models. In Proceedings of
the 20th international joint conference on Artifical intelligence,IJCAI’07, pages 2480–2485, San Francisco, CA, USA, 2007.Morgan Kaufmann Publishers Inc.
[10] Daniel Hsu, Sham Kakade, and Tong Zhang. A spectralalgorithm for learning hidden Markov models. In COLT, 2009.
[11] George A Kantor and Sanjiv Singh. Preliminary results in range-only localization and mapping. In Proceedings of the IEEE
Conference on Robotics and Automation (ICRA ’02), volume 2,pages 1818 – 1823, May 2002.
[12] Athanasios Kehagias, Joseph Djugash, and Sanjiv Singh.Range-only SLAM with interpolated range data. TechnicalReport CMU-RI-TR-06-26, Robotics Institute, May 2006.
[13] Derek Kurth, George A Kantor, and Sanjiv Singh. Experimentalresults in range-only localization with radio. In 2003 IEEE/RSJ
International Conference on Intelligent Robots and Systems
(IROS ’03), volume 1, pages 974 – 979, October 2003.[14] Michael Montemerlo, Sebastian Thrun, Daphne Koller, and Ben
Wegbreit. FastSLAM: A factored solution to the simultaneouslocalization and mapping problem. In In Proceedings of the
AAAI National Conference on Artificial Intelligence, pages 593–598. AAAI, 2002.
[15] Matthew Rosencrantz, Geoffrey J. Gordon, and SebastianThrun. Learning low dimensional predictive representations.In Proc. ICML, 2004.
[16] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics
(AISTATS-2010), 2010.[17] G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory.
Academic Press, 1990.[18] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilis-
tic Robotics (Intelligent Robotics and Autonomous Agents). TheMIT Press, 2005. ISBN 0262201623.
[19] Michael E. Tipping and Chris M. Bishop. Probabilistic principalcomponent analysis. Journal of the Royal Statistical Society,
Series B, 61:611–622, 1999.[20] Carlo Tomasi. Shape and motion from image streams under
orthography: a factorization method. International Journal of
Computer Vision, 9:137–154, 1992.[21] P. Van Overschee and B. De Moor. Subspace Identification for
Linear Systems: Theory, Implementation, Applications. Kluwer,1996.
[22] Takehisa Yairi. Map building without localization by dimen-sionality reduction techniques. In Proceedings of the 24th
international conference on Machine learning, ICML ’07, pages1071–1078, New York, NY, USA, 2007. ACM.
autonomous lawn mower
Plaza 1
Plaza 2
true Landmarkest. Landmark
dead reackoning Pathtrue Pathest. Path
Fig. 4. The autonomous lawn mower and spectral SLAM. A.) The robotic lawn mower platform. B.) In the first experiment, the robot traveled 1.9kmreceiving 3,529 range measurements. This path minimizes the effect of heading error by balancing the number of left turns with an equal number of rightturns (a commonly used pattern for lawn mowing). Spectral SLAM recovers the robot’s path and landmark positions accurately. C.) In the second experiment,the robot traveled 1.3km receiving 1,816 range measurements. This path highlights the effect of heading error on dead reckoning performance by turning inthe same direction repeatedly. Again, spectral SLAM is able to recover the path and landmarks accurately.
VI. CONCLUSION
We proposed a novel formulation and solution for the range-only SLAM problem that differs substantially from previousapproaches. The essence of this new approach is to formulateSLAM as a factorization problem, which allows us to derive alocal-minimum free spectral learning method that is closely re-lated to SfM and spectral approaches to system identification.We provide theoretical guarantees for our algorithm, discusshow to derive an online algorithm, and show how to generalizeto a full robot system identification algorithm. Finally, wedemonstrate that our spectral approach to SLAM improves onother state-of-the-art SLAM approaches in real-world range-only SLAM problems.
REFERENCES
[1] Michael Biggs, Ali Ghodsi, Dana Wilkinson, and MichaelBowling. Action respecting embedding. In In Proceedings of the
Twenty-Second International Conference on Machine Learning,pages 65–72, 2005.
[2] Byron Boots and Geoff Gordon. Predictive state temporaldifference learning. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural
Information Processing Systems 23, pages 271–279. 2010.[3] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing
the learning-planning loop with predictive state representations.In Proceedings of Robotics: Science and Systems VI, 2010.
[4] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. An onlinespectral learning algorithm for partially observable nonlineardynamical systems. In Proceedings of the 25th National
Conference on Artificial Intelligence (AAAI-2011), 2011.[5] Matthew Brand. Fast low-rank modifications of the thin singular
value decomposition. Linear Algebra and its Applications, 415(1):20–30, 2006.
[6] Joseph Djugash. Geolocation with range: Robustness, efficiencyand scalability. PhD. Thesis, Carnegie Mellon University, 2010.
[7] Joseph Djugash and Sanjiv Singh. A robust method of localiza-tion and mapping using only range. In International Symposium
on Experimental Robotics, July 2008.[8] Joseph Djugash, Sanjiv Singh, and Peter Ian Corke. Further
results with localization and mapping using range from radio.In International Conference on Field and Service Robotics (FSR
’05), July 2005.[9] Brian Ferris, Dieter Fox, and Neil Lawrence. WiFi-SLAM using
Gaussian process latent variable models. In Proceedings of
the 20th international joint conference on Artifical intelligence,IJCAI’07, pages 2480–2485, San Francisco, CA, USA, 2007.Morgan Kaufmann Publishers Inc.
[10] Daniel Hsu, Sham Kakade, and Tong Zhang. A spectralalgorithm for learning hidden Markov models. In COLT, 2009.
[11] George A Kantor and Sanjiv Singh. Preliminary results in range-only localization and mapping. In Proceedings of the IEEE
Conference on Robotics and Automation (ICRA ’02), volume 2,pages 1818 – 1823, May 2002.
[12] Athanasios Kehagias, Joseph Djugash, and Sanjiv Singh.Range-only SLAM with interpolated range data. TechnicalReport CMU-RI-TR-06-26, Robotics Institute, May 2006.
[13] Derek Kurth, George A Kantor, and Sanjiv Singh. Experimentalresults in range-only localization with radio. In 2003 IEEE/RSJ
International Conference on Intelligent Robots and Systems
(IROS ’03), volume 1, pages 974 – 979, October 2003.[14] Michael Montemerlo, Sebastian Thrun, Daphne Koller, and Ben
Wegbreit. FastSLAM: A factored solution to the simultaneouslocalization and mapping problem. In In Proceedings of the
AAAI National Conference on Artificial Intelligence, pages 593–598. AAAI, 2002.
[15] Matthew Rosencrantz, Geoffrey J. Gordon, and SebastianThrun. Learning low dimensional predictive representations.In Proc. ICML, 2004.
[16] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics
(AISTATS-2010), 2010.[17] G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory.
Academic Press, 1990.[18] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilis-
tic Robotics (Intelligent Robotics and Autonomous Agents). TheMIT Press, 2005. ISBN 0262201623.
[19] Michael E. Tipping and Chris M. Bishop. Probabilistic principalcomponent analysis. Journal of the Royal Statistical Society,
Series B, 61:611–622, 1999.[20] Carlo Tomasi. Shape and motion from image streams under
orthography: a factorization method. International Journal of
Computer Vision, 9:137–154, 1992.[21] P. Van Overschee and B. De Moor. Subspace Identification for
Linear Systems: Theory, Implementation, Applications. Kluwer,1996.
[22] Takehisa Yairi. Map building without localization by dimen-sionality reduction techniques. In Proceedings of the 24th
international conference on Machine learning, ICML ’07, pages1071–1078, New York, NY, USA, 2007. ACM.
[Dju
gash
& S
ingh
, 200
8]
Geoff Gordon—UT Austin—Feb, 2012
Experiments (full version)
25
20
0
20
30
40
50
60
0
0
20
30
40
50
60
0
C.B. Plaza 2
Fig. 4. The autonomous lawn mower and spectral SLAM. A.) The robotic lawn mower platform. B.) In the first experiment, the robot traveled 1.9kmreceiving 3,529 range measurements. This path minimizes the effect of heading error by balancing the number of left turns with an equal number of rightturns (a commonly used pattern for lawn mowing). Spectral SLAM recovers the robot’s path and landmark positions accurately. C.) In the second experiment,the robot traveled 1.3km receiving 1,816 range measurements. This path highlights the effect of heading error on dead reckoning performance by turning inthe same direction repeatedly. Again, spectral SLAM is able to recover the path and landmarks accurately.
VI. CONCLUSION
We proposed a novel formulation and solution for the range-only SLAM problem that differs substantially from previousapproaches. The essence of this new approach is to formulateSLAM as a factorization problem, which allows us to derive alocal-minimum free spectral learning method that is closely re-lated to SfM and spectral approaches to system identification.We provide theoretical guarantees for our algorithm, discusshow to derive an online algorithm, and show how to generalizeto a full robot system identification algorithm. Finally, wedemonstrate that our spectral approach to SLAM improves onother state-of-the-art SLAM approaches in real-world range-only SLAM problems.
REFERENCES
[1] Michael Biggs, Ali Ghodsi, Dana Wilkinson, and MichaelBowling. Action respecting embedding. In In Proceedings of the
Twenty-Second International Conference on Machine Learning,pages 65–72, 2005.
[2] Byron Boots and Geoff Gordon. Predictive state temporaldifference learning. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural
Information Processing Systems 23, pages 271–279. 2010.[3] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing
the learning-planning loop with predictive state representations.In Proceedings of Robotics: Science and Systems VI, 2010.
[4] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. An onlinespectral learning algorithm for partially observable nonlineardynamical systems. In Proceedings of the 25th National
Conference on Artificial Intelligence (AAAI-2011), 2011.[5] Matthew Brand. Fast low-rank modifications of the thin singular
value decomposition. Linear Algebra and its Applications, 415(1):20–30, 2006.
[6] Joseph Djugash. Geolocation with range: Robustness, efficiencyand scalability. PhD. Thesis, Carnegie Mellon University, 2010.
[7] Joseph Djugash and Sanjiv Singh. A robust method of localiza-tion and mapping using only range. In International Symposium
on Experimental Robotics, July 2008.[8] Joseph Djugash, Sanjiv Singh, and Peter Ian Corke. Further
results with localization and mapping using range from radio.In International Conference on Field and Service Robotics (FSR
’05), July 2005.[9] Brian Ferris, Dieter Fox, and Neil Lawrence. WiFi-SLAM using
Gaussian process latent variable models. In Proceedings of
the 20th international joint conference on Artifical intelligence,IJCAI’07, pages 2480–2485, San Francisco, CA, USA, 2007.Morgan Kaufmann Publishers Inc.
[10] Daniel Hsu, Sham Kakade, and Tong Zhang. A spectralalgorithm for learning hidden Markov models. In COLT, 2009.
[11] George A Kantor and Sanjiv Singh. Preliminary results in range-only localization and mapping. In Proceedings of the IEEE
Conference on Robotics and Automation (ICRA ’02), volume 2,pages 1818 – 1823, May 2002.
[12] Athanasios Kehagias, Joseph Djugash, and Sanjiv Singh.Range-only SLAM with interpolated range data. TechnicalReport CMU-RI-TR-06-26, Robotics Institute, May 2006.
[13] Derek Kurth, George A Kantor, and Sanjiv Singh. Experimentalresults in range-only localization with radio. In 2003 IEEE/RSJ
International Conference on Intelligent Robots and Systems
(IROS ’03), volume 1, pages 974 – 979, October 2003.[14] Michael Montemerlo, Sebastian Thrun, Daphne Koller, and Ben
Wegbreit. FastSLAM: A factored solution to the simultaneouslocalization and mapping problem. In In Proceedings of the
AAAI National Conference on Artificial Intelligence, pages 593–598. AAAI, 2002.
[15] Matthew Rosencrantz, Geoffrey J. Gordon, and SebastianThrun. Learning low dimensional predictive representations.In Proc. ICML, 2004.
[16] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics
(AISTATS-2010), 2010.[17] G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory.
Academic Press, 1990.[18] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilis-
tic Robotics (Intelligent Robotics and Autonomous Agents). TheMIT Press, 2005. ISBN 0262201623.
[19] Michael E. Tipping and Chris M. Bishop. Probabilistic principalcomponent analysis. Journal of the Royal Statistical Society,
Series B, 61:611–622, 1999.[20] Carlo Tomasi. Shape and motion from image streams under
orthography: a factorization method. International Journal of
Computer Vision, 9:137–154, 1992.[21] P. Van Overschee and B. De Moor. Subspace Identification for
Linear Systems: Theory, Implementation, Applications. Kluwer,1996.
[22] Takehisa Yairi. Map building without localization by dimen-sionality reduction techniques. In Proceedings of the 24th
international conference on Machine learning, ICML ’07, pages1071–1078, New York, NY, USA, 2007. ACM.
autonomous lawn mower
Plaza 1
Plaza 2
true Landmarkest. Landmark
dead reackoning Pathtrue Pathest. Path
Fig. 4. The autonomous lawn mower and spectral SLAM. A.) The robotic lawn mower platform. B.) In the first experiment, the robot traveled 1.9kmreceiving 3,529 range measurements. This path minimizes the effect of heading error by balancing the number of left turns with an equal number of rightturns (a commonly used pattern for lawn mowing). Spectral SLAM recovers the robot’s path and landmark positions accurately. C.) In the second experiment,the robot traveled 1.3km receiving 1,816 range measurements. This path highlights the effect of heading error on dead reckoning performance by turning inthe same direction repeatedly. Again, spectral SLAM is able to recover the path and landmarks accurately.
VI. CONCLUSION
We proposed a novel formulation and solution for the range-only SLAM problem that differs substantially from previousapproaches. The essence of this new approach is to formulateSLAM as a factorization problem, which allows us to derive alocal-minimum free spectral learning method that is closely re-lated to SfM and spectral approaches to system identification.We provide theoretical guarantees for our algorithm, discusshow to derive an online algorithm, and show how to generalizeto a full robot system identification algorithm. Finally, wedemonstrate that our spectral approach to SLAM improves onother state-of-the-art SLAM approaches in real-world range-only SLAM problems.
REFERENCES
[1] Michael Biggs, Ali Ghodsi, Dana Wilkinson, and MichaelBowling. Action respecting embedding. In In Proceedings of the
Twenty-Second International Conference on Machine Learning,pages 65–72, 2005.
[2] Byron Boots and Geoff Gordon. Predictive state temporaldifference learning. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural
Information Processing Systems 23, pages 271–279. 2010.[3] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing
the learning-planning loop with predictive state representations.In Proceedings of Robotics: Science and Systems VI, 2010.
[4] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. An onlinespectral learning algorithm for partially observable nonlineardynamical systems. In Proceedings of the 25th National
Conference on Artificial Intelligence (AAAI-2011), 2011.[5] Matthew Brand. Fast low-rank modifications of the thin singular
value decomposition. Linear Algebra and its Applications, 415(1):20–30, 2006.
[6] Joseph Djugash. Geolocation with range: Robustness, efficiencyand scalability. PhD. Thesis, Carnegie Mellon University, 2010.
[7] Joseph Djugash and Sanjiv Singh. A robust method of localiza-tion and mapping using only range. In International Symposium
on Experimental Robotics, July 2008.[8] Joseph Djugash, Sanjiv Singh, and Peter Ian Corke. Further
results with localization and mapping using range from radio.In International Conference on Field and Service Robotics (FSR
’05), July 2005.[9] Brian Ferris, Dieter Fox, and Neil Lawrence. WiFi-SLAM using
Gaussian process latent variable models. In Proceedings of
the 20th international joint conference on Artifical intelligence,IJCAI’07, pages 2480–2485, San Francisco, CA, USA, 2007.Morgan Kaufmann Publishers Inc.
[10] Daniel Hsu, Sham Kakade, and Tong Zhang. A spectralalgorithm for learning hidden Markov models. In COLT, 2009.
[11] George A Kantor and Sanjiv Singh. Preliminary results in range-only localization and mapping. In Proceedings of the IEEE
Conference on Robotics and Automation (ICRA ’02), volume 2,pages 1818 – 1823, May 2002.
[12] Athanasios Kehagias, Joseph Djugash, and Sanjiv Singh.Range-only SLAM with interpolated range data. TechnicalReport CMU-RI-TR-06-26, Robotics Institute, May 2006.
[13] Derek Kurth, George A Kantor, and Sanjiv Singh. Experimentalresults in range-only localization with radio. In 2003 IEEE/RSJ
International Conference on Intelligent Robots and Systems
(IROS ’03), volume 1, pages 974 – 979, October 2003.[14] Michael Montemerlo, Sebastian Thrun, Daphne Koller, and Ben
Wegbreit. FastSLAM: A factored solution to the simultaneouslocalization and mapping problem. In In Proceedings of the
AAAI National Conference on Artificial Intelligence, pages 593–598. AAAI, 2002.
[15] Matthew Rosencrantz, Geoffrey J. Gordon, and SebastianThrun. Learning low dimensional predictive representations.In Proc. ICML, 2004.
[16] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics
(AISTATS-2010), 2010.[17] G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory.
Academic Press, 1990.[18] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilis-
tic Robotics (Intelligent Robotics and Autonomous Agents). TheMIT Press, 2005. ISBN 0262201623.
[19] Michael E. Tipping and Chris M. Bishop. Probabilistic principalcomponent analysis. Journal of the Royal Statistical Society,
Series B, 61:611–622, 1999.[20] Carlo Tomasi. Shape and motion from image streams under
orthography: a factorization method. International Journal of
Computer Vision, 9:137–154, 1992.[21] P. Van Overschee and B. De Moor. Subspace Identification for
Linear Systems: Theory, Implementation, Applications. Kluwer,1996.
[22] Takehisa Yairi. Map building without localization by dimen-sionality reduction techniques. In Proceedings of the 24th
international conference on Machine learning, ICML ’07, pages1071–1078, New York, NY, USA, 2007. ACM.
[Dju
gash
& S
ingh
, 200
8]
of the expected state at time t + 1 around the current MAPstate st, given the current action at:
st+1 − st ≈ N [φ(st, at) +dφds
��st(st − st)] (23)
dφds
��s= (s⊗ I + I ⊗ s)⊗ at ⊗ at (24)
We simply plug this Taylor approximation into the standardKalman filter motion update (e.g., [18]).
V. EXPERIMENTAL RESULTS
We perform several SLAM and robot navigation experi-ments to illustrate and test the ideas proposed in this paper.First we show how our methods work in theory with syntheticexperiments where complete observations are received at eachpoint in time and i.i.d. noise is sampled from a multivariateGaussian distribution. Next we illustrate our algorithm on datacollected from a real-world robotic system with substantialamounts of missing data.
Experiments were performed in Matlab, on a 2.66 GHzIntel Core i7 computer with 8 GB of RAM. The spectrallearning methods described in this paper are very fast, usuallytaking only a few seconds to run. Code and data for theseexperiments is available at [link will become available whenpaper is published].
A. Synthetic ExperimentsFirst we evaluate our spectral SLAM algorithm on simulated
range-only data. Our simulator randomly places 6 landmarks ina 2-D environment. A robot then randomly moves through theenvironment for 500 time steps and receives a range readingto each one of the landmarks at each time step. The rangereadings are perturbed by noise sampled from a Gaussiandistribution with variance equal to 1% of the range. Giventhis data, we apply the algorithm from Section III-C to solvethe SLAM problem. We use the coordinates of 4 landmarks tolearn the linear transform S and recover the true state space.Results are shown in Figure 2A. The results indicate that wecan accurately recover both the landmark locations and therobot path.
We also investigated the empirical convergence rate of ourobservation model (and therefore the map) as the number ofrange readings increased. To do so, we generated 1000 differ-ent random pairs of environments and robot paths. For eachpair, we repeatedly performed our spectral SLAM algorithmon increasingly large numbers of range readings and looked atthe difference between our estimated measurement model andthe true measurement model || �C−C||F . The results are shownin Figure 2B, and show that our estimates steadily convergeto the true model.
B. Autonomous Lawn MowerWe next evaluate the performance of our spectral SLAM
algorithm on two freely available range-only SLAM data setscollected from an autonomous lawn mowing robot [6]. (Thelawn mowing robot is shown in Figure 4A.)2 These “Plaza”
2http://www.frc.ri.cmu.edu/projects/emergencyresponse/RangeData/index.html
Method Plaza 1 Plaza 2Dead Reckoning (full path) 15.92m 27.28mCartesian EKF (last 10%) 0.94m 0.92mBatch Optimization (last 10%) 0.68m 0.96mFastSLAM (last 10%) 0.73m 1.14mROP EKF (last 10%) 0.65m 0.87mSpectral SLAM (worst 10%) 1.17m 0.51mSpectral SLAM (best 10%) 0.28m 0.22mSpectral SLAM (full path) 0.39m 0.35m
Fig. 3. Comparison of Range-Only SLAM Algorithms (Localization RMSE).Boldface entries are best in column.
datasets were collected via radio nodes from MultispectralSolutions that use time-of-flight of ultra-wide-band signalsto provide inter-node ranging measurements. This systemproduces a time-stamped range estimate between the mobilerobot and stationary nodes (landmarks) in the environment.The landmark radio nodes are placed atop traffic cones approx-imately 138cm above the ground throughout the environment,and one node was placed on top of the center of the robot’scoordinate frame (also 138cm above the ground). The robotodometry (dead reckoning) comes from an onboard fiberopticgyro and wheel encoders. The two environmental setups, in-cluding the location of the landmarks, the dead reckoning path,and the ground truth path (calculated within 2cm accuracy viaGPS), are shown in Figure 4B-C. For details see [6].
The two data sets were very sparse, with approximately11 time steps (and up to 500 steps) between range readingsfor the worst case landmark. We first interpolated the missingrange readings with the method of Section III-C4. Then weapplied the rank-7 spectral SLAM algorithm of Section III-C;the results are depicted in Figure 4B-C. Qualitatively, we seethat the robot’s localization path conforms to the true path.
In addition to the qualitative results, we quantitativelycompared spectral SLAM to a number of different competingrange-only SLAM algorithms (the results we compared to aresummarized in [6]). These algorithms included a CartesianEKF, a batch optimization technique [12] run for 10,000iterations, FastSLAM [14], and a Relative Over-ParameterizedEKF (ROP-EKF) [7], a state-of-the-art range-only SLAMalgorithm. The localization root mean squared error (RMSE)in meters for each SLAM algorithm is shown in Figure 3.Previous results only reported the RMSE for the last 10% ofthe path, which is generally the best 10% of the path (since itgives the most time to recover from initialization problems);this is a very favorable statistic for competing SLAM methods.The full path localization error can be considerably worse,particularly for the initial portion of the path—see Fig. 5(right) of [7].
To compare to these results, we report the RMSE for ourspectral algorithm when restricting the data set to the best orworst 10% of the path, as well as the RMSE for the full path.The results show that our spectral SLAM algorithm beats,often by a significant margin, the best previous methods onthe Plaza datasets.
Geoff Gordon—UT Austin—Feb, 2012
Structure from motion
26
[Tom
asi &
Kan
ade,
1992
]
Geoff Gordon—UT Austin—Feb, 2012
Video textures
• Kalman filter works for some video textures
‣ steam grate example above
‣ fountain:
27
observation = raw pixels (vector of reals over time)
Geoff Gordon—UT Austin—Feb, 2012
Video textures, redux
Kalman Filter PSR
28
both models: 10 latent dimensions
Original
Geoff Gordon—UT Austin—Feb, 2012
Video textures, redux
Kalman Filter PSR
28
both models: 10 latent dimensions
Original
Geoff Gordon—UT Austin—Feb, 2012
Learning in the loop: option pricing
• Price a financial derivative: “psychic call”‣ holder gets to say “I bought call 100 days ago”‣ underlying stock follows Black Scholes (unknown
parameters)
29
pt
pt–100
Geoff Gordon—UT Austin—Feb, 2012
Option pricing
• Solution [Van Roy et al.]: use policy iteration‣ 16 hand-picked features (e.g., poly ⋅ history)
‣ initialize policy arbitrarily‣ least-squares temporal differences (LSTD) to
estimate value function‣ policy := greedy; repeat
30
pt
pt–100
Geoff Gordon—UT Austin—Feb, 2012
Option pricing
• Better solution: spectral SSID inside policy iteration‣ 16 original features from Van Roy et al.‣ 204 additional “low-originality” features‣ e.g., linear fns of price history of underlying‣ SSID picks best 16-d dynamics to explain
feature evolution‣ solve for value function in closed form
31
Geoff Gordon—UT Austin—Feb, 2012
Policy iteration w/ spectral learning
32
Exp
ecte
d pa
yoff
/ $ o
f und
erly
ing
1 2 3 4!10
!5
0
5
10
15
1 2 3 4!10
!5
0
5
10
15
1 2 3 4!10
!5
0
5
10
15
0 5 10 15 20 25 300.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
Val
ue
State State State
Expec
ted R
ewar
d
Policy Iteration
A. B. C. D.LSTD
PSTD
LARS-TD
LSTD (16)
LSTD
PSTD
LARS-TD
ThresholdJ
LSTD
PSTD
LARS-TD
J
LSTD
PSTD
LARS-TD
J!
!
!
Figure 1: Experimental Results. Error bars indicate standard error. (A.) Estimating the value func-
tion with a small number of informative features. All three approaches do well. (B.) Estimating the
value function with a small set of informative features and a large set of random features. LARS-TD
is designed for this scenario and dramatically outperforms PSTD and LSTD. (C.) Estimating the
value function with a large set of semi-informative features. PSTD is able to determine a small set
of compressed features that retain the maximal amount of information about the value function, out-
performing LSTD and LARS-TD. (D.) Pricing a high-dimensional derivative via policy iteration. At
each iteration the current policy was executed on 10,000 stock price trajectories sampled from the
stationary distribution. The y-axis is expected reward for the current policy executed on the test set
at each iteration. The optimal threshold strategy (sell if price is above a threshold [23]) is in black,
LSTD (16 canonical features) is in blue, LSTD (on the full 240 features) is cyan, LARS-TD (feature
selection from set of 240) is in green, and PSTD (16 dimensions, compressing 240 features(16 +
224)) is in red. PSTD outperforms the competing approaches.
holding the contract or exercise. We consider the financial derivative introduced by Tsitsiklis and
Van Roy [23]. The derivative generates payoffs that are contingent on the prices of a single stock.
At the end of a given day, the holder may opt to exercise. At the time of exercise the contract is
terminated, and a payoff is received in an amount equal to the current price of the stock divided by
the price 100 days beforehand. The stock price is modeled as a geometric Brownian motion with
volatility σ = 0.02, and there is a continuously compounded short term interest rate ρ = 0.0004(corresponding to an annual interest rate of ∼ 10%). In more detail, if wt is a standard Brownian
motion, then the stock price pt evolves as∇pt = ρpt∇t+σpt∇wt, and we can summarize relevant
state at the end of each day as a discrete-time process {xt | t = 0, 1, ...} ∈ R100with xt =�
pt−99pt−100
, pt−98pt−100
, . . . , pt
pt−100
�T. The ith dimension xt(i) represents the amount a $1 investment in a
stock at time t−100 would grow to at time t−100+ i. This process is Markov and ergodic [23, 24]:
xt and xt+100 are independent and identically distributed. The immediate reward for exercising
the stock is G(x) = x(100), the immediate reward for continuing to hold the stock is 0, and the
discount factor γ = e−ρis determined by the interest rate. The value of the derivative security is
given bysupt E[γtG(xt)]. Our goal is to calculate an approximate value function, and then use this
value function to generate a stopping time t∗ = min{t |G(xt) ≥ V ∗(xt)}. To do so, we sample a
sequence of 1,000,000 states xt ∈ R100and calculate features φH of each state. We then perform
policy iteration on this sample, alternately estimating the value function under a given policy and
then using this value function to define a new greedy policy “stop if G(xt) ≥ wTφH(xt).”
Within the above strategy, we have two main choices: which features do we use, and how do we
estimate the value function in terms of these features. For value function estimation, we used LSTD,
LARS-TD, or PSTD. In each case we re-used our 1,000,000-state sample trajectory for all iterations:
we start at the beginning and follow the trajectory as long as the policy chooses the “continue” action,
with reward 0 at each step. When the policy executes the “stop” action, the reward is G(x) and the
next state’s features are all 0; we then restart the policy 100 steps in the future, after the process
has fully mixed. For feature selection, we are fortunate, previous researchers have hand-selected a
“good” set of 16 features for this data set through repeated trial and error (see Appendix, Section D
and [23, 24]). We greatly expand this set of features, then use PSTD to synthesize a small set of
high-quality combined features and calculate the value function more accurately. Specifically, we
add the entire 100-step sample trajectory, the squares of the sample trajectory, and several additional
nonlinear features, increasing the total number of features from 16 to 240. We use histories of length
1, tests of length 3, and (for comparison’s sake) we choose a linear dimension of 16.
8
Iterations of PI
0.82¢/$ better than best competitor;
1.33¢/$ better than best previously
published
data: 1,000,000 time steps
Geoff Gordon—UT Austin—Feb, 2012
Hilbert Space Embeddings of Hidden Markov Models
A. B.
0 10 20 30 40 50 60 70 80 90 100
3
4
5
6
7
8x 106
Prediction Horizon
Avg.
Pre
dic
tion
Err
.
2
1
IMUSlo
t C
ar
0
Racetrack
RR-HMMLDS
HMMMeanLast
Embedded
Figure 4. Slot car inertial measurement data. (A) The slot
car platform and the IMU (top) and the racetrack (bot-
tom). (B) Squared error for prediction with different esti-
mated models and baselines.
this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.
We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.
For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.
4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes
Latent Space Dimensionality
Acc
ura
cy (
%)
HMM
LDS
RR!HMM
Embedded
10 20 30 40 5060
70
80
90
85
75
65
Figure 5. Accuracies and 95% confidence intervals for Hu-
man vs. Non-human audio event classification, comparing
embedded HMMs to other common sequential models at
different latent state space sizes.
11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.
For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.
For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.
�tτ=1 log
��ϕ(xτ ), µXτ |x1:τ−1
�F
�.
For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.
5. Conclusion
We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The
Hilbert Space Embeddings of Hidden Markov Models
A. B.
0 10 20 30 40 50 60 70 80 90 100
3
4
5
6
7
8x 106
Prediction Horizon
Av
g.
Pre
dic
tio
n E
rr.
2
1
IMUSlo
t C
ar
0
Racetrack
RR-HMMLDS
HMMMeanLast
Embedded
Figure 4. Slot car inertial measurement data. (A) The slot
car platform and the IMU (top) and the racetrack (bot-
tom). (B) Squared error for prediction with different esti-
mated models and baselines.
this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.
We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.
For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.
4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes
Latent Space Dimensionality
Acc
ura
cy (
%)
HMM
LDS
RR!HMM
Embedded
10 20 30 40 5060
70
80
90
85
75
65
Figure 5. Accuracies and 95% confidence intervals for Hu-
man vs. Non-human audio event classification, comparing
embedded HMMs to other common sequential models at
different latent state space sizes.
11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.
For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.
For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.
�tτ=1 log
��ϕ(xτ ), µXτ |x1:τ−1
�F
�.
For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.
5. Conclusion
We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The
Slot-car IMU data
0 50 100 1503
2
1
0
1
2
3
[Ko
& F
ox, 0
9; B
oots
et a
l., 10
]
Geoff Gordon—UT Austin—Feb, 2012
Kalman < HSE-HMM < manifold
34
Manuscript under review by AISTATS 2012
0 10 20 30 40 50 60 70 80 90 1000
40
80
120
160A. C.
Prediction Horizon
RM
S Er
rorIMU
Racetrack
Mean Obs.Slot Car B.
Laplacian Eigenmap
Gaussian RBF
Kalman Filter
LinearHSE-HMMManifold-HMM
HMM (EM)Previous Obs.
Car
Loc
atio
ns P
redi
cted
by
Diff
eren
t Sta
te S
pace
s
Figure 3: Slot car with inertial measurement unit (IMU). (A) The slot car platform: the car and IMU (top) and
racetrack (bottom). (B) A comparison of training data embedded into the state space of three different learnedmodels. Red line indicates true 2-d position of the car over time, blue lines indicate the prediction from state
space. The top graph shows the Kalman filter state space (linear kernel), the middle graph shows the HSE-
HMM state space (Gaussian RBF kernel), and the bottom graph shows the manifold HSE-HMM state space
(LE kernel). The LE kernel finds the best representation of the true manifold. (C) Root mean squared error for
prediction (averaged over 250 trials) with different estimated models. The HSE-HMM significantly outperforms
the other learned models by taking advantage of the fact that the data we want to predict lies on a manifold.
structing a manifold, none of these papers compares
the predictive accuracy of its model to state-of-the-art
dynamical system identification algorithms.
7 Conclusion
In this paper we propose a class of problems called
two-manifold problems where two sets of correspond-
ing data points, generated by a single latent manifold,
lie on or near two different higher dimensional mani-
folds. We design algorithms by relating two-manifold
problems to cross-covariance operators in RKHS, and
show that these algorithms result in a significant im-
provement over standard manifold learning approaches
in the presence of noise. This is an appealing result:
manifold learning algorithms typically assume that ob-
servations are (close to) noiseless, an assumption that
is rarely satisfied in practice.
Furthermore, we demonstrate the utility of two-
manifold problems by extending a recent dynamical
system identification algorithm to learn a system with
a state space that lies on a manifold. The resulting al-
gorithm learns a model that outperforms the current
state-of-the-art in terms of predictive accuracy. To our
knowledge this is the first combination of system iden-
tification and manifold learning that accurately iden-
tifies a latent time series manifold and is competitive
with the best system identification algorithms at learn-
ing accurate models.
Manuscript under review by AISTATS 2012
0 10 20 30 40 50 60 70 80 90 1000
40
80
120
160A. C.
Prediction Horizon
RM
S Er
rorIMU
Racetrack
Mean Obs.Slot Car B.
Laplacian Eigenmap
Gaussian RBF
Kalman Filter
LinearHSE-HMMManifold-HMM
HMM (EM)Previous Obs.
Car
Loc
atio
ns P
redi
cted
by
Diff
eren
t Sta
te S
pace
s
Figure 3: Slot car with inertial measurement unit (IMU). (A) The slot car platform: the car and IMU (top) and
racetrack (bottom). (B) A comparison of training data embedded into the state space of three different learnedmodels. Red line indicates true 2-d position of the car over time, blue lines indicate the prediction from state
space. The top graph shows the Kalman filter state space (linear kernel), the middle graph shows the HSE-
HMM state space (Gaussian RBF kernel), and the bottom graph shows the manifold HSE-HMM state space
(LE kernel). The LE kernel finds the best representation of the true manifold. (C) Root mean squared error for
prediction (averaged over 250 trials) with different estimated models. The HSE-HMM significantly outperforms
the other learned models by taking advantage of the fact that the data we want to predict lies on a manifold.
structing a manifold, none of these papers compares
the predictive accuracy of its model to state-of-the-art
dynamical system identification algorithms.
7 Conclusion
In this paper we propose a class of problems called
two-manifold problems where two sets of correspond-
ing data points, generated by a single latent manifold,
lie on or near two different higher dimensional mani-
folds. We design algorithms by relating two-manifold
problems to cross-covariance operators in RKHS, and
show that these algorithms result in a significant im-
provement over standard manifold learning approaches
in the presence of noise. This is an appealing result:
manifold learning algorithms typically assume that ob-
servations are (close to) noiseless, an assumption that
is rarely satisfied in practice.
Furthermore, we demonstrate the utility of two-
manifold problems by extending a recent dynamical
system identification algorithm to learn a system with
a state space that lies on a manifold. The resulting al-
gorithm learns a model that outperforms the current
state-of-the-art in terms of predictive accuracy. To our
knowledge this is the first combination of system iden-
tification and manifold learning that accurately iden-
tifies a latent time series manifold and is competitive
with the best system identification algorithms at learn-
ing accurate models.
RMS prediction error (XY position) v. horizoncomparison of kernels
Geoff Gordon—MURI review—Nov, 2011
Algorithm intuition
• Use separate one-manifold learners to estimate structure in past, future
‣ result: noisy Gram matrices GX, GY
‣ “signal” is correlated between GX, GY
‣ “noise” is independent
• Look at eigensystem of GXGY
‣ suppresses noise, leaves signal
35
Geoff Gordon—UT Austin—Feb, 2012
Summary
• General class of models for dynamical systems
• Fast, statistically consistent learning method
• Includes many well-known models & algorithms as special cases
‣ HMM, Kalman filter, n-gram, PSR, kernel versions
‣ SfM, subspace ID, Kalman video textures
• Also includes new models & algorithms that give state-of-the-art performance on interesting tasks
‣ range-only SLAM, PSR video textures, HSE-HMM
36
Geoff Gordon—UT Austin—Feb, 2012
Papers
• B. Boots and G. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems. AAAI, 2011.
• B. Boots and G.% Predictive state temporal difference learning.%NIPS, 2010.
• B. Boots, S. M. Siddiqi, and G.% Closing the learning-planning loop with predictive state representations. RSS, 2010.
• L. Song, B. Boots, S. M. Siddiqi, G., and A. J. Smola.% Hilbert space embeddings of hidden Markov models.% ICML, 2010. (Best paper)
37