behavioral indoor navigation with natural language...

1
Behavioral Indoor Navigation With Natural Language Directions X. Zang 1 , M.Vázquez 1 , J.C. Niebles 1 , A. Soto 2 , S. Savarese 1 1 Stanford University 2 P. Universidad Católica de Chile Acknowledgements The Toyota Research Institute (TRI) provided funds to assist with this research, but this paper solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. We are also thankful to G. Sepúlveda for his assistance with data collection. Contact [email protected] References [1] G. Sepulveda, J. C. Niebles, and A. Soto. A Deep Learning Based Behavioral Approach to Indoor Autonomous Navigation. ICRA’18. [2] R. Brooks. A robust layered control system for a mobile robot. IEEE J. Robot. Autom. 2, 2, 1 (1986), 14–23. [3] B. Kuipers. Modeling spatial knowledge. Cogn. Sci., v.2, 1978 [4] C. Matuszek, D. Fox, and K. Koscher. Following Directions Using Statistical Machine Translation. HRI’10. [5] H. Mei, M. Bansal, and M. R. Walter. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences. AAAI’16. [6] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. E. Hinton. Grammar as a Foreign Language. CoRR abs/1412.7449, 2014. Introduction Man-made environments are composed of navigational structures, such as corridors or stairs, that in turn are intended to connect meaningful neighboring places, such as rooms or halls. We hypothesize is that by providing robots with suitable abilities to understand the world at this semantic level, it is possible build robust navigational systems. Behavioral Navigation Approach The proposed robot navigation approach [1] revisits the ideas of behavioral robot control [2] and early topological map representations [3]. In our approach, complex navigation routes are composed of simple, parameterized visuo-motor behaviors that leverage the semantic structure of indoor environments. These routes are generated by planning over a behavioral graph that encodes relations among navigational structures. O1* O2 O3 O4 O5 O6 K1 R1 O7 O8 O9 O10 H1 C1 C2 C3 O1* Start location Route office * (O1) corridor (C1) or office entrance (EO4) cf office (O4) el hall (H1) cf corridor (C2) kitchen entrance (EK1) kitchen (K1) tl cf er corridor (C3) cs office (O7) office entrance (EO7) cf el office entrance (EO8) cf office (O8) el or Behavioral Graph Representation “Go out of the office and turn right, pass the lockers on your left, cross the hall and turn left, pass a door on your right, and enter the kitchen on your right” SEQ-TO-SEQ MODEL Output: Graph Input: Navigation Commands in Natural Language office* corridor hall corridor entrance kitchen or cf ( pol [lockers] ) tl cf ( por [door] ) er office or ( ) corridor cf ( lockers pol ) hall tl ( ) corridor cf ( door por ) entrance er ( ) kitchen Translating Natural Language We estimate a function f that maps navigation commands to a sequence: S = <p1, b1, R1, p2, b2, R2, ..., pn > of places p, behaviors b, and reference sequences R. These reference sequences: R = < “(”,l1,r1,l2,r2,...,“)” > are ordered collections of reference actions r grounded on landmarks l. These references guide the execution of the prior behavior b in S. As an initial proof of concept, we implemented the function f as a differentiable sequence-to-sequence deep learning model with attention [6]. We trained the model using an adaptation of the learning environment DeepMind Lab. Experiments We evaluated our approach on a dataset of 16, 370 unique examples using 5-fold cross-validation. The target sequences in the dataset were composed of 40.18 elements on average (STD = 9.75). Following Navigation Instructions Inspired by prior work [4, 5], we cast the problem of following navigation instructions in natural language as a machine translation task. We perform end-to-end learning to translate commands into navigational routes. # GRU Units Avg. Accuracy Std. Dev. 64 99.53% 0.01 128 99.99% 0.0002 256 99.99% 0.0002 With less than 32 GRU units, the performance dropped signicantly and learning was unstable.

Upload: others

Post on 14-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Behavioral Indoor Navigation With Natural Language …ai.stanford.edu/~jniebles/poster-hri18-nlp.pdftranslation task. We perform end-to-end learning to translate commands into navigational

Behavioral Indoor Navigation With Natural Language DirectionsX. Zang1, M.Vázquez1, J.C. Niebles1, A. Soto2, S. Savarese1

1Stanford University2P. Universidad Católica de Chile

AcknowledgementsThe Toyota Research Institute (TRI) provided funds to assist with this research, but this paper solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. We are also thankful to G. Sepúlveda for his assistance with data collection.Contact [email protected]

References[1] G. Sepulveda, J. C. Niebles, and A. Soto. A Deep Learning Based Behavioral Approach to Indoor Autonomous Navigation. ICRA’18.[2] R. Brooks. A robust layered control system for a mobile robot. IEEE J. Robot. Autom. 2, 2, 1 (1986), 14–23.[3] B. Kuipers. Modeling spatial knowledge. Cogn. Sci., v.2, 1978[4] C. Matuszek, D. Fox, and K. Koscher. Following Directions Using Statistical Machine Translation. HRI’10.[5] H. Mei, M. Bansal, and M. R. Walter. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences. AAAI’16.[6] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. E. Hinton. Grammar as a Foreign Language. CoRR abs/1412.7449, 2014.

IntroductionMan-made environments are composed of navigational structures, such as corridors or stairs, that in turn are intended to connect meaningful neighboring places, such as rooms or halls. We hypothesize is that by providing robots with suitable abilities to understand the world at this semantic level, it is possible build robust navigational systems.

Behavioral Navigation ApproachThe proposed robot navigation approach [1] revisits the ideas of behavioral robot control [2] and early topological map representations [3].In our approach, complex navigation routes are composed of simple, parameterized visuo-motor behaviors that leverage the semantic structure of indoor environments. These routes are generated by planning over a behavioral graph that encodes relations among navigational structures.

O1*

O2

O3

O4

O5 O6

K1R1O7

O8O9

O10

H1

C1

C2

C3

O1* Start locationRoute

office *(O1)

corridor(C1)

or

office entrance (EO4)

cf

office(O4)

el

hall(H1)

cf

corridor(C2)

kitchen entrance(EK1)

kitchen(K1)

tl

cf er

corridor(C3)

cs

office(O7)

office entrance (EO7)

cf

el

office entrance (EO8)

cf

office(O8)

el

or

Behavioral Graph Representation

“Go out of the office and turn right, pass the lockers on your left, cross the hall and turn left, pass a door on your right, and enter the kitchen on your right”

SEQ-TO-SEQ MODELOutput: Graph

Input: Navigation Commands in Natural Language

office* corridor hall corridor entrance kitchenor cf ( p

ol [lockers]

)

tl cf ( por [d

oor] )

er

office or ( ) corridor cf ( lockers pol ) hall tl ( ) corridor cf ( door por ) entrance er ( ) kitchen

Translating Natural LanguageWe estimate a function f that maps navigation commands to a sequence:

S = <p1, b1, R1, p2, b2, R2, ..., pn >of places p, behaviors b, and reference sequences R. These reference sequences:

R = < “(”,l1,r1,l2,r2,...,“)” > are ordered collections of reference actions r grounded on landmarks l. These references guide the execution of the prior behavior b in S. As an initial proof of concept, we implemented the function f as a differentiable sequence-to-sequence deep learning model with attention [6]. We trained the model using an adaptation of the learning environment DeepMind Lab.

ExperimentsWe evaluated our approach on a dataset of 16, 370 unique examples using 5-fold cross-validation. The target sequences in the dataset were composed of 40.18 elements on average (STD = 9.75).

Following Navigation InstructionsInspired by prior work [4, 5], we cast the problem of following navigation instructions in natural language as a machine translation task. We perform end-to-end learning to translate commands into navigational routes.

# GRU Units Avg. Accuracy Std. Dev.64 99.53% 0.01128 99.99% 0.0002256 99.99% 0.0002

With less than 32 GRU units, the performance dropped signicantly and learning was unstable.