learning to perform described actions in a virtualhometingwuwang/2541.pdf · 2017. 3. 8. · xavier...

Learning to Perform Described Actions in a VirtualHome

TINGWU WANG, ML GROUP, U OF TJO INT WORK WITH X A V I E R P U I G , K E V I N R A , M A R K O B O B E N , A N T O N I O T O R R A L B A , A N D S A N J A F I D L E R

Contents1. Introduction

2. Dataset and Platform

3. Script Generation from Described Actions

4. Results

5. Concluding Remarks

Introduction1. Background

1. Target: autonomous agents (domestic robots)2. Understand human instructions 3. Able to execute them correctly

say we have this cute red-eyed robot in our house.

Introduction1. How robots "do" things?

1. Robot uses executable pseudo-code2. Stupid robots acts according to

1. predefined "Atomic Action Triplets"2. if smarter, download new sequences

3. Clever robots learn and predict new sequences1. understand natural language

"find a book and start to read""give me a beer""tell the salesman I am not here in the house!"

2. understand teaching videos

Dataset and Platform1. Crowd-sourcing the Scripts for Tasks

1. We crowd-source the scripts on AMT, and have them rechecked with a high-quality annotator via Upwork

2. Creating the Virtual Environment1. We exploit the Unity3D game engine to create our

VirtualHome1. provide video ground truth data2. independent of the real robot platform3. excecute the predicted actions

Dataset and Platform1. Data statistics

1. five rooms, three 'robots'2. more than 70 actions and 260 objects to interact

2. Robots able to act according to the predicted atomic actions

Script Generation from Described Actions

1. Sequence to sequence baseline1. each atomic triplet is a token.

2. Attention decoder with minimum number of parameters1. treat the transition from human natural language to atomic

action sequences as language translation2. directly using w2v embedding as attentions?

3. w2v pretrained embedding1. note that atomic triplet consists of one action and two objects2. limited data

4. Video model?

Script Generation from Described Actions

1. Proposed Model

Results1. Text model

Results1. Generating videos according to action prediction (limited

time)

Concluding Remarks1. Work submitted to CVPR 17'

2. Future work1. Reinforcement Learning?2. Video Teaching?3. Zero-shot Learning?

Concluding Remarks1. Q & A

learning to perform described actions in a virtualhometingwuwang/2541.pdf · 2017. 3. 8. · xavier...

Documents