learning to perform described actions in a virtualhometingwuwang/2541.pdf · 2017. 3. 8. · xavier...

12
Learning to Perform Described Actions in a VirtualHome TINGWU WANG, ML GROUP, U OF T JOINT WORK WITH XAVIER PUIG, KEVIN RA, MARKO BOBEN, ANTONIO TORRALBA, AND SANJA FIDLER

Upload: others

Post on 25-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Learning to Perform Described Actions in a VirtualHome

    TINGWU WANG, ML GROUP, U OF TJO INT WORK WITH X A V I E R P U I G , K E V I N R A , M A R K O B O B E N , A N T O N I O T O R R A L B A , A N D S A N J A F I D L E R

  • Contents1. Introduction

    2. Dataset and Platform

    3. Script Generation from Described Actions

    4. Results

    5. Concluding Remarks

  • Introduction1. Background

    1. Target: autonomous agents (domestic robots)2. Understand human instructions 3. Able to execute them correctly

    say we have this cute red-eyed robot in our house.

  • Introduction1. How robots "do" things?

    1. Robot uses executable pseudo-code2. Stupid robots acts according to

    1. predefined "Atomic Action Triplets"2. if smarter, download new sequences

    3. Clever robots learn and predict new sequences1. understand natural language

    "find a book and start to read""give me a beer""tell the salesman I am not here in the house!"

    2. understand teaching videos

  • Dataset and Platform1. Crowd-sourcing the Scripts for Tasks

    1. We crowd-source the scripts on AMT, and have them rechecked with a high-quality annotator via Upwork

    2. Creating the Virtual Environment1. We exploit the Unity3D game engine to create our

    VirtualHome1. provide video ground truth data2. independent of the real robot platform3. excecute the predicted actions

  • Dataset and Platform1. Data statistics

    1. five rooms, three 'robots'2. more than 70 actions and 260 objects to interact

    2. Robots able to act according to the predicted atomic actions

  • Script Generation from Described Actions

    1. Sequence to sequence baseline1. each atomic triplet is a token.

    2. Attention decoder with minimum number of parameters1. treat the transition from human natural language to atomic

    action sequences as language translation2. directly using w2v embedding as attentions?

    3. w2v pretrained embedding1. note that atomic triplet consists of one action and two objects2. limited data

    4. Video model?

  • Script Generation from Described Actions

    1. Proposed Model

  • Results1. Text model

  • Results1. Generating videos according to action prediction (limited

    time)

  • Concluding Remarks1. Work submitted to CVPR 17'

    2. Future work1. Reinforcement Learning?2. Video Teaching?3. Zero-shot Learning?

  • Concluding Remarks1. Q & A