see, hear, do: language and robots

IBM Research

© 2002 IBM Corporation

See, Hear, Do:Language and Robots

Jonathan Connell Exploratory Computer Vision Group

Etienne MarcheretSpeech Algorithms & Engines Group

Sharath Pankanti (ECVG)

Josef Vopicka (Speech)

Far Reaching Research (FRR) Project

2

IBM Research


Challenge = Multi-modal instructional dialogs

Use speech, language, and vision to learn objects & actions

Innate perception abilities (objects / properties)

Innate action capabilities (navigation / grasping)

Easily acquire terms not knowable a priori

Example dialog:

Round up my mug.I don’t know how to “round up” your mug.

Walk around the house and look for it.When you find it bring it back to me.

I don’t know what your “mug” looks like.

It is like this <shows another mug> but sort of orange-ish.OK … I could not find your mug.

Try looking on the table in the living room.OK … Here it is!

Language Learning & Understanding is a AAAI Grand Challengehttp://www.aaai.org/aitopics/pmwiki/pmwiki.php/AITopics/GrandChallenges#language

verb learning

command following

noun learning

advice taking

3

IBM Research


Eldercare as an application

Example tasks:Pick up dropped phone

Get blanket from another room

Bring me the book I was reading yesterday

Large potential marketMany affluent societies have a demographic imbalance (Japan, EU, US)

Institutional care can be very expensive (to person, insurance, state)

A little help can go a long wayCan be supplied immediately (no waiting list for admission)

Allows person to stay at home longer (generally easier & less expensive)

Boosts independence and feeling of control (psychological advantage)

Note: We are not attempting to address the whole problemX Aggressive production cost containment

X Robust self-recharging and stairs traversal

X Bathing and bathroom care, patient transfer, cooking

X OSHA, ADA, FDA, FCC, UL or CE certification

4

IBM Research


Novel approach: Linguistically-guided robots

Use language as the core of the operating system not something tacked-on after-the-fact

Interaction

Simple progress / error reporting (“I am entering the kitchen”)

Easy to request missing information (“Please tell me where X is located.”)

Clarification dialogs possible (“Which box did you want, red or blue?”)

Learning

Can direct attention to specific objects or areas (e.g. “this object”)

Can focus learning on relevant properties (e.g. color, location)

Less trial and error since richer feedback (i.e. faster acquisition)

Interface

Much easier than programming (textual or graphical)

More natural for unskilled users

Less effort for “one-off” activities

5

IBM Research


ELI the robot Power supply

528 WH sealed lead-acid batteries

28 lbs for balancing counterweight

Estimate 4-5 hr run-time

Drive systemTwo wheel differential steer

Two 4 in rear casters (blue)

47 in/sec (2.7 mph) top speed

Handles 10 deg slope, ½ in bumps

Motorized lift For arm & sensors (offset 27 in up)

Floor to 36 in (counter) range

16 in / sec = 2.3 sec bottom to top

ComputationPlatform for quad-core GPU laptop

Single USB cable for interface

OverallAbout 65 lbs total weight

Stable +/- 10 degs any direction

15 in wide, 24 in long, 45-66 in tall

6

IBM Research


Joystick control video

Picking up a dropped object

eli_kitchen.wmv

7

IBM Research


Speech interaction video

Far-field speech

interpretation

eli_voice.wmv

8

IBM Research


Detached Arm

OTC medications (Advil & Gaviscon)

camera

arm

Software– Serial control code optimized– Joint control via manual gamepad– Inverse kinematic solver

Hardware– Single color camera 25 in above surface– Arm = 3 positional DOF, Wrist = 3 angular DOF– Gripper augmented with compliant closure– Workspace = 2 ft wide, 1ft deep, +8/-2 in high

for dialog development

9

IBM Research


Speech manipulation video

Selecting and disambiguating

objects

eli_table.wmv

10

IBM Research


Dialog phenomena handled “Grab it.” (1 object)

<grabs object> no confusion since only 1 choice for “it”

“Grab it.” (4 objects)“I'm confused. Which of the 4 things do you mean?” knows a unique target is required

“What color is the object on the left?” (4 objects) “It’s blue.” understand positions & colors

“Grab it” (4 objects)<grabs blue object> uses “it” from previous interaction

“Grab that object” (human points)<grabs object> understands human gesture

“Grab the white thing.” (2 white objects)“Do you mean this one?” <robot points> uses gesture to suggest alternative

“No, the other one.”<grabs other object> uses “other” from previous interaction

“Grab the green thing.”“Sorry, that’s too big for me.” sensitive to physical constraints

11

IBM Research


Noun Learning Scenario

eli_noun_sub.wmv

Features:

Automatically finds objects

Selects by position, size, color

Understands user pointing

Robot points for emphasis

Grabs selected object

Passes object to/from user

Adds new nouns to grammar

Builds visual models

Identifies objects from models

12

IBM Research


Multi-modal Dialog Script

1. “Eli, what is the object on the left?”No existing visual model matches object

“I don’t know.”

3. “Eli, this object <points> is Advil.”Word already known

New visual model for object

“Okay. That is Advil.”

4. “Eli, how many Advil do you see?”Uses existing visual model to find item(s)

“I see two.”

Matching = nearest neighbor

dist = Σ w[i] * | v[i] – m[i] |

2. “Eli, that is aspirin.”New word added to grammar


“Okay. This <points> is aspirin.”

Model = size + shape + colors

13

IBM Research


Multi-modal Dialog Script (continued)

6. “Eli, where is the aspirin?”Uses existing visual model to find item(s)

“Here.” <points>

5. “Eli, give me the Tylenol.”Uses existing visual model to find item(s)

<gets bottle> “Here you go”

Waits for user hand motion

<releases>


<regrabs bottle> “Thanks.”

<replaces bottle>

14

IBM Research


Eli Robot at Watson ObjectsVision

ASR Parser

Talk

Kinematics Sequencer

Reasoning

Action models

Visual models

Semantic memory

Vocabulary

Brainy Response System at Tokyo

Lifelog

Archive

Retrieve

Collaboration with Toyko Research Lab

Principle researchers:• Michiharu Kudoh

• Risa Nishiyama

“BRAINS” project goal:Make the robot respond appropriately as if it understands social rules

context update

vetoes,

recommendation

Network

15

IBM Research


Combined Demo

eli_bottles_sub.wmv

Features:

Learns object names

Learns object appearances

Grabs and passes objects

Vetoes actions based on DB

Picks alternates using ontology

Checks for valid dose interval

Real-time cloud connection

16

IBM Research


Combined Demo Script

1. “Eli, this <points> object is aspirin.”New word added to grammar


“Okay. That is aspirin”

2. “Eli, the object on the right is called Tums.”Word already known


“Okay. This <points> is Tums.”

3. “Eli, give me some aspirin.”Uses existing visual model to find item(s)

Check against personal database

“But that will hurt your stomach.”

“Alice”

DB

aspirinNO

17

IBM Research


Combined Demo Script (continued)4. “Eli, give me some Tylenol instead.”

Uses existing visual model to find item(s)

<gets bottle> “Here you go”


<releases>


<regrabs bottle> “Thanks.”

<replaces bottle>

Records dose in lifelog

5. “Eli, give me some Rolaids.”No visual model for item

“I don’t know what Rolaids looks like.”

Ontology used to find available alternative(s)

“Do you want another antacid, Tums?”

6. “Eli, just give me some Tylenol.”Uses existing visual model to find item(s)

Lifelog consulted for last dose

“You just had Tylenol.”

antacid

Rolaids(requested)

Tums(present)

7:14 AM xxxxx

8:39 AM zzzzz

9:01 AM took Tylenol

lifelog history

18

IBM Research


Verb Learning Scenario

Features:

Handles relative motion commands

Responds to incremental positioning

Learns action sequences

Applies new actions to other objects

eli_verb_sub.wmv

19

IBM Research


Verb Learning Script1. “Eli, poke the thing in the middle.”

Resolves visual target based on position

No existing action sequence to link

New action sequence opened for input

“I don’t know how to poke something.”

2. “Eli, point at it.”Resolves pronoun from previous selection

Moves relative to visual target

<points>

point 1.0

“poke”

3. “Eli, extend your hand.”Low level incremental move

<advances>

out 1.0

4. “Eli, retract your hand.”Low level incremental move

<retreats>

out -1.0

20

IBM Research


Verb Learning Script (continued)

5. “Eli, that is how you poke something.”Recognizes closing of action block

Links action sequence to word

“Okay. Now I know how to now poke something.”

6. “Eli, poke the red object.”Resolves visual target based on color

Retrieves action sequence for verb and executes

<pokes>

8. “Eli, poke the Tylenol.”Resolves visual target based known object model


<pokes>

point

1.0

“poke”

out

1.0out

-1.0

DB

7. “Eli, poke the object on the left.”Resolves visual target based on position


<pokes>

21

IBM Research


Project Milestones

Year 1 : Establishing the Language Framework (2011)

table-top environment with off-the-shelf arm / cameras / mics

Visual detection & identification of objects

Visual servoing of arm to grasp objects

Speech-based naming of objects

Speech-based learning of motion routines

Year 2 : Extension to Application Scenario (2012)

port to mobile platform with on-board power & processing

Vision-based obstacle avoidance

Visual grounding for rooms / doors / furniture

Speech adaptation for different users & rooms

Speech-based place naming & fetch routines

22

IBM Research


Overcoming obstacles to widespread robotics

PerceptionRobots do not conceptualize world as people do (e.g. what is an object?)

ProgrammingHard to tell robots what to do short of C++ programming

CostRobots are too expensive for generic activities or personal use

Focus on nouns using partial scene segmentation Separate using depth boundaries and homogeneous regions Recognize with interest points and bulk properties

Use speech and (constrained) natural language Learn word associations to objects and places Simply remember spatial paths and action procedures

Substitute sensing and computation for precise mechanicals Use cameras only, not (low volume) special-purpose sensors Use graphics processors (GPU) instead of CPU when possible

see, hear, do: language and robots

Documents