see, hear, do: language and robots
DESCRIPTION
Far Reaching Research (FRR) Project. See, Hear, Do: Language and Robots. Jonathan Connell Exploratory Computer Vision Group Etienne Marcheret Speech Algorithms & Engines Group Sharath Pankanti (ECVG) Josef Vopicka (Speech). Title slide. Challenge = Multi-modal instructional dialogs. - PowerPoint PPT PresentationTRANSCRIPT
IBM Research
© 2002 IBM Corporation
See, Hear, Do:Language and Robots
Jonathan Connell Exploratory Computer Vision Group
Etienne MarcheretSpeech Algorithms & Engines Group
Sharath Pankanti (ECVG)
Josef Vopicka (Speech)
Far Reaching Research (FRR) Project
2
IBM Research
© 2005 IBM Corporation
Challenge = Multi-modal instructional dialogs
Use speech, language, and vision to learn objects & actions
Innate perception abilities (objects / properties)
Innate action capabilities (navigation / grasping)
Easily acquire terms not knowable a priori
Example dialog:
Round up my mug.I don’t know how to “round up” your mug.
Walk around the house and look for it.When you find it bring it back to me.
I don’t know what your “mug” looks like.
It is like this <shows another mug> but sort of orange-ish.OK … I could not find your mug.
Try looking on the table in the living room.OK … Here it is!
Language Learning & Understanding is a AAAI Grand Challengehttp://www.aaai.org/aitopics/pmwiki/pmwiki.php/AITopics/GrandChallenges#language
verb learning
command following
noun learning
advice taking
3
IBM Research
© 2005 IBM Corporation
Eldercare as an application
Example tasks:Pick up dropped phone
Get blanket from another room
Bring me the book I was reading yesterday
Large potential marketMany affluent societies have a demographic imbalance (Japan, EU, US)
Institutional care can be very expensive (to person, insurance, state)
A little help can go a long wayCan be supplied immediately (no waiting list for admission)
Allows person to stay at home longer (generally easier & less expensive)
Boosts independence and feeling of control (psychological advantage)
Note: We are not attempting to address the whole problemX Aggressive production cost containment
X Robust self-recharging and stairs traversal
X Bathing and bathroom care, patient transfer, cooking
X OSHA, ADA, FDA, FCC, UL or CE certification
4
IBM Research
© 2005 IBM Corporation
Novel approach: Linguistically-guided robots
Use language as the core of the operating system not something tacked-on after-the-fact
Interaction
Simple progress / error reporting (“I am entering the kitchen”)
Easy to request missing information (“Please tell me where X is located.”)
Clarification dialogs possible (“Which box did you want, red or blue?”)
Learning
Can direct attention to specific objects or areas (e.g. “this object”)
Can focus learning on relevant properties (e.g. color, location)
Less trial and error since richer feedback (i.e. faster acquisition)
Interface
Much easier than programming (textual or graphical)
More natural for unskilled users
Less effort for “one-off” activities
5
IBM Research
© 2005 IBM Corporation
ELI the robot Power supply
528 WH sealed lead-acid batteries
28 lbs for balancing counterweight
Estimate 4-5 hr run-time
Drive systemTwo wheel differential steer
Two 4 in rear casters (blue)
47 in/sec (2.7 mph) top speed
Handles 10 deg slope, ½ in bumps
Motorized lift For arm & sensors (offset 27 in up)
Floor to 36 in (counter) range
16 in / sec = 2.3 sec bottom to top
ComputationPlatform for quad-core GPU laptop
Single USB cable for interface
OverallAbout 65 lbs total weight
Stable +/- 10 degs any direction
15 in wide, 24 in long, 45-66 in tall
6
IBM Research
© 2005 IBM Corporation
Joystick control video
Picking up a dropped object
eli_kitchen.wmv
7
IBM Research
© 2005 IBM Corporation
Speech interaction video
Far-field speech
interpretation
eli_voice.wmv
8
IBM Research
© 2005 IBM Corporation
Detached Arm
OTC medications (Advil & Gaviscon)
camera
arm
Software– Serial control code optimized– Joint control via manual gamepad– Inverse kinematic solver
Hardware– Single color camera 25 in above surface– Arm = 3 positional DOF, Wrist = 3 angular DOF– Gripper augmented with compliant closure– Workspace = 2 ft wide, 1ft deep, +8/-2 in high
for dialog development
9
IBM Research
© 2005 IBM Corporation
Speech manipulation video
Selecting and disambiguating
objects
eli_table.wmv
10
IBM Research
© 2005 IBM Corporation
Dialog phenomena handled “Grab it.” (1 object)
<grabs object> no confusion since only 1 choice for “it”
“Grab it.” (4 objects)“I'm confused. Which of the 4 things do you mean?” knows a unique target is required
“What color is the object on the left?” (4 objects) “It’s blue.” understand positions & colors
“Grab it” (4 objects)<grabs blue object> uses “it” from previous interaction
“Grab that object” (human points)<grabs object> understands human gesture
“Grab the white thing.” (2 white objects)“Do you mean this one?” <robot points> uses gesture to suggest alternative
“No, the other one.”<grabs other object> uses “other” from previous interaction
“Grab the green thing.”“Sorry, that’s too big for me.” sensitive to physical constraints
11
IBM Research
© 2005 IBM Corporation
Noun Learning Scenario
eli_noun_sub.wmv
Features:
Automatically finds objects
Selects by position, size, color
Understands user pointing
Robot points for emphasis
Grabs selected object
Passes object to/from user
Adds new nouns to grammar
Builds visual models
Identifies objects from models
12
IBM Research
© 2005 IBM Corporation
Multi-modal Dialog Script
1. “Eli, what is the object on the left?”No existing visual model matches object
“I don’t know.”
3. “Eli, this object <points> is Advil.”Word already known
New visual model for object
“Okay. That is Advil.”
4. “Eli, how many Advil do you see?”Uses existing visual model to find item(s)
“I see two.”
Matching = nearest neighbor
dist = Σ w[i] * | v[i] – m[i] |
2. “Eli, that is aspirin.”New word added to grammar
New visual model for object
“Okay. This <points> is aspirin.”
Model = size + shape + colors
13
IBM Research
© 2005 IBM Corporation
Multi-modal Dialog Script (continued)
6. “Eli, where is the aspirin?”Uses existing visual model to find item(s)
“Here.” <points>
5. “Eli, give me the Tylenol.”Uses existing visual model to find item(s)
<gets bottle> “Here you go”
Waits for user hand motion
<releases>
Waits for user hand motion
<regrabs bottle> “Thanks.”
<replaces bottle>
14
IBM Research
© 2005 IBM Corporation
Eli Robot at Watson ObjectsVision
ASR Parser
Talk
Kinematics Sequencer
Reasoning
Action models
Visual models
Semantic memory
Vocabulary
Brainy Response System at Tokyo
Lifelog
Archive
Retrieve
Collaboration with Toyko Research Lab
Principle researchers:• Michiharu Kudoh
• Risa Nishiyama
“BRAINS” project goal:Make the robot respond appropriately as if it understands social rules
context update
vetoes,
recommendation
Network
15
IBM Research
© 2005 IBM Corporation
Combined Demo
eli_bottles_sub.wmv
Features:
Learns object names
Learns object appearances
Grabs and passes objects
Vetoes actions based on DB
Picks alternates using ontology
Checks for valid dose interval
Real-time cloud connection
16
IBM Research
© 2005 IBM Corporation
Combined Demo Script
1. “Eli, this <points> object is aspirin.”New word added to grammar
New visual model for object
“Okay. That is aspirin”
2. “Eli, the object on the right is called Tums.”Word already known
New visual model for object
“Okay. This <points> is Tums.”
3. “Eli, give me some aspirin.”Uses existing visual model to find item(s)
Check against personal database
“But that will hurt your stomach.”
“Alice”
DB
aspirinNO
17
IBM Research
© 2005 IBM Corporation
Combined Demo Script (continued)4. “Eli, give me some Tylenol instead.”
Uses existing visual model to find item(s)
<gets bottle> “Here you go”
Waits for user hand motion
<releases>
Waits for user hand motion
<regrabs bottle> “Thanks.”
<replaces bottle>
Records dose in lifelog
5. “Eli, give me some Rolaids.”No visual model for item
“I don’t know what Rolaids looks like.”
Ontology used to find available alternative(s)
“Do you want another antacid, Tums?”
6. “Eli, just give me some Tylenol.”Uses existing visual model to find item(s)
Lifelog consulted for last dose
“You just had Tylenol.”
antacid
Rolaids(requested)
Tums(present)
7:14 AM xxxxx
8:39 AM zzzzz
9:01 AM took Tylenol
lifelog history
18
IBM Research
© 2005 IBM Corporation
Verb Learning Scenario
Features:
Handles relative motion commands
Responds to incremental positioning
Learns action sequences
Applies new actions to other objects
eli_verb_sub.wmv
19
IBM Research
© 2005 IBM Corporation
Verb Learning Script1. “Eli, poke the thing in the middle.”
Resolves visual target based on position
No existing action sequence to link
New action sequence opened for input
“I don’t know how to poke something.”
2. “Eli, point at it.”Resolves pronoun from previous selection
Moves relative to visual target
<points>
point 1.0
“poke”
3. “Eli, extend your hand.”Low level incremental move
<advances>
out 1.0
4. “Eli, retract your hand.”Low level incremental move
<retreats>
out -1.0
20
IBM Research
© 2005 IBM Corporation
Verb Learning Script (continued)
5. “Eli, that is how you poke something.”Recognizes closing of action block
Links action sequence to word
“Okay. Now I know how to now poke something.”
6. “Eli, poke the red object.”Resolves visual target based on color
Retrieves action sequence for verb and executes
<pokes>
8. “Eli, poke the Tylenol.”Resolves visual target based known object model
Retrieves action sequence for verb and executes
<pokes>
point
1.0
“poke”
out
1.0out
-1.0
DB
7. “Eli, poke the object on the left.”Resolves visual target based on position
Retrieves action sequence for verb and executes
<pokes>
21
IBM Research
© 2005 IBM Corporation
Project Milestones
Year 1 : Establishing the Language Framework (2011)
table-top environment with off-the-shelf arm / cameras / mics
Visual detection & identification of objects
Visual servoing of arm to grasp objects
Speech-based naming of objects
Speech-based learning of motion routines
Year 2 : Extension to Application Scenario (2012)
port to mobile platform with on-board power & processing
Vision-based obstacle avoidance
Visual grounding for rooms / doors / furniture
Speech adaptation for different users & rooms
Speech-based place naming & fetch routines
22
IBM Research
© 2005 IBM Corporation
Overcoming obstacles to widespread robotics
PerceptionRobots do not conceptualize world as people do (e.g. what is an object?)
ProgrammingHard to tell robots what to do short of C++ programming
CostRobots are too expensive for generic activities or personal use
Focus on nouns using partial scene segmentation Separate using depth boundaries and homogeneous regions Recognize with interest points and bulk properties
Use speech and (constrained) natural language Learn word associations to objects and places Simply remember spatial paths and action procedures
Substitute sensing and computation for precise mechanicals Use cameras only, not (low volume) special-purpose sensors Use graphics processors (GPU) instead of CPU when possible