conversational human robot interaction
TRANSCRIPT
Conversational Human–Robot Interaction
Speech and Language Technology Challenges
Gabriel Skantze
Professor in Speech Technology at KTHCo-founder & Chief scientist, Furhat Robotics
• Engaging• Feeling of presence• Situated
Output• Gaze (attention)• Facial expressions• Lip movements
Input• Speaker detection• Speaker recognition• Facial expressions (emotions)• Attention (gaze/headpose)
What the face adds to the conversation
Modelling the situation
Where should Furhat be looking?When should Furhat speak?What should Furhat say?What facial expressions to use?
Where are the users located?Where are the users attending?Where are objects located? Who is speaking?Which object is being talked about?
Current limitations and Challenges ahead
• Fluent Turn-taking (SSF/VR)• Learning symbol grounding through dialog (SSF/WASP)• Scaling beyond limited domains (Amazon)• Learning from interaction (VR)• Learning from human/Wizard-of-oz (SSF)
– Transfer learning• Managing shared spaces (ICT-TNG)
– Using Mixed/Augmented Reality for simulation
How is turn-taking coordinated?
The more cues, the stronger the signal! (Duncan, 1972)
Signal Turn-yielding cue Turn-holding cueSyntax Complete IncompleteProsody - Pitch Rising or Falling FlatProsody - Intensity Lower HigherProsody - Duration Shorter LongerBreathing Breathe out Breathe inGaze Looking at addressee Looking awayGesture Terminated Non-terminated
S1
S2
Machine learning of turn-taking
what do you think?
like this?
3
1 2
time
1. Don’t
2. Opportunity
3. ObligationCard movements
Words
Prosody
Head pose
Dialog history
Classifier
Annotated data
F-score0.709
0.8510.789
supervised learning
yeah, ehm
Johansson & Skantze (2015)
Towards a more general model
• Annotation of data is expensive and does not generalize– Can we learn from human-human interaction?
• Predictions are only made after IPUs (pauses)– Humans make continuous predictions
• Other turn-taking decisions are important as well– Overlap– Detecting backchannels
�Can we create a more general model?
Skantze (2017)
Outputs
LSTM layer
Prediction of S0 speaking
S0
S1
Input
Training labels(S0 speaking)
1 1 1 1 1 1 1 1
Feature extraction
0 0 00
Skantze (2017)
Data & Features
• HCRC Map Task Corpus
• Voice Activity• Pitch• Power• Spectral stability• Part-of-Speech (POS)
• Full model (130 input, 40 hidden) vs. • Prosody model (12 input, 10 hidden)
Prediction at pauses: comparison
RNN, Prosody only (SigDial, 2017) 0.724
RNN, All features (SigDial, 2017) 0.762
Human performance (SigDial, 2017) 0.709
Majority-class baseline 0.421
Tuning speech parameters (Interspeech, 2018) 0.813
F-score for Prediction of turn-shifts at 500ms pauses
Prediction at pauses: comparison
RNN, Prosody only (SigDial, 2017) 0.724
RNN, All features (SigDial, 2017) 0.762
Human performance (SigDial, 2017) 0.709
Majority-class baseline 0.421
Tuning speech parameters (Interspeech, 2018) 0.813
Multi-scale RNN (ICMI, 2018) 0.855
F-score for Prediction of turn-shifts at 500ms pauses
Current work: Application to HRI
• Applying to human-machine interaction– Human and synthetic voice– Various datasets of non-perfect
interactions
Tell me about yourself
Okay
mhm
I have had many interesting jobs
time
Current limitations and Challenges ahead
• Fluent Turn-taking (SSF/VR)• Learning symbol grounding through dialog (SSF/WASP)• Scaling beyond limited domains (Amazon)• Learning from interaction (VR)• Learning from human/Wizard-of-oz (SSF)
– Transfer learning• Managing shared spaces (ICT-TNG)
– Using Mixed/Augmented Reality for simulation
Learning referential semantics from dialog
it looks like a blue crab sticking up his claws
Shore & Skantze (2007). Enhancing reference resolution in dialogue using participant feedback. In Proceedings of Grounding Language Understanding GLU2017, Stockholm.
no, the small one to the right
you mean the left one?
Shore & Skantze (2018)
Data collection
• Referring-to-tangrams game• Repeated rounds
– Switching roles– Pieces move
• 42 dialogs, > 40 rounds each– Coreferences– Mean duration 15:25
minutes• Manual transcriptions• Train and evaluate using 42-
fold cross-validation
?
Instructor
Manipulator
It’s the blue crab
Data collection
• Referring-to-tangrams game• Repeated rounds
– Switching roles– Pieces move
• 42 dialogs, > 40 rounds each– Coreferences– Mean duration 15:25
minutes• Manual transcriptions• Train and evaluate using 42-
fold cross-validation
?
Instructor
Now it’s the pink chicken
Manipulator
Training a semantic model for words
“The small blue crab” Baseline• “words-as-classifiers”
Kennington & Schlangen (2015)• Train a logistic regression model
for each word• Negative sampling
• Use features of objects• Shape• Color• Size• Position
• Application: Rank the referents based on simple average of the world classifier scores
• Evaluate using MRR• 1/rank
Comparison to the previous results
• Applying the baseline model to our dataset gives MRR 0.682• Previous results using similar model and task: MRR 0.759
(Kennington & Schlangen, 2015)• Complicating factors:
– Dialog, not monolog– Using all language in the dialogs, not just manually annotated
Referring expressions– Non-trivial referring language, sparsity of data
• Weird shapes (“asteroid”, “a man with a hat”), atypical colors (“pinkish”)
Referring as a collaborative process (Clark)
Complicating factors• Not all referring language is NPs• Both speakers contribute• Not all language is referential• Language semantics gets established
between the speakers
Referring ability
• Not all language is referential• From Referring expressions (binary notion)
to Referring ability (probabilistic notion)• Model the Referring ability (RA) for each
word as the discriminative power of the classifier
• For a given word, the difference between – the average classification score for the
true referent– the average classification score for all
other referents• When ranking referents, weight each word
classifier score using RA
Conclusions
• Semantics of referring language can be learned from observing situated dialog, even with relatively little data
• With non-typical properties of objects, data sparsity is an issue– We can leverage language entrainment between speakers to
adapt the model• Referring in dialog is a collaborative process�We cannot derive referential semantics from “Referring
expressions” alone� A more probabilistic notion of “Referring ability” of language
is useful, and can be derived from data