perception systems for naturally interacting humanoid … · perception systems for naturally...

8
Perception Systems for Naturally Interacting Humanoid Robots Norbert Schmitz and Karsten Berns Abstract— This paper presents a design concept and a exem- plary realization of a perception system for naturally interacting humanoid robots. Relevant non-verbal interaction capabilities are selected based on psychological research. These interaction capabilities are transferred into a multi-functional user model which is exemplary implemented for interactive game scenarios. Benefits of this perception system design in comparison to existing realizations are a modular perception concept and the possibility to transfer psychological inter-personal research directly into human-robot interaction scenarios. I. INTRODUCTION The interaction capabilities of current humanoid robots is still far beyond human interaction scenarios. Since inter- personal interaction is not limited to verbal interaction it is also worth considering non-verbal interaction capabilities. Many realizations of interactive humanoid robots are designed for a specific interaction scenario. This kind of limitation is still required since inter-personal interaction is a complex task. Targeting the goal of complex and versatile human-robot interaction it is a feasible approach to transfer knowledge from inter-personal to human-robot interaction. This transfer can be supported by the design of the robot control architecture and the perception system in particular. This paper focuses on the perception although various other aspects are considered to be of interest. The design and realization of the proposed system within this paper is based on the idea to analyze and categorize knowledge from psy- chological inter-personal interaction and transfer suitable and relevant aspects into a user model and a technical realization of such a perception system for interactive humanoid robots. The whole system is embedded into a emotion-based control architecture which is described in [9]. The following sections will introduce this concept and present an exemplary realization with the focus on an in- teractive game play situation. II. ROBOT PERCEPTION The analysis of currently realized perception systems for naturally interacting robot reveals dramatic differences con- cerning the design concepts. Many implementations are not related to human interaction and extract information which is considered to be relevant from the robotic point of view. The following section introduces important contributions which are based on inter-personal communication. Robotics Research Laboratory, Faculty of Com- puter Science, University of Kaiserslautern, Germany {nschmitz|berns}@agrosy.cs.uni-kl.de A. Gandalf and Ymir The development of naturally interacting robots is not limited to embodied agents. Artificial animated heads or even full body models are often used to overcome the limitations of the mechanical construction. The avatar Gandalf and the cognitive architecture Ymir are used to realize real-time capable interaction [16]. Thorisson developed a real-time capable system and a control architecture based on data from the psychological and linguistic literature. He realized that existing knowledge from inter-personal communication can be transferred to human- robot interaction [17]. The control architecture uses a layered combination of reactive and reflective behaviors which are triggered by the dialog management system. The modular distributed system is divided into three module groups perception, action and decision. The decision modules of each layer use more or less abstract knowledge generated by the perception system to select suitable actions. The action modules are responsible to execute one action in one of the nine action categories. The perception system as part of the whole architecture uses virtual sensors for unimodal information extraction and multimodal descriptors as fusion modules. The virtual sensors operate on a single stream of input data and are separated into the categories: Prosodic, Speech, Positional and Directional. These sensors are further subdivided into static sensors reporting the current state of a property and dynamic sensors reacting on changes of an observed stimuli. Gandalf uses 16 basic virtual sensors. Since the virtual sensors operate on a single modality a fusion system is required. Thorisson introduces ten multimodal descriptors. They again can either react on static stimuli or a dynamic change. The presented approach of natural interaction using Ymir and Gandalf is a very promising example of modeling a perception system for humanoid robots. Major aspects are the transfer of psychological knowledge into robotic applications and the realization of real-time capable systems. The perception module with the splitting of virtual sensors and multimodal descriptors allows multiple views on the same dataset and reusable information. This approach allows the separation of the perception and decision process which is crucial for the adaption end extension of the proposed architecture. B. Asimo Recent developments concerning Asimo’s interaction ca- pabilities are performing interactive game situations. Okita [13] analyzes the reaction of children on different behaviors

Upload: hoangthu

Post on 12-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Perception Systems for Naturally Interacting Humanoid Robots

Norbert Schmitz and Karsten Berns

Abstract— This paper presents a design concept and a exem-plary realization of a perception system for naturally interactinghumanoid robots. Relevant non-verbal interaction capabilitiesare selected based on psychological research. These interactioncapabilities are transferred into a multi-functional user modelwhich is exemplary implemented for interactive game scenarios.Benefits of this perception system design in comparison toexisting realizations are a modular perception concept andthe possibility to transfer psychological inter-personal researchdirectly into human-robot interaction scenarios.

I. INTRODUCTION

The interaction capabilities of current humanoid robots

is still far beyond human interaction scenarios. Since inter-

personal interaction is not limited to verbal interaction it is

also worth considering non-verbal interaction capabilities.

Many realizations of interactive humanoid robots are

designed for a specific interaction scenario. This kind of

limitation is still required since inter-personal interaction is

a complex task. Targeting the goal of complex and versatile

human-robot interaction it is a feasible approach to transfer

knowledge from inter-personal to human-robot interaction.

This transfer can be supported by the design of the robot

control architecture and the perception system in particular.

This paper focuses on the perception although various other

aspects are considered to be of interest. The design and

realization of the proposed system within this paper is based

on the idea to analyze and categorize knowledge from psy-

chological inter-personal interaction and transfer suitable and

relevant aspects into a user model and a technical realization

of such a perception system for interactive humanoid robots.

The whole system is embedded into a emotion-based control

architecture which is described in [9].

The following sections will introduce this concept and

present an exemplary realization with the focus on an in-

teractive game play situation.

II. ROBOT PERCEPTION

The analysis of currently realized perception systems for

naturally interacting robot reveals dramatic differences con-

cerning the design concepts. Many implementations are not

related to human interaction and extract information which

is considered to be relevant from the robotic point of view.

The following section introduces important contributions

which are based on inter-personal communication.

Robotics Research Laboratory, Faculty of Com-puter Science, University of Kaiserslautern, Germany{nschmitz|berns}@agrosy.cs.uni-kl.de

A. Gandalf and Ymir

The development of naturally interacting robots is not

limited to embodied agents. Artificial animated heads or even

full body models are often used to overcome the limitations

of the mechanical construction. The avatar Gandalf and the

cognitive architecture Ymir are used to realize real-time

capable interaction [16].

Thorisson developed a real-time capable system and a

control architecture based on data from the psychological and

linguistic literature. He realized that existing knowledge from

inter-personal communication can be transferred to human-

robot interaction [17].

The control architecture uses a layered combination of

reactive and reflective behaviors which are triggered by the

dialog management system. The modular distributed system

is divided into three module groups perception, action and

decision. The decision modules of each layer use more or

less abstract knowledge generated by the perception system

to select suitable actions. The action modules are responsible

to execute one action in one of the nine action categories.

The perception system as part of the whole architecture

uses virtual sensors for unimodal information extraction

and multimodal descriptors as fusion modules. The virtual

sensors operate on a single stream of input data and are

separated into the categories: Prosodic, Speech, Positional

and Directional. These sensors are further subdivided into

static sensors reporting the current state of a property and

dynamic sensors reacting on changes of an observed stimuli.

Gandalf uses 16 basic virtual sensors. Since the virtual

sensors operate on a single modality a fusion system is

required. Thorisson introduces ten multimodal descriptors.

They again can either react on static stimuli or a dynamic

change.

The presented approach of natural interaction using Ymir

and Gandalf is a very promising example of modeling

a perception system for humanoid robots. Major aspects

are the transfer of psychological knowledge into robotic

applications and the realization of real-time capable systems.

The perception module with the splitting of virtual sensors

and multimodal descriptors allows multiple views on the

same dataset and reusable information. This approach allows

the separation of the perception and decision process which

is crucial for the adaption end extension of the proposed

architecture.

B. Asimo

Recent developments concerning Asimo’s interaction ca-

pabilities are performing interactive game situations. Okita

[13] analyzes the reaction of children on different behaviors

of the robot in interactive learning situations. Comparisons of

different learning styles show that the behavior of the robot

has an influence on the interacting children.

Focusing again the perception system the concept of a

cognitive map has been introduced by Ng-Thow-Hing [12].

Another demonstration using a card table top game shows

the principles of the architecture. The central component

of the Cognitive Map Architecture is the cognitive map

server which accepts perception information and provides an

interface for information retrieval. In the card game scenario

table, card and speech information are extracted and stored

in the server. Each stored information consists of a variable

set of parameters like position and orientation.

To access the stored data three mechanisms are provided:

indexer, detector and decider. Indexers are use to access the

stored data using a provided sorting strategy. This allows

sorting by time or position. The second mechanism, the

detectors, are boolean operators which limit the amount

of messages. Messages are only forwarded if the detector

evaluates to true. In the card game example detectors are used

to suppress new objects located beside the table. Deciders

finally use the information from detectors or other deciders to

extract information which can be stored back to the cognitive

map. The output of the deciders are used to trigger a finite

state machine representing the game flow by modeling both

robot and game state.

The cognitive map architecture provides an uniform and

flexible way to access stored perception data. The hierarchy

of indexer, detector and decider allows the usage of already

existing information and the reduction of computational ef-

fort. Nevertheless it does not provide rules which perceptions

are required and should be perceived. Here again the focus

is the information required for the specific interactive game

situation.

III. COMMUNICATION MODELS

Focusing the channels sight and hearing a new question

arises: Which information is transferred and of interest for a

communication process? In [6] detailed descriptions of face-

to-face interaction processes are given. There seven major

categories of action that can be performed or perceived are

described.

• Para-language This term defines all information trans-

mitted on top of speech signals but not directly related

to the content of the spoken words. Further subdivi-

sion distinguishes perspective, organic, expressive and

linguistic aspects.

• Kinesics The term kinesics describes any type of body

motion including head, facials, trunk, hands and so on.

• Proxemics The perception of distance zones, spatial

arrangements and sensory capabilities is a key factor

of any conversation.

• Artifacts Artifacts in the environment and even the

environment itself influences a communication process.

Depending on the situation artifacts may even be nec-

essary.

• Olfactory The scent of interaction partners has also

an influence although it is limited to a narrow area

close by the interaction partners [4]. Since human-robot

communication mainly covers public areas olfactory

communication is of minor interest and therefore not

discussed further.

• Haptics Haptic interaction signals are related to specific

situations which are for example hand shaking, beating

or grabbing. These communicative actions require an

embodied agent and direct feedback of the robot [1].

Since the focus of this work is the modeling of the

user and haptic interactions are seldom in public envi-

ronments this topic is not noticed further. Nevertheless

haptics may be of interest for extensions of this work.

• Language Although non-verbal aspects are considered

important most of the conversational content is trans-

fered via the speech [10]. Verbal interaction is men-

tioned due to the immense importance but it is not

directly included in the model since the user modeling

focuses on non-verbal aspects.

A. Aspects of Paralanguage

The analysis of speech related communication information

is divided into different aspects. Traunmller [18] distin-

guishes

• Perspective aspects Perspective aspects of speech are

concerned with spatial relationships. These include be-

side others place, distance, orientation and transition

channel. These aspects have various impacts on the

communication. Examples are attention attraction and

turn-taking.

• Organic aspects Organic aspects are modifications of

speech caused by the differences of the vocal tract. Each

human being has an unique vocal tract with an unique

modification of speech. Based on these modifications in-

formation like age, sex and pathology can be extracted.

Especially information like age and sex are commonly

used in communication situations.

• Expressive aspects Expressive aspects are transmitted

by variations in speaking rate, pitch dynamics and voice

quality. Transmitted are for example emotion, attitude

and environment. Expressive aspects strongly influence

the communication process.

• Linguistic aspects Linguistic aspects transmit primar-

ily the message itself. Besides the spoken words also

dialect, accent and speech style are transmitted. The

influence on the communication process is immense

which is mainly based on the transmitted content.

The amount of information derived from paralinguistic

phenomena is huge and large variations in the usage of these

information and the reliability can be observed [2]. A crucial

part towards a model of non-verbal communication is the

selection of relevant aspects. Unfortunately an importance

based ranking of paralinguistic aspects is not available due

to the strong correlation with the situation and personal

preferences. Nevertheless several aspects do exists which

seem to be of crucial importance.

B. Situation Models

The situation model gives quite precise information about

required knowledge in a communication situation. If the

situation is more specific detailed models do exist. In [5]

various types of conversation situations are mentioned. As

example of a complex situation a specific model of play has

been developed1. Here four continuous cycles are mentioned.

• Continuous play In a concrete play situation an observe-

plan-act cycle is proposed. In this cycle all participants

are focused on the game with a mainly serial turn-taking

scheme.

• Fun In a second independent cycle the pursuit of

fun, which is a basis for many games, is mentioned.

Based on the conversation, a shared situation and the

engagement a situation of amusement can be created.

• Learning Additionally the authors claim a cycle of

learning. Each participant has the goal to gain expe-

rience and improve the game play.

• Repeated play The forth mentioned cycle claims that

game situations may be interrupted and continued at

any time.

C. Communication Partner Model

Any interactive communication situation is based on the

perception of the interaction partner. To provide these in-

formation as basis for any complex interaction three basic

requirements have to be fulfilled:

• Timing requirements The trisection of the communi-

cation process in opening, main part and closing is

realized. During the opening a conversation partner is

not yet present so the detection of humans is important.

The main part is strongly depending on the conversation

situation with detailed information of the interaction

partner. The third part describes the closing which

concludes the conversation.

• Memory requirements It is obvious that previous knowl-

edge as well as a runtime memory is required in any

communication situation. In this user model two types

of knowledge are used: the a priori knowledge provided

by the application programmer containing information

like face dimensions and speech identification data

and the a posteriori knowledge which consists of data

captured and stored during an interaction.

• Sensor requirements The perception system generates

a broad range of information which can be subdivided

into basic information units called percepts. Each per-

cept provides a specific information which can be of

interest for the robot. The percepts are grouped into the

categories paralanguage, kinesics, proxemics, language

and situation specific percepts. The user model assigns

a set of relevant percepts to each perception task in the

timing model. Only these percepts are considered to be

relevant in the specific time step.

All these requirements are described in the general user

model depicted in figure 1. This model consists of the three

1see http://www.dubberly.com/concept-maps/a-model-of-play.html

main stages of interaction and the relevant percepts for each

category.

Opening State

Speech ActivitySound Direction

Head Position Person Movement

Main State

Speech ActivitySound DirectionSound DistanceSpeaker Ident.

Hand GestureHead OrientationHead PositionEmotional State

Person Movement

Game StatePlayer State

Closing State

Head Position

Fig. 1: Schematic view of the realized communication partner

model with active perception capabilities.

IV. AUDITORY PERCEPTION SYSTEM

Audio

Vid

eo

Kinematic

PerceptionFusion

Sensors

JackCapture JackCapture

PreprocessorUSTM

BeamformerCUDA

Preprocessor

VoiceActivityDetection

VoiceClassifierLTM

AudioFusionSTM

Fig. 2: Auditory perception system with two major branches

localization and identification.

The auditory perception system as depicted in figure

2 contains two major branches the identification and the

localization which are fused in the auditory fusion step. The

following sections will briefly introduce these branches.

A. Sound Source Localization

The sound source localization has to provide proxemic

information about the spatial arrangement of human in-

teraction partners. This includes the position of a sound

source relative to the position of the robot or more precise

the position of the microphones. The contribution of the

sound source localization to the model are the paralinguistic

aspects of Sound Direction and Sound Distance. Additionally

information of loudness are of interest to allow priority based

distinction of sound sources. Sound localization approaches

as required here have widely been used in humanoid robotics

with varying focus. Basic requirements for interactive robots

are real time data processing and noise robustness.

B. Speaker Identification

For any inter-personal communication situation the iden-

tity of the interaction partner is of crucial importance. The

introduction of all participants is often one of the first

conversation steps if the identity is not previously known.

The behavior of all participants heavily depends on the

relationship between them. It is therefore an important

competence for a naturally interacting humanoid robot to

estimate the identity of an interaction partner and adapt the

robot behavior according to the relationship. The speaker

identification contributes to the paralinguistic aspects of

Speech Activity and Speaker Identification.

The realized speaker classification is subdivided into three

modules: Preprocessing, voice activity detection and voice

classification. The preprocessing is responsible for basic

transformations into frequency domain which is not de-

scribed further here. The voice activity detection (VAD)

generates a probability value for each microphone channel

and frame of the audio stream. The voice activity detection

is based on the work of Cohen [3]. The classification module

generates the current feature vectors and compares them to

the set of previously trained codewords [7].

The extraction of speaker related information from sound

sources is related to the extraction process described by Sig-

urdsson [15]. The extraction process is divided into five steps:

Framing, frequency transform, log amplitude transform, Mel

scaling and discrete cosine transform. The following section

will briefly introduce this process and describe variations in

the realized approach.

The discrete auditory input signal s(n) of each microphone

m ∈ {0, ..,M} is divided by the voice activity detection into

chunks of speech. Each chunk is further subdivided into a

set of frames f ∈ {0, .., F} with length L. Each input frame

is 50% overlapping the previous one and transformed into

frequency domain by using Hamming Windows and the Fast

Fourier Transformation.

The selection of a suitable set of features influences the

detection rate and the calculation time. The used set of

features is a modification of the MFCC FB-24 filter bank.

The frequency range between flow = 0Hz and fhigh =3.000Hz is divided into M = 22 triangular filters. The

centers Ci of the triangular filters and their distances ∆fmel

are calculated as

∆fmel =1127 ·

(

ln(1 +fhigh

700)− ln(1 + flow

700))

M

Ci = f−1

mel

(

1127 · ln

(

1 +flow

700

)

+ i ·∆fmel

)

0 ≤ i ≤ M + 1

f−1

mel (m) = 700 ∗(

em

1127 − 1)

Each filter Fi(fmel) overlaps half of both neighboring

filters and is defined as

Fi(fmel) =

0 k < Ci − 1k−Ci−1

Ci−Ci−1Ci−1 ≤ k < Ci

Ci+1−k

Ci+1−Ci

Ci ≤ k < Ci+1

0 k > Ci+1

1 ≤ i ≤ M

The corresponding transformed Mel coefficients Smeli for

each filter i and the Fourier transformed input signals S(n)are then calculated as

Smeli =

N−1∑

n=0

Fi · |S(n)|2, 0 ≤ i ≤ M − 1

The final Mel Frequency Cepstral Coefficients (MFCC)

are calculated using the Discrete Cosine Transform II (DCT-

II) as shown in equation 1. The output of this procedure is a

set of MFCC feature vectors describing a chunk of speech.

Xk =

N−1∑

n=0

x(n)·cos

[

π

N

(

n+1

2

)

k

]

k = 0, . . . , N−1

A suitable selection of contributing MFCCs is not trivial.

Zhen [20] provides a relative comparison of the importance

of the Mel coefficients. The coefficients C0 and C1 down-

grade recognition results since they are correlated with the

signal energy. Zhen further describes that all coefficients

above C12 only have minor contributions since they are

heavily influenced by noise. Also the contribution of delta

and delta-delta features can be neglected as described by

Radova [14].

Based on these information the final set of MFCC features

consists solely on the coefficients C2 till C16. One of these 14

dimensional vectors is extracted for each frame of a chunk.

Each of them has to be compared to a reference set of

vectors approximating the possible speakers. To compare two

feature vector a distance measurement is required. The mean

squared error (MSE) provides the distance measurement used

here since it does not require a square root operation in

comparison to the euclidean distance.

d(~x, ~y) = ||~x− ~y||2 =

l∑

i=0

(xi − yi)2

The following section will introduce the realized vector

quantization technique based on the feature vectors and

distance measurement described before.

The modeling of a speaker can now be reduced to the

description of a large amount of example MFCC vectors

generated during a training phase. Assuming that a set of

descriptive vectors xi has been extracted. Such a set X ={x0, . . . , xn} exists for each possible communication partner.

The goal of the vector quantization is the selection of

a suitable preferably small set of representatives M ={µ1, µ2, . . . , µk} which optimally describe the large amount

of input vectors X . This process can be seen as a lossy

compression of data.

Several approaches like Gaussian Mixture Models and

Neural Networks have been used for speaker representation

with similar results. Another approach is the usage of the

K-Means algorithm described for example by Kanungo [11]

with a fixed number of representatives. The algorithm assigns

the closest centroid to each input vector according to a

distance measurement d(x, µ). The centroid vectors are then

recalculated as mean of all assigned input vectors. The

new centroid position leads to an update of the centroid

assignments. This procedure is repeated until the assignment

is unchanged.

The selection of an optimal number of representatives

k is a complex task. Extensions to the K-Means like the

X-Means algorithm do exist. The increased complexity of

estimating the required number of centroids also leads to

high computational complexity. During several experiments

it could be seen that a large number of centroids sufficiently

represents the set of input vectors.

The current implementation selects k = 128 centroid

vectors as description for each speaker. The approach of

independent codebook generation for each speaker did not

provide sufficient results. The distances between the speak-

ers are too low. The problem with this procedure is the

omnipresent background noise which is trained into each

codebook. The following section will introduce the universal

background model which helps to distinguish noise from

speaker.

Speaker recognition systems often use voice which is

recorded in silent environments. The existence of noise is

reduced by using microphones which are nearby the speaker,

like head sets or telephones. In the area of mobile robotics

this is in general not the fact. Speakers are spread in

the environment and noise from motors and computers is

omnipresent.

A crucial point in the training process of speaker code-

books is to avoid the training of noise as speaker features.

A possible solution is noise reduction which is complex and

hard to realize in frequently changing environments. Another

approach is the introduction of a background model. The

principle idea is that all speaker vectors are used to train

a single codebook. This codebook should contain features

which are present in all sets of feature vectors, especially

the background noise. After the training of the background

model each speaker codebook is trained with the additional

constraint to keep away from the background model.

The Universal Background Model (UBM) is trained using

the standard K-Means algorithm. The number of features

used to train the model is huge since it contains all speaker

vectors which leads to long training times. The training of

the individual speaker models is realized using a modified K-

Means algorithm. A similar approach for Background Mod-

eling using Gaussian Mixture Models has been described by

Hautamaeki [8].

V. VISUAL PERCEPTION

From the technical point of view the visual perception

system as depicted in figure 3 is a sensor input driven

hierarchy of perception modules. Each module provides new

information based on input images and optional additional

elementary percepts. The arrangement of perception modules

has been developed and re-factored several times to opti-

mize the flow of information and reduce the computational

complexity. The final module hierarchy is depicted in figure

5 which includes the relation of the camera perception

Vid

eo

AudioKin

em

atic

PerceptionFusion

Sensors

SaveImages

Sca

leIm

ages

SingleProcessing

StereoProcessing

SkinColorD

etector

FaceDetector

Gestu

reDetector

FaceDetector2

EmotionDetector

TangramDetector

SkinColorD

etector2

Hea

dPoseEstim

ation

Video

Fusion

CUDA

CUDA

CUDA

CUDA

CUDA

Fig. 3: Visual perception system with several information

extraction branches.

module to other groups relevant for the perception system.

Each module and the functionality is briefly introduced

before details on relevant modules are given in the following

sections.

• Face Detector The face detector uses the Haar cascade

classifier introduced by [19] to find frontal faces. A

list of face candidates is generated and provided as

elementary percept.

• Tangram Detector The Tangram detector uses color

information to extract interaction relevant artifacts. The

number of relevant artifacts for communication pur-

poses is endless and therefore a small but useful set has

been selected. The detection focuses on the analysis of

game play situations during an instruction puzzle game

called Tangram.

• Skin Color Detector The skin color detector uses a

skin color model based on a combination of Gaussian

probabilities. The classifier assigns a skin probability to

each image pixel using a look up table. Additionally an

average skin color value is added to all face candidates

provided by the face detector.

• Emotion Detector The emotion detector uses a closeup

high resolution image of a frontal face to extract facial

features. According to the relative location of these

features an emotional state is estimated using FACS.

• Gesture Detector The gesture detector realizes the de-

tection of emblematic gestures commonly provided in

interaction situations. Therefore skin colored regions are

detected and compared with a previously trained set of

hand gestures.

• Head Pose Estimation The head pose estimator uses the

camera with narrow field of view and estimates the head

pose by matching a rigid 3D head model with a set of

2D image features.

• Video Fusion The video fusion module uses all el-

ementary perceptions like face candidates, skin color

estimation and depth image to track the head of possible

interaction partners. The tracking is realized using a

particle filter for each person in the environment. The

output of this fusion module is a list of public percepts.

Kinematic

VideoAudio

PerceptionFusion

Sensors

KinematicAnnotation

DynamicAnnotation

AudioVideoFusion

Fig. 4: Perception fusion system as late fusion of auditory

and visual perception.

The multimodal fusion module as part of the fusion

hierarchy depicted in figure 4 provides a consistent view of

the extracted percepts by combining all relevant information

into a single description for each person in the environment.

Additionally all here mentioned percepts can be accessed

directly by the control architecture if desired.

The perception fusion represents the last instance of the

perception system. All information extracted and fused up to

this point is now forwarded to the control architecture which

will be introduced in the following section.

VI. EXPERIMENTS

In group conversations with multiple interaction partners

it is not possible that all participants are communicating

simultaneously. Currently not communicating participants

are observers of the interaction but the role of speaker,

addressee and observer is changing frequently. The following

experiment shows the observation of an interaction between

two participants. One of them is explaining the rules of the

Tangram tabletop game. The robot is observing the situation

without active participation.

The goal of the experiment is to demonstrate both auditory

and visual perception as well as the fusion mechanisms.

Figure 5 depicts the extracted percepts during the time

interval from 30 to 90 seconds. The duration of the whole

experiment was about 100 seconds. In the diagram four time

steps are marked with a to d and the corresponding percepts

are displayed. Figure 6 depicts images of the overhead

reference camera, the left body camera with overlay of the

person tracking system and the left eye camera during the

experiment at the marked time steps.

The output of the percept P ′

audiofusion is displayed in

the second diagram of figure 5. The voice activity detection

splits the talking phases of both participants. The voice

classification uses these intervals to generated an identity

estimation at the marked time steps a,b and d. During the

experiment the identification zero is assigned to the person

on the left side (from robot view) and one is assigned to the

person on the right.

The plot below the voice activity shows the root position of

the localized sound source which corresponds to the positions

of the two talking persons. The large variations of the yrvariable depicts the alternating speech activity. The peak

output energy increases only slightly during speech phases.

The noise generated by the movement of the motors and the

computer below the robot significantly reduces the signal

to noise ratio. Nevertheless both identification as well as

speaker localization is possible.

The last two plots show the output of the visual tracking

system divided into the tracking of the person on the left and

the one on the right hand side. A close relationship between

the position of the visual and the auditory tracking system

can be seen. This geometrical correlation is used to assign the

estimated speaker identification to the closest tracked person.

The behavior of the robot during the experiment is solely

influenced by the sound source position. The robot is steadily

focusing the sound source with the left eye camera. It can be

seen that the left eye is always focused on the current speaker.

The visual tracking sometimes loses the interaction partner

which forces the robot to rely on the auditory perception

until the tracking is reinitialized.

VII. CONCLUSIONS AND FUTURE WORK

The complex interaction of humans and robots requires

a model of the human interaction partner. The design as

well as the realization of such a model should be based on

psychological research to allow the transfer of inter-personal

to human-robot communication scenarios.

This paper presents psychological models of time, memory

and interaction signals and combines these into a common

user model. One representative aspect of the model, the

speaker identification, is discussed in detail. With the focus

on an interactive tabletop game the implementation and usage

of the model is presented using the control architecture of

the humanoid robot Roman.

The experiments showed that the proposed design is well

suitable for complex interactive scenarios. Future develop-

ments will add additional information to the model and

support a wider range of communication scenarios.

REFERENCES

[1] Michael Argyle. Bodily Communikation. Methuen & Co. Ltd., 9thedition 2005 edition, 1975.

[2] S. E. Avons, R. G. Leiser, and D. J. Carr. Paralanguage andhuman-computer interaction. part 1: identification of recorded vocalsegregates. Behaviour & Information Technology, 8(1):13–21, 1989.

[3] I. Cohen. Noise spectrum estimation in adverse environments: im-proved minima controlled recursive averaging. Speech and Audio

Processing, IEEE Transactions on, 11(5):466 – 475, sept. 2003.

30 35 40 45 50 55 60 65 70 75 80 85 90

a

P ′

a= (1, 0, . . . , 0.04, 0.12, 2.3,−0.8, 0.8)

P ′

p= (2, . . . , 1.2,−0.3, 1),(3, . . . , 2, 0.9, 1)

b

P ′

a= (0, 0, . . . , 0.05, 0.09, 3.1, 0.0, 0.9)

c

P ′

a= (−1, 1, . . . , 0.05, 0.08, 2.6,−1.4, 1)

P ′

p= (5, . . . , 1.3,−0.4, 1.1),(4, . . . , 1.2, 0.5, 0.8)

d

P ′

a= (0, 0, . . . , 0.04, 0.09, 3.1, 0.7, 0.9)

P ′

p= (5, . . . , 1.3,−0.3, 1.1),(4, . . . , 1.3, 0.6, 0.8)

Percepts

P ′

person

P ′

audiofusion

30 35 40 45 50 55 60 65 70 75 80 85 900

0.5

1

Activity

P ′

audiofusion

vppa

30 35 40 45 50 55 60 65 70 75 80 85 90−202

Position[m

]

xr

yrzr

30 35 40 45 50 55 60 65 70 75 80 85 90

012

Position[m

]

P ′

person

xr

yrzr

30 35 40 45 50 55 60 65 70 75 80 85 90

012

Time [s]

Position[m

]

xr

yrzr

Fig. 5: Experimental evaluation of a interactive situation with two interaction partners talking in front of the robot. The first

diagram shows the extracted percepts. The second diagram shows the position and activity of the sound localization and

speaker identification. The last two rows show the corresponding perceived positions of the visual perception system.

[4] Richard L. Doty. Olfactory communication in humans. In Oxford

Journals, Life Sciences & Medicine, Chemical Senses, volume 6, 1981.

[5] Hugh Dubberly and Paul Pangaro. On modeling what is conversation,and how can we design for it? interactions, 16(4):22–28, 2009.

[6] Starkey Duncan and Donald W. Fiske. Face to face interaction.Erlbaum [u.a.], Hillsdale, NJ, 1977.

[7] Florian Faust. Multi-channel voice based speaker recognition forhumanoid robots. Master’s thesis, University of Kaiserslautern, 2009.

[8] Ville Hautamki, Tomi Kinnunen, Ismo Krkkinen, Juhani Saasta-moinen, Marko Tuononen, and Pasi Frnti. Maximum a posterioriadaptation of the centroid model for speaker verification. IEEE Signal

Process. Lett, pages 162–165, 2008.

[9] Jochen Hirth, Norbert Schmitz, and Karsten Berns. Towards social

robots: Designing an emotion-based architecture. International Jour-

nal of Social Robotics, 2011. DOI: 10.1007/s12369-010-0087-2.

[10] James G. Hollandsworth, Richard Kazelskis, Joanne Stevens, andMary Edith Dressel. Relative contributions of verbal, articulative,and nonverbal communication to employment decisions in the jobinterview setting. Personnel Psychology, 32(2):359–367, 1979.

[11] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D.Piatko, Ruth Silverman, and Angela Y. Wu. An efficient k-meansclustering algorithm: Analysis and implementation. IEEE Trans.

Pattern Anal. Mach. Intell., 24:881–892, July 2002.

[12] V. Ng-Thow-Hing, K. Thorisson, R. Sarvadevabhatla, J. Wormer, andT. List. Cognitive map architecture. Robotics Automation Magazine,

IEEE, 16(1):55 –66, march 2009.

(a) 31s : Identification of person on the right side is assigned

(b) 38s : Visual tracking is lost but robot still focuses sound source

(c) 52s : Visual and auditory tracking is running independently

(d) 58s : Identification can be reassigned

Fig. 6: Samples of the captured images during the dialog experiment. The left eye image shown in the rightmost column

continuously focuses the observed sound source and therefore shows the current talking person.

[13] S.Y. Okita, V. Ng-Thow-Hing, and R. Sarvadevabhatla. Learningtogether: Asimo developing an interactive learning partnership withchildren. In Robot and Human Interactive Communication, 2009. RO-

MAN 2009. The 18th IEEE International Symposium on, pages 1125–1130, 272009-oct.2 2009.

[14] Vlasta Radov and Zdenk venda. Speaker identification based on vectorquantization. In Vclav Matousek, Pavel Mautner, Jana Ocelkov, andPetr Sojka, editors, Text, Speech and Dialogue, volume 1692 of Lecture

Notes in Computer Science, pages 83–83. Springer Berlin / Heidelberg,1999.

[15] Sigurdur Sigurdsson, Kaare Br, T Petersen, and Tue Lehn-schiler. Melfrequency cepstral coefficients: An evaluation of robustness of mp3encoded music. In in Proceedings of the International Symposium on

Music Information Retrieval, 2006.

[16] Kristinn R. Thrisson. Real-time decision making in multimodal face-to-face communication. In Proceedings of the second international

conference on Autonomous agents, AGENTS ’98, pages 16–23, NewYork, NY, USA, 1998. ACM.

[17] Kristinn R. Thrisson. A mind model for multimodal communicativecreatures & humanoids. INTERNATIONAL JOURNAL OF APPLIED

ARTIFICIAL INTELLIGENCE, 13(4):519–538, 1999.[18] Hartmut Traunmller. Paralinguistic phenomena. In Handbooks of

Linguistics and Communication Science, pages 653–665. Walter deGruyter, Berlin New York, December 2004.

[19] P. Viola and M. Jones. Rapid object detection using a boosted cascadeof simple features. In Computer Vision and Pattern Recognition,

2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society

Conference on, volume 1, pages I–511 – I–518 vol.1, 2001.[20] Bin Zhen, Xihong Wu, Zhimin Liu, and Huisheng Chi. On the impor-

tance of components of the mfcc in speech and speaker recognition.In ICSLP-2000, volume 2, pages 487–490, October 2000.