phd general examinations: publishable quality paper for main … · 2006. 1. 26. · “token...

PhD General Examinations:

Publishable Quality Paper for Main Area (Prof. Deb Roy)

Nikolaos Mavridis, MIT Media Lab, June 2005

I think the most productive topic for a paper will be to design the MINIMAL mental

model architecture that will support sensory-motor grounded interpretation of all spoken

commands in the final part of the Token test. The paper should present this design, justify

each element of the design, and relate the architecture as best as you can to the readings

in your reading list.

Deb

Dear Deb,

I followed the decomposition of the paper in three parts that I had proposed to you by

email and to which you have agreed. The key themes of this paper are:

1) THE MINIMALITY CRITERION

What could a meaningful "minimality" criterion be for a mental model or other systems

of comparable complexity and scope? What are viable approximations for such criteria,

or alternatively stepwise design methodologies that can guarantee some notion of strict or

relaxed "minimality"?

2) THE FUNCTIONAL SPECIFICATION

What does it really mean to "pass the token test?". I.e. at what level of generality would

we expect the system to perform? (i.e. what family of abilities which are exemplified

from the token test are we targeting?)

3) STEPPING THROUGH THE DESIGN PROCESS OF THE PROPOSED SOLUTION

Based on specific proposals for 1) and 2), what is the suggested design presented in a

stepwise/incremental manner? (Here, a lattice of "minimal" designs of increasing

abilities/complexity might need to be presented, and the corresponding design space

searched to find the minimal satisfactory solution)

Nicholas

A near-minimal grounded situation model design

for a conversational robot that can pass the “Token Test”

Nikolaos Mavridis, MIT Media Lab, June 2005

Abstract

The purpose of this paper is to present a design

of a minimal grounded situation model (GSM)

[Mavridis and Roy 2005, Mavridis 2005] which

will enable a conversational robot to pass the

“Token Test”. The “Token Test” is a standard

test designed to assess early situated language

skills of children. This specific design is given as

an example of a wider methodology. According

to this methodology, agent model designs are

derived from specifications, which are in turn

derived by human behavioral tests. In this paper,

after a brief introduction to the problem, optimal

design and reasonable approaches towards

practical near-optimality in such complex

systems are discussed. Then, an answer to the

question of what it should really mean for a

robot to pass a human behavioral test is

attempted. In the main section of this paper that

follows, the proposed design is derived step-by-

step. A conclusion section ends this paper. This

paper constitutes a step towards the creation of

conversational robots that can cooperate fluidly

with humans. The paper’s main contributions

include a proposal for an incremental design

methodology for grounded situation models,

guidelines for translating human behavioral tests

into design specifications for robots, and an

example of the use of the above methodology

and guidelines: an easily extensible step-by-step

design for a conversational robot that can pass

the “Token Test”.

0. Introduction

[vision - previous work]

The ultimate purpose of this work is the creation

of conversational robots that can cooperate

fluidly with humans in an intuitive manner using

natural language. Our current work is part of a

larger effort to develop conversational interfaces

for interactive robots [Crangle and Suppes 1994,

McGuire et al. 2002, Sofge et al. 2003]. Robots

that understand and use natural language may

find application in entertainment, assistive, and

educational domains. Towards this ultimate

purpose, a mental model for the conversational

robot Ripley was originally designed, which in

its initial form enabled the robot to achieve

object permanence and viewpoint switching

[Roy, Hsiao and Mavridis 2004]. Later, the

design was extended, and incorporated into the

general theoretical framework of “Grounded

Situation Models” (GSM) having triple-layer

properties [Mavridis and Roy 2005]. Using such

models, Ripley became the first conversational

robot that was able to fluidly mix sensory-

derived information with spoken descriptions in

the creation of the contents of his situation

model, and was also able to resolve temporal

referents and refer to past events. At that stage,

Ripley was able to pass the first two parts of the

“Token Test” [DiSimoni 1978], a standard test

designed to assess early situated language skills

of children. In [Mavridis 2005], a wider

framework for a hierarchy of agent models of

increasing complexity was proposed. The

situation model was positioned as a part of the

wider agent model, which apart from the GSM

also includes other machinery, such as the

processes required for action selection.

[news - steps]

The immediate aim of this paper will be the

design of a GSM that will enable Ripley to pass

the whole of the “Token Test”. The terminology

developed in the relevant previous papers

[Mavridis and Roy 2005, Mavridis 2005] will be

adapted. However, the design proposal given

here will be given as an example of a wider

design methodology that will be suggested.

Through this wider methodology, a list of

specifications drives an incremental stepwise

design. This incremental method can guarantee

local near-optimality of the design, and also

naturally enables future expandability. The

specifications driving the method are in turn

derived by a human behavioral test.

Two prerequisites are needed for such an

approach. First, a position on what minimality

for grounded situation models might mean. What

is optimality? What are we optimizing here?

What methods or claims might be practically

applicable? Second, we need a position on what

we should really mean when we claim that a

robot has passed a certain behavioral test. Should

the robot be allowed to know the questions

before the test? What level of generality should

we aim for? What was the real purpose of the

test? We will thus try to propose positions and

thus fulfill these two prerequisites.

[contributions]

It is our hope that the methodology proposed,

even though it is currently in an initial stage, will

prove to be useful towards the creation of

conversational partners that cooperate with

humans in a multitude of environments. Most

existing systems have been designed through

non-systematic ad-hoc methods. Although

existing systems do exhibit satisfactory

performance in highly constrained environments

and with limited vocabularies, their ability to

scale has yet to be proven. Also, most of the

existing systems are more similar to tools than to

partners, as they exhibit either none or minimal

intention recognition of the human partner. Thus,

they can only respond to immediate commands,

and cannot participate in longer term cooperative

and flexible planning. We believe that the

approach presented here will prove to be capable

of both scalability and true cooperation. We

envision that in the future the systematicity of

the design procedure will enable its partial

automation and thus aid towards scalability.

Also, the inherent ease with which the proposed

design can accommodate embedded agent

models of others might well prove to be

invaluable towards equipping the robot with a

form of Theory of Mind [Premack and Woodruff

1978], which will act as a catalyst towards true

human-robot cooperation. Furthermore, we

believe that our proposals on grounded situation

model near-optimality as well as on the

adaptation of human behavioral tests for robot

design will prove valuable to other researchers in

the field.

[overview of paper]

This paper will start by discussing our positions

on design minimality for GSM’s, as well as on

how to use human behavioral tests as

specifications for robot design. Then, the

proposed design will be given in an incremental

step-wise manner. We will start by considering

the first four parts of the token test, and derive

the heart of our proposal through their

requirements. Then, we will consider the final

part, which exhibits much less homogeneity than

the first four, and propose successive

augmentations to our initial design. A section

discussing the local optimality of the proposed

design will follow. Finally, a concluding section

will end this paper.

1. Optimal design

[generalities about optimality]

Optimality is, and has always been, one of the

principle obsessions of engineers. Apart from the

purpose of achieving maximal “economy” of

some sort, optimality also fulfills another very

important function: it helps narrow down an

often exponentially huge and intractable space of

possible designs to a single representative, which

is baptized as “best”. This design is not “best”

under any absolute sense; it is only under the

arbitrariness of a designer-chosen criterion that

quantifies its “economy” that it can claim to be

so. Given such a criterion, engineers can

compare their designs more objectively, and

claim superiority for a particular design on

rational grounds. Of course, absolute

“optimality” is only a naive engineer’s illusion.

This is so, as it can only exist within a well

constrained and predefined universe of designs.

Furthermore, it is optimal under the biased

judgement of an adhoc optimality criterion,

which will also in most cases take into account

only partial aspects of the designs. Under the

above two degrees of freedom (the universe of

designs and the optimality criterion), and given

inevitable preexisting constraints, it nevertheless

proves easy to find optimal designs only in some

highly idealized cases.

[a highly idealized example problem]

Consider, for example a variant on a classic high

school optimization problem:

“John has a rope which is four meters long. He

wants to use this string in order to separate the

territory of his small robot from the rest of the

world. This territory can be any rectangle which

is made from the rope. His robot would be

happiest if the area of its territory was as big as

possible. How can John make his robot as happy

as possible?”. Let us use this idealized problem

as a small parable illustrating some of the critical

issues in optimal design.

First, the space of possible designs is given and

pretty well defined – it consists of all rectangles

with perimeter of four meters. Second, the

“optimality” criterion is given and well defined –

make his robot as happy as possible by

maximizing the area of the rectangle. Third,

among the preexisting constraints are the total

length of the rope, as well as rope physics and

the geometry of euclidean space.

How does John proceed to discover the optimal

design? First, he idealizes the natural

components into mathematical abstractions: the

rope becomes a one-dimensional curve of

infinitesimal width and without bending

constraints. Second, he parametrizes the space of

all possible designs, yielding in essence a

generative model. The space is isomorphic to the

real numbers R; indeed, any rectangle can be

described by two parameters (width, height)

which collapse to one under the perimeter

constraint (the perimeter should be equal to the

length of the rope which is 4m, i.e. 2*width +

2*height = 4, i.e. width = 2 – height). Third, he

has to select the “best” design; by an arbitrary

choice of method (calculus, inequality

arguments, consulting an oracle) he declares that

the particular design with width = 1 (and

correspondingly equal height) is “best”.

He thus takes the rope, implements the design,

and his robot seems to be smiling and jumping

around. Alas, in a scientific community, any

claim should be accepted with a lot of suspicion;

and indeed we would expect many critics to

appear and declare that John’s robot might have

been happier. John must answer their claims, or

else he will fall from his throne of Reason to the

muds of unwarranted speculation and blindsight.

Indeed, he takes his ruler, measures the design,

and declares that the area is one square meter.

The critics take their turns:

If you had just made one side a little longer – say

1.5 meters – your robot would be happier. That’s

an easy one – John replies without hesitation.

“But then”, he says, “the total area would have

been less – let’s see how much: .75 square

meters – 25% less”. The theorists were

convinced, but some hardliners were still waiting

for empirical evidence. Indeed, he reshapes the

territory, and the robot’s face is not glowing with

happiness anymore. Before it starts to cry, he

reshapes it back to the original. Then, another

critic voices his thoughts: “I have a better

solution: why not try a pentagon?”. Another one

stops him mid-sentence: “A circle! – why care

about pentagons?”. “But “, John answers, “The

problem statement was clear: it said. Is the pentagon a

rectangle? Or the circle? All I can do is follow

the statement. And, after all, my robot is still

smiling!”.

[key results of above story]

A few key points should have been illustrated by

the above story. First, for convenience, we can

conjecture a real world of designs, as well as an

ideal world of models. Models can be tested in

theory and designs in practice, but in the end,

practice is what we aim for, and it is always

more convincing. Second, the universe of

designs and the optimality criterion are degrees

of freedom that have to be explicated and

justified. The constraints of the real world, in the

form that they are transferred to the ideal, should

be explicated too. Third, the generative model (if

any) that will span the universe of designs, as

well as the optimization method, present even

more degrees of freedom.1

[moving on to more complicated cases]

But what happens when we leave such simple

examples and move on to more complicated

cases? First of all, the uniqueness as well as the

achievability of the optimal solution might

become questionable. Second, approximations of

global or even local optima might be all we can

achieve. Also, the optimality criterion itself often

has to become partial: we have to select a

specific subset of the set of possible important

aspects of each design, in order to make the

problem tractable. Third, absolute optimality

might be meaningless. For example, our data

might be noisy, or the computational procedure

to achieve an absolute optimum might be too

expensive. Also, in the case of real-time

optimization, the environment might be changing

so quickly and the optimality criterion changing

with it. In all of the above cases, a satisfactory

solution might be as good if not even better than

a hypothetically optimal one [Simon 1969].

Fourth, we might only have partial data; and we

might be required to devise something that will

still be useful in the future, when exposed to new

data that we have not encountered yet. I.e., we

might need not only need a good if not optimal

“fit” to the current requirements, but also an easy

upgrade to further requirements; i.e.

generalization ability given an inexpensive

expansion. Of course, the ability to generalize is

most often antagonistic to fitting the current

requirements optimally; but, tradeoffs can be

found, and many of the techniques developed in

pattern recognition (k-fold validation for

example) and complexity theory [Cherkassky et

al. 1999] address similar problems. In the design

case, what we really need is slightly different

1 In the above example we have assumed a non-

enumerably infinite and continuous space of

designs – very often, discrete or hybrid spaces

also enter the picture.

than the standard problem setting. In the standard

setting, we need to be able to generalize to

further requirements with the present design.

Here, we instead need to be able to generalize to

further requirements through an inexpensive

augmentation of our existing design. Things are

much better if a simple “patch” on top of the

existing design could satisfy the new needs;

instead, if we need to throw away everything and

redesign ex nihilo, we would have to “spend”

much.

Let us now recapitulate briefly, before moving

on from general comments to the special case of

grounded situation models. When we move from

small-scale highly idealized problems to more

realistic large scale cases, an absolute global

optimum might not be achievable or even

desirable. Also, our criteria for optimality cannot

be all-inclusive anymore. Finally, we often have

to take into account the adequacy of our

proposed design under not only current but also

future specifications, and the existence of an

inexpensive route for expansion.

[minimal grounded situation models – overview

of our proposal]

But what import do all of the above comments

have on minimal grounded situation model

designs? Here we will propose a methodology

for an incremental design, satisfying a set of

specifications, and claiming local optimality. The

specification will be broken down to parts, in

rough order of foreseen complexity. An initial

design satisfying the first specification will be

used as a starting point, and successive

augmentations that satisfy the remaining

specifications will be applied to it. The final

design will be tested for local optimality, and

pruned down to a locally optimal solution if this

is required. We will base our designs to a

standard underlying organizational framework

for grounded situation models, that is briefly

described in [Mavridis 2005]. The use of this

framework in conjunction with the proposed

design methodology, naturally provides an easy

pathway for further expansion in the future.

[a natural partial ordering of complexity]

First of all, we need to devise a criterion for

comparison of two grounded situation models.

Here I will propose a rough criterion that will

nevertheless be adequate for our purpose.

Ideally, given a GSM design, we would like to

be able to quantify its complexity by some

metric value. Then, when given two GSM

designs, we just accept the design with the

smallest complexity metric as “minimal”.

However, absolute metric values are not

necessary for such a comparison; any ordering

relation will do.

Now, let’s suppose that any GSM design can be

decomposed into parts (representations and

processes), and that we are given two specific

designs, the first of which includes all of the

parts of the second other plus some extra parts.

Then, any reasonable definition of “complexity”

will order the first design as being more complex

than the second. Of course, this principle only

provides a partial ordering – it does not supply

us with any directive on how to deal with cases

where one of the designs is not composed from

parts which are a superset of the parts of the

other. However, this partial ordering will be the

criterion for comparing GSM designs that I will

adapt here. It will prove adequate, as we will

restrict our comparisons to augmentations of

existing designs. The question that follows is:

can we devise an incremental design

methodology that will be able to claim some

form of optimality, while only utilizing this

natural partial ordering?

[incremental design using the natural partial

ordering]

Consider that we are given a list of design

specifications. We start by producing an optimal

design for the first specification. Then, we

consider all possible augmentations of our initial

design that would enable it to cover not only the

first but also the second specification, and we

choose the optimal augmentation. Under what

conditions will the resulting new augmented

design be optimal? The answer should not be

difficult to spot: the resulting new design is

optimal as long as:

C1) We assume additive costs:

When A intersection B is empty, then cost

(A+B) = cost(A) + cost (B)

C2) There is no design that would include only

part of the initial design, and which would

suffice for both the requirements of specification

1 and specification 2.

Proof: Let’s assume that there exists another

design C, which is not equal to A+B, and which

is better than A+B. Then, by C2, C should

include A; and thus, C should be decomposable

as: C = A+B’, where B’ not equal to B, because

we supposed that C is not equal to A+B. But

then, as C is known to be better than A+B,

cost(A+B’)

satisfactory GSM designs, which are furthermore

naturally expandable.

2. Behavioral tests for humans as design

specifications for robots

[what consists cheating in a test and what does

not?]

When trying to use a behavioral test designed for

humans as a design-driving specification for a

robot, one main difficulty arises, which we will

discuss here. The human to be tested is supposed

not to know exactly the contents of the test

before it is administered. However, we assume

that throughout his experience so far in his life,

he has received adequate explicit or implicit

training stimuli in order to perform well. When

administering a behavioral test to a robot, we

certainly should not allow the designer of the

robot or the trainer to have given explicit

response specifications for each of the specific

tasks that comprise the test. Allowing such a

state of affairs would be equivalent to having

trained the human for the specific tasks/questions

of the test: having shown him the correct

responses and having made sure that he performs

them well. This would certainly count as

cheating in most human testing cases. So the

question arises: if having supplied the

“questions” and the “correct answers” to the

specific test, and having made sure the human

perform them adequately counts as “cheating”,

what would be considered allowable?

[the wider area versus the specific questions]

Let’s examine the human case again. A high

school student preparing for his SAT tests is

exposed to intense training. This training takes

the form of explicit tuition on the related

subjects, practice tests etc. Then, one day, he has

to sit his actual exam. Why doesn’t the training

count as cheating? Because it is general enough

in order to cover the wider area to which the

questions of the test to be administered belong

to, but is not specific enough in order to cover

only these questions. It would have been highly

unlikely for the training to cover only the

specific questions of the administered test,

without any pre-knowledge or “leak” of the

exam questions.

So what is the moral for the robot case? Any

proposed design solution that passes a human

behavioral test should be able to pass any test

that belongs to the wider area to which the test

belongs, and not only the specific test in

question. Of course, the delineation of the

“width” of the area to which the specific tests

belongs should be carefully justified. Notice that

an analogous situation holds in the pattern

recognition evaluation literature. In supervised

training problems, we are given a training set and

are asked to design a classifier that performs well

on a (supposedly unknown) testing set. The

training set should be general-enough in order to

cover and represent densely enough the area

where the testing set might belong to. However,

it must again not be specific enough so that it

suspiciously only includes the testing set.

[the test design problem]

After all, one must ask a wider question: what is

the purpose of testing? What was the test

designer trying to achieve? Usually, the test

design procedure is posed as follows:

a) decide on the limits of the wider area of the

material to be tested

b) select specific questions out of the wider area,

such that:

b1) they can be practically tested in limited

time (few in number, short answers)

b2) the answers given to the selected

questions can have high predictive value

towards the answers that the examinee

would give to all the questions comprising

the wider area of the material to be tested.

In short, the test designer must chose a few

easily testable questions that are however highly

indicative of the examinee’s mastery of a wider

area of knowledge that is being tested.

[the three commandments]

The above discussion has clear implications

towards using behavioral tests for humans as

design specifications (or alternatively as training

material) for robots. First, the test must be

reverse-engineered: given a specific test we must

try to delineate the wider area in which the tasks

of the test belong, and for which the specific test

acts as a representative sample of. In case we

have explicit knowledge of the test designer’s

coverage intentions, we can adapt them readily.

Else, we have to “grow” the wider area by using

the specific questions as a “seed”. The extent and

generality of the wider area must be chosen.

Second, we must use this wider area as a design

specification, or sample it randomly for the

generation of a training set. We are not allowed

to use the specific questions comprising the test

as the only target domain. Third, we must clearly

and explicitly justify all of our above choices.

[minimality at odds with generality requirement

prescribed by “width”]

Finally, it is worth noting that the designs that

can satisfy the requirements of the wider

“grown” areas will often prove to be more costly

than those that would only satisfy the seeds.

Nevertheless, only a design that can pass the

“wider” requirements is of any real value, and is

furthermore more general and easily expandable

in the future. Of course, the right level of

generality must be decided and explicated on a

case-by-case basis, as we will do here.

3. The token test and the constraints it imposes

on mental model reps and processes

Below I will adopt an agent model which

decomposes into an enumerated set of

components (representations and processes),

along the lines of the decomposition described

in: [Mavridis 2005].3 Also, I will adopt the three-

layer grounded situation model terminology of

[Mavridis and Roy 2005]. I will examine each

part of the test in turn and proceed in the manner

outlined above: I will use the specific questions

of the test as a seed used to grow a wider area to

be tested, and then use a model of this wider area

and the expected performance as a design

specification.4

[sensory input modalities]

Let us start with the sensory input of the agent.

The agent under test should be able to sense a set

of utterances (signs) as well as an arrangement of

physical objects (non-signs). The physical

objects are either colored squares or circles,

which have been cut out of thin plastic, and

which have been placed on a flat uniform-

colored surface.5

3 This choice constrains the universe of designs

(as termed in the first section) to those that can

be produced within the decompositional

framework of [Mavridis 2005] 4 Translating to the terms introduced earlier in

the first section: The optimality criterion follows

naturally from the partial ordering that lies at the

heart of the proposed incremental design

methodology. The optimization method is also

contained in the proposed design methodology. 5 Also, for the fifth part of the test, it has to sense

the physical body of the person administering the

test, as we shall see much later in this paper.

In terms of the four-stage reality-to-situation

model processing pipeline, the first stage

(reality-to-sensory stream projection) concerns

us in this paragraph. From the form of the

required signs and non-signs to be sensed, two

requirements arise. First, the sensory organs

should cover the sound wave spectrum with

enough range and fidelity in order to be able to

receive the utterances and successfully

discriminate among them, and thus some form of

a hearing organ is required. Second, the sensory

organs should be able to locate and discriminate

among the various objects and their properties,

and thus at this stage we assume that some form

of a visual organ with coverage of the human

visible spectrum having adequate range and

fidelity is also required.6

[utterances heard are all requests]

Now let us briefly consider the utterances heard.

After a brief glance through the “Token Test”,

one can easily notice that only request speech

acts are used, taking their direct syntactical form

(Imperative). Furthermore, all these are requests

for overt actions, and in particular, motor

manipulation actions (there are no requests for

speech, and arguable none for “internal” covert

actions). Also, notice that the recipient of the test

is assumed to be cooperative, and should

therefore try to fulfill the test-giver’s requests as

best as he can. Furthermore, he is assumed to

have been well-fed and sheltered, so that his

other vital goals are already satisfied (hunger,

thirst, danger avoidance etc.), and so during the

test he will only perform actions towards the

satisfaction of the requests given by the tester.

Thus, as the utterances contain requests for

motor actions, we suppose that the expected

result of the recipient hearing the utterance will

be the production of one or more motor actions,

which will be selected on the basis of the

utterance as well as the situational context.7 This

dictates that the action selection mechanism of

the agent need not involve complicated partial

planning or even rational action selection. For a

pre-programmed agent, we are at the level that

6 Alternatively, touch or ultrasound sonar might

have substituted for vision, although color would

not be covered. Of course, the Token Test was

not originally designed for blind children, and

vision is silently assumed to be the modality

used. 7 later in this paper, we will also introduce covert

internal actions in addition to overt motor actions

was termed level-1 in [Mavridis 2005] (reflexive

organism). In the case of a conditionable-

response agent, we might move to level-2.

Nevertheless, once conditioning is finished,

actions will effectively be selected by a fixed

(utterance, Situation Model Contents)-to-

ActionSequence table.8 Below I will explicate on

the stages that intervene between utterance

reception and action sequence production, and on

the way that the situation model contents (i.e. the

situational context) contribute to this process.

[the 3 stages from utterance to action sequence]

I will assume that each heard utterance goes

through three stages of translation, before the

necessary motor actions are produced. First, the

surface form of the utterance is translated into a

set of zero, one or more ordered statements

which have the general form:

(arguments in curly brackets {} are optional)

{} { …}

At this first level (Description-level), the

situational context has not yet been taken into

account.

At the second level (Object-level), the situational

context (categorical layer of situation model) is

used in order to resolve the definite descriptions

of object sets into their actual members (pointers

to objects in the situation model with unique

identifiers):

{} { …}

At the third level (Action-level), the situational

context (this time the continuous layer of the

situation model) is fed to the manner-to-

actionargument functions, in order to provide

continuous-valued arguments for the action

routines. I.e.:

{ …}

[an illustrative example of the three stages]

A simple example will help illustrate the three

stages. Let us assume a situation model

consisting of objects that have three properties

(position, size, color), which furthermore are

8 Later, we will comment on the parts of test

where a more complicated action selection

mechanism, such as a planner, might have been

better suited, even if not necessitated

quantized to the categories {left, center, right},

{small, large}, {red, green, blue}. Let us further

assume that the current contents of the situation

model are:

Obj#33

Continuous layer:

Pos = (34, 12), Size = (2), Color = (.5, .1, .2)

Categorical layer:

Pos = “right”, Size = “small”, Color = “red”

Obj#37

Continuous layer:

Pos = (14, 2), Size = (5), Color = (.05, .6, .2)

Categorical layer:

Pos = “left”, Size = “large”, Color = “green”

Obj#38

Continuous layer:

Pos = (54, 8), Size = (5.3), Color = (.5, .1, .1)

Categorical layer:

Pos = “right”, Size = “large”, Color = “red”

Let us also assume that the received utterance is:

“Put the small red one in front of the green one”

At the description-level, this utterance will

translate to an ordered sequence of two

statements:

,

Notice that the situational context has not been

utilized yet. At the next level, the categorical

layer of the situation model will help resolve the

descriptions into unique object ID’s.

At the object-level, we will have:

,

Now, manner will compose with the continuous

layer contents of the relevant objects, to produce

action parameters. I.e.:

,

9

Finally, this will translate to the following two

actions to be executed sequentially:

9 Here we assume for simplicity that Infrontof

requires as an argument only the landmark object

and does not depend on the rest of the situational

context

Get(34,12) and Put(14,0)10

After this brief introduction, we are now ready to

continue with the discussion of the specific

demands that the questions of the token test

impose on the grounded situation model of the

agent. We will start by considering the first four

parts of the test, and then proceed to the fifth,

which is much less homogeneous and more

demanding.

4. Passing the first four parts of the token test

Here the general form of the questions is:

Touch the AND the

Where:

, belong to: {circle, square}

, belong to: {red, green,

yellow, blue, white}

, belong to: {small, large}

Some of these arguments are allowed to be

optional.

Of course, this wider general form only

corresponds to part 4; in earlier parts, some of

the above slots are missing, and thus some

arguments should be allowed to be optional.

Thus, with the above general form, we have used

the specific questions of the first four parts of the

token test as a seed for a wider set of questions

to be targeted. Arguably, at this stage, one might

further widen this set; for example, we could

extend to the more general forms:

the AND

AND …

the … AND

the …

the … AND

the …

One could also extend towards more property

categories: more colors, shapes etc.

However, I will assume that these wider general

forms and extension would constitute an

overgeneralization given the specific questions

comprising parts 1-4. Thus, we will stop at the

initial level of generality that we proposed.

10 In this example, let us suppose Infrontof(14,2)

= (14,0)). Notice that here we silently assumed

that Infrontof returns a number of object

properties and not a virtual object as its output.

Now, let us step through the requirements the

chosen general form of questions and their

expected resulting actions impose on the

representations and processes of the situation

model. The representation – process

enumerations given in [Mavridis 2005] will be

used here, and are repeated below for easy

reference:

Representations contained in the agent model:

R1) Situation model: represents external reality

(not only physical necessarily)

R2) Internal state (goals, affect)

R3) Past states/situations (includes events)

R4) Future predictions

R5) Action list

Processes contained in the agent model:

P1) Sensor-to-SituationModel process

P2) SituationModel-to-InternalState process

P3) Storage/Maintanence of Past process

(includes event recognition)

P4) Future Prediction Generation process

P5) Action Selection Process

(SituationModelState-to-action table or planner)

Representations:

The questions impose a need for three property

dimensions in the description of each object

(R1), namely , and . At

the categorical-layer level (R1), these property

vectors should be quantized to the following

categories: {circle, square}, {red, green, yellow,

blue, white}, {small, large}. Notice that no

requirement for a position property dimension

has yet arisen, but nevertheless will soon arise.

The action required by the questions necessitates

a new entry at the agent’s empty action list (R5).

This entry will correspond to the “Touch” action,

which requires at this stage only a position as its

argument. Due to the general form of the

questions, this position must be derived from an

object11. This necessitates the addition of a

further fourth property dimension to the

description of each object (R1), namely

. Notice that no categorical-layer

requirements yet exist for this property; positions

needn’t ever be categorized or verbalized yet.

For the purpose of providing arguments to the

“Touch” action, the position property dimension

only needs to have a continuous-layer

11 i.e. we don’t currently have:

touch such as “touch at the left”,

but touch : “touch the green circle”

representation (i.e. a single vector value).

[humanly-understandable property dimensions]

Here it is worth clarifying a crucial point before

proceeding. What do we mean by saying that a

requirement for three property dimensions (for

example and ) has

arisen? Couldn’t we for example get a single

composite multi-dimensional property, break it

up using any possible local coordinate system, or

even collapse all three to a single-dimensional

one given a suitable transformation? Of course

we could, as long as the new representation was

informationally equivalent to the original one

which consisted of the three humanly-intelligible

property dimensions. But such a new

representation would complicate our

understanding of the actual processes that would

comprise the sensory-to-situation model

processing stages. For example, instead of

deriving “size” from the object-individuated

parts of the incoming image, some other feature

vector with no direct human interpretation will

have to be derived. Later, the “categorizer” at the

fourth stage of the pipeline, would have to

operate on this new feature vector in order to

derive verbal size categories.

This is why we decide to directly reflect in the

situation model designs the property dimensions

of objects that exist in natural language – such as

and . Of course, such

properties will not always be orthogonal or

independent. However, if we needn’t impose a

very strict definition of minimality in our

representations, we are better off by sometimes

possibly including an extra property dimension

(even if somewhat redundant) than by dealing

with the extra complications described above.

[the test-giver object]

Finally, an extra “dummy” object is inserted in

the situation model, corresponding to the speaker

of the heard utterances. The utterances are

represented as property contents in the situation

model, which are attributed to the dummy object

(speaker), of which no physical visibility is

assumed (a simplifying assumption). Thus, the

speaker object is instantiated initially given no

visual evidence, and contains only dummy

property information (position, size, shape,

color), plus a new property dimension: utterance

(R1). Of course, in real-world implementations

such as the robot Ripley, the speaker is not a

simple object but an agent, and the properties do

not only hold dummy contents [Mavridis and

Roy 2005]. For example, the “position” property

might be estimated through a face detector.

However, given the purpose of our exposition,

this is not required. In conclusion, here we need

to augment our object property dimensions with

the “utterance” dimension, and we need to insert

a dummy “speaker” object in the situation

model. We will soon show how specific

processes update the contents if the “utterance”

property so that its continuous layer contains a

feature representation of the last speech segment

heard, and its categorical layer contains the

recognized text string.

Processes:

Here, we will first consider the sensory stream of

vision (non-sign), and the four stages of

processing that mediate between external reality

and the situation model. Then, we will turn our

attention towards the sign-stream (hearing

utterances), and the three stages of processing

that mediate between the utterances (belonging

to the situation model) and the resulting actions.

a) The non-sign stream (vision) (P1a):

After having a look at the two object

arrangements given in the Token Test, we can

propose that the general set to which the physical

objects causing the sensory stimuli belong is:

Rectilinear equidistant arrangements of circles or

squares of small or large size, with uniform Blue

or Green or Yellow or White or Red color. The

circles / squares are cut out of plastic and placed

on a uniform background.

Arguably, we could extend this set to a wider

one, allowing more complicated arrangements,

more sizes, colors, shapes, more complicated

background, or even three-dimensional objects.

However, we will chose not to impose such a

wide requirement here, because we believe it

was out of the spirit of the designer of the Token

Test.

The overall purpose of the external reality – to –

situation model stream for non-signs is to create

a categorical-layer and continuous-layer

representation that is of adequate fidelity in order

to:

a) drive referent resolution (through the

categorical-layer size, shape and color

representations),

b) provide parameters for motor actions on

objects (through the continuous-layer position

representation).

These requirements (successful referent

resolution and motor manipulation) provide the

criterion for the adequacy of the 4-stage

pipeline.12 These requirements can also be

broken to into specific requirements for each of

the processing stages:

The first of the four stages is the sensory

projection stage. Here, any device that will

provide us with a sensory stream that will be

able to provide us with adequate fidelity in the

resulting categorical representation at the end of

the pipeline will be sufficient. Then, the object

individuation stage follows. Here, the sensory

stream must be segmented into parts that

correspond to the physical objects that had

caused them. Later, the property values have to

be extracted. Here, we need four processes,

which given the segmented parts corresponding

to each physical object, will calculate its

property values. Thus, we need properties for

position estimation, size estimation, color

estimation, and shape estimation. These

estimators must provide estimates of adequate

quality in order to ultimately drive referent

resolution and motor action parameters, as we

said above. The fourth and final stage is that of

categorical quantization. Here, we need

categorical classifiers for size, color and shape

(categorization of position not required by Token

Test, as explained above). These have to be

tuned so that they correspond with the tester’s

categories.

In conclusion, the need for the following

processes has arisen:

P1.a1) Vision sensor

P1.a2) Vision stream segmenter

P1.a3a) Shape estimator

P1.a3b) Color estimator

P1.a3c) Size estimator

P1.a3d) Position estimator

P1.a4a) Shape category classifier

P1.a4b) Color category classifier

P1.a4c) Size category classifier

12 We can judge success in two ways: either by

comparing to some known objective external

reality (i.e. which physical objects we have

placed where), or by comparing to the tester’s

subjective representation of reality. In the first

case we essentially judge fidelity, and in the

second alignment among two agents.

b) The sign stream (speech) (P1b):

Now, let’s consider the hearing-to-

utteranceproperty stream. We will use the

simplifying assumption that the visual and aural

streams do not fuse after the second stage of their

pipelines, but are separate. First of all, we need a

sensor with adequate coverage and quality in the

human speech spectrum. Then, we need an

utterance segmenter stage. In place of a property-

extractor, we will most probably have some

feature-extraction for speech recognition.

Finally, the categorical classifier output should

consist of the text strings produced by the rest of

the speech recognizer.

In summary: the air pressure waves are translated

into an audio stream by the hearing sensor, and

the stream is segmented into utterances. Then,

the utterances are analyzed into features will fill

the continuous layer of the “utterance” property

dimension of the dummy speaker object. Finally,

the speech classifier is fed with the features and

produces a text string, which fills the categorical

layer of the “utterance” property dimension. In

most real-world implementations, the hearing

stream segmenter, speech feature estimator, as

well as speech classifier are all parts of the

speech recognition software. Once again, the

overall requirement for the four stages of the

pipeline is that the incoming speech stream is

correctly segmented and translated into text.

In conclusion, the need for the following

processes has arisen:

P1.b1) Hearing sensor

P1.b2) Hearing stream segmenter

P1.b3) speech feature estimator

P1.b4) speech classifier (language-dependent)

c) The action selection process (P5)

Here, the instantaneous “utterance” property

dimension contents of the dummy speaker object

have to go through the three stages of translation

to sequences of motor actions, which were

described in the third section of this paper. First,

the utterance contents are broken into actions by

a motor-action selection process. However, the

output is still at the Description-Level. Thus, an

object referent resolution process translates it to

the Object-Level, and a motor-paramater

extractor process finally gets us at the Action-

Level. The resulting actions are executed in their

order of production, by the motor execution

process.

To recap, we need the following processes:

P5.1) Action selector process

P5.2) Object Referent Resolution process

P5.3) Motor parameter extractor process

P5.4) Motor executor process

Summary:

A summary of the requirements needed so far: In

terms of representations, a situation model

structure containing objects described along five

property dimensions (shape, color, size, position,

utterance), including a dummy object

representing the speaker of the utterance. In

terms of processes, all of the P1.a processes

required for the visual (non-sign) – to –

situationmodel stream, all of the P1.b processes

required for the aural (sign) – to – situationmodel

stream, and processes P5.1-P5.4 required for the

situationmodel-to-action stream.

5. The fifth and final part of the Token Test

Here, I have attempted to collect the questions of

the fifth part in groups, according to the

requirements they impose on the situation model.

Then, the groups have been ordered into a rough

and subjective scale of increasing difficulty /

complexity, which also follows any prerequisite

relations among the groups. The resulting groups

are:

1. Novel Actions and Manners:

(Q41-44, Q47-48, Q52, Q54, Q58, Q60)

2. Set Referents: (Q50, Q53, Q55)

3. Conjunctions/Disjunctions: (Q45-46)

4. Explicit Temporal sequencing: (Q59, Q61)

5. Conditionals: (present/future): (Q49, Q51)

6. Corrections: (Q56, Q57)

Now let’s consider each group of questions in

turn, use it as a seed for a wider requirement, and

see how the existing representations and

processes must be augmented in order to account

for it.

Part 5.1: Novel Actions and Manners:

Q41.Put [the red circle] on [the green square]

Q42.Put [the white square] behind [the yellow

circle]

Q43.Touch [the blue circle] with [the red square]

Q44.Touch -with [the blue circle]- [the red

square]

Q47.Put [the green square] away from [the

yellow square]

Q48.Put [the white circle] in front of [the blue

square]

Q52.Put [the green square] beside [the red circle]

Q54.Put [the red circle] between [the yellow

square] and [the green square]

Q58.Together with [the yellow circle], take [the

blue circle]

Q60.Put [the blue circle] underneath [the white

square]

[the general forms]

Two new verbs are introduced (“put” and

“take”). These, together with “touch” that we had

encountered before, are now also followed by

various propositional phrases (starting with

“on.”, “behind”, “with”, “away from” etc.). By

using the above set of questions as a seed, we

will assume that the general set that they belong

consists of the following surface forms:

Put

Put underneath

Put between and

Touch with

Touch -with - dd1

Together with , take

Where:

=

= the ,

(possibly with and/or missing)

belongs to {on, behind, away from, in

front of, beside}

[necessary augmentations]

First of all, we will introduce two new motor

actions. The action list already contained

“touch”, and will now be augmented by “get”

and “put”. We will assume that these motor

actions only take one argument, namely a target

position13. The verbs appearing at the utterance

level do not necessarily correspond to a single

action. For example, the surface form “put the

circle beside the square” will translate to:

,

Note here that the general form of Q60 creates

special problems, and will be dealt with

separately. Now, let us define rules for

translation of the surface forms of the utterance

to the Description-level, the Object-Level and

the Action-Level consecutively. Starting with the

translation to the Descriptive-level, process P5.1

(action selector) should realize rules for the

13 More advanced object grasping motor routines

might necessitate more arguments, for example

the size or shape of the object to be grasped.

following mappings:

Put �

,

Put between and �

,

Touch with �

,

Touch -with - dd1 �

,

Together with , take �

,

14

Continuing to the Object-level, the referent

resolution process (P5.2) remains as is. Finally,

in order to reach the Action-level, the motor

parameter extractor process will have to include

the following trivial rules:

�

(already part of P5.2)

�

�

14 Here we assume that the robot’s hand has the

ability of holding two or more objects at the

same time, or else that the robot has two hands

and the second execution of the “get” motor

routine will divert to the second hand as the first

is full. A deep treatment would necessitate the

introduction of a self-object in the situation

model. The self-object should have some form of

a “gripper” property dimension, holding the

number of the objects that are currently held. As

the objects that are being held have to be

referable too, so that they are able to serve as

subsequent “put” statements, some way of

attaching objects to the gripper is required. A

fuller treatment would require the introduction of

relations among objects, with a special

“possession” relation linking the self- object with

the objects held

�

[the “manner” functions]

Finally, we have to define the “manner”

functions, realizing the propositions. Their

output should be a target position, and their input

might consist of any subset of the properties of

the landmark object. For example, let us consider

. The target position of “put” can

be defined as:

(pos_x, pos_y, pos_z) = On(obj #n) where:

pos_x = pos_x(obj #n)

pos_y = pos_y(obj #n)

pos_z = pos_z(obj #n) + size_z(obj #n) / 2

Let us also consider “infrontof”. Assuming a y-

axis moving away from the tester, we might

define “infront” of as:

(pos_x, pos_y, pos_z) = infrontof(obj #n) where:

pos_x = pos_x(obj #n)

pos_y = pos_y(obj #n) - size_y(obj #n) + offset

pos_z = pos_z(obj #n)

These definitions are only a very rough first

approximation, though. More complex functions

for spatial propositions can be defined, taking as

arguments not only the landmark object but also

all of the rest of the objects (the situational

context). However, we will not get into such

detail here. Some relevant ideas can be found in:

[Regier and Carlson 2001, Gorniak and Roy

2004]. All of the other propositional “manners”

are defined along similar lines (except for

“between” that accepts two objects as

arguments). Their rules are realized by

appropriately augmenting process P5.3.

[the problematic case of “underneath” -

conclusion]

Thus, we have successfully augmented the action

list R5 and the processes P5.1-P5.3, in order to

cope with the requirements set by all the

questions considered, except Q60. The

problematic general form of Q60 is consider in

detail in Appendix B in order not to interrupt the

flow of the text.

Part 5.2: Set Referents: (Q50, Q53, Q55)

Q50.Pick up [the squares, except the yellow one]

Q53.Touch [the squares] slowly and [the circles],

quickly

Q55.Except for [the green one], touch [the

circles]

[general form]

The general surface form of the statements is:

(at a first level of generality)

Touch slowly and quickly

Where: dd1 might return not only a single but

also a set of referents, and where the description

of the set might include a plural plus an

exception. In the third case, the description might

be broken down in two parts (exception and

plural), with the verb positioned between them.


Here, the object referent resolution process

(P5.2) has to be augmented to account for the

possibility of set referents. Also, two new

manners have to be introduced for the motor

action touch – namely the manners slowly and

quickly. Furthermore, in terms of surface form

difficulty, one more problem arises: the set

referent descriptions might be broken into two

disconnected parts, separated by the action verb.

Let us see how we will deal with these

complications. First, we introduce the following

utterance-to-description level rules:

�

(where action = get for “pickup” or “take”,

action = touch for “touch”)

Touch �

(where manner = slowly or quickly)

�

(where =

and where action = get for “pickup” or “take”,

action = touch for “touch”)15

Thus, we have successfully reached the

15 We do not attempt to extend plural referents to

“put down” here. We have to chosen to do so,

because of the complications that might arise

when the second object set consists of more than

one object, as for example in: “put the circles on

the squares”

Description-level. The augmented referent

resolution process (P5.2) will then be applied on

the Description-level representations, and we

will get results of the form:

�

We see that the single argument of the action

now consists of multiple objects, instead of one.

However, this is not allowable given the action

forms that we have introduced in the action list

(R5). We thus have to introduce one further

process, which is applied after the object referent

resolution process (P5.2) but before the motor

parameter extractor process (P5.3). This process

will be called “repeated action expansion

process” (P5.2a), and will realize the following

rule:

�

,

,

…

For example:

will be expanded to:

,

,

.

Finally, after the application of P5.2a, P5.3 is

applied. P5.3 has to be augmented with new

rules that account for the new manners

associated with “touch”. Touch, as represented in

the action list (R5), accepts only a single

argument so far: the target position to be

touched. “Slowly” as a manner does not modify

the position argument; instead, it necessitates

either a new motor routine for touch_slowly, or

the addition of an extra argument to the touch

action, namely a speed argument. Here we will

adapt the later method. Thus, the action list entry

for touch (R5) will be augmented with a second

argument, namely speed, which will have three

discrete values (slow, normal, quick), and which

will default to normal speed when no argument

is given through the manner. I.e., the following

rules will be implemented in the Motor

Parameter Extractor Process (P5.3):

Touch �

Touch(pos=pos(obj #n), speed=normal)

Touch slowly �

Touch(pos=pos(obj #n), speed=slow)

Touch quickly �

Touch(pos=pos(obj #n), speed=quick)

Thus, we have finally successfully reached the

Action-level. Only the motor executor process

requires augmentation now; indeed, motor

routines for ,

have to be created, and the routine for touch

should be augmented with a speed argument:

.16

Part 5.3: Conjunctions/Disjunctions: (Q45-46)

Q45.Touch [the blue circle] and [the red square]

Q46.Pick up [the blue circle] or [the red square]

[the general forms]

and

or

Where:

belongs to {touch, pickup, take}

and are definite descriptions of the

form: , possibly with

missing and / or


In our previous discussion on parts 1-4 of the

test, it was assumed the action selector process

(P5.1) was able to deal with the utterance form:

and , and that it

realized a rule of the form:

and �

,

It is trivial to modify this rule superficially to

account for disjunctions. For example, we might

decide to just perform the action on the first

referent, and never on the second:

or �

16 As an extension, if we decided to be able to

use the manners “quickly” and “slowly” not only

with touch but also with get and put, then we

would need a further augmentation in P5.3, to

account for the possibility of multiple manners.

For example: “put the red square quickly on the

yellow circle” (which includes the manners

and ). This would further

necessitate compatibility checks among manners,

or more complicated compositional rules.

Consider for example: “put the red square

underneath the green circle on the green circle”

which would be illegal.

This is the augmentation to P5.1 that we will

propose here. Notice, however, that such an

interpretation would oppose the Gricean maxim

of quantity [Grice 1975]: why ever mention the

second referent if it is never going to be useful?

A deeper interpretation of disjunctions would

require the introduction of concepts such as

“feasibility” of action execution, or

“preferability” of one action over another given

some execution cost or ordering. Thus, a deeper

treatment of conjunctions, would give them the

following procedural meaning:

or means:

Out of the two actions ( ,

) select the one that is either

preferable or anyway just feasible, if the other

isn’t.

Regarding feasibility of actions, it is worth

clarifying one further point. In our proposal so

far, we have silently assumed that we expect the

tester to provide us with singular definite

descriptions which resolve to a single existing

referent17. Requested actions are silently

assumed to be always feasible. The case of an

unresolvable or an ambiguous definite

description has not been explicitly dealt with.

How do we augment our system to deal with

such cases? A simple but effective proposal

follows:18

a) Unresolvable definite description:

Motor action is not executed. Instead, robot

produces canned speech, informing the tester that

“no such object exists”. Note that this is totally

out of the climate of our proposal so far; none of

the token test questions demand speech

production from the child.

b) Ambiguous definite description:

Motor action is not executed. Instead, robot

produces canned speech, requesting an inform

speech act from the user: “Do you mean this

one?” (touching one possible referent) “That

one?” (touching another) “Or this one?”

(touching the third). Again, this is just a simple

canned behavior. A deeper treatment would

require a more complicated tester model.

Treating the tester as physical object with

utterance properties, as we do here, is not

17 Plural definite descriptions that resolve to set

referents were discussed above in Part 5.2 18 Implemented in [Mavridis and Roy 2005]

enough. Also, a rational action selection

mechanism covering not only motor actions but

also speech actions would be required.

In conclusion, we have chosen here to follow the

trivial interpretation of disjunctions, arbitrarily

selecting only the first action for execution.

However, we have also discussed what would be

necessary for a deeper treatment.

Part 5.4. Explicit Temporal sequencing:

Q59.After picking up [the green square], touch

[the white circle]

Q61.Before touching [the yellow circle], pick up

[the red square]

[general form]

After -ing ,

Before -ing ,

where:

, belong to

{touch, pick up, take}

and are definite descriptions of the

form: , possibly with

missing and / or


Again, a slightly superficial but nevertheless

adequate implementation exists. We just have to

augment the process P5.1 (action selector

process), which transforms the utterances to the

Description-level, with the following rules:

After -ing , �

,

Before -ing , �

,

A deeper interpretation, would remove the

immediacy of the request, and would involve

planning. However, we feel such an

interpretation is not what is really required by the

token test.

Part 5.5: Conditionals: (present/future)

Q49.If [there is [a black circle]], pick up [the red

square]

Q51.When [I touch [the green circle]], you take

[the white square]

Here we have a huge leap of complexity

compared to the previous parts. First of all,

conditions are introduced, which are intricately

bound to the notion of events. Then, an indefinite

description appears: “a black circle”. Last, but

most importantly, the loaded personal pronouns

“I” and “you” appear. We will now have to

decide on an appropriate generality and depth of

treatment.

[general form]

if , {}

(includes present, but not future)

when , {you}

{}

(normally doesn’t include present, only future)

Where:

belongs to the form:

there is

belongs to the form:

I 19

And where:

is definite description

is indefinite description etc.

[suggested augmentations]

An introduction of a notion of events, although

not absolutely necessary at this stage, would

provide the right amount of generality and

expandability. Thus, we have to start devising

representations in R3. Notice that although

conditional tests for existence (“there is a circle”)

do not necessarily involve the past, action

conditions do: “when I put the circle on the

square” cannot only be tested through the

experience of an instant. Although “when I put

the circle on the square” refers to the time instant

when an action performed by the tester has been

completed, testing whether action completion

requires some account of the whole duration of

the action. Thus, we need a history of temporal

“snapshots” justifying that indeed the tester has

performed the put action. Just seeing the circle

on the square, and maybe the user’s hand

touching the circle, is not enough justification

that the condition holds. Thus, the introduction

of events and some form of representations of

19 We won’t extend to manners etc., although we

could easily do so. We will also not extend to

observing your own actions; only actions

performed by the tester can be initially observed)

the past (R3) becomes inevitable.

Let us now decide on an appropriate form of

event representation. I propose the form:20

21

Where:

belongs to {exists, touch, get, put}

If =exists,

then participant 1 is obj #n that exists.

If =touch, get, put,

then participant 1 is the object corresponding to

the tester’s body, while participant 2 is the obj #n

that is being manipulated.

, : time landmarks in

some referable subjective timescale (clicks etc.)

Exists events are continuously updated so that

the end time is equal to the current time, as long

as the participant still exists.

Manipulation events are continuously updated so

that the end time is equal to the current time, as

long as the action has not finished yet.

flag becomes true when the

event has finished, and the end time is less than

the current time

Actions refer to sequences of GSM contents, and

connect to the senses only indirectly through the

GSM. Of course, the sensory streams should

provide enough information to the GSM so that

the action can be logged. In particular, for the

second type of condition, i.e. “I

”, we need to be able to observe the actions

of the “tester” object, which so far was only a

dummy object associated with incoming

utterances. Thus, we first need to be able to

localize the tester in space, and then to derive a

rich enough observation of his body from the

senses so that we can successfully recognize

when he performs the action verbs: get, put,

touch. Also, we need to be able to recognize

which physical objects are being implicated in

the above actions – but hopefully our vision

system would already provide enough clues for

that.

20 Explicit representation of agents in a different

form than physical objects has not been carried

out so far. Thus, by object here we are referring

to either the robot, the tester, or the physical

objects 21 For simplicity, we prefer the neutrality of the

term “participants” instead of agent/patient roles.

Thus, we need the following augmentations in

the vision stream:

P1.a1) Vision sensor: should be able to get the

test-giver within the field of view, and get

sensory evidence of him of quality adequate for

the subsequent action recognition

P1.a2) Vision stream segmenter: should be able

to segment the test-giver’s body

P1.a3e) Action feature extractor: enough features

should be extracted so that the performed actions

can later be recognized by the event classifier

through the resulting timeseries.

There is no explicit need for quantized action

features, as the language specification does not

demand verbalization of instantaneous features

of actions. Thus, no augmentation to P1.a4 is

needed. With the augmentations made so far, we

have a time series of rich features enabling

action recognition in the GSM. Now, we need to

augment suitably the (previously empty)

components R3 and P3.

In detail:


The current GSM is augmented with a log of

events, including events in progress as well as

past events. No explicit detailed history of the

past needs to be kept, except for this log which is

embedded in the current GSM.22



A set of four event classifier processes is needed,

each process specialized for each of the four

event types (exists, get, put, touch). These

processes examine the current state of the GSM,

keep track of whatever information they need

from the past internally, and produce ongoing

event descriptions in the event log which is part

of R3. When an event is finished, the flag is set true, and the event is

updated in the log for one last time. For

simplicity, we assume that no memory

compression/forgetting process operates on the

log, although in systems that operate round-the

clock such a process would be necessary.

[from events to condition testing]

22 If we need to answer questions about the state

of the GSM during specific past moments etc.,

then we also need some detailed history of past

moments. However, this is not necessary in the

token test. It is necessary for example in the

specs of [Mavridis and Roy 2005].

So far, we have augmented the sensory stream

(P1a) so that the test-giver’s actions can be

observed, and augmented our representations and

processes dealing with the past (R3, P3) so that a

log of ongoing and past events is always

accessible in the GSM. Now, we need to connect

the events that are part of the event log with the

conditions that we have posited in the general

form of the requirements for this part of the test.

The first form of the conditions is: “there is

”. This will correspond to the existence of

an event with type “exists”, with a participant

which satisfies , and with an ongoing end

time equal to the current time.

But what augmentations should we make to the

processes of P5? I propose that process P5.1

(action selection), should be augmented with a

conditional filter process (P5.1a). The

conditional filter process will call the object

referent resolution process (P5.2) in order to get

a set of possible event participants that satisfy

the indefinite description . Then, it will

search in the event log, and conclude whether the

condition is TRUE or not. The following

translation rule should be realized in P5.1:

if , {}

�

IF condition is TRUE:

{} , …

IF condition is not TRUE:

(Empty action list)

The second form of the conditions is: “I ”. This will correspond to the

existence of an event with type analogous to the

action verb (get, put, touch), with the object

corresponding to the test-giver as the first

participant, with an object which satisfies

as a second participant, and with an ongoing end

time equal to the current time minus one tick (i.e.

the flag should have just been

raised). Note that this second form of the

conditions is paired with the following utterance

form:

when , {you}

{}

An interesting aspect of this request speech act is

that it cannot be instantaneously fulfilled, but

often needs to wait for to hold in

the near future before the tester can fulfill the

request. Thus, we need some way of getting the

agent under test to repeatedly hear the request,

until favorable conditions arise for its

satisfaction, or until it has to be forgotten. It

might have to be forgotten after a time-out

period, or after one or more new requests have

arisen.

Here, we will propose the following way of

handling these demands: we will assume that

such an utterance is persistently hallucinated23 in

the agent’s head until its conditions are satisfied

or until a new utterance has arrived. Thus, what

we need is a “hallucinatory” circuit in the senses.

An internal action will be produced by the action

selection process, which will “echo” the master’s

voice repeatedly in the agent’s head, until the

master has been served or he has demanded

something new. The action list R5 will be

augmented with an action of the form:

Hallucinate_utterance

The source object will be initially fixed to the

object representing the test-giver.

The effect of this internal action will be the

writing of the utterance in the appropriate

utterance property representation in the GSM, as

if it had been freshly heard. In case new sensory

information has come in, it will automatically

overwrite the hallucinated utterance – i.e. real

voices suppress persistent hallucinated voices.

Furthermore, the translation rule that we need in

P5.1 is:

when , {you}

{} �

IF condition is TRUE:

{} , …

IF condition is not TRUE:

Hallucinate_utterance

In this way, we are able to handle conditional

action execution in the future, through repeated

hallucination of the request of the test-giver. The

hallucination scheme works well even when we

abort the simplifying assumptions that the

hallucination stops being repeated when a new

request comes in. But then, we need to devise

some form of buffer of simultaneous “directives”

that are being heard at each instant together with

the sensory-derived utterances, and which guide

the agent’s behavior. We would then also need to

23 this idea is reminiscent of the general theme of

[Jaynes 1976]

devise ways to deal with incompatible requests

etc., and we slowly end up getting to all the

known forms of internal conflict. Nevertheless,

this simple idea of hallucinated requests can

really achieve a lot. We can have agents that are

verbally taught to respond appropriately to

appropriate circumstances by their masters, we

can have incoming sensory signs to translate to

hallucinated utterances (for example, in reading

written texts), and much more.

[conclusion]

We have shown how present/future conditionals

can be handled. This was achieved through the

introduction of an event representation and an

event log, through augmentation of the sensory

machinery that enables the observation and

recognition of the actions of the test-giver, and

through a “hallucinatory utterance” circuit that

persistently repeats a conditional action

execution request until it is carried out or it is

forgotten. More importantly, this primitive

hallucinatory circuit provides an interesting

pathway for expansion in the future that enables

the implementation of many capabilities such as

teaching, reading etc. Such capabilities are

invaluable towards the creation of autonomous

cooperative agents.

P5.6: Corrections: (Q56, Q57)

Q56.Pick up [the red circle] -No!- [the white

square]

Q57.Instead of [the white square], take [the

yellow circle] (inverted syntax;

[General form]

-No!-

Instead of ,

Where:

belongs to {touch, pick up, take}

And , are definite descriptions, as

defined previously.


Once again, as was the case in part 5.3

(conjunctions), there is a deeper phenomenon

going on here. However, we will treat it slightly

superficially, albeit quite effectively for the

purpose of the test. A deep treatment of

corrections requires an understanding of

utterance production in the tester (covering the –

No!- interruption), or a longer-term

understanding of discourse (covering the contrast

marked by “Instead”). However, this deeper

understanding is not essential for responding

appropriately to these utterances.

We propose the following augmentations to

process P5.1 (action selector), performing the

utterance-to-description level translation:

-No!- �

Instead of , �

I.e. we totally ignore the mistaken / contrasted

part of the utterance, and proceed with executing

the requested action with the correct argument.

6. Local minimality of the proposed design

Here, a list of the main parts and blocks of the

proposed design is given. We consider each

block in turn, and observe the effects of its

removal from the design, and show which part of

the design specification fails to hold in each case.

We finally conclude, that through this “Jenga”

procedure, we can claim local minimality of the

proposed design in the universe of designs

proposed by the general framework we have

adopted. Furthermore, we have already shown

that the design provides easy and systematic

hooks for further expandibility. Let us now

proceed with the test:

Representations contained in the agent model:

R1) Situation model: represents external reality

(not only physical necessarily)

If we remove any of the proposed property

dimensions, or the speaker-object, the test fails.

R2) Internal state (goals, affect)

Not used.


If we remove the event representation or event

types, then Q49/Q51 will fail

R4) Future predictions

Not used. Useful for planning etc. (level-3/4)

R5) Action list

If we remove any of the action entries that we

made, specific questions of the test will fail.

Processes contained in the agent model:

P1) Sensor-to-SituationModel process

If we remove any of the processes of P1a or P1b,

different parts of the test will fail. For example,

if we remove the size estimation process,

referents involving size will not be resolved

correctly etc. If we remove action recognition of

the test-giver, Q51 will fail etc.

P2) SituationModel-to-InternalState process

Not used.



If the log creation / event recognition processes

did not exist, we would not be able to correctly

respond to Q51 for example.

P4) Future Prediction Generation process

Not used.

P5) Action Selection Process

(SituationModelState-to-action table or planner)

Any removal of P5.1-P5.4 or their parts will

inhibit successful responses to several questions.

Thus, without getting into a detailed

examination, it should be clear that any removal

of a proposed representation, process or their

parts will make the system incapable of passing

the whole token test. Thus, by the proposed

“Jenga” criterion, the design is locally optimal in

the space created by the proposed framework of

designs.

7. Conclusion

In this paper, I presented a design of a minimal

grounded situation model which will enable a

conversational robot to pass the “Token Test”. I

started by considering what it might mean for a

GSM to be minimal, and proposed an

incremental design methodology. Then, I

considered how human behavioral tests should

be translated to specifications for design. Later, I

derived the proposed design step-by-step by

considering the questions of the “Token Test”.

Finally, I pro

phd general examinations: publishable quality paper for main … · 2006. 1. 26. · “token...

Documents