phd general examinations: publishable quality paper for main … · 2006. 1. 26. · “token...
TRANSCRIPT
-
PhD General Examinations:
Publishable Quality Paper for Main Area (Prof. Deb Roy)
Nikolaos Mavridis, MIT Media Lab, June 2005
I think the most productive topic for a paper will be to design the MINIMAL mental
model architecture that will support sensory-motor grounded interpretation of all spoken
commands in the final part of the Token test. The paper should present this design, justify
each element of the design, and relate the architecture as best as you can to the readings
in your reading list.
Deb
Dear Deb,
I followed the decomposition of the paper in three parts that I had proposed to you by
email and to which you have agreed. The key themes of this paper are:
1) THE MINIMALITY CRITERION
What could a meaningful "minimality" criterion be for a mental model or other systems
of comparable complexity and scope? What are viable approximations for such criteria,
or alternatively stepwise design methodologies that can guarantee some notion of strict or
relaxed "minimality"?
2) THE FUNCTIONAL SPECIFICATION
What does it really mean to "pass the token test?". I.e. at what level of generality would
we expect the system to perform? (i.e. what family of abilities which are exemplified
from the token test are we targeting?)
3) STEPPING THROUGH THE DESIGN PROCESS OF THE PROPOSED SOLUTION
Based on specific proposals for 1) and 2), what is the suggested design presented in a
stepwise/incremental manner? (Here, a lattice of "minimal" designs of increasing
abilities/complexity might need to be presented, and the corresponding design space
searched to find the minimal satisfactory solution)
Nicholas
-
A near-minimal grounded situation model design
for a conversational robot that can pass the “Token Test”
Nikolaos Mavridis, MIT Media Lab, June 2005
Abstract
The purpose of this paper is to present a design
of a minimal grounded situation model (GSM)
[Mavridis and Roy 2005, Mavridis 2005] which
will enable a conversational robot to pass the
“Token Test”. The “Token Test” is a standard
test designed to assess early situated language
skills of children. This specific design is given as
an example of a wider methodology. According
to this methodology, agent model designs are
derived from specifications, which are in turn
derived by human behavioral tests. In this paper,
after a brief introduction to the problem, optimal
design and reasonable approaches towards
practical near-optimality in such complex
systems are discussed. Then, an answer to the
question of what it should really mean for a
robot to pass a human behavioral test is
attempted. In the main section of this paper that
follows, the proposed design is derived step-by-
step. A conclusion section ends this paper. This
paper constitutes a step towards the creation of
conversational robots that can cooperate fluidly
with humans. The paper’s main contributions
include a proposal for an incremental design
methodology for grounded situation models,
guidelines for translating human behavioral tests
into design specifications for robots, and an
example of the use of the above methodology
and guidelines: an easily extensible step-by-step
design for a conversational robot that can pass
the “Token Test”.
0. Introduction
[vision - previous work]
The ultimate purpose of this work is the creation
of conversational robots that can cooperate
fluidly with humans in an intuitive manner using
natural language. Our current work is part of a
larger effort to develop conversational interfaces
for interactive robots [Crangle and Suppes 1994,
McGuire et al. 2002, Sofge et al. 2003]. Robots
that understand and use natural language may
find application in entertainment, assistive, and
educational domains. Towards this ultimate
purpose, a mental model for the conversational
robot Ripley was originally designed, which in
its initial form enabled the robot to achieve
object permanence and viewpoint switching
[Roy, Hsiao and Mavridis 2004]. Later, the
design was extended, and incorporated into the
general theoretical framework of “Grounded
Situation Models” (GSM) having triple-layer
properties [Mavridis and Roy 2005]. Using such
models, Ripley became the first conversational
robot that was able to fluidly mix sensory-
derived information with spoken descriptions in
the creation of the contents of his situation
model, and was also able to resolve temporal
referents and refer to past events. At that stage,
Ripley was able to pass the first two parts of the
“Token Test” [DiSimoni 1978], a standard test
designed to assess early situated language skills
of children. In [Mavridis 2005], a wider
framework for a hierarchy of agent models of
increasing complexity was proposed. The
situation model was positioned as a part of the
wider agent model, which apart from the GSM
also includes other machinery, such as the
processes required for action selection.
[news - steps]
The immediate aim of this paper will be the
design of a GSM that will enable Ripley to pass
the whole of the “Token Test”. The terminology
developed in the relevant previous papers
[Mavridis and Roy 2005, Mavridis 2005] will be
adapted. However, the design proposal given
here will be given as an example of a wider
design methodology that will be suggested.
Through this wider methodology, a list of
specifications drives an incremental stepwise
design. This incremental method can guarantee
local near-optimality of the design, and also
naturally enables future expandability. The
specifications driving the method are in turn
derived by a human behavioral test.
Two prerequisites are needed for such an
approach. First, a position on what minimality
for grounded situation models might mean. What
is optimality? What are we optimizing here?
What methods or claims might be practically
applicable? Second, we need a position on what
we should really mean when we claim that a
robot has passed a certain behavioral test. Should
the robot be allowed to know the questions
before the test? What level of generality should
we aim for? What was the real purpose of the
-
test? We will thus try to propose positions and
thus fulfill these two prerequisites.
[contributions]
It is our hope that the methodology proposed,
even though it is currently in an initial stage, will
prove to be useful towards the creation of
conversational partners that cooperate with
humans in a multitude of environments. Most
existing systems have been designed through
non-systematic ad-hoc methods. Although
existing systems do exhibit satisfactory
performance in highly constrained environments
and with limited vocabularies, their ability to
scale has yet to be proven. Also, most of the
existing systems are more similar to tools than to
partners, as they exhibit either none or minimal
intention recognition of the human partner. Thus,
they can only respond to immediate commands,
and cannot participate in longer term cooperative
and flexible planning. We believe that the
approach presented here will prove to be capable
of both scalability and true cooperation. We
envision that in the future the systematicity of
the design procedure will enable its partial
automation and thus aid towards scalability.
Also, the inherent ease with which the proposed
design can accommodate embedded agent
models of others might well prove to be
invaluable towards equipping the robot with a
form of Theory of Mind [Premack and Woodruff
1978], which will act as a catalyst towards true
human-robot cooperation. Furthermore, we
believe that our proposals on grounded situation
model near-optimality as well as on the
adaptation of human behavioral tests for robot
design will prove valuable to other researchers in
the field.
[overview of paper]
This paper will start by discussing our positions
on design minimality for GSM’s, as well as on
how to use human behavioral tests as
specifications for robot design. Then, the
proposed design will be given in an incremental
step-wise manner. We will start by considering
the first four parts of the token test, and derive
the heart of our proposal through their
requirements. Then, we will consider the final
part, which exhibits much less homogeneity than
the first four, and propose successive
augmentations to our initial design. A section
discussing the local optimality of the proposed
design will follow. Finally, a concluding section
will end this paper.
1. Optimal design
[generalities about optimality]
Optimality is, and has always been, one of the
principle obsessions of engineers. Apart from the
purpose of achieving maximal “economy” of
some sort, optimality also fulfills another very
important function: it helps narrow down an
often exponentially huge and intractable space of
possible designs to a single representative, which
is baptized as “best”. This design is not “best”
under any absolute sense; it is only under the
arbitrariness of a designer-chosen criterion that
quantifies its “economy” that it can claim to be
so. Given such a criterion, engineers can
compare their designs more objectively, and
claim superiority for a particular design on
rational grounds. Of course, absolute
“optimality” is only a naive engineer’s illusion.
This is so, as it can only exist within a well
constrained and predefined universe of designs.
Furthermore, it is optimal under the biased
judgement of an adhoc optimality criterion,
which will also in most cases take into account
only partial aspects of the designs. Under the
above two degrees of freedom (the universe of
designs and the optimality criterion), and given
inevitable preexisting constraints, it nevertheless
proves easy to find optimal designs only in some
highly idealized cases.
[a highly idealized example problem]
Consider, for example a variant on a classic high
school optimization problem:
“John has a rope which is four meters long. He
wants to use this string in order to separate the
territory of his small robot from the rest of the
world. This territory can be any rectangle which
is made from the rope. His robot would be
happiest if the area of its territory was as big as
possible. How can John make his robot as happy
as possible?”. Let us use this idealized problem
as a small parable illustrating some of the critical
issues in optimal design.
First, the space of possible designs is given and
pretty well defined – it consists of all rectangles
with perimeter of four meters. Second, the
“optimality” criterion is given and well defined –
make his robot as happy as possible by
maximizing the area of the rectangle. Third,
among the preexisting constraints are the total
length of the rope, as well as rope physics and
the geometry of euclidean space.
How does John proceed to discover the optimal
-
design? First, he idealizes the natural
components into mathematical abstractions: the
rope becomes a one-dimensional curve of
infinitesimal width and without bending
constraints. Second, he parametrizes the space of
all possible designs, yielding in essence a
generative model. The space is isomorphic to the
real numbers R; indeed, any rectangle can be
described by two parameters (width, height)
which collapse to one under the perimeter
constraint (the perimeter should be equal to the
length of the rope which is 4m, i.e. 2*width +
2*height = 4, i.e. width = 2 – height). Third, he
has to select the “best” design; by an arbitrary
choice of method (calculus, inequality
arguments, consulting an oracle) he declares that
the particular design with width = 1 (and
correspondingly equal height) is “best”.
He thus takes the rope, implements the design,
and his robot seems to be smiling and jumping
around. Alas, in a scientific community, any
claim should be accepted with a lot of suspicion;
and indeed we would expect many critics to
appear and declare that John’s robot might have
been happier. John must answer their claims, or
else he will fall from his throne of Reason to the
muds of unwarranted speculation and blindsight.
Indeed, he takes his ruler, measures the design,
and declares that the area is one square meter.
The critics take their turns:
If you had just made one side a little longer – say
1.5 meters – your robot would be happier. That’s
an easy one – John replies without hesitation.
“But then”, he says, “the total area would have
been less – let’s see how much: .75 square
meters – 25% less”. The theorists were
convinced, but some hardliners were still waiting
for empirical evidence. Indeed, he reshapes the
territory, and the robot’s face is not glowing with
happiness anymore. Before it starts to cry, he
reshapes it back to the original. Then, another
critic voices his thoughts: “I have a better
solution: why not try a pentagon?”. Another one
stops him mid-sentence: “A circle! – why care
about pentagons?”. “But “, John answers, “The
problem statement was clear: it said. Is the pentagon a
rectangle? Or the circle? All I can do is follow
the statement. And, after all, my robot is still
smiling!”.
[key results of above story]
A few key points should have been illustrated by
the above story. First, for convenience, we can
conjecture a real world of designs, as well as an
ideal world of models. Models can be tested in
theory and designs in practice, but in the end,
practice is what we aim for, and it is always
more convincing. Second, the universe of
designs and the optimality criterion are degrees
of freedom that have to be explicated and
justified. The constraints of the real world, in the
form that they are transferred to the ideal, should
be explicated too. Third, the generative model (if
any) that will span the universe of designs, as
well as the optimization method, present even
more degrees of freedom.1
[moving on to more complicated cases]
But what happens when we leave such simple
examples and move on to more complicated
cases? First of all, the uniqueness as well as the
achievability of the optimal solution might
become questionable. Second, approximations of
global or even local optima might be all we can
achieve. Also, the optimality criterion itself often
has to become partial: we have to select a
specific subset of the set of possible important
aspects of each design, in order to make the
problem tractable. Third, absolute optimality
might be meaningless. For example, our data
might be noisy, or the computational procedure
to achieve an absolute optimum might be too
expensive. Also, in the case of real-time
optimization, the environment might be changing
so quickly and the optimality criterion changing
with it. In all of the above cases, a satisfactory
solution might be as good if not even better than
a hypothetically optimal one [Simon 1969].
Fourth, we might only have partial data; and we
might be required to devise something that will
still be useful in the future, when exposed to new
data that we have not encountered yet. I.e., we
might need not only need a good if not optimal
“fit” to the current requirements, but also an easy
upgrade to further requirements; i.e.
generalization ability given an inexpensive
expansion. Of course, the ability to generalize is
most often antagonistic to fitting the current
requirements optimally; but, tradeoffs can be
found, and many of the techniques developed in
pattern recognition (k-fold validation for
example) and complexity theory [Cherkassky et
al. 1999] address similar problems. In the design
case, what we really need is slightly different
1 In the above example we have assumed a non-
enumerably infinite and continuous space of
designs – very often, discrete or hybrid spaces
also enter the picture.
-
than the standard problem setting. In the standard
setting, we need to be able to generalize to
further requirements with the present design.
Here, we instead need to be able to generalize to
further requirements through an inexpensive
augmentation of our existing design. Things are
much better if a simple “patch” on top of the
existing design could satisfy the new needs;
instead, if we need to throw away everything and
redesign ex nihilo, we would have to “spend”
much.
Let us now recapitulate briefly, before moving
on from general comments to the special case of
grounded situation models. When we move from
small-scale highly idealized problems to more
realistic large scale cases, an absolute global
optimum might not be achievable or even
desirable. Also, our criteria for optimality cannot
be all-inclusive anymore. Finally, we often have
to take into account the adequacy of our
proposed design under not only current but also
future specifications, and the existence of an
inexpensive route for expansion.
[minimal grounded situation models – overview
of our proposal]
But what import do all of the above comments
have on minimal grounded situation model
designs? Here we will propose a methodology
for an incremental design, satisfying a set of
specifications, and claiming local optimality. The
specification will be broken down to parts, in
rough order of foreseen complexity. An initial
design satisfying the first specification will be
used as a starting point, and successive
augmentations that satisfy the remaining
specifications will be applied to it. The final
design will be tested for local optimality, and
pruned down to a locally optimal solution if this
is required. We will base our designs to a
standard underlying organizational framework
for grounded situation models, that is briefly
described in [Mavridis 2005]. The use of this
framework in conjunction with the proposed
design methodology, naturally provides an easy
pathway for further expansion in the future.
[a natural partial ordering of complexity]
First of all, we need to devise a criterion for
comparison of two grounded situation models.
Here I will propose a rough criterion that will
nevertheless be adequate for our purpose.
Ideally, given a GSM design, we would like to
be able to quantify its complexity by some
metric value. Then, when given two GSM
designs, we just accept the design with the
smallest complexity metric as “minimal”.
However, absolute metric values are not
necessary for such a comparison; any ordering
relation will do.
Now, let’s suppose that any GSM design can be
decomposed into parts (representations and
processes), and that we are given two specific
designs, the first of which includes all of the
parts of the second other plus some extra parts.
Then, any reasonable definition of “complexity”
will order the first design as being more complex
than the second. Of course, this principle only
provides a partial ordering – it does not supply
us with any directive on how to deal with cases
where one of the designs is not composed from
parts which are a superset of the parts of the
other. However, this partial ordering will be the
criterion for comparing GSM designs that I will
adapt here. It will prove adequate, as we will
restrict our comparisons to augmentations of
existing designs. The question that follows is:
can we devise an incremental design
methodology that will be able to claim some
form of optimality, while only utilizing this
natural partial ordering?
[incremental design using the natural partial
ordering]
Consider that we are given a list of design
specifications. We start by producing an optimal
design for the first specification. Then, we
consider all possible augmentations of our initial
design that would enable it to cover not only the
first but also the second specification, and we
choose the optimal augmentation. Under what
conditions will the resulting new augmented
design be optimal? The answer should not be
difficult to spot: the resulting new design is
optimal as long as:
C1) We assume additive costs:
When A intersection B is empty, then cost
(A+B) = cost(A) + cost (B)
C2) There is no design that would include only
part of the initial design, and which would
suffice for both the requirements of specification
1 and specification 2.
Proof: Let’s assume that there exists another
design C, which is not equal to A+B, and which
is better than A+B. Then, by C2, C should
include A; and thus, C should be decomposable
as: C = A+B’, where B’ not equal to B, because
we supposed that C is not equal to A+B. But
-
then, as C is known to be better than A+B,
cost(A+B’)
-
satisfactory GSM designs, which are furthermore
naturally expandable.
2. Behavioral tests for humans as design
specifications for robots
[what consists cheating in a test and what does
not?]
When trying to use a behavioral test designed for
humans as a design-driving specification for a
robot, one main difficulty arises, which we will
discuss here. The human to be tested is supposed
not to know exactly the contents of the test
before it is administered. However, we assume
that throughout his experience so far in his life,
he has received adequate explicit or implicit
training stimuli in order to perform well. When
administering a behavioral test to a robot, we
certainly should not allow the designer of the
robot or the trainer to have given explicit
response specifications for each of the specific
tasks that comprise the test. Allowing such a
state of affairs would be equivalent to having
trained the human for the specific tasks/questions
of the test: having shown him the correct
responses and having made sure that he performs
them well. This would certainly count as
cheating in most human testing cases. So the
question arises: if having supplied the
“questions” and the “correct answers” to the
specific test, and having made sure the human
perform them adequately counts as “cheating”,
what would be considered allowable?
[the wider area versus the specific questions]
Let’s examine the human case again. A high
school student preparing for his SAT tests is
exposed to intense training. This training takes
the form of explicit tuition on the related
subjects, practice tests etc. Then, one day, he has
to sit his actual exam. Why doesn’t the training
count as cheating? Because it is general enough
in order to cover the wider area to which the
questions of the test to be administered belong
to, but is not specific enough in order to cover
only these questions. It would have been highly
unlikely for the training to cover only the
specific questions of the administered test,
without any pre-knowledge or “leak” of the
exam questions.
So what is the moral for the robot case? Any
proposed design solution that passes a human
behavioral test should be able to pass any test
that belongs to the wider area to which the test
belongs, and not only the specific test in
question. Of course, the delineation of the
“width” of the area to which the specific tests
belongs should be carefully justified. Notice that
an analogous situation holds in the pattern
recognition evaluation literature. In supervised
training problems, we are given a training set and
are asked to design a classifier that performs well
on a (supposedly unknown) testing set. The
training set should be general-enough in order to
cover and represent densely enough the area
where the testing set might belong to. However,
it must again not be specific enough so that it
suspiciously only includes the testing set.
[the test design problem]
After all, one must ask a wider question: what is
the purpose of testing? What was the test
designer trying to achieve? Usually, the test
design procedure is posed as follows:
a) decide on the limits of the wider area of the
material to be tested
b) select specific questions out of the wider area,
such that:
b1) they can be practically tested in limited
time (few in number, short answers)
b2) the answers given to the selected
questions can have high predictive value
towards the answers that the examinee
would give to all the questions comprising
the wider area of the material to be tested.
In short, the test designer must chose a few
easily testable questions that are however highly
indicative of the examinee’s mastery of a wider
area of knowledge that is being tested.
[the three commandments]
The above discussion has clear implications
towards using behavioral tests for humans as
design specifications (or alternatively as training
material) for robots. First, the test must be
reverse-engineered: given a specific test we must
try to delineate the wider area in which the tasks
of the test belong, and for which the specific test
acts as a representative sample of. In case we
have explicit knowledge of the test designer’s
coverage intentions, we can adapt them readily.
Else, we have to “grow” the wider area by using
the specific questions as a “seed”. The extent and
generality of the wider area must be chosen.
Second, we must use this wider area as a design
specification, or sample it randomly for the
generation of a training set. We are not allowed
to use the specific questions comprising the test
as the only target domain. Third, we must clearly
-
and explicitly justify all of our above choices.
[minimality at odds with generality requirement
prescribed by “width”]
Finally, it is worth noting that the designs that
can satisfy the requirements of the wider
“grown” areas will often prove to be more costly
than those that would only satisfy the seeds.
Nevertheless, only a design that can pass the
“wider” requirements is of any real value, and is
furthermore more general and easily expandable
in the future. Of course, the right level of
generality must be decided and explicated on a
case-by-case basis, as we will do here.
3. The token test and the constraints it imposes
on mental model reps and processes
Below I will adopt an agent model which
decomposes into an enumerated set of
components (representations and processes),
along the lines of the decomposition described
in: [Mavridis 2005].3 Also, I will adopt the three-
layer grounded situation model terminology of
[Mavridis and Roy 2005]. I will examine each
part of the test in turn and proceed in the manner
outlined above: I will use the specific questions
of the test as a seed used to grow a wider area to
be tested, and then use a model of this wider area
and the expected performance as a design
specification.4
[sensory input modalities]
Let us start with the sensory input of the agent.
The agent under test should be able to sense a set
of utterances (signs) as well as an arrangement of
physical objects (non-signs). The physical
objects are either colored squares or circles,
which have been cut out of thin plastic, and
which have been placed on a flat uniform-
colored surface.5
3 This choice constrains the universe of designs
(as termed in the first section) to those that can
be produced within the decompositional
framework of [Mavridis 2005] 4 Translating to the terms introduced earlier in
the first section: The optimality criterion follows
naturally from the partial ordering that lies at the
heart of the proposed incremental design
methodology. The optimization method is also
contained in the proposed design methodology. 5 Also, for the fifth part of the test, it has to sense
the physical body of the person administering the
test, as we shall see much later in this paper.
In terms of the four-stage reality-to-situation
model processing pipeline, the first stage
(reality-to-sensory stream projection) concerns
us in this paragraph. From the form of the
required signs and non-signs to be sensed, two
requirements arise. First, the sensory organs
should cover the sound wave spectrum with
enough range and fidelity in order to be able to
receive the utterances and successfully
discriminate among them, and thus some form of
a hearing organ is required. Second, the sensory
organs should be able to locate and discriminate
among the various objects and their properties,
and thus at this stage we assume that some form
of a visual organ with coverage of the human
visible spectrum having adequate range and
fidelity is also required.6
[utterances heard are all requests]
Now let us briefly consider the utterances heard.
After a brief glance through the “Token Test”,
one can easily notice that only request speech
acts are used, taking their direct syntactical form
(Imperative). Furthermore, all these are requests
for overt actions, and in particular, motor
manipulation actions (there are no requests for
speech, and arguable none for “internal” covert
actions). Also, notice that the recipient of the test
is assumed to be cooperative, and should
therefore try to fulfill the test-giver’s requests as
best as he can. Furthermore, he is assumed to
have been well-fed and sheltered, so that his
other vital goals are already satisfied (hunger,
thirst, danger avoidance etc.), and so during the
test he will only perform actions towards the
satisfaction of the requests given by the tester.
Thus, as the utterances contain requests for
motor actions, we suppose that the expected
result of the recipient hearing the utterance will
be the production of one or more motor actions,
which will be selected on the basis of the
utterance as well as the situational context.7 This
dictates that the action selection mechanism of
the agent need not involve complicated partial
planning or even rational action selection. For a
pre-programmed agent, we are at the level that
6 Alternatively, touch or ultrasound sonar might
have substituted for vision, although color would
not be covered. Of course, the Token Test was
not originally designed for blind children, and
vision is silently assumed to be the modality
used. 7 later in this paper, we will also introduce covert
internal actions in addition to overt motor actions
-
was termed level-1 in [Mavridis 2005] (reflexive
organism). In the case of a conditionable-
response agent, we might move to level-2.
Nevertheless, once conditioning is finished,
actions will effectively be selected by a fixed
(utterance, Situation Model Contents)-to-
ActionSequence table.8 Below I will explicate on
the stages that intervene between utterance
reception and action sequence production, and on
the way that the situation model contents (i.e. the
situational context) contribute to this process.
[the 3 stages from utterance to action sequence]
I will assume that each heard utterance goes
through three stages of translation, before the
necessary motor actions are produced. First, the
surface form of the utterance is translated into a
set of zero, one or more ordered statements
which have the general form:
(arguments in curly brackets {} are optional)
{} { …}
At this first level (Description-level), the
situational context has not yet been taken into
account.
At the second level (Object-level), the situational
context (categorical layer of situation model) is
used in order to resolve the definite descriptions
of object sets into their actual members (pointers
to objects in the situation model with unique
identifiers):
{} { …}
At the third level (Action-level), the situational
context (this time the continuous layer of the
situation model) is fed to the manner-to-
actionargument functions, in order to provide
continuous-valued arguments for the action
routines. I.e.:
{ …}
[an illustrative example of the three stages]
A simple example will help illustrate the three
stages. Let us assume a situation model
consisting of objects that have three properties
(position, size, color), which furthermore are
8 Later, we will comment on the parts of test
where a more complicated action selection
mechanism, such as a planner, might have been
better suited, even if not necessitated
quantized to the categories {left, center, right},
{small, large}, {red, green, blue}. Let us further
assume that the current contents of the situation
model are:
Obj#33
Continuous layer:
Pos = (34, 12), Size = (2), Color = (.5, .1, .2)
Categorical layer:
Pos = “right”, Size = “small”, Color = “red”
Obj#37
Continuous layer:
Pos = (14, 2), Size = (5), Color = (.05, .6, .2)
Categorical layer:
Pos = “left”, Size = “large”, Color = “green”
Obj#38
Continuous layer:
Pos = (54, 8), Size = (5.3), Color = (.5, .1, .1)
Categorical layer:
Pos = “right”, Size = “large”, Color = “red”
Let us also assume that the received utterance is:
“Put the small red one in front of the green one”
At the description-level, this utterance will
translate to an ordered sequence of two
statements:
,
Notice that the situational context has not been
utilized yet. At the next level, the categorical
layer of the situation model will help resolve the
descriptions into unique object ID’s.
At the object-level, we will have:
,
Now, manner will compose with the continuous
layer contents of the relevant objects, to produce
action parameters. I.e.:
,
9
Finally, this will translate to the following two
actions to be executed sequentially:
9 Here we assume for simplicity that Infrontof
requires as an argument only the landmark object
and does not depend on the rest of the situational
context
-
Get(34,12) and Put(14,0)10
After this brief introduction, we are now ready to
continue with the discussion of the specific
demands that the questions of the token test
impose on the grounded situation model of the
agent. We will start by considering the first four
parts of the test, and then proceed to the fifth,
which is much less homogeneous and more
demanding.
4. Passing the first four parts of the token test
Here the general form of the questions is:
Touch the AND the
Where:
, belong to: {circle, square}
, belong to: {red, green,
yellow, blue, white}
, belong to: {small, large}
Some of these arguments are allowed to be
optional.
Of course, this wider general form only
corresponds to part 4; in earlier parts, some of
the above slots are missing, and thus some
arguments should be allowed to be optional.
Thus, with the above general form, we have used
the specific questions of the first four parts of the
token test as a seed for a wider set of questions
to be targeted. Arguably, at this stage, one might
further widen this set; for example, we could
extend to the more general forms:
the AND
AND …
the … AND
the …
the … AND
the …
One could also extend towards more property
categories: more colors, shapes etc.
However, I will assume that these wider general
forms and extension would constitute an
overgeneralization given the specific questions
comprising parts 1-4. Thus, we will stop at the
initial level of generality that we proposed.
10 In this example, let us suppose Infrontof(14,2)
= (14,0)). Notice that here we silently assumed
that Infrontof returns a number of object
properties and not a virtual object as its output.
Now, let us step through the requirements the
chosen general form of questions and their
expected resulting actions impose on the
representations and processes of the situation
model. The representation – process
enumerations given in [Mavridis 2005] will be
used here, and are repeated below for easy
reference:
Representations contained in the agent model:
R1) Situation model: represents external reality
(not only physical necessarily)
R2) Internal state (goals, affect)
R3) Past states/situations (includes events)
R4) Future predictions
R5) Action list
Processes contained in the agent model:
P1) Sensor-to-SituationModel process
P2) SituationModel-to-InternalState process
P3) Storage/Maintanence of Past process
(includes event recognition)
P4) Future Prediction Generation process
P5) Action Selection Process
(SituationModelState-to-action table or planner)
Representations:
The questions impose a need for three property
dimensions in the description of each object
(R1), namely , and . At
the categorical-layer level (R1), these property
vectors should be quantized to the following
categories: {circle, square}, {red, green, yellow,
blue, white}, {small, large}. Notice that no
requirement for a position property dimension
has yet arisen, but nevertheless will soon arise.
The action required by the questions necessitates
a new entry at the agent’s empty action list (R5).
This entry will correspond to the “Touch” action,
which requires at this stage only a position as its
argument. Due to the general form of the
questions, this position must be derived from an
object11. This necessitates the addition of a
further fourth property dimension to the
description of each object (R1), namely
. Notice that no categorical-layer
requirements yet exist for this property; positions
needn’t ever be categorized or verbalized yet.
For the purpose of providing arguments to the
“Touch” action, the position property dimension
only needs to have a continuous-layer
11 i.e. we don’t currently have:
touch such as “touch at the left”,
but touch : “touch the green circle”
-
representation (i.e. a single vector value).
[humanly-understandable property dimensions]
Here it is worth clarifying a crucial point before
proceeding. What do we mean by saying that a
requirement for three property dimensions (for
example and ) has
arisen? Couldn’t we for example get a single
composite multi-dimensional property, break it
up using any possible local coordinate system, or
even collapse all three to a single-dimensional
one given a suitable transformation? Of course
we could, as long as the new representation was
informationally equivalent to the original one
which consisted of the three humanly-intelligible
property dimensions. But such a new
representation would complicate our
understanding of the actual processes that would
comprise the sensory-to-situation model
processing stages. For example, instead of
deriving “size” from the object-individuated
parts of the incoming image, some other feature
vector with no direct human interpretation will
have to be derived. Later, the “categorizer” at the
fourth stage of the pipeline, would have to
operate on this new feature vector in order to
derive verbal size categories.
This is why we decide to directly reflect in the
situation model designs the property dimensions
of objects that exist in natural language – such as
and . Of course, such
properties will not always be orthogonal or
independent. However, if we needn’t impose a
very strict definition of minimality in our
representations, we are better off by sometimes
possibly including an extra property dimension
(even if somewhat redundant) than by dealing
with the extra complications described above.
[the test-giver object]
Finally, an extra “dummy” object is inserted in
the situation model, corresponding to the speaker
of the heard utterances. The utterances are
represented as property contents in the situation
model, which are attributed to the dummy object
(speaker), of which no physical visibility is
assumed (a simplifying assumption). Thus, the
speaker object is instantiated initially given no
visual evidence, and contains only dummy
property information (position, size, shape,
color), plus a new property dimension: utterance
(R1). Of course, in real-world implementations
such as the robot Ripley, the speaker is not a
simple object but an agent, and the properties do
not only hold dummy contents [Mavridis and
Roy 2005]. For example, the “position” property
might be estimated through a face detector.
However, given the purpose of our exposition,
this is not required. In conclusion, here we need
to augment our object property dimensions with
the “utterance” dimension, and we need to insert
a dummy “speaker” object in the situation
model. We will soon show how specific
processes update the contents if the “utterance”
property so that its continuous layer contains a
feature representation of the last speech segment
heard, and its categorical layer contains the
recognized text string.
Processes:
Here, we will first consider the sensory stream of
vision (non-sign), and the four stages of
processing that mediate between external reality
and the situation model. Then, we will turn our
attention towards the sign-stream (hearing
utterances), and the three stages of processing
that mediate between the utterances (belonging
to the situation model) and the resulting actions.
a) The non-sign stream (vision) (P1a):
After having a look at the two object
arrangements given in the Token Test, we can
propose that the general set to which the physical
objects causing the sensory stimuli belong is:
Rectilinear equidistant arrangements of circles or
squares of small or large size, with uniform Blue
or Green or Yellow or White or Red color. The
circles / squares are cut out of plastic and placed
on a uniform background.
Arguably, we could extend this set to a wider
one, allowing more complicated arrangements,
more sizes, colors, shapes, more complicated
background, or even three-dimensional objects.
However, we will chose not to impose such a
wide requirement here, because we believe it
was out of the spirit of the designer of the Token
Test.
The overall purpose of the external reality – to –
situation model stream for non-signs is to create
a categorical-layer and continuous-layer
representation that is of adequate fidelity in order
to:
a) drive referent resolution (through the
categorical-layer size, shape and color
representations),
b) provide parameters for motor actions on
objects (through the continuous-layer position
representation).
-
These requirements (successful referent
resolution and motor manipulation) provide the
criterion for the adequacy of the 4-stage
pipeline.12 These requirements can also be
broken to into specific requirements for each of
the processing stages:
The first of the four stages is the sensory
projection stage. Here, any device that will
provide us with a sensory stream that will be
able to provide us with adequate fidelity in the
resulting categorical representation at the end of
the pipeline will be sufficient. Then, the object
individuation stage follows. Here, the sensory
stream must be segmented into parts that
correspond to the physical objects that had
caused them. Later, the property values have to
be extracted. Here, we need four processes,
which given the segmented parts corresponding
to each physical object, will calculate its
property values. Thus, we need properties for
position estimation, size estimation, color
estimation, and shape estimation. These
estimators must provide estimates of adequate
quality in order to ultimately drive referent
resolution and motor action parameters, as we
said above. The fourth and final stage is that of
categorical quantization. Here, we need
categorical classifiers for size, color and shape
(categorization of position not required by Token
Test, as explained above). These have to be
tuned so that they correspond with the tester’s
categories.
In conclusion, the need for the following
processes has arisen:
P1.a1) Vision sensor
P1.a2) Vision stream segmenter
P1.a3a) Shape estimator
P1.a3b) Color estimator
P1.a3c) Size estimator
P1.a3d) Position estimator
P1.a4a) Shape category classifier
P1.a4b) Color category classifier
P1.a4c) Size category classifier
12 We can judge success in two ways: either by
comparing to some known objective external
reality (i.e. which physical objects we have
placed where), or by comparing to the tester’s
subjective representation of reality. In the first
case we essentially judge fidelity, and in the
second alignment among two agents.
b) The sign stream (speech) (P1b):
Now, let’s consider the hearing-to-
utteranceproperty stream. We will use the
simplifying assumption that the visual and aural
streams do not fuse after the second stage of their
pipelines, but are separate. First of all, we need a
sensor with adequate coverage and quality in the
human speech spectrum. Then, we need an
utterance segmenter stage. In place of a property-
extractor, we will most probably have some
feature-extraction for speech recognition.
Finally, the categorical classifier output should
consist of the text strings produced by the rest of
the speech recognizer.
In summary: the air pressure waves are translated
into an audio stream by the hearing sensor, and
the stream is segmented into utterances. Then,
the utterances are analyzed into features will fill
the continuous layer of the “utterance” property
dimension of the dummy speaker object. Finally,
the speech classifier is fed with the features and
produces a text string, which fills the categorical
layer of the “utterance” property dimension. In
most real-world implementations, the hearing
stream segmenter, speech feature estimator, as
well as speech classifier are all parts of the
speech recognition software. Once again, the
overall requirement for the four stages of the
pipeline is that the incoming speech stream is
correctly segmented and translated into text.
In conclusion, the need for the following
processes has arisen:
P1.b1) Hearing sensor
P1.b2) Hearing stream segmenter
P1.b3) speech feature estimator
P1.b4) speech classifier (language-dependent)
c) The action selection process (P5)
Here, the instantaneous “utterance” property
dimension contents of the dummy speaker object
have to go through the three stages of translation
to sequences of motor actions, which were
described in the third section of this paper. First,
the utterance contents are broken into actions by
a motor-action selection process. However, the
output is still at the Description-Level. Thus, an
object referent resolution process translates it to
the Object-Level, and a motor-paramater
extractor process finally gets us at the Action-
Level. The resulting actions are executed in their
order of production, by the motor execution
process.
To recap, we need the following processes:
-
P5.1) Action selector process
P5.2) Object Referent Resolution process
P5.3) Motor parameter extractor process
P5.4) Motor executor process
Summary:
A summary of the requirements needed so far: In
terms of representations, a situation model
structure containing objects described along five
property dimensions (shape, color, size, position,
utterance), including a dummy object
representing the speaker of the utterance. In
terms of processes, all of the P1.a processes
required for the visual (non-sign) – to –
situationmodel stream, all of the P1.b processes
required for the aural (sign) – to – situationmodel
stream, and processes P5.1-P5.4 required for the
situationmodel-to-action stream.
5. The fifth and final part of the Token Test
Here, I have attempted to collect the questions of
the fifth part in groups, according to the
requirements they impose on the situation model.
Then, the groups have been ordered into a rough
and subjective scale of increasing difficulty /
complexity, which also follows any prerequisite
relations among the groups. The resulting groups
are:
1. Novel Actions and Manners:
(Q41-44, Q47-48, Q52, Q54, Q58, Q60)
2. Set Referents: (Q50, Q53, Q55)
3. Conjunctions/Disjunctions: (Q45-46)
4. Explicit Temporal sequencing: (Q59, Q61)
5. Conditionals: (present/future): (Q49, Q51)
6. Corrections: (Q56, Q57)
Now let’s consider each group of questions in
turn, use it as a seed for a wider requirement, and
see how the existing representations and
processes must be augmented in order to account
for it.
Part 5.1: Novel Actions and Manners:
Q41.Put [the red circle] on [the green square]
Q42.Put [the white square] behind [the yellow
circle]
Q43.Touch [the blue circle] with [the red square]
Q44.Touch -with [the blue circle]- [the red
square]
Q47.Put [the green square] away from [the
yellow square]
Q48.Put [the white circle] in front of [the blue
square]
Q52.Put [the green square] beside [the red circle]
Q54.Put [the red circle] between [the yellow
square] and [the green square]
Q58.Together with [the yellow circle], take [the
blue circle]
Q60.Put [the blue circle] underneath [the white
square]
[the general forms]
Two new verbs are introduced (“put” and
“take”). These, together with “touch” that we had
encountered before, are now also followed by
various propositional phrases (starting with
“on.”, “behind”, “with”, “away from” etc.). By
using the above set of questions as a seed, we
will assume that the general set that they belong
consists of the following surface forms:
Put
Put underneath
Put between and
Touch with
Touch -with - dd1
Together with , take
Where:
=
= the ,
(possibly with and/or missing)
belongs to {on, behind, away from, in
front of, beside}
[necessary augmentations]
First of all, we will introduce two new motor
actions. The action list already contained
“touch”, and will now be augmented by “get”
and “put”. We will assume that these motor
actions only take one argument, namely a target
position13. The verbs appearing at the utterance
level do not necessarily correspond to a single
action. For example, the surface form “put the
circle beside the square” will translate to:
,
Note here that the general form of Q60 creates
special problems, and will be dealt with
separately. Now, let us define rules for
translation of the surface forms of the utterance
to the Description-level, the Object-Level and
the Action-Level consecutively. Starting with the
translation to the Descriptive-level, process P5.1
(action selector) should realize rules for the
13 More advanced object grasping motor routines
might necessitate more arguments, for example
the size or shape of the object to be grasped.
-
following mappings:
Put �
,
Put between and �
,
Touch with �
,
Touch -with - dd1 �
,
Together with , take �
,
14
Continuing to the Object-level, the referent
resolution process (P5.2) remains as is. Finally,
in order to reach the Action-level, the motor
parameter extractor process will have to include
the following trivial rules:
�
(already part of P5.2)
�
�
14 Here we assume that the robot’s hand has the
ability of holding two or more objects at the
same time, or else that the robot has two hands
and the second execution of the “get” motor
routine will divert to the second hand as the first
is full. A deep treatment would necessitate the
introduction of a self-object in the situation
model. The self-object should have some form of
a “gripper” property dimension, holding the
number of the objects that are currently held. As
the objects that are being held have to be
referable too, so that they are able to serve as
subsequent “put” statements, some way of
attaching objects to the gripper is required. A
fuller treatment would require the introduction of
relations among objects, with a special
“possession” relation linking the self- object with
the objects held
�
[the “manner” functions]
Finally, we have to define the “manner”
functions, realizing the propositions. Their
output should be a target position, and their input
might consist of any subset of the properties of
the landmark object. For example, let us consider
. The target position of “put” can
be defined as:
(pos_x, pos_y, pos_z) = On(obj #n) where:
pos_x = pos_x(obj #n)
pos_y = pos_y(obj #n)
pos_z = pos_z(obj #n) + size_z(obj #n) / 2
Let us also consider “infrontof”. Assuming a y-
axis moving away from the tester, we might
define “infront” of as:
(pos_x, pos_y, pos_z) = infrontof(obj #n) where:
pos_x = pos_x(obj #n)
pos_y = pos_y(obj #n) - size_y(obj #n) + offset
pos_z = pos_z(obj #n)
These definitions are only a very rough first
approximation, though. More complex functions
for spatial propositions can be defined, taking as
arguments not only the landmark object but also
all of the rest of the objects (the situational
context). However, we will not get into such
detail here. Some relevant ideas can be found in:
[Regier and Carlson 2001, Gorniak and Roy
2004]. All of the other propositional “manners”
are defined along similar lines (except for
“between” that accepts two objects as
arguments). Their rules are realized by
appropriately augmenting process P5.3.
[the problematic case of “underneath” -
conclusion]
Thus, we have successfully augmented the action
list R5 and the processes P5.1-P5.3, in order to
cope with the requirements set by all the
questions considered, except Q60. The
problematic general form of Q60 is consider in
detail in Appendix B in order not to interrupt the
flow of the text.
Part 5.2: Set Referents: (Q50, Q53, Q55)
Q50.Pick up [the squares, except the yellow one]
Q53.Touch [the squares] slowly and [the circles],
-
quickly
Q55.Except for [the green one], touch [the
circles]
[general form]
The general surface form of the statements is:
(at a first level of generality)
Touch slowly and quickly
Where: dd1 might return not only a single but
also a set of referents, and where the description
of the set might include a plural plus an
exception. In the third case, the description might
be broken down in two parts (exception and
plural), with the verb positioned between them.
[necessary augmentations]
Here, the object referent resolution process
(P5.2) has to be augmented to account for the
possibility of set referents. Also, two new
manners have to be introduced for the motor
action touch – namely the manners slowly and
quickly. Furthermore, in terms of surface form
difficulty, one more problem arises: the set
referent descriptions might be broken into two
disconnected parts, separated by the action verb.
Let us see how we will deal with these
complications. First, we introduce the following
utterance-to-description level rules:
�
(where action = get for “pickup” or “take”,
action = touch for “touch”)
Touch �
(where manner = slowly or quickly)
�
(where =
and where action = get for “pickup” or “take”,
action = touch for “touch”)15
Thus, we have successfully reached the
15 We do not attempt to extend plural referents to
“put down” here. We have to chosen to do so,
because of the complications that might arise
when the second object set consists of more than
one object, as for example in: “put the circles on
the squares”
Description-level. The augmented referent
resolution process (P5.2) will then be applied on
the Description-level representations, and we
will get results of the form:
�
We see that the single argument of the action
now consists of multiple objects, instead of one.
However, this is not allowable given the action
forms that we have introduced in the action list
(R5). We thus have to introduce one further
process, which is applied after the object referent
resolution process (P5.2) but before the motor
parameter extractor process (P5.3). This process
will be called “repeated action expansion
process” (P5.2a), and will realize the following
rule:
�
,
,
…
For example:
will be expanded to:
,
,
.
Finally, after the application of P5.2a, P5.3 is
applied. P5.3 has to be augmented with new
rules that account for the new manners
associated with “touch”. Touch, as represented in
the action list (R5), accepts only a single
argument so far: the target position to be
touched. “Slowly” as a manner does not modify
the position argument; instead, it necessitates
either a new motor routine for touch_slowly, or
the addition of an extra argument to the touch
action, namely a speed argument. Here we will
adapt the later method. Thus, the action list entry
for touch (R5) will be augmented with a second
argument, namely speed, which will have three
discrete values (slow, normal, quick), and which
will default to normal speed when no argument
is given through the manner. I.e., the following
rules will be implemented in the Motor
Parameter Extractor Process (P5.3):
Touch �
Touch(pos=pos(obj #n), speed=normal)
Touch slowly �
-
Touch(pos=pos(obj #n), speed=slow)
Touch quickly �
Touch(pos=pos(obj #n), speed=quick)
Thus, we have finally successfully reached the
Action-level. Only the motor executor process
requires augmentation now; indeed, motor
routines for ,
have to be created, and the routine for touch
should be augmented with a speed argument:
.16
Part 5.3: Conjunctions/Disjunctions: (Q45-46)
Q45.Touch [the blue circle] and [the red square]
Q46.Pick up [the blue circle] or [the red square]
[the general forms]
and
or
Where:
belongs to {touch, pickup, take}
and are definite descriptions of the
form: , possibly with
missing and / or
[necessary augmentations]
In our previous discussion on parts 1-4 of the
test, it was assumed the action selector process
(P5.1) was able to deal with the utterance form:
and , and that it
realized a rule of the form:
and �
,
It is trivial to modify this rule superficially to
account for disjunctions. For example, we might
decide to just perform the action on the first
referent, and never on the second:
or �
16 As an extension, if we decided to be able to
use the manners “quickly” and “slowly” not only
with touch but also with get and put, then we
would need a further augmentation in P5.3, to
account for the possibility of multiple manners.
For example: “put the red square quickly on the
yellow circle” (which includes the manners
and ). This would further
necessitate compatibility checks among manners,
or more complicated compositional rules.
Consider for example: “put the red square
underneath the green circle on the green circle”
which would be illegal.
This is the augmentation to P5.1 that we will
propose here. Notice, however, that such an
interpretation would oppose the Gricean maxim
of quantity [Grice 1975]: why ever mention the
second referent if it is never going to be useful?
A deeper interpretation of disjunctions would
require the introduction of concepts such as
“feasibility” of action execution, or
“preferability” of one action over another given
some execution cost or ordering. Thus, a deeper
treatment of conjunctions, would give them the
following procedural meaning:
or means:
Out of the two actions ( ,
) select the one that is either
preferable or anyway just feasible, if the other
isn’t.
Regarding feasibility of actions, it is worth
clarifying one further point. In our proposal so
far, we have silently assumed that we expect the
tester to provide us with singular definite
descriptions which resolve to a single existing
referent17. Requested actions are silently
assumed to be always feasible. The case of an
unresolvable or an ambiguous definite
description has not been explicitly dealt with.
How do we augment our system to deal with
such cases? A simple but effective proposal
follows:18
a) Unresolvable definite description:
Motor action is not executed. Instead, robot
produces canned speech, informing the tester that
“no such object exists”. Note that this is totally
out of the climate of our proposal so far; none of
the token test questions demand speech
production from the child.
b) Ambiguous definite description:
Motor action is not executed. Instead, robot
produces canned speech, requesting an inform
speech act from the user: “Do you mean this
one?” (touching one possible referent) “That
one?” (touching another) “Or this one?”
(touching the third). Again, this is just a simple
canned behavior. A deeper treatment would
require a more complicated tester model.
Treating the tester as physical object with
utterance properties, as we do here, is not
17 Plural definite descriptions that resolve to set
referents were discussed above in Part 5.2 18 Implemented in [Mavridis and Roy 2005]
-
enough. Also, a rational action selection
mechanism covering not only motor actions but
also speech actions would be required.
In conclusion, we have chosen here to follow the
trivial interpretation of disjunctions, arbitrarily
selecting only the first action for execution.
However, we have also discussed what would be
necessary for a deeper treatment.
Part 5.4. Explicit Temporal sequencing:
Q59.After picking up [the green square], touch
[the white circle]
Q61.Before touching [the yellow circle], pick up
[the red square]
[general form]
After -ing ,
Before -ing ,
where:
, belong to
{touch, pick up, take}
and are definite descriptions of the
form: , possibly with
missing and / or
[necessary augmentations]
Again, a slightly superficial but nevertheless
adequate implementation exists. We just have to
augment the process P5.1 (action selector
process), which transforms the utterances to the
Description-level, with the following rules:
After -ing , �
,
Before -ing , �
,
A deeper interpretation, would remove the
immediacy of the request, and would involve
planning. However, we feel such an
interpretation is not what is really required by the
token test.
Part 5.5: Conditionals: (present/future)
Q49.If [there is [a black circle]], pick up [the red
square]
Q51.When [I touch [the green circle]], you take
[the white square]
Here we have a huge leap of complexity
compared to the previous parts. First of all,
conditions are introduced, which are intricately
bound to the notion of events. Then, an indefinite
description appears: “a black circle”. Last, but
most importantly, the loaded personal pronouns
“I” and “you” appear. We will now have to
decide on an appropriate generality and depth of
treatment.
[general form]
if , {}
(includes present, but not future)
when , {you}
{}
(normally doesn’t include present, only future)
Where:
belongs to the form:
there is
belongs to the form:
I 19
And where:
is definite description
is indefinite description etc.
[suggested augmentations]
An introduction of a notion of events, although
not absolutely necessary at this stage, would
provide the right amount of generality and
expandability. Thus, we have to start devising
representations in R3. Notice that although
conditional tests for existence (“there is a circle”)
do not necessarily involve the past, action
conditions do: “when I put the circle on the
square” cannot only be tested through the
experience of an instant. Although “when I put
the circle on the square” refers to the time instant
when an action performed by the tester has been
completed, testing whether action completion
requires some account of the whole duration of
the action. Thus, we need a history of temporal
“snapshots” justifying that indeed the tester has
performed the put action. Just seeing the circle
on the square, and maybe the user’s hand
touching the circle, is not enough justification
that the condition holds. Thus, the introduction
of events and some form of representations of
19 We won’t extend to manners etc., although we
could easily do so. We will also not extend to
observing your own actions; only actions
performed by the tester can be initially observed)
-
the past (R3) becomes inevitable.
Let us now decide on an appropriate form of
event representation. I propose the form:20
21
Where:
belongs to {exists, touch, get, put}
If =exists,
then participant 1 is obj #n that exists.
If =touch, get, put,
then participant 1 is the object corresponding to
the tester’s body, while participant 2 is the obj #n
that is being manipulated.
, : time landmarks in
some referable subjective timescale (clicks etc.)
Exists events are continuously updated so that
the end time is equal to the current time, as long
as the participant still exists.
Manipulation events are continuously updated so
that the end time is equal to the current time, as
long as the action has not finished yet.
flag becomes true when the
event has finished, and the end time is less than
the current time
Actions refer to sequences of GSM contents, and
connect to the senses only indirectly through the
GSM. Of course, the sensory streams should
provide enough information to the GSM so that
the action can be logged. In particular, for the
second type of condition, i.e. “I
”, we need to be able to observe the actions
of the “tester” object, which so far was only a
dummy object associated with incoming
utterances. Thus, we first need to be able to
localize the tester in space, and then to derive a
rich enough observation of his body from the
senses so that we can successfully recognize
when he performs the action verbs: get, put,
touch. Also, we need to be able to recognize
which physical objects are being implicated in
the above actions – but hopefully our vision
system would already provide enough clues for
that.
20 Explicit representation of agents in a different
form than physical objects has not been carried
out so far. Thus, by object here we are referring
to either the robot, the tester, or the physical
objects 21 For simplicity, we prefer the neutrality of the
term “participants” instead of agent/patient roles.
Thus, we need the following augmentations in
the vision stream:
P1.a1) Vision sensor: should be able to get the
test-giver within the field of view, and get
sensory evidence of him of quality adequate for
the subsequent action recognition
P1.a2) Vision stream segmenter: should be able
to segment the test-giver’s body
P1.a3e) Action feature extractor: enough features
should be extracted so that the performed actions
can later be recognized by the event classifier
through the resulting timeseries.
There is no explicit need for quantized action
features, as the language specification does not
demand verbalization of instantaneous features
of actions. Thus, no augmentation to P1.a4 is
needed. With the augmentations made so far, we
have a time series of rich features enabling
action recognition in the GSM. Now, we need to
augment suitably the (previously empty)
components R3 and P3.
In detail:
R3) Past states/situations (includes events)
The current GSM is augmented with a log of
events, including events in progress as well as
past events. No explicit detailed history of the
past needs to be kept, except for this log which is
embedded in the current GSM.22
P3) Storage/Maintanence of Past process
(includes event recognition)
A set of four event classifier processes is needed,
each process specialized for each of the four
event types (exists, get, put, touch). These
processes examine the current state of the GSM,
keep track of whatever information they need
from the past internally, and produce ongoing
event descriptions in the event log which is part
of R3. When an event is finished, the flag is set true, and the event is
updated in the log for one last time. For
simplicity, we assume that no memory
compression/forgetting process operates on the
log, although in systems that operate round-the
clock such a process would be necessary.
[from events to condition testing]
22 If we need to answer questions about the state
of the GSM during specific past moments etc.,
then we also need some detailed history of past
moments. However, this is not necessary in the
token test. It is necessary for example in the
specs of [Mavridis and Roy 2005].
-
So far, we have augmented the sensory stream
(P1a) so that the test-giver’s actions can be
observed, and augmented our representations and
processes dealing with the past (R3, P3) so that a
log of ongoing and past events is always
accessible in the GSM. Now, we need to connect
the events that are part of the event log with the
conditions that we have posited in the general
form of the requirements for this part of the test.
The first form of the conditions is: “there is
”. This will correspond to the existence of
an event with type “exists”, with a participant
which satisfies , and with an ongoing end
time equal to the current time.
But what augmentations should we make to the
processes of P5? I propose that process P5.1
(action selection), should be augmented with a
conditional filter process (P5.1a). The
conditional filter process will call the object
referent resolution process (P5.2) in order to get
a set of possible event participants that satisfy
the indefinite description . Then, it will
search in the event log, and conclude whether the
condition is TRUE or not. The following
translation rule should be realized in P5.1:
if , {}
�
IF condition is TRUE:
{} , …
IF condition is not TRUE:
(Empty action list)
The second form of the conditions is: “I ”. This will correspond to the
existence of an event with type analogous to the
action verb (get, put, touch), with the object
corresponding to the test-giver as the first
participant, with an object which satisfies
as a second participant, and with an ongoing end
time equal to the current time minus one tick (i.e.
the flag should have just been
raised). Note that this second form of the
conditions is paired with the following utterance
form:
when , {you}
{}
An interesting aspect of this request speech act is
that it cannot be instantaneously fulfilled, but
often needs to wait for to hold in
the near future before the tester can fulfill the
request. Thus, we need some way of getting the
agent under test to repeatedly hear the request,
until favorable conditions arise for its
satisfaction, or until it has to be forgotten. It
might have to be forgotten after a time-out
period, or after one or more new requests have
arisen.
Here, we will propose the following way of
handling these demands: we will assume that
such an utterance is persistently hallucinated23 in
the agent’s head until its conditions are satisfied
or until a new utterance has arrived. Thus, what
we need is a “hallucinatory” circuit in the senses.
An internal action will be produced by the action
selection process, which will “echo” the master’s
voice repeatedly in the agent’s head, until the
master has been served or he has demanded
something new. The action list R5 will be
augmented with an action of the form:
Hallucinate_utterance
The source object will be initially fixed to the
object representing the test-giver.
The effect of this internal action will be the
writing of the utterance in the appropriate
utterance property representation in the GSM, as
if it had been freshly heard. In case new sensory
information has come in, it will automatically
overwrite the hallucinated utterance – i.e. real
voices suppress persistent hallucinated voices.
Furthermore, the translation rule that we need in
P5.1 is:
when , {you}
{} �
IF condition is TRUE:
{} , …
IF condition is not TRUE:
Hallucinate_utterance
In this way, we are able to handle conditional
action execution in the future, through repeated
hallucination of the request of the test-giver. The
hallucination scheme works well even when we
abort the simplifying assumptions that the
hallucination stops being repeated when a new
request comes in. But then, we need to devise
some form of buffer of simultaneous “directives”
that are being heard at each instant together with
the sensory-derived utterances, and which guide
the agent’s behavior. We would then also need to
23 this idea is reminiscent of the general theme of
[Jaynes 1976]
-
devise ways to deal with incompatible requests
etc., and we slowly end up getting to all the
known forms of internal conflict. Nevertheless,
this simple idea of hallucinated requests can
really achieve a lot. We can have agents that are
verbally taught to respond appropriately to
appropriate circumstances by their masters, we
can have incoming sensory signs to translate to
hallucinated utterances (for example, in reading
written texts), and much more.
[conclusion]
We have shown how present/future conditionals
can be handled. This was achieved through the
introduction of an event representation and an
event log, through augmentation of the sensory
machinery that enables the observation and
recognition of the actions of the test-giver, and
through a “hallucinatory utterance” circuit that
persistently repeats a conditional action
execution request until it is carried out or it is
forgotten. More importantly, this primitive
hallucinatory circuit provides an interesting
pathway for expansion in the future that enables
the implementation of many capabilities such as
teaching, reading etc. Such capabilities are
invaluable towards the creation of autonomous
cooperative agents.
P5.6: Corrections: (Q56, Q57)
Q56.Pick up [the red circle] -No!- [the white
square]
Q57.Instead of [the white square], take [the
yellow circle] (inverted syntax;
[General form]
-No!-
Instead of ,
Where:
belongs to {touch, pick up, take}
And , are definite descriptions, as
defined previously.
[necessary augmentations]
Once again, as was the case in part 5.3
(conjunctions), there is a deeper phenomenon
going on here. However, we will treat it slightly
superficially, albeit quite effectively for the
purpose of the test. A deep treatment of
corrections requires an understanding of
utterance production in the tester (covering the –
No!- interruption), or a longer-term
understanding of discourse (covering the contrast
marked by “Instead”). However, this deeper
understanding is not essential for responding
appropriately to these utterances.
We propose the following augmentations to
process P5.1 (action selector), performing the
utterance-to-description level translation:
-No!- �
Instead of , �
I.e. we totally ignore the mistaken / contrasted
part of the utterance, and proceed with executing
the requested action with the correct argument.
6. Local minimality of the proposed design
Here, a list of the main parts and blocks of the
proposed design is given. We consider each
block in turn, and observe the effects of its
removal from the design, and show which part of
the design specification fails to hold in each case.
We finally conclude, that through this “Jenga”
procedure, we can claim local minimality of the
proposed design in the universe of designs
proposed by the general framework we have
adopted. Furthermore, we have already shown
that the design provides easy and systematic
hooks for further expandibility. Let us now
proceed with the test:
Representations contained in the agent model:
R1) Situation model: represents external reality
(not only physical necessarily)
If we remove any of the proposed property
dimensions, or the speaker-object, the test fails.
R2) Internal state (goals, affect)
Not used.
R3) Past states/situations (includes events)
If we remove the event representation or event
types, then Q49/Q51 will fail
R4) Future predictions
Not used. Useful for planning etc. (level-3/4)
R5) Action list
If we remove any of the action entries that we
made, specific questions of the test will fail.
Processes contained in the agent model:
P1) Sensor-to-SituationModel process
If we remove any of the processes of P1a or P1b,
different parts of the test will fail. For example,
if we remove the size estimation process,
referents involving size will not be resolved
correctly etc. If we remove action recognition of
the test-giver, Q51 will fail etc.
-
P2) SituationModel-to-InternalState process
Not used.
P3) Storage/Maintanence of Past process
(includes event recognition)
If the log creation / event recognition processes
did not exist, we would not be able to correctly
respond to Q51 for example.
P4) Future Prediction Generation process
Not used.
P5) Action Selection Process
(SituationModelState-to-action table or planner)
Any removal of P5.1-P5.4 or their parts will
inhibit successful responses to several questions.
Thus, without getting into a detailed
examination, it should be clear that any removal
of a proposed representation, process or their
parts will make the system incapable of passing
the whole token test. Thus, by the proposed
“Jenga” criterion, the design is locally optimal in
the space created by the proposed framework of
designs.
7. Conclusion
In this paper, I presented a design of a minimal
grounded situation model which will enable a
conversational robot to pass the “Token Test”. I
started by considering what it might mean for a
GSM to be minimal, and proposed an
incremental design methodology. Then, I
considered how human behavioral tests should
be translated to specifications for design. Later, I
derived the proposed design step-by-step by
considering the questions of the “Token Test”.
Finally, I pro