phd general examinations: publishable quality paper for main … · 2006. 1. 26. · “token...

25
PhD General Examinations: Publishable Quality Paper for Main Area (Prof. Deb Roy) Nikolaos Mavridis, MIT Media Lab, June 2005 I think the most productive topic for a paper will be to design the MINIMAL mental model architecture that will support sensory-motor grounded interpretation of all spoken commands in the final part of the Token test. The paper should present this design, justify each element of the design, and relate the architecture as best as you can to the readings in your reading list. Deb Dear Deb, I followed the decomposition of the paper in three parts that I had proposed to you by email and to which you have agreed. The key themes of this paper are: 1) THE MINIMALITY CRITERION What could a meaningful "minimality" criterion be for a mental model or other systems of comparable complexity and scope? What are viable approximations for such criteria, or alternatively stepwise design methodologies that can guarantee some notion of strict or relaxed "minimality"? 2) THE FUNCTIONAL SPECIFICATION What does it really mean to "pass the token test?". I.e. at what level of generality would we expect the system to perform? (i.e. what family of abilities which are exemplified from the token test are we targeting?) 3) STEPPING THROUGH THE DESIGN PROCESS OF THE PROPOSED SOLUTION Based on specific proposals for 1) and 2), what is the suggested design presented in a stepwise/incremental manner? (Here, a lattice of "minimal" designs of increasing abilities/complexity might need to be presented, and the corresponding design space searched to find the minimal satisfactory solution) Nicholas

Upload: others

Post on 21-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • PhD General Examinations:

    Publishable Quality Paper for Main Area (Prof. Deb Roy)

    Nikolaos Mavridis, MIT Media Lab, June 2005

    I think the most productive topic for a paper will be to design the MINIMAL mental

    model architecture that will support sensory-motor grounded interpretation of all spoken

    commands in the final part of the Token test. The paper should present this design, justify

    each element of the design, and relate the architecture as best as you can to the readings

    in your reading list.

    Deb

    Dear Deb,

    I followed the decomposition of the paper in three parts that I had proposed to you by

    email and to which you have agreed. The key themes of this paper are:

    1) THE MINIMALITY CRITERION

    What could a meaningful "minimality" criterion be for a mental model or other systems

    of comparable complexity and scope? What are viable approximations for such criteria,

    or alternatively stepwise design methodologies that can guarantee some notion of strict or

    relaxed "minimality"?

    2) THE FUNCTIONAL SPECIFICATION

    What does it really mean to "pass the token test?". I.e. at what level of generality would

    we expect the system to perform? (i.e. what family of abilities which are exemplified

    from the token test are we targeting?)

    3) STEPPING THROUGH THE DESIGN PROCESS OF THE PROPOSED SOLUTION

    Based on specific proposals for 1) and 2), what is the suggested design presented in a

    stepwise/incremental manner? (Here, a lattice of "minimal" designs of increasing

    abilities/complexity might need to be presented, and the corresponding design space

    searched to find the minimal satisfactory solution)

    Nicholas

  • A near-minimal grounded situation model design

    for a conversational robot that can pass the “Token Test”

    Nikolaos Mavridis, MIT Media Lab, June 2005

    Abstract

    The purpose of this paper is to present a design

    of a minimal grounded situation model (GSM)

    [Mavridis and Roy 2005, Mavridis 2005] which

    will enable a conversational robot to pass the

    “Token Test”. The “Token Test” is a standard

    test designed to assess early situated language

    skills of children. This specific design is given as

    an example of a wider methodology. According

    to this methodology, agent model designs are

    derived from specifications, which are in turn

    derived by human behavioral tests. In this paper,

    after a brief introduction to the problem, optimal

    design and reasonable approaches towards

    practical near-optimality in such complex

    systems are discussed. Then, an answer to the

    question of what it should really mean for a

    robot to pass a human behavioral test is

    attempted. In the main section of this paper that

    follows, the proposed design is derived step-by-

    step. A conclusion section ends this paper. This

    paper constitutes a step towards the creation of

    conversational robots that can cooperate fluidly

    with humans. The paper’s main contributions

    include a proposal for an incremental design

    methodology for grounded situation models,

    guidelines for translating human behavioral tests

    into design specifications for robots, and an

    example of the use of the above methodology

    and guidelines: an easily extensible step-by-step

    design for a conversational robot that can pass

    the “Token Test”.

    0. Introduction

    [vision - previous work]

    The ultimate purpose of this work is the creation

    of conversational robots that can cooperate

    fluidly with humans in an intuitive manner using

    natural language. Our current work is part of a

    larger effort to develop conversational interfaces

    for interactive robots [Crangle and Suppes 1994,

    McGuire et al. 2002, Sofge et al. 2003]. Robots

    that understand and use natural language may

    find application in entertainment, assistive, and

    educational domains. Towards this ultimate

    purpose, a mental model for the conversational

    robot Ripley was originally designed, which in

    its initial form enabled the robot to achieve

    object permanence and viewpoint switching

    [Roy, Hsiao and Mavridis 2004]. Later, the

    design was extended, and incorporated into the

    general theoretical framework of “Grounded

    Situation Models” (GSM) having triple-layer

    properties [Mavridis and Roy 2005]. Using such

    models, Ripley became the first conversational

    robot that was able to fluidly mix sensory-

    derived information with spoken descriptions in

    the creation of the contents of his situation

    model, and was also able to resolve temporal

    referents and refer to past events. At that stage,

    Ripley was able to pass the first two parts of the

    “Token Test” [DiSimoni 1978], a standard test

    designed to assess early situated language skills

    of children. In [Mavridis 2005], a wider

    framework for a hierarchy of agent models of

    increasing complexity was proposed. The

    situation model was positioned as a part of the

    wider agent model, which apart from the GSM

    also includes other machinery, such as the

    processes required for action selection.

    [news - steps]

    The immediate aim of this paper will be the

    design of a GSM that will enable Ripley to pass

    the whole of the “Token Test”. The terminology

    developed in the relevant previous papers

    [Mavridis and Roy 2005, Mavridis 2005] will be

    adapted. However, the design proposal given

    here will be given as an example of a wider

    design methodology that will be suggested.

    Through this wider methodology, a list of

    specifications drives an incremental stepwise

    design. This incremental method can guarantee

    local near-optimality of the design, and also

    naturally enables future expandability. The

    specifications driving the method are in turn

    derived by a human behavioral test.

    Two prerequisites are needed for such an

    approach. First, a position on what minimality

    for grounded situation models might mean. What

    is optimality? What are we optimizing here?

    What methods or claims might be practically

    applicable? Second, we need a position on what

    we should really mean when we claim that a

    robot has passed a certain behavioral test. Should

    the robot be allowed to know the questions

    before the test? What level of generality should

    we aim for? What was the real purpose of the

  • test? We will thus try to propose positions and

    thus fulfill these two prerequisites.

    [contributions]

    It is our hope that the methodology proposed,

    even though it is currently in an initial stage, will

    prove to be useful towards the creation of

    conversational partners that cooperate with

    humans in a multitude of environments. Most

    existing systems have been designed through

    non-systematic ad-hoc methods. Although

    existing systems do exhibit satisfactory

    performance in highly constrained environments

    and with limited vocabularies, their ability to

    scale has yet to be proven. Also, most of the

    existing systems are more similar to tools than to

    partners, as they exhibit either none or minimal

    intention recognition of the human partner. Thus,

    they can only respond to immediate commands,

    and cannot participate in longer term cooperative

    and flexible planning. We believe that the

    approach presented here will prove to be capable

    of both scalability and true cooperation. We

    envision that in the future the systematicity of

    the design procedure will enable its partial

    automation and thus aid towards scalability.

    Also, the inherent ease with which the proposed

    design can accommodate embedded agent

    models of others might well prove to be

    invaluable towards equipping the robot with a

    form of Theory of Mind [Premack and Woodruff

    1978], which will act as a catalyst towards true

    human-robot cooperation. Furthermore, we

    believe that our proposals on grounded situation

    model near-optimality as well as on the

    adaptation of human behavioral tests for robot

    design will prove valuable to other researchers in

    the field.

    [overview of paper]

    This paper will start by discussing our positions

    on design minimality for GSM’s, as well as on

    how to use human behavioral tests as

    specifications for robot design. Then, the

    proposed design will be given in an incremental

    step-wise manner. We will start by considering

    the first four parts of the token test, and derive

    the heart of our proposal through their

    requirements. Then, we will consider the final

    part, which exhibits much less homogeneity than

    the first four, and propose successive

    augmentations to our initial design. A section

    discussing the local optimality of the proposed

    design will follow. Finally, a concluding section

    will end this paper.

    1. Optimal design

    [generalities about optimality]

    Optimality is, and has always been, one of the

    principle obsessions of engineers. Apart from the

    purpose of achieving maximal “economy” of

    some sort, optimality also fulfills another very

    important function: it helps narrow down an

    often exponentially huge and intractable space of

    possible designs to a single representative, which

    is baptized as “best”. This design is not “best”

    under any absolute sense; it is only under the

    arbitrariness of a designer-chosen criterion that

    quantifies its “economy” that it can claim to be

    so. Given such a criterion, engineers can

    compare their designs more objectively, and

    claim superiority for a particular design on

    rational grounds. Of course, absolute

    “optimality” is only a naive engineer’s illusion.

    This is so, as it can only exist within a well

    constrained and predefined universe of designs.

    Furthermore, it is optimal under the biased

    judgement of an adhoc optimality criterion,

    which will also in most cases take into account

    only partial aspects of the designs. Under the

    above two degrees of freedom (the universe of

    designs and the optimality criterion), and given

    inevitable preexisting constraints, it nevertheless

    proves easy to find optimal designs only in some

    highly idealized cases.

    [a highly idealized example problem]

    Consider, for example a variant on a classic high

    school optimization problem:

    “John has a rope which is four meters long. He

    wants to use this string in order to separate the

    territory of his small robot from the rest of the

    world. This territory can be any rectangle which

    is made from the rope. His robot would be

    happiest if the area of its territory was as big as

    possible. How can John make his robot as happy

    as possible?”. Let us use this idealized problem

    as a small parable illustrating some of the critical

    issues in optimal design.

    First, the space of possible designs is given and

    pretty well defined – it consists of all rectangles

    with perimeter of four meters. Second, the

    “optimality” criterion is given and well defined –

    make his robot as happy as possible by

    maximizing the area of the rectangle. Third,

    among the preexisting constraints are the total

    length of the rope, as well as rope physics and

    the geometry of euclidean space.

    How does John proceed to discover the optimal

  • design? First, he idealizes the natural

    components into mathematical abstractions: the

    rope becomes a one-dimensional curve of

    infinitesimal width and without bending

    constraints. Second, he parametrizes the space of

    all possible designs, yielding in essence a

    generative model. The space is isomorphic to the

    real numbers R; indeed, any rectangle can be

    described by two parameters (width, height)

    which collapse to one under the perimeter

    constraint (the perimeter should be equal to the

    length of the rope which is 4m, i.e. 2*width +

    2*height = 4, i.e. width = 2 – height). Third, he

    has to select the “best” design; by an arbitrary

    choice of method (calculus, inequality

    arguments, consulting an oracle) he declares that

    the particular design with width = 1 (and

    correspondingly equal height) is “best”.

    He thus takes the rope, implements the design,

    and his robot seems to be smiling and jumping

    around. Alas, in a scientific community, any

    claim should be accepted with a lot of suspicion;

    and indeed we would expect many critics to

    appear and declare that John’s robot might have

    been happier. John must answer their claims, or

    else he will fall from his throne of Reason to the

    muds of unwarranted speculation and blindsight.

    Indeed, he takes his ruler, measures the design,

    and declares that the area is one square meter.

    The critics take their turns:

    If you had just made one side a little longer – say

    1.5 meters – your robot would be happier. That’s

    an easy one – John replies without hesitation.

    “But then”, he says, “the total area would have

    been less – let’s see how much: .75 square

    meters – 25% less”. The theorists were

    convinced, but some hardliners were still waiting

    for empirical evidence. Indeed, he reshapes the

    territory, and the robot’s face is not glowing with

    happiness anymore. Before it starts to cry, he

    reshapes it back to the original. Then, another

    critic voices his thoughts: “I have a better

    solution: why not try a pentagon?”. Another one

    stops him mid-sentence: “A circle! – why care

    about pentagons?”. “But “, John answers, “The

    problem statement was clear: it said. Is the pentagon a

    rectangle? Or the circle? All I can do is follow

    the statement. And, after all, my robot is still

    smiling!”.

    [key results of above story]

    A few key points should have been illustrated by

    the above story. First, for convenience, we can

    conjecture a real world of designs, as well as an

    ideal world of models. Models can be tested in

    theory and designs in practice, but in the end,

    practice is what we aim for, and it is always

    more convincing. Second, the universe of

    designs and the optimality criterion are degrees

    of freedom that have to be explicated and

    justified. The constraints of the real world, in the

    form that they are transferred to the ideal, should

    be explicated too. Third, the generative model (if

    any) that will span the universe of designs, as

    well as the optimization method, present even

    more degrees of freedom.1

    [moving on to more complicated cases]

    But what happens when we leave such simple

    examples and move on to more complicated

    cases? First of all, the uniqueness as well as the

    achievability of the optimal solution might

    become questionable. Second, approximations of

    global or even local optima might be all we can

    achieve. Also, the optimality criterion itself often

    has to become partial: we have to select a

    specific subset of the set of possible important

    aspects of each design, in order to make the

    problem tractable. Third, absolute optimality

    might be meaningless. For example, our data

    might be noisy, or the computational procedure

    to achieve an absolute optimum might be too

    expensive. Also, in the case of real-time

    optimization, the environment might be changing

    so quickly and the optimality criterion changing

    with it. In all of the above cases, a satisfactory

    solution might be as good if not even better than

    a hypothetically optimal one [Simon 1969].

    Fourth, we might only have partial data; and we

    might be required to devise something that will

    still be useful in the future, when exposed to new

    data that we have not encountered yet. I.e., we

    might need not only need a good if not optimal

    “fit” to the current requirements, but also an easy

    upgrade to further requirements; i.e.

    generalization ability given an inexpensive

    expansion. Of course, the ability to generalize is

    most often antagonistic to fitting the current

    requirements optimally; but, tradeoffs can be

    found, and many of the techniques developed in

    pattern recognition (k-fold validation for

    example) and complexity theory [Cherkassky et

    al. 1999] address similar problems. In the design

    case, what we really need is slightly different

    1 In the above example we have assumed a non-

    enumerably infinite and continuous space of

    designs – very often, discrete or hybrid spaces

    also enter the picture.

  • than the standard problem setting. In the standard

    setting, we need to be able to generalize to

    further requirements with the present design.

    Here, we instead need to be able to generalize to

    further requirements through an inexpensive

    augmentation of our existing design. Things are

    much better if a simple “patch” on top of the

    existing design could satisfy the new needs;

    instead, if we need to throw away everything and

    redesign ex nihilo, we would have to “spend”

    much.

    Let us now recapitulate briefly, before moving

    on from general comments to the special case of

    grounded situation models. When we move from

    small-scale highly idealized problems to more

    realistic large scale cases, an absolute global

    optimum might not be achievable or even

    desirable. Also, our criteria for optimality cannot

    be all-inclusive anymore. Finally, we often have

    to take into account the adequacy of our

    proposed design under not only current but also

    future specifications, and the existence of an

    inexpensive route for expansion.

    [minimal grounded situation models – overview

    of our proposal]

    But what import do all of the above comments

    have on minimal grounded situation model

    designs? Here we will propose a methodology

    for an incremental design, satisfying a set of

    specifications, and claiming local optimality. The

    specification will be broken down to parts, in

    rough order of foreseen complexity. An initial

    design satisfying the first specification will be

    used as a starting point, and successive

    augmentations that satisfy the remaining

    specifications will be applied to it. The final

    design will be tested for local optimality, and

    pruned down to a locally optimal solution if this

    is required. We will base our designs to a

    standard underlying organizational framework

    for grounded situation models, that is briefly

    described in [Mavridis 2005]. The use of this

    framework in conjunction with the proposed

    design methodology, naturally provides an easy

    pathway for further expansion in the future.

    [a natural partial ordering of complexity]

    First of all, we need to devise a criterion for

    comparison of two grounded situation models.

    Here I will propose a rough criterion that will

    nevertheless be adequate for our purpose.

    Ideally, given a GSM design, we would like to

    be able to quantify its complexity by some

    metric value. Then, when given two GSM

    designs, we just accept the design with the

    smallest complexity metric as “minimal”.

    However, absolute metric values are not

    necessary for such a comparison; any ordering

    relation will do.

    Now, let’s suppose that any GSM design can be

    decomposed into parts (representations and

    processes), and that we are given two specific

    designs, the first of which includes all of the

    parts of the second other plus some extra parts.

    Then, any reasonable definition of “complexity”

    will order the first design as being more complex

    than the second. Of course, this principle only

    provides a partial ordering – it does not supply

    us with any directive on how to deal with cases

    where one of the designs is not composed from

    parts which are a superset of the parts of the

    other. However, this partial ordering will be the

    criterion for comparing GSM designs that I will

    adapt here. It will prove adequate, as we will

    restrict our comparisons to augmentations of

    existing designs. The question that follows is:

    can we devise an incremental design

    methodology that will be able to claim some

    form of optimality, while only utilizing this

    natural partial ordering?

    [incremental design using the natural partial

    ordering]

    Consider that we are given a list of design

    specifications. We start by producing an optimal

    design for the first specification. Then, we

    consider all possible augmentations of our initial

    design that would enable it to cover not only the

    first but also the second specification, and we

    choose the optimal augmentation. Under what

    conditions will the resulting new augmented

    design be optimal? The answer should not be

    difficult to spot: the resulting new design is

    optimal as long as:

    C1) We assume additive costs:

    When A intersection B is empty, then cost

    (A+B) = cost(A) + cost (B)

    C2) There is no design that would include only

    part of the initial design, and which would

    suffice for both the requirements of specification

    1 and specification 2.

    Proof: Let’s assume that there exists another

    design C, which is not equal to A+B, and which

    is better than A+B. Then, by C2, C should

    include A; and thus, C should be decomposable

    as: C = A+B’, where B’ not equal to B, because

    we supposed that C is not equal to A+B. But

  • then, as C is known to be better than A+B,

    cost(A+B’)

  • satisfactory GSM designs, which are furthermore

    naturally expandable.

    2. Behavioral tests for humans as design

    specifications for robots

    [what consists cheating in a test and what does

    not?]

    When trying to use a behavioral test designed for

    humans as a design-driving specification for a

    robot, one main difficulty arises, which we will

    discuss here. The human to be tested is supposed

    not to know exactly the contents of the test

    before it is administered. However, we assume

    that throughout his experience so far in his life,

    he has received adequate explicit or implicit

    training stimuli in order to perform well. When

    administering a behavioral test to a robot, we

    certainly should not allow the designer of the

    robot or the trainer to have given explicit

    response specifications for each of the specific

    tasks that comprise the test. Allowing such a

    state of affairs would be equivalent to having

    trained the human for the specific tasks/questions

    of the test: having shown him the correct

    responses and having made sure that he performs

    them well. This would certainly count as

    cheating in most human testing cases. So the

    question arises: if having supplied the

    “questions” and the “correct answers” to the

    specific test, and having made sure the human

    perform them adequately counts as “cheating”,

    what would be considered allowable?

    [the wider area versus the specific questions]

    Let’s examine the human case again. A high

    school student preparing for his SAT tests is

    exposed to intense training. This training takes

    the form of explicit tuition on the related

    subjects, practice tests etc. Then, one day, he has

    to sit his actual exam. Why doesn’t the training

    count as cheating? Because it is general enough

    in order to cover the wider area to which the

    questions of the test to be administered belong

    to, but is not specific enough in order to cover

    only these questions. It would have been highly

    unlikely for the training to cover only the

    specific questions of the administered test,

    without any pre-knowledge or “leak” of the

    exam questions.

    So what is the moral for the robot case? Any

    proposed design solution that passes a human

    behavioral test should be able to pass any test

    that belongs to the wider area to which the test

    belongs, and not only the specific test in

    question. Of course, the delineation of the

    “width” of the area to which the specific tests

    belongs should be carefully justified. Notice that

    an analogous situation holds in the pattern

    recognition evaluation literature. In supervised

    training problems, we are given a training set and

    are asked to design a classifier that performs well

    on a (supposedly unknown) testing set. The

    training set should be general-enough in order to

    cover and represent densely enough the area

    where the testing set might belong to. However,

    it must again not be specific enough so that it

    suspiciously only includes the testing set.

    [the test design problem]

    After all, one must ask a wider question: what is

    the purpose of testing? What was the test

    designer trying to achieve? Usually, the test

    design procedure is posed as follows:

    a) decide on the limits of the wider area of the

    material to be tested

    b) select specific questions out of the wider area,

    such that:

    b1) they can be practically tested in limited

    time (few in number, short answers)

    b2) the answers given to the selected

    questions can have high predictive value

    towards the answers that the examinee

    would give to all the questions comprising

    the wider area of the material to be tested.

    In short, the test designer must chose a few

    easily testable questions that are however highly

    indicative of the examinee’s mastery of a wider

    area of knowledge that is being tested.

    [the three commandments]

    The above discussion has clear implications

    towards using behavioral tests for humans as

    design specifications (or alternatively as training

    material) for robots. First, the test must be

    reverse-engineered: given a specific test we must

    try to delineate the wider area in which the tasks

    of the test belong, and for which the specific test

    acts as a representative sample of. In case we

    have explicit knowledge of the test designer’s

    coverage intentions, we can adapt them readily.

    Else, we have to “grow” the wider area by using

    the specific questions as a “seed”. The extent and

    generality of the wider area must be chosen.

    Second, we must use this wider area as a design

    specification, or sample it randomly for the

    generation of a training set. We are not allowed

    to use the specific questions comprising the test

    as the only target domain. Third, we must clearly

  • and explicitly justify all of our above choices.

    [minimality at odds with generality requirement

    prescribed by “width”]

    Finally, it is worth noting that the designs that

    can satisfy the requirements of the wider

    “grown” areas will often prove to be more costly

    than those that would only satisfy the seeds.

    Nevertheless, only a design that can pass the

    “wider” requirements is of any real value, and is

    furthermore more general and easily expandable

    in the future. Of course, the right level of

    generality must be decided and explicated on a

    case-by-case basis, as we will do here.

    3. The token test and the constraints it imposes

    on mental model reps and processes

    Below I will adopt an agent model which

    decomposes into an enumerated set of

    components (representations and processes),

    along the lines of the decomposition described

    in: [Mavridis 2005].3 Also, I will adopt the three-

    layer grounded situation model terminology of

    [Mavridis and Roy 2005]. I will examine each

    part of the test in turn and proceed in the manner

    outlined above: I will use the specific questions

    of the test as a seed used to grow a wider area to

    be tested, and then use a model of this wider area

    and the expected performance as a design

    specification.4

    [sensory input modalities]

    Let us start with the sensory input of the agent.

    The agent under test should be able to sense a set

    of utterances (signs) as well as an arrangement of

    physical objects (non-signs). The physical

    objects are either colored squares or circles,

    which have been cut out of thin plastic, and

    which have been placed on a flat uniform-

    colored surface.5

    3 This choice constrains the universe of designs

    (as termed in the first section) to those that can

    be produced within the decompositional

    framework of [Mavridis 2005] 4 Translating to the terms introduced earlier in

    the first section: The optimality criterion follows

    naturally from the partial ordering that lies at the

    heart of the proposed incremental design

    methodology. The optimization method is also

    contained in the proposed design methodology. 5 Also, for the fifth part of the test, it has to sense

    the physical body of the person administering the

    test, as we shall see much later in this paper.

    In terms of the four-stage reality-to-situation

    model processing pipeline, the first stage

    (reality-to-sensory stream projection) concerns

    us in this paragraph. From the form of the

    required signs and non-signs to be sensed, two

    requirements arise. First, the sensory organs

    should cover the sound wave spectrum with

    enough range and fidelity in order to be able to

    receive the utterances and successfully

    discriminate among them, and thus some form of

    a hearing organ is required. Second, the sensory

    organs should be able to locate and discriminate

    among the various objects and their properties,

    and thus at this stage we assume that some form

    of a visual organ with coverage of the human

    visible spectrum having adequate range and

    fidelity is also required.6

    [utterances heard are all requests]

    Now let us briefly consider the utterances heard.

    After a brief glance through the “Token Test”,

    one can easily notice that only request speech

    acts are used, taking their direct syntactical form

    (Imperative). Furthermore, all these are requests

    for overt actions, and in particular, motor

    manipulation actions (there are no requests for

    speech, and arguable none for “internal” covert

    actions). Also, notice that the recipient of the test

    is assumed to be cooperative, and should

    therefore try to fulfill the test-giver’s requests as

    best as he can. Furthermore, he is assumed to

    have been well-fed and sheltered, so that his

    other vital goals are already satisfied (hunger,

    thirst, danger avoidance etc.), and so during the

    test he will only perform actions towards the

    satisfaction of the requests given by the tester.

    Thus, as the utterances contain requests for

    motor actions, we suppose that the expected

    result of the recipient hearing the utterance will

    be the production of one or more motor actions,

    which will be selected on the basis of the

    utterance as well as the situational context.7 This

    dictates that the action selection mechanism of

    the agent need not involve complicated partial

    planning or even rational action selection. For a

    pre-programmed agent, we are at the level that

    6 Alternatively, touch or ultrasound sonar might

    have substituted for vision, although color would

    not be covered. Of course, the Token Test was

    not originally designed for blind children, and

    vision is silently assumed to be the modality

    used. 7 later in this paper, we will also introduce covert

    internal actions in addition to overt motor actions

  • was termed level-1 in [Mavridis 2005] (reflexive

    organism). In the case of a conditionable-

    response agent, we might move to level-2.

    Nevertheless, once conditioning is finished,

    actions will effectively be selected by a fixed

    (utterance, Situation Model Contents)-to-

    ActionSequence table.8 Below I will explicate on

    the stages that intervene between utterance

    reception and action sequence production, and on

    the way that the situation model contents (i.e. the

    situational context) contribute to this process.

    [the 3 stages from utterance to action sequence]

    I will assume that each heard utterance goes

    through three stages of translation, before the

    necessary motor actions are produced. First, the

    surface form of the utterance is translated into a

    set of zero, one or more ordered statements

    which have the general form:

    (arguments in curly brackets {} are optional)

    {} { …}

    At this first level (Description-level), the

    situational context has not yet been taken into

    account.

    At the second level (Object-level), the situational

    context (categorical layer of situation model) is

    used in order to resolve the definite descriptions

    of object sets into their actual members (pointers

    to objects in the situation model with unique

    identifiers):

    {} { …}

    At the third level (Action-level), the situational

    context (this time the continuous layer of the

    situation model) is fed to the manner-to-

    actionargument functions, in order to provide

    continuous-valued arguments for the action

    routines. I.e.:

    { …}

    [an illustrative example of the three stages]

    A simple example will help illustrate the three

    stages. Let us assume a situation model

    consisting of objects that have three properties

    (position, size, color), which furthermore are

    8 Later, we will comment on the parts of test

    where a more complicated action selection

    mechanism, such as a planner, might have been

    better suited, even if not necessitated

    quantized to the categories {left, center, right},

    {small, large}, {red, green, blue}. Let us further

    assume that the current contents of the situation

    model are:

    Obj#33

    Continuous layer:

    Pos = (34, 12), Size = (2), Color = (.5, .1, .2)

    Categorical layer:

    Pos = “right”, Size = “small”, Color = “red”

    Obj#37

    Continuous layer:

    Pos = (14, 2), Size = (5), Color = (.05, .6, .2)

    Categorical layer:

    Pos = “left”, Size = “large”, Color = “green”

    Obj#38

    Continuous layer:

    Pos = (54, 8), Size = (5.3), Color = (.5, .1, .1)

    Categorical layer:

    Pos = “right”, Size = “large”, Color = “red”

    Let us also assume that the received utterance is:

    “Put the small red one in front of the green one”

    At the description-level, this utterance will

    translate to an ordered sequence of two

    statements:

    ,

    Notice that the situational context has not been

    utilized yet. At the next level, the categorical

    layer of the situation model will help resolve the

    descriptions into unique object ID’s.

    At the object-level, we will have:

    ,

    Now, manner will compose with the continuous

    layer contents of the relevant objects, to produce

    action parameters. I.e.:

    ,

    9

    Finally, this will translate to the following two

    actions to be executed sequentially:

    9 Here we assume for simplicity that Infrontof

    requires as an argument only the landmark object

    and does not depend on the rest of the situational

    context

  • Get(34,12) and Put(14,0)10

    After this brief introduction, we are now ready to

    continue with the discussion of the specific

    demands that the questions of the token test

    impose on the grounded situation model of the

    agent. We will start by considering the first four

    parts of the test, and then proceed to the fifth,

    which is much less homogeneous and more

    demanding.

    4. Passing the first four parts of the token test

    Here the general form of the questions is:

    Touch the AND the

    Where:

    , belong to: {circle, square}

    , belong to: {red, green,

    yellow, blue, white}

    , belong to: {small, large}

    Some of these arguments are allowed to be

    optional.

    Of course, this wider general form only

    corresponds to part 4; in earlier parts, some of

    the above slots are missing, and thus some

    arguments should be allowed to be optional.

    Thus, with the above general form, we have used

    the specific questions of the first four parts of the

    token test as a seed for a wider set of questions

    to be targeted. Arguably, at this stage, one might

    further widen this set; for example, we could

    extend to the more general forms:

    the AND

    AND …

    the … AND

    the …

    the … AND

    the …

    One could also extend towards more property

    categories: more colors, shapes etc.

    However, I will assume that these wider general

    forms and extension would constitute an

    overgeneralization given the specific questions

    comprising parts 1-4. Thus, we will stop at the

    initial level of generality that we proposed.

    10 In this example, let us suppose Infrontof(14,2)

    = (14,0)). Notice that here we silently assumed

    that Infrontof returns a number of object

    properties and not a virtual object as its output.

    Now, let us step through the requirements the

    chosen general form of questions and their

    expected resulting actions impose on the

    representations and processes of the situation

    model. The representation – process

    enumerations given in [Mavridis 2005] will be

    used here, and are repeated below for easy

    reference:

    Representations contained in the agent model:

    R1) Situation model: represents external reality

    (not only physical necessarily)

    R2) Internal state (goals, affect)

    R3) Past states/situations (includes events)

    R4) Future predictions

    R5) Action list

    Processes contained in the agent model:

    P1) Sensor-to-SituationModel process

    P2) SituationModel-to-InternalState process

    P3) Storage/Maintanence of Past process

    (includes event recognition)

    P4) Future Prediction Generation process

    P5) Action Selection Process

    (SituationModelState-to-action table or planner)

    Representations:

    The questions impose a need for three property

    dimensions in the description of each object

    (R1), namely , and . At

    the categorical-layer level (R1), these property

    vectors should be quantized to the following

    categories: {circle, square}, {red, green, yellow,

    blue, white}, {small, large}. Notice that no

    requirement for a position property dimension

    has yet arisen, but nevertheless will soon arise.

    The action required by the questions necessitates

    a new entry at the agent’s empty action list (R5).

    This entry will correspond to the “Touch” action,

    which requires at this stage only a position as its

    argument. Due to the general form of the

    questions, this position must be derived from an

    object11. This necessitates the addition of a

    further fourth property dimension to the

    description of each object (R1), namely

    . Notice that no categorical-layer

    requirements yet exist for this property; positions

    needn’t ever be categorized or verbalized yet.

    For the purpose of providing arguments to the

    “Touch” action, the position property dimension

    only needs to have a continuous-layer

    11 i.e. we don’t currently have:

    touch such as “touch at the left”,

    but touch : “touch the green circle”

  • representation (i.e. a single vector value).

    [humanly-understandable property dimensions]

    Here it is worth clarifying a crucial point before

    proceeding. What do we mean by saying that a

    requirement for three property dimensions (for

    example and ) has

    arisen? Couldn’t we for example get a single

    composite multi-dimensional property, break it

    up using any possible local coordinate system, or

    even collapse all three to a single-dimensional

    one given a suitable transformation? Of course

    we could, as long as the new representation was

    informationally equivalent to the original one

    which consisted of the three humanly-intelligible

    property dimensions. But such a new

    representation would complicate our

    understanding of the actual processes that would

    comprise the sensory-to-situation model

    processing stages. For example, instead of

    deriving “size” from the object-individuated

    parts of the incoming image, some other feature

    vector with no direct human interpretation will

    have to be derived. Later, the “categorizer” at the

    fourth stage of the pipeline, would have to

    operate on this new feature vector in order to

    derive verbal size categories.

    This is why we decide to directly reflect in the

    situation model designs the property dimensions

    of objects that exist in natural language – such as

    and . Of course, such

    properties will not always be orthogonal or

    independent. However, if we needn’t impose a

    very strict definition of minimality in our

    representations, we are better off by sometimes

    possibly including an extra property dimension

    (even if somewhat redundant) than by dealing

    with the extra complications described above.

    [the test-giver object]

    Finally, an extra “dummy” object is inserted in

    the situation model, corresponding to the speaker

    of the heard utterances. The utterances are

    represented as property contents in the situation

    model, which are attributed to the dummy object

    (speaker), of which no physical visibility is

    assumed (a simplifying assumption). Thus, the

    speaker object is instantiated initially given no

    visual evidence, and contains only dummy

    property information (position, size, shape,

    color), plus a new property dimension: utterance

    (R1). Of course, in real-world implementations

    such as the robot Ripley, the speaker is not a

    simple object but an agent, and the properties do

    not only hold dummy contents [Mavridis and

    Roy 2005]. For example, the “position” property

    might be estimated through a face detector.

    However, given the purpose of our exposition,

    this is not required. In conclusion, here we need

    to augment our object property dimensions with

    the “utterance” dimension, and we need to insert

    a dummy “speaker” object in the situation

    model. We will soon show how specific

    processes update the contents if the “utterance”

    property so that its continuous layer contains a

    feature representation of the last speech segment

    heard, and its categorical layer contains the

    recognized text string.

    Processes:

    Here, we will first consider the sensory stream of

    vision (non-sign), and the four stages of

    processing that mediate between external reality

    and the situation model. Then, we will turn our

    attention towards the sign-stream (hearing

    utterances), and the three stages of processing

    that mediate between the utterances (belonging

    to the situation model) and the resulting actions.

    a) The non-sign stream (vision) (P1a):

    After having a look at the two object

    arrangements given in the Token Test, we can

    propose that the general set to which the physical

    objects causing the sensory stimuli belong is:

    Rectilinear equidistant arrangements of circles or

    squares of small or large size, with uniform Blue

    or Green or Yellow or White or Red color. The

    circles / squares are cut out of plastic and placed

    on a uniform background.

    Arguably, we could extend this set to a wider

    one, allowing more complicated arrangements,

    more sizes, colors, shapes, more complicated

    background, or even three-dimensional objects.

    However, we will chose not to impose such a

    wide requirement here, because we believe it

    was out of the spirit of the designer of the Token

    Test.

    The overall purpose of the external reality – to –

    situation model stream for non-signs is to create

    a categorical-layer and continuous-layer

    representation that is of adequate fidelity in order

    to:

    a) drive referent resolution (through the

    categorical-layer size, shape and color

    representations),

    b) provide parameters for motor actions on

    objects (through the continuous-layer position

    representation).

  • These requirements (successful referent

    resolution and motor manipulation) provide the

    criterion for the adequacy of the 4-stage

    pipeline.12 These requirements can also be

    broken to into specific requirements for each of

    the processing stages:

    The first of the four stages is the sensory

    projection stage. Here, any device that will

    provide us with a sensory stream that will be

    able to provide us with adequate fidelity in the

    resulting categorical representation at the end of

    the pipeline will be sufficient. Then, the object

    individuation stage follows. Here, the sensory

    stream must be segmented into parts that

    correspond to the physical objects that had

    caused them. Later, the property values have to

    be extracted. Here, we need four processes,

    which given the segmented parts corresponding

    to each physical object, will calculate its

    property values. Thus, we need properties for

    position estimation, size estimation, color

    estimation, and shape estimation. These

    estimators must provide estimates of adequate

    quality in order to ultimately drive referent

    resolution and motor action parameters, as we

    said above. The fourth and final stage is that of

    categorical quantization. Here, we need

    categorical classifiers for size, color and shape

    (categorization of position not required by Token

    Test, as explained above). These have to be

    tuned so that they correspond with the tester’s

    categories.

    In conclusion, the need for the following

    processes has arisen:

    P1.a1) Vision sensor

    P1.a2) Vision stream segmenter

    P1.a3a) Shape estimator

    P1.a3b) Color estimator

    P1.a3c) Size estimator

    P1.a3d) Position estimator

    P1.a4a) Shape category classifier

    P1.a4b) Color category classifier

    P1.a4c) Size category classifier

    12 We can judge success in two ways: either by

    comparing to some known objective external

    reality (i.e. which physical objects we have

    placed where), or by comparing to the tester’s

    subjective representation of reality. In the first

    case we essentially judge fidelity, and in the

    second alignment among two agents.

    b) The sign stream (speech) (P1b):

    Now, let’s consider the hearing-to-

    utteranceproperty stream. We will use the

    simplifying assumption that the visual and aural

    streams do not fuse after the second stage of their

    pipelines, but are separate. First of all, we need a

    sensor with adequate coverage and quality in the

    human speech spectrum. Then, we need an

    utterance segmenter stage. In place of a property-

    extractor, we will most probably have some

    feature-extraction for speech recognition.

    Finally, the categorical classifier output should

    consist of the text strings produced by the rest of

    the speech recognizer.

    In summary: the air pressure waves are translated

    into an audio stream by the hearing sensor, and

    the stream is segmented into utterances. Then,

    the utterances are analyzed into features will fill

    the continuous layer of the “utterance” property

    dimension of the dummy speaker object. Finally,

    the speech classifier is fed with the features and

    produces a text string, which fills the categorical

    layer of the “utterance” property dimension. In

    most real-world implementations, the hearing

    stream segmenter, speech feature estimator, as

    well as speech classifier are all parts of the

    speech recognition software. Once again, the

    overall requirement for the four stages of the

    pipeline is that the incoming speech stream is

    correctly segmented and translated into text.

    In conclusion, the need for the following

    processes has arisen:

    P1.b1) Hearing sensor

    P1.b2) Hearing stream segmenter

    P1.b3) speech feature estimator

    P1.b4) speech classifier (language-dependent)

    c) The action selection process (P5)

    Here, the instantaneous “utterance” property

    dimension contents of the dummy speaker object

    have to go through the three stages of translation

    to sequences of motor actions, which were

    described in the third section of this paper. First,

    the utterance contents are broken into actions by

    a motor-action selection process. However, the

    output is still at the Description-Level. Thus, an

    object referent resolution process translates it to

    the Object-Level, and a motor-paramater

    extractor process finally gets us at the Action-

    Level. The resulting actions are executed in their

    order of production, by the motor execution

    process.

    To recap, we need the following processes:

  • P5.1) Action selector process

    P5.2) Object Referent Resolution process

    P5.3) Motor parameter extractor process

    P5.4) Motor executor process

    Summary:

    A summary of the requirements needed so far: In

    terms of representations, a situation model

    structure containing objects described along five

    property dimensions (shape, color, size, position,

    utterance), including a dummy object

    representing the speaker of the utterance. In

    terms of processes, all of the P1.a processes

    required for the visual (non-sign) – to –

    situationmodel stream, all of the P1.b processes

    required for the aural (sign) – to – situationmodel

    stream, and processes P5.1-P5.4 required for the

    situationmodel-to-action stream.

    5. The fifth and final part of the Token Test

    Here, I have attempted to collect the questions of

    the fifth part in groups, according to the

    requirements they impose on the situation model.

    Then, the groups have been ordered into a rough

    and subjective scale of increasing difficulty /

    complexity, which also follows any prerequisite

    relations among the groups. The resulting groups

    are:

    1. Novel Actions and Manners:

    (Q41-44, Q47-48, Q52, Q54, Q58, Q60)

    2. Set Referents: (Q50, Q53, Q55)

    3. Conjunctions/Disjunctions: (Q45-46)

    4. Explicit Temporal sequencing: (Q59, Q61)

    5. Conditionals: (present/future): (Q49, Q51)

    6. Corrections: (Q56, Q57)

    Now let’s consider each group of questions in

    turn, use it as a seed for a wider requirement, and

    see how the existing representations and

    processes must be augmented in order to account

    for it.

    Part 5.1: Novel Actions and Manners:

    Q41.Put [the red circle] on [the green square]

    Q42.Put [the white square] behind [the yellow

    circle]

    Q43.Touch [the blue circle] with [the red square]

    Q44.Touch -with [the blue circle]- [the red

    square]

    Q47.Put [the green square] away from [the

    yellow square]

    Q48.Put [the white circle] in front of [the blue

    square]

    Q52.Put [the green square] beside [the red circle]

    Q54.Put [the red circle] between [the yellow

    square] and [the green square]

    Q58.Together with [the yellow circle], take [the

    blue circle]

    Q60.Put [the blue circle] underneath [the white

    square]

    [the general forms]

    Two new verbs are introduced (“put” and

    “take”). These, together with “touch” that we had

    encountered before, are now also followed by

    various propositional phrases (starting with

    “on.”, “behind”, “with”, “away from” etc.). By

    using the above set of questions as a seed, we

    will assume that the general set that they belong

    consists of the following surface forms:

    Put

    Put underneath

    Put between and

    Touch with

    Touch -with - dd1

    Together with , take

    Where:

    =

    = the ,

    (possibly with and/or missing)

    belongs to {on, behind, away from, in

    front of, beside}

    [necessary augmentations]

    First of all, we will introduce two new motor

    actions. The action list already contained

    “touch”, and will now be augmented by “get”

    and “put”. We will assume that these motor

    actions only take one argument, namely a target

    position13. The verbs appearing at the utterance

    level do not necessarily correspond to a single

    action. For example, the surface form “put the

    circle beside the square” will translate to:

    ,

    Note here that the general form of Q60 creates

    special problems, and will be dealt with

    separately. Now, let us define rules for

    translation of the surface forms of the utterance

    to the Description-level, the Object-Level and

    the Action-Level consecutively. Starting with the

    translation to the Descriptive-level, process P5.1

    (action selector) should realize rules for the

    13 More advanced object grasping motor routines

    might necessitate more arguments, for example

    the size or shape of the object to be grasped.

  • following mappings:

    Put �

    ,

    Put between and �

    ,

    Touch with �

    ,

    Touch -with - dd1 �

    ,

    Together with , take �

    ,

    14

    Continuing to the Object-level, the referent

    resolution process (P5.2) remains as is. Finally,

    in order to reach the Action-level, the motor

    parameter extractor process will have to include

    the following trivial rules:

    (already part of P5.2)

    14 Here we assume that the robot’s hand has the

    ability of holding two or more objects at the

    same time, or else that the robot has two hands

    and the second execution of the “get” motor

    routine will divert to the second hand as the first

    is full. A deep treatment would necessitate the

    introduction of a self-object in the situation

    model. The self-object should have some form of

    a “gripper” property dimension, holding the

    number of the objects that are currently held. As

    the objects that are being held have to be

    referable too, so that they are able to serve as

    subsequent “put” statements, some way of

    attaching objects to the gripper is required. A

    fuller treatment would require the introduction of

    relations among objects, with a special

    “possession” relation linking the self- object with

    the objects held

    [the “manner” functions]

    Finally, we have to define the “manner”

    functions, realizing the propositions. Their

    output should be a target position, and their input

    might consist of any subset of the properties of

    the landmark object. For example, let us consider

    . The target position of “put” can

    be defined as:

    (pos_x, pos_y, pos_z) = On(obj #n) where:

    pos_x = pos_x(obj #n)

    pos_y = pos_y(obj #n)

    pos_z = pos_z(obj #n) + size_z(obj #n) / 2

    Let us also consider “infrontof”. Assuming a y-

    axis moving away from the tester, we might

    define “infront” of as:

    (pos_x, pos_y, pos_z) = infrontof(obj #n) where:

    pos_x = pos_x(obj #n)

    pos_y = pos_y(obj #n) - size_y(obj #n) + offset

    pos_z = pos_z(obj #n)

    These definitions are only a very rough first

    approximation, though. More complex functions

    for spatial propositions can be defined, taking as

    arguments not only the landmark object but also

    all of the rest of the objects (the situational

    context). However, we will not get into such

    detail here. Some relevant ideas can be found in:

    [Regier and Carlson 2001, Gorniak and Roy

    2004]. All of the other propositional “manners”

    are defined along similar lines (except for

    “between” that accepts two objects as

    arguments). Their rules are realized by

    appropriately augmenting process P5.3.

    [the problematic case of “underneath” -

    conclusion]

    Thus, we have successfully augmented the action

    list R5 and the processes P5.1-P5.3, in order to

    cope with the requirements set by all the

    questions considered, except Q60. The

    problematic general form of Q60 is consider in

    detail in Appendix B in order not to interrupt the

    flow of the text.

    Part 5.2: Set Referents: (Q50, Q53, Q55)

    Q50.Pick up [the squares, except the yellow one]

    Q53.Touch [the squares] slowly and [the circles],

  • quickly

    Q55.Except for [the green one], touch [the

    circles]

    [general form]

    The general surface form of the statements is:

    (at a first level of generality)

    Touch slowly and quickly

    Where: dd1 might return not only a single but

    also a set of referents, and where the description

    of the set might include a plural plus an

    exception. In the third case, the description might

    be broken down in two parts (exception and

    plural), with the verb positioned between them.

    [necessary augmentations]

    Here, the object referent resolution process

    (P5.2) has to be augmented to account for the

    possibility of set referents. Also, two new

    manners have to be introduced for the motor

    action touch – namely the manners slowly and

    quickly. Furthermore, in terms of surface form

    difficulty, one more problem arises: the set

    referent descriptions might be broken into two

    disconnected parts, separated by the action verb.

    Let us see how we will deal with these

    complications. First, we introduce the following

    utterance-to-description level rules:

    (where action = get for “pickup” or “take”,

    action = touch for “touch”)

    Touch �

    (where manner = slowly or quickly)

    (where =

    and where action = get for “pickup” or “take”,

    action = touch for “touch”)15

    Thus, we have successfully reached the

    15 We do not attempt to extend plural referents to

    “put down” here. We have to chosen to do so,

    because of the complications that might arise

    when the second object set consists of more than

    one object, as for example in: “put the circles on

    the squares”

    Description-level. The augmented referent

    resolution process (P5.2) will then be applied on

    the Description-level representations, and we

    will get results of the form:

    We see that the single argument of the action

    now consists of multiple objects, instead of one.

    However, this is not allowable given the action

    forms that we have introduced in the action list

    (R5). We thus have to introduce one further

    process, which is applied after the object referent

    resolution process (P5.2) but before the motor

    parameter extractor process (P5.3). This process

    will be called “repeated action expansion

    process” (P5.2a), and will realize the following

    rule:

    ,

    ,

    For example:

    will be expanded to:

    ,

    ,

    .

    Finally, after the application of P5.2a, P5.3 is

    applied. P5.3 has to be augmented with new

    rules that account for the new manners

    associated with “touch”. Touch, as represented in

    the action list (R5), accepts only a single

    argument so far: the target position to be

    touched. “Slowly” as a manner does not modify

    the position argument; instead, it necessitates

    either a new motor routine for touch_slowly, or

    the addition of an extra argument to the touch

    action, namely a speed argument. Here we will

    adapt the later method. Thus, the action list entry

    for touch (R5) will be augmented with a second

    argument, namely speed, which will have three

    discrete values (slow, normal, quick), and which

    will default to normal speed when no argument

    is given through the manner. I.e., the following

    rules will be implemented in the Motor

    Parameter Extractor Process (P5.3):

    Touch �

    Touch(pos=pos(obj #n), speed=normal)

    Touch slowly �

  • Touch(pos=pos(obj #n), speed=slow)

    Touch quickly �

    Touch(pos=pos(obj #n), speed=quick)

    Thus, we have finally successfully reached the

    Action-level. Only the motor executor process

    requires augmentation now; indeed, motor

    routines for ,

    have to be created, and the routine for touch

    should be augmented with a speed argument:

    .16

    Part 5.3: Conjunctions/Disjunctions: (Q45-46)

    Q45.Touch [the blue circle] and [the red square]

    Q46.Pick up [the blue circle] or [the red square]

    [the general forms]

    and

    or

    Where:

    belongs to {touch, pickup, take}

    and are definite descriptions of the

    form: , possibly with

    missing and / or

    [necessary augmentations]

    In our previous discussion on parts 1-4 of the

    test, it was assumed the action selector process

    (P5.1) was able to deal with the utterance form:

    and , and that it

    realized a rule of the form:

    and �

    ,

    It is trivial to modify this rule superficially to

    account for disjunctions. For example, we might

    decide to just perform the action on the first

    referent, and never on the second:

    or �

    16 As an extension, if we decided to be able to

    use the manners “quickly” and “slowly” not only

    with touch but also with get and put, then we

    would need a further augmentation in P5.3, to

    account for the possibility of multiple manners.

    For example: “put the red square quickly on the

    yellow circle” (which includes the manners

    and ). This would further

    necessitate compatibility checks among manners,

    or more complicated compositional rules.

    Consider for example: “put the red square

    underneath the green circle on the green circle”

    which would be illegal.

    This is the augmentation to P5.1 that we will

    propose here. Notice, however, that such an

    interpretation would oppose the Gricean maxim

    of quantity [Grice 1975]: why ever mention the

    second referent if it is never going to be useful?

    A deeper interpretation of disjunctions would

    require the introduction of concepts such as

    “feasibility” of action execution, or

    “preferability” of one action over another given

    some execution cost or ordering. Thus, a deeper

    treatment of conjunctions, would give them the

    following procedural meaning:

    or means:

    Out of the two actions ( ,

    ) select the one that is either

    preferable or anyway just feasible, if the other

    isn’t.

    Regarding feasibility of actions, it is worth

    clarifying one further point. In our proposal so

    far, we have silently assumed that we expect the

    tester to provide us with singular definite

    descriptions which resolve to a single existing

    referent17. Requested actions are silently

    assumed to be always feasible. The case of an

    unresolvable or an ambiguous definite

    description has not been explicitly dealt with.

    How do we augment our system to deal with

    such cases? A simple but effective proposal

    follows:18

    a) Unresolvable definite description:

    Motor action is not executed. Instead, robot

    produces canned speech, informing the tester that

    “no such object exists”. Note that this is totally

    out of the climate of our proposal so far; none of

    the token test questions demand speech

    production from the child.

    b) Ambiguous definite description:

    Motor action is not executed. Instead, robot

    produces canned speech, requesting an inform

    speech act from the user: “Do you mean this

    one?” (touching one possible referent) “That

    one?” (touching another) “Or this one?”

    (touching the third). Again, this is just a simple

    canned behavior. A deeper treatment would

    require a more complicated tester model.

    Treating the tester as physical object with

    utterance properties, as we do here, is not

    17 Plural definite descriptions that resolve to set

    referents were discussed above in Part 5.2 18 Implemented in [Mavridis and Roy 2005]

  • enough. Also, a rational action selection

    mechanism covering not only motor actions but

    also speech actions would be required.

    In conclusion, we have chosen here to follow the

    trivial interpretation of disjunctions, arbitrarily

    selecting only the first action for execution.

    However, we have also discussed what would be

    necessary for a deeper treatment.

    Part 5.4. Explicit Temporal sequencing:

    Q59.After picking up [the green square], touch

    [the white circle]

    Q61.Before touching [the yellow circle], pick up

    [the red square]

    [general form]

    After -ing ,

    Before -ing ,

    where:

    , belong to

    {touch, pick up, take}

    and are definite descriptions of the

    form: , possibly with

    missing and / or

    [necessary augmentations]

    Again, a slightly superficial but nevertheless

    adequate implementation exists. We just have to

    augment the process P5.1 (action selector

    process), which transforms the utterances to the

    Description-level, with the following rules:

    After -ing , �

    ,

    Before -ing , �

    ,

    A deeper interpretation, would remove the

    immediacy of the request, and would involve

    planning. However, we feel such an

    interpretation is not what is really required by the

    token test.

    Part 5.5: Conditionals: (present/future)

    Q49.If [there is [a black circle]], pick up [the red

    square]

    Q51.When [I touch [the green circle]], you take

    [the white square]

    Here we have a huge leap of complexity

    compared to the previous parts. First of all,

    conditions are introduced, which are intricately

    bound to the notion of events. Then, an indefinite

    description appears: “a black circle”. Last, but

    most importantly, the loaded personal pronouns

    “I” and “you” appear. We will now have to

    decide on an appropriate generality and depth of

    treatment.

    [general form]

    if , {}

    (includes present, but not future)

    when , {you}

    {}

    (normally doesn’t include present, only future)

    Where:

    belongs to the form:

    there is

    belongs to the form:

    I 19

    And where:

    is definite description

    is indefinite description etc.

    [suggested augmentations]

    An introduction of a notion of events, although

    not absolutely necessary at this stage, would

    provide the right amount of generality and

    expandability. Thus, we have to start devising

    representations in R3. Notice that although

    conditional tests for existence (“there is a circle”)

    do not necessarily involve the past, action

    conditions do: “when I put the circle on the

    square” cannot only be tested through the

    experience of an instant. Although “when I put

    the circle on the square” refers to the time instant

    when an action performed by the tester has been

    completed, testing whether action completion

    requires some account of the whole duration of

    the action. Thus, we need a history of temporal

    “snapshots” justifying that indeed the tester has

    performed the put action. Just seeing the circle

    on the square, and maybe the user’s hand

    touching the circle, is not enough justification

    that the condition holds. Thus, the introduction

    of events and some form of representations of

    19 We won’t extend to manners etc., although we

    could easily do so. We will also not extend to

    observing your own actions; only actions

    performed by the tester can be initially observed)

  • the past (R3) becomes inevitable.

    Let us now decide on an appropriate form of

    event representation. I propose the form:20

    21

    Where:

    belongs to {exists, touch, get, put}

    If =exists,

    then participant 1 is obj #n that exists.

    If =touch, get, put,

    then participant 1 is the object corresponding to

    the tester’s body, while participant 2 is the obj #n

    that is being manipulated.

    , : time landmarks in

    some referable subjective timescale (clicks etc.)

    Exists events are continuously updated so that

    the end time is equal to the current time, as long

    as the participant still exists.

    Manipulation events are continuously updated so

    that the end time is equal to the current time, as

    long as the action has not finished yet.

    flag becomes true when the

    event has finished, and the end time is less than

    the current time

    Actions refer to sequences of GSM contents, and

    connect to the senses only indirectly through the

    GSM. Of course, the sensory streams should

    provide enough information to the GSM so that

    the action can be logged. In particular, for the

    second type of condition, i.e. “I

    ”, we need to be able to observe the actions

    of the “tester” object, which so far was only a

    dummy object associated with incoming

    utterances. Thus, we first need to be able to

    localize the tester in space, and then to derive a

    rich enough observation of his body from the

    senses so that we can successfully recognize

    when he performs the action verbs: get, put,

    touch. Also, we need to be able to recognize

    which physical objects are being implicated in

    the above actions – but hopefully our vision

    system would already provide enough clues for

    that.

    20 Explicit representation of agents in a different

    form than physical objects has not been carried

    out so far. Thus, by object here we are referring

    to either the robot, the tester, or the physical

    objects 21 For simplicity, we prefer the neutrality of the

    term “participants” instead of agent/patient roles.

    Thus, we need the following augmentations in

    the vision stream:

    P1.a1) Vision sensor: should be able to get the

    test-giver within the field of view, and get

    sensory evidence of him of quality adequate for

    the subsequent action recognition

    P1.a2) Vision stream segmenter: should be able

    to segment the test-giver’s body

    P1.a3e) Action feature extractor: enough features

    should be extracted so that the performed actions

    can later be recognized by the event classifier

    through the resulting timeseries.

    There is no explicit need for quantized action

    features, as the language specification does not

    demand verbalization of instantaneous features

    of actions. Thus, no augmentation to P1.a4 is

    needed. With the augmentations made so far, we

    have a time series of rich features enabling

    action recognition in the GSM. Now, we need to

    augment suitably the (previously empty)

    components R3 and P3.

    In detail:

    R3) Past states/situations (includes events)

    The current GSM is augmented with a log of

    events, including events in progress as well as

    past events. No explicit detailed history of the

    past needs to be kept, except for this log which is

    embedded in the current GSM.22

    P3) Storage/Maintanence of Past process

    (includes event recognition)

    A set of four event classifier processes is needed,

    each process specialized for each of the four

    event types (exists, get, put, touch). These

    processes examine the current state of the GSM,

    keep track of whatever information they need

    from the past internally, and produce ongoing

    event descriptions in the event log which is part

    of R3. When an event is finished, the flag is set true, and the event is

    updated in the log for one last time. For

    simplicity, we assume that no memory

    compression/forgetting process operates on the

    log, although in systems that operate round-the

    clock such a process would be necessary.

    [from events to condition testing]

    22 If we need to answer questions about the state

    of the GSM during specific past moments etc.,

    then we also need some detailed history of past

    moments. However, this is not necessary in the

    token test. It is necessary for example in the

    specs of [Mavridis and Roy 2005].

  • So far, we have augmented the sensory stream

    (P1a) so that the test-giver’s actions can be

    observed, and augmented our representations and

    processes dealing with the past (R3, P3) so that a

    log of ongoing and past events is always

    accessible in the GSM. Now, we need to connect

    the events that are part of the event log with the

    conditions that we have posited in the general

    form of the requirements for this part of the test.

    The first form of the conditions is: “there is

    ”. This will correspond to the existence of

    an event with type “exists”, with a participant

    which satisfies , and with an ongoing end

    time equal to the current time.

    But what augmentations should we make to the

    processes of P5? I propose that process P5.1

    (action selection), should be augmented with a

    conditional filter process (P5.1a). The

    conditional filter process will call the object

    referent resolution process (P5.2) in order to get

    a set of possible event participants that satisfy

    the indefinite description . Then, it will

    search in the event log, and conclude whether the

    condition is TRUE or not. The following

    translation rule should be realized in P5.1:

    if , {}

    IF condition is TRUE:

    {} , …

    IF condition is not TRUE:

    (Empty action list)

    The second form of the conditions is: “I ”. This will correspond to the

    existence of an event with type analogous to the

    action verb (get, put, touch), with the object

    corresponding to the test-giver as the first

    participant, with an object which satisfies

    as a second participant, and with an ongoing end

    time equal to the current time minus one tick (i.e.

    the flag should have just been

    raised). Note that this second form of the

    conditions is paired with the following utterance

    form:

    when , {you}

    {}

    An interesting aspect of this request speech act is

    that it cannot be instantaneously fulfilled, but

    often needs to wait for to hold in

    the near future before the tester can fulfill the

    request. Thus, we need some way of getting the

    agent under test to repeatedly hear the request,

    until favorable conditions arise for its

    satisfaction, or until it has to be forgotten. It

    might have to be forgotten after a time-out

    period, or after one or more new requests have

    arisen.

    Here, we will propose the following way of

    handling these demands: we will assume that

    such an utterance is persistently hallucinated23 in

    the agent’s head until its conditions are satisfied

    or until a new utterance has arrived. Thus, what

    we need is a “hallucinatory” circuit in the senses.

    An internal action will be produced by the action

    selection process, which will “echo” the master’s

    voice repeatedly in the agent’s head, until the

    master has been served or he has demanded

    something new. The action list R5 will be

    augmented with an action of the form:

    Hallucinate_utterance

    The source object will be initially fixed to the

    object representing the test-giver.

    The effect of this internal action will be the

    writing of the utterance in the appropriate

    utterance property representation in the GSM, as

    if it had been freshly heard. In case new sensory

    information has come in, it will automatically

    overwrite the hallucinated utterance – i.e. real

    voices suppress persistent hallucinated voices.

    Furthermore, the translation rule that we need in

    P5.1 is:

    when , {you}

    {} �

    IF condition is TRUE:

    {} , …

    IF condition is not TRUE:

    Hallucinate_utterance

    In this way, we are able to handle conditional

    action execution in the future, through repeated

    hallucination of the request of the test-giver. The

    hallucination scheme works well even when we

    abort the simplifying assumptions that the

    hallucination stops being repeated when a new

    request comes in. But then, we need to devise

    some form of buffer of simultaneous “directives”

    that are being heard at each instant together with

    the sensory-derived utterances, and which guide

    the agent’s behavior. We would then also need to

    23 this idea is reminiscent of the general theme of

    [Jaynes 1976]

  • devise ways to deal with incompatible requests

    etc., and we slowly end up getting to all the

    known forms of internal conflict. Nevertheless,

    this simple idea of hallucinated requests can

    really achieve a lot. We can have agents that are

    verbally taught to respond appropriately to

    appropriate circumstances by their masters, we

    can have incoming sensory signs to translate to

    hallucinated utterances (for example, in reading

    written texts), and much more.

    [conclusion]

    We have shown how present/future conditionals

    can be handled. This was achieved through the

    introduction of an event representation and an

    event log, through augmentation of the sensory

    machinery that enables the observation and

    recognition of the actions of the test-giver, and

    through a “hallucinatory utterance” circuit that

    persistently repeats a conditional action

    execution request until it is carried out or it is

    forgotten. More importantly, this primitive

    hallucinatory circuit provides an interesting

    pathway for expansion in the future that enables

    the implementation of many capabilities such as

    teaching, reading etc. Such capabilities are

    invaluable towards the creation of autonomous

    cooperative agents.

    P5.6: Corrections: (Q56, Q57)

    Q56.Pick up [the red circle] -No!- [the white

    square]

    Q57.Instead of [the white square], take [the

    yellow circle] (inverted syntax;

    [General form]

    -No!-

    Instead of ,

    Where:

    belongs to {touch, pick up, take}

    And , are definite descriptions, as

    defined previously.

    [necessary augmentations]

    Once again, as was the case in part 5.3

    (conjunctions), there is a deeper phenomenon

    going on here. However, we will treat it slightly

    superficially, albeit quite effectively for the

    purpose of the test. A deep treatment of

    corrections requires an understanding of

    utterance production in the tester (covering the –

    No!- interruption), or a longer-term

    understanding of discourse (covering the contrast

    marked by “Instead”). However, this deeper

    understanding is not essential for responding

    appropriately to these utterances.

    We propose the following augmentations to

    process P5.1 (action selector), performing the

    utterance-to-description level translation:

    -No!- �

    Instead of , �

    I.e. we totally ignore the mistaken / contrasted

    part of the utterance, and proceed with executing

    the requested action with the correct argument.

    6. Local minimality of the proposed design

    Here, a list of the main parts and blocks of the

    proposed design is given. We consider each

    block in turn, and observe the effects of its

    removal from the design, and show which part of

    the design specification fails to hold in each case.

    We finally conclude, that through this “Jenga”

    procedure, we can claim local minimality of the

    proposed design in the universe of designs

    proposed by the general framework we have

    adopted. Furthermore, we have already shown

    that the design provides easy and systematic

    hooks for further expandibility. Let us now

    proceed with the test:

    Representations contained in the agent model:

    R1) Situation model: represents external reality

    (not only physical necessarily)

    If we remove any of the proposed property

    dimensions, or the speaker-object, the test fails.

    R2) Internal state (goals, affect)

    Not used.

    R3) Past states/situations (includes events)

    If we remove the event representation or event

    types, then Q49/Q51 will fail

    R4) Future predictions

    Not used. Useful for planning etc. (level-3/4)

    R5) Action list

    If we remove any of the action entries that we

    made, specific questions of the test will fail.

    Processes contained in the agent model:

    P1) Sensor-to-SituationModel process

    If we remove any of the processes of P1a or P1b,

    different parts of the test will fail. For example,

    if we remove the size estimation process,

    referents involving size will not be resolved

    correctly etc. If we remove action recognition of

    the test-giver, Q51 will fail etc.

  • P2) SituationModel-to-InternalState process

    Not used.

    P3) Storage/Maintanence of Past process

    (includes event recognition)

    If the log creation / event recognition processes

    did not exist, we would not be able to correctly

    respond to Q51 for example.

    P4) Future Prediction Generation process

    Not used.

    P5) Action Selection Process

    (SituationModelState-to-action table or planner)

    Any removal of P5.1-P5.4 or their parts will

    inhibit successful responses to several questions.

    Thus, without getting into a detailed

    examination, it should be clear that any removal

    of a proposed representation, process or their

    parts will make the system incapable of passing

    the whole token test. Thus, by the proposed

    “Jenga” criterion, the design is locally optimal in

    the space created by the proposed framework of

    designs.

    7. Conclusion

    In this paper, I presented a design of a minimal

    grounded situation model which will enable a

    conversational robot to pass the “Token Test”. I

    started by considering what it might mean for a

    GSM to be minimal, and proposed an

    incremental design methodology. Then, I

    considered how human behavioral tests should

    be translated to specifications for design. Later, I

    derived the proposed design step-by-step by

    considering the questions of the “Token Test”.

    Finally, I pro