learning the semantics of object-action relations by ......pair wise spatial relations between...

29
Learning the semantics of object-action relations by observation Eren Erdal Aksoy 1 , Alexey Abramov 1 , Johannes D¨ orr 1 , Kejun Ning 1 , Babette Dellen 1,2 and Florentin W¨org¨ otter 1 1 Bernstein Center for Computational Neuroscience, University of G¨ ottingen, III. Physikalisches Institut, Friedrich-Hund Platz 1, 37077 G¨ottingen, Germany Emails: eaksoye,abramov,jdoerr,[email protected] 2 Institut de Rob` otica i Inform` atica Industrial (CSIC-UPC), Llorens i Artigas 4-6, 08028 Barcelona, Spain Email: [email protected] Abstract Recognizing manipulations performed by a human and the transfer and execution of this by a robot is a difficult problem. We address this in the cur- rent study by introducing a novel representation of the relations between objects at decisive time points during a manipulation. Thereby, we encode the es- sential changes in a visual scenery in a condensed way such that a robot can recognize and learn a manipula- tion without prior object knowledge. To achieve this we continuously track image segments in the video and construct a dynamic graph sequence. Topologi- cal transitions of those graphs occur whenever a spa- tial relation between some segments has changed in a discontinuous way and these moments are stored in a transition matrix called the semantic event chain (SEC). We demonstrate that these time points are highly descriptive for distinguishing different manip- ulations. Employing simple sub-string search algo- rithms, semantic event chains can be compared and type-similar manipulations can be recognized with high confidence. As the approach is generic, statisti- cal learning can be used to find the archetypal SEC of a given manipulation class. The performance of the algorithm is demonstrated on a set of real videos showing hands manipulating various objects and per- forming different actions. In experiments with a robotic arm, we show that the SEC can be learned by observing human manipulations, transferred to a new scenario, and then reproduced by the machine. Keywords: Semantic scene graphs, unsupervised learning, action recognition, object categorization, affordances, Object-Action Complexes (OACs) 1 Introduction It is long known that raw observation and naive copy- ing are insufficient to execute an action by a robot. Execution requires capturing the action’s essence and often this problem is discussed in conjunction with imitation learning (Breazeal and Scassellati, 2002). Humans are – without problems – able to capture “the essence” and recognize the consequences of their own actions as well as those performed by others. While, the mirror-neuron system is suspected to be involved in this feat (Rizzolatti and Craighero, 2004), it is until now completely unknown how newborns gradually learn to recognize and imitate and thereby develop advanced motor skills aiding their cognitive development. 1

Upload: others

Post on 09-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Learning the semantics of object-action relations by observation

Eren Erdal Aksoy1, Alexey Abramov1, Johannes Dorr1, Kejun Ning1,Babette Dellen1,2 and Florentin Worgotter1

1Bernstein Center for Computational Neuroscience, University of Gottingen,III. Physikalisches Institut, Friedrich-Hund Platz 1,

37077 Gottingen, GermanyEmails: eaksoye,abramov,jdoerr,[email protected]

2Institut de Robotica i Informatica Industrial (CSIC-UPC),Llorens i Artigas 4-6, 08028 Barcelona, Spain

Email: [email protected]

Abstract

Recognizing manipulations performed by a humanand the transfer and execution of this by a robotis a difficult problem. We address this in the cur-rent study by introducing a novel representation ofthe relations between objects at decisive time pointsduring a manipulation. Thereby, we encode the es-sential changes in a visual scenery in a condensed waysuch that a robot can recognize and learn a manipula-tion without prior object knowledge. To achieve thiswe continuously track image segments in the videoand construct a dynamic graph sequence. Topologi-cal transitions of those graphs occur whenever a spa-tial relation between some segments has changed ina discontinuous way and these moments are stored ina transition matrix called the semantic event chain(SEC). We demonstrate that these time points arehighly descriptive for distinguishing different manip-ulations. Employing simple sub-string search algo-rithms, semantic event chains can be compared andtype-similar manipulations can be recognized withhigh confidence. As the approach is generic, statisti-cal learning can be used to find the archetypal SECof a given manipulation class. The performance ofthe algorithm is demonstrated on a set of real videos

showing hands manipulating various objects and per-forming different actions. In experiments with arobotic arm, we show that the SEC can be learnedby observing human manipulations, transferred to anew scenario, and then reproduced by the machine.

Keywords: Semantic scene graphs, unsupervisedlearning, action recognition, object categorization,affordances, Object-Action Complexes (OACs)

1 Introduction

It is long known that raw observation and naive copy-ing are insufficient to execute an action by a robot.Execution requires capturing the action’s essence andoften this problem is discussed in conjunction withimitation learning (Breazeal and Scassellati, 2002).

Humans are – without problems – able to capture“the essence” and recognize the consequences of theirown actions as well as those performed by others.While, the mirror-neuron system is suspected to beinvolved in this feat (Rizzolatti and Craighero, 2004),it is until now completely unknown how newbornsgradually learn to recognize and imitate and therebydevelop advanced motor skills aiding their cognitivedevelopment.

1

Page 2: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Part of the problem lies in the fact that usuallythere are many ways to perform a certain manipu-lation. Movement trajectories may be different andeven the order of how to perform it may change tosome degree. On the other hand, certain momentsduring a manipulation will be similar or even iden-tical. For example, during a manual assembly pro-cess certain object combinations must occur withoutwhich mounting would render nonsense.

This points to the fact that at certain pivotaltime-points one needs to comprehend specifically themomentarily existing relation between manipulator(hand) and manipulated object as well as the result-ing relations (and their changes) between objects andobject parts.

This defines the requirements for a potentially use-ful manipulation representation for artificial agents:It needs to be (1) based on sensory signals and (2)learn-able by observation. At the same time it shouldencode the (3) relations between objects in an in-variant way, which it should do only on certain de-cisive (4) moments during the manipulation. Fur-thermore, preferably it should also be (5) human-comprehensible and (6) compatible with model-basedknowledge. Aspects (1) and (2) will assure ground-ing as the process is bootstrapped in a generic way.Aspect (3) would lead to the required categoriza-tion property (invariance against irrelevant object-specifics) and aspect (4) to a dramatic data compres-sion as only a few moments need to be stored. Thelast two aspects (5,6) would allow human access –very practically – for debugging and improving thealgorithm(s) but also for being able to better under-stand and possibly interact with the artificial systemand for entering model based knowledge.

To arrive at such a representation is a very diffi-cult problem and commonly one uses models of ob-jects (and hands) and trajectories to encode a ma-nipulation (see next section for literature discussion).These approaches, however, can easily lack ground-ing because models are almost always given by thedesigner and not learned by the agent itself. Further-more, it is so far unknown how to solve the variabil-ity problem of manipulations in a model-based way.Or, more plainly: What is the correct model (model-class) for bringing objects and actions together in all

those vastly differing manipulations?

In this study it is our goal to introduce the so-called “Semantic Event Chain” (SEC) as a novel,generic encoding scheme for manipulations, which, toa large degree, fulfills the above introduced require-ments (grounded, learnable, invariant, compressed,and human-comprehensible). We will show that theseSECs can be used to allow an agent by observation tolearn distinguishing between different manipulationsand to classify parts of the observed scene. Further-more, we will demonstrate that an agent can decideby self-observation whether or not a manipulation se-quence was correct. Thus, our algorithms give themachine a basic tool by which it can assess the con-sequence of a manipulation step directly linking thesymbolic planning domain to the signal domain (im-age) addressing the difficult cause-effect problem ofhow to observe and verify self-induced changes.

To show this we will use at the sensor front-endcomputer vision methods for image segmentation andtracking. On the motor side we will employ meth-ods for dynamic trajectory generation and control.Both aspects are not in the core of this paper, whichfocuses – as discussed above – on the semantics ofmanipulations, their representation, and learning.

Parts of this study have been published at a con-ference (Aksoy et al., 2010).

2 Related Work

To date, there exists no common framework for ma-nipulation recognition. Different approaches havebeen presented for vision-based recognition of ma-nipulations (which is the focus of the work presentedin this paper), vision-based recognition of human-motion patterns, and non-visual recognition of othertypes of activities (Modayil et al., 2008; Liao et al.,2005; Hongeng, 2004). The latter will not be dis-cussed any further, because our work focuses on vi-sion. In the following we give short summaries of pre-vious achievements obtained in these areas. We alsoreview previous work on vision-based object recogni-tion, since there exist relations to our work.

2

Page 3: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

2.1 Recognition of manipulations

The visual analysis of manipulations, e.g., a handmanipulating an object, represents an importantsubproblem in vision-based manipulation recognitionand is relevant for many vision-based applicationssuch as learning from demonstration, work-flow opti-mization, and automatic surveillance. However, ma-nipulations are far less understood than for examplehuman motion patterns and only a few solutions havebeen proposed so far (Vicente et al., 2007; Sridharet al., 2008; Kjellstrom et al., 2008).

Sridhar et al. (2008) analyzed manipulations inthe context of a breakfast scenario, where a handis manipulating several objects (cups, knifes, bread)in a certain order. The whole image sequence isrepresented by an activity graph which holds spa-tiotemporal object interactions. By using statisticalgeneralization, event classes are extracted from theactivity graphs. Here, each event class encodes asimilar pattern of spatiotemporal relations betweencorresponding objects, and object categories can belearned by calculating the similarity between objectroles at each event class. They demonstrated thatobjects can be categorized by considering their com-mon roles in manipulations. However, large activitygraphs and the difficulty of finding exact graph iso-morphisms make this framework expensive and sen-sitive to noise. Furthermore, object knowledge wasprovided beforehand. This way the vision problemwas (artificially) separated from the manipulation-recognition problem.

Kjellstrom et al. (2008) segmented hand and ob-jects from the video and then defined hand/objectfeatures (shape based) and manipulation features,providing a sequence of interrelated manipulationsand object features. Semantic manipulation-objectdependencies, e.g. drink/glass, are then extracted us-ing conditional random fields (CRFs) and connectedhierarchical CRFs. Hand/manipulator and the ma-nipulated object together define the manipulation,and for this reason the recognition process simulta-neously involves both hand/manipulator and objects(Vicente et al., 2007; Kjellstrom et al., 2008). In Vi-cente et al. (2007), manipulations are represented assequences of motion primitives. Here, five different

manipulations of different levels of complexity wereinvestigated. The process is modeled using a combi-nation of discriminative support vector machines andgenerative Hidden Markov Models (HMMs). HMMshave also been used by Ogawara et al. (2002) to ex-tract primitive of manipulations by learning severalHMMs and then to cluster these HMMs such thateach cluster represents one primitive. Raamana et al.(2007) recognized simple object manipulations suchas pointing, rotating and grasping in a table-top sce-nario using HMMs and selected the best features forrecognition automatically. These works demonstratethat HMMs are a useful tool if the manipulationprimitives are hidden in the sensory feature set pro-vided to solve the recognition tasks. Usually this thecase if low-level features are used instead of higher-level “object” like entities. However, in our case, ma-nipulations are represented by chained relations be-tween image segments (see Section 3), which directlyrepresent manipulation primitives, and as such theycan be compared, grouped, and superimposed with-out having to assume a hidden model. This holdsat least for the manipulation examples considered inthis paper.

2.2 Recognition of human motion pat-terns

Recognition of human motion has received much at-tention in recent years and many contributions ex-ist, but are often unrelated to manipulation recog-nition (Laptev and Perez, 2007; Niebles et al., 2008;Dee et al., 2009; Hakeem and Shah, 2005; Calinonand Billard, 2004, 2005, 2007; Maurer et al., 2005;Gilbert et al., 2009). Much work has been done bythe group of Aude Billard (Calinon and Billard, 2004,2005, 2007; Maurer et al., 2005) addressing the aspectof gesture recognition. Naturally a strong focus lieshere on finding a way to describe complete trajec-tories and different methods (including PCA, ICA,HMM and Hopfield nets) have been used in differ-ent combinations to address this problem and also todeal with the question of sequence learning (Maureret al., 2005). In Laptev and Perez (2007) spatiotem-poral volumes of optical flow are used to classify hu-man motion patterns. In Niebles et al. (2008) human

3

Page 4: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

actions are learned in an unsupervised way by us-ing spatiotemporal words that represent space-timeinterest points. Dee et al. (2009) segment imagesinto regions of similar motion structure and learnpair wise spatial relations between motion regions,roughly corresponding to semantic relations such as“above”, “below”, and “overlapping”. By combiningthese learned spatial relations with the segmentationslearned from data, a compact representation can beprovided for each video, representing a motion-basedmodel of the scene, which allows classifying videoscontaining different kinds of motion patterns, e.g. in-door scenarios with moving people, roads, squares orplazas. In Hakeem and Shah (2005) events involvingmultiple agents are detected and learned consideringtemporally correlated sub-events. In Gilbert et al.(2009) simple 2D corners are grouped in both spatialand temporal domains using a hierarchical processat each stage and the most descriptive features arethen learned by using data mining. This way, fastand accurate action recognition in video sequences isachieved in real time.

2.3 Object recognition and the role ofcontext

Despite progress that has been made in the pastdecades, the recognition of objects using visual cuesremains a highly challenging task and still thereexists no vision system reaching human object-recognition capabilities. This is mainly due to thefact that objects take vastly different appearancesin images because of the following factors: (i) rel-ative pose of an object to a camera, (ii) light-ing variations, and (iii) variance in appearance ofobjects (size, color, shape) belonging to the sameclass. Object recognition systems extract certainobject-relevant characteristics in images and matchthem against stored object representation or mod-els, which can be either 2D or 3D. We roughly dis-tinguish between geometry-based, appearance-based,and feature-based approaches. Geometry-based ap-proaches use a geometric description of a 3D ob-ject and match its projected shape against the im-age of the object (Mundy, 2006; Mundy and Zisser-man, 1992). This approach however requires that

the object can be initially segmented from the im-age. Appearance-based algorithms use global im-age patterns to perform recognition (Turk and Pent-land, 1991; Murase and Nayar, 1995; Belhumeur andKriegmant, 1996). For example, Turk and Pentland(1991) projected face images onto a face-specific fea-ture space and used the distance of a projected imageto the eigenvectors of the face space for classification.

These methods show invariance to changes in view-point and lighting conditions, but are sensitive to oc-clusions. Feature-based algorithms find local interestpoints in the image, e.g., SIFT (Lowe, 2004), thathave invariant properties with respect to pose, light-ing, and affine transformations (Fergus et al., 2003;Nister and Stewenius, 2006; Sivic and Zisserman,2003). Local feature histograms are then matchedagainst model representations for object recognition.Feature-based methods depend on the quality andnumber of features that can be extracted from theimage, and thus perform best for images containingrich texture.

In the above described “classical” approaches toobject recognition, the context in which the object isembedded is usually considered to be of minor impor-tance or even harmful to the recognition procedure,and the problem is sometimes eased by segmentingthe object from the background prior to recognition.On the other hand, evidence from visual cognitionas well as computer vision suggests that objects ap-pearing in a consistent or familiar background canbe more accurately detected and recognized than ob-jects appearing in an inconsistent scenario (Torralba,2003; Helbig et al., 2010; Hoiem et al., 2008; Olivaand Torralba, 2009). Recently it has been shown inpsychophysical experiments that also action contextcan facilitate human object recognition (Helbig et al.,2010).

This observation is to some extent in agreementwith our study, where objects, which can be asso-ciated with certain manipulations, are obtained in-directly by classifying and recognizing actions andwithout using prior object knowledge.

4

Page 5: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 1: Processing example and semantic event chain representation. (a) Frames from a movie recordedduring a manipulation. All frames (b) are segmented (c) by superparamagnetic clustering in a spin-latticemodel (Dellen et al., 2009), which also allows for consistent marker-less tracking (e) of the individual seg-ments. From the image segments, graphs are constructed (d) where graph nodes represent the segments’centers and graph edges encode whether or not two segments touch each other. Then we encode a manipu-lation by storing only main graphs between which a topological change has taken place (f). Such a changehappens whenever an edge or a node has been newly formed or has been deleted. This type of representa-tion is then given by the semantic event chain (g), which is a sequence-table, where each entry encodes thespatial relations between each segment pair ρi,j counting graph edges (2 means that segments touch (denotedby red edges), 1 means that they overlap (denoted by blue edges), 0 means that there is no edge between twosegments, and absence of a previously existing segment yields 9).

5

Page 6: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 2: Block diagram of the algorithm

3 METHODS

3.1 Overview of the Algorithm

Before discussing details we provide an overview ofthe different algorithmic steps (see Fig. 1 and 2).

Fig. 1 shows a processing example of a manipula-tion resulting in its semantic event chain represen-tation. We first extract all frames from the ma-nipulation movie (Fig. 1 (a)). Frames (Fig. 1 (b))are then segmented (Fig. 1 (c)) by superparamag-netic clustering in a spin-lattice model (Dellen et al.,2009; Abramov et al., 2010), which allows for consis-tent marker-less tracking (Fig. 1 (e)) of the individ-ual segments due to spin-linking across images usingoptic-flow information. The scene is then representedby undirected and un-weighted graphs (Fig. 1 (d)),the nodes and edges of which represent segments andtheir neighborhood relations, respectively. Graphscan change by continuous distortions (lengthening orshortening of edges) or, more importantly, throughdiscontinuous changes (nodes or edges can appear ordisappear). Such a discontinuous change representsa natural breaking point: All graphs before are topo-logically identical and so are those after the breakingpoint. Hence, we can apply an exact graph-matchingmethod at each breaking point and extract the cor-responding topological main graph. The sequenceof these main graphs (Fig. 1 (f)) thus represents allstructural changes in the scene. This type of rep-resentation is then encoded by the semantic eventchain (Fig. 1 (g)), which is a sequence-table, where 0means that there is no edge between two segments,corresponding to two spatially separated segments, 1means that one segment overlaps with the other com-

pletely, and 2 represents segments that touch eachother. A special case exists when segment has dis-appeared, which will be denoted by 9. Note that thecomplete image sequence, which has here roughly 100frames, is represented by an event chain with a size ofonly 7 × 8. The above described steps (1-4) are alsopresented in Fig. 2, showing the block diagram of thecomplete algorithm. The following steps (5-7) utilizethe SEC to compute similarity values between videosshowing manipulations (step 5), to perform actionclassification (step 6A) and conjointly performed seg-ment categorization (step 6B), and, finally, the learn-ing of archetypal SECs (step 7). In the following, wedescribe the different algorithmic steps in detail.

3.2 Recording, Preprocessing, Seg-mentation & Tracking (Step 1)

Manipulation movies are recorded in indoor environ-ments with limited context. All movies used in thisstudy can be found at http://www.nld.ds.mpg.de/

~eren/Movies.html (see Extension 1-2). Typical ex-amples are shown in Fig. 4. We use a stereoscopiccamera setup using AVT Marlin F080C CCD firewirecameras and lenses with variable focal length of 2.7-13.5mm (see Fig. 3(a)). Distance to the manipula-tion scene is about 1.0− 1.5 m. Images are rectified(see Fig. 3(b-c)), stereo and optic-flow information isextracted by different standard algorithms (Pauwelsand Van Hulle, 2008; Sabatini et al., 2007). An exam-ple of a resulting sparse phase-based disparity map isshown in Fig. 3(d). For step 1 (Fig. 2), we use animage-segmentation method, developed by us earlier,in which segments are obtained and tracked by a 3D

6

Page 7: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 3: Schematic of recording and visual prepro-cessing. (a) Stereo camera setup. (b,c) Original ex-ample frames from the left and right image sequences.(d) Sparse phase-based disparity map. (e,f) Extractedsegments for the left and right image.

linking process (see Fig. 3(e-f)). The method hasbeen been described in detail elsewhere (Shylo et al.,2009; Dellen et al., 2009; Dellen and Worgotter, 2009;Abramov et al., 2010). It is mainly implemented onGPUs and operates close-to real-time at about 23 fpsat a resolution of 256 × 320 pixels. For reasons ofbrevity details will be omitted here. The main re-sult from these steps is that we receive consistentlytracked image segments, the fate of which can be usedto encode a manipulation as described next.

3.3 Relational Scene Graphs (Step 2)

Following the extraction of segments (Step 1), we an-alyze the spatial relations between each segment pair.We denote spatial relations by ρi,j in which i and jare the segment numbers. Note that spatial relationsare symmetric, i.e. ρi,j = ρj,i.

As mentioned in the algorithmic overview above,we define four relations between segments: Touch-ing=2, Overlapping=1, Non-touching=0, and Ab-sent=9, which refers to an image segment that isnot observed in the scene. We redefined standardconcepts used in the field of topology (e.g. hole,neighbor, etc.) on purpose to make the terminol-ogy more appropriate in the context of manipula-tion recognition. Terms such as overlapping andtouching are directly referring to primitive manipu-lations. Whenever necessary, we use 3D-informationfrom our stereo setup to disambiguate perspective ef-fects, which would lead to false relations when us-ing only 2D. Note 3D information could also be usedto define additional relations (like “inside”, “above”,etc.). This is currently not done as classification re-sults are already highly satisfactory even without.

Furthermore we note that ultimately additional in-formation must be stored if one wants to use suchgraphs (or event chains) also for execution of a ma-nipulation. At least one needs to additionally definethe required movement trajectories as well as the de-sired poses of the objects which are brought into con-tact with each other as the relation of just ”touching”will not be sufficient for - for example - a mountingprocedure. The current paper does not address theseissues, though, as we are here focusing on manipula-tion recognition but not on execution.

Given that image segments often have strangely-shaped as well as noisy borders, the correct assign-ment of these relations is non-trivial and we had todesign a fast and efficient special algorithm for this.As this is not in the core of this study, we will providedetails only in Appendix 1. This algorithm gives usthe required relations.

Once the image sequence has been segmentedand spatial relations have been extracted, we repre-sent the scene by undirected and unweighted labeledgraphs. The graph nodes are the segment labels and

7

Page 8: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 4: Four different real action types. (a),(d),(g),(j) Original images, (b),(e),(h),(k) correspondingimage segments, and (c),(f),(i),(l) scene graphs from the following manipulations: Moving Object, MakingSandwich, Filling Liquid, and Opening Book. In blue and red are indicated Overlapping and Touchingrelations.

8

Page 9: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

plotted at the center of each segment. Nodes are thenconnected by an edge if segment relations are eitherTouching or Overlapping.

Fig. 4 shows original frames and corresponding seg-ments and their scene graphs from four different realaction types: Moving Object, Making Sandwich, Fill-ing Liquid, and Opening Book. In the Moving Objectaction a hand is putting an orange on a plate whilemoving the plate together with the orange (Fig. 4 (a-c)). The Making Sandwich action represents a sce-nario in which two hands are putting pieces of bread,salami, and cheese on top of each other (Fig. 4 (d-f)).In the Filling Liquid action a cup is being filled withliquid from another cup (Fig. 4 (g-i)). The OpeningBook action describes a scenario in which a hand isopening a book (Fig. 4 (j-l)).

3.4 Main Graphs (Step 3)

To understand the remainder of the algorithm better,we will now use simple, artificial scenes to describesteps 3 to 6 of Fig. 2. Fig. 5 (a-b) depicts originalframes and their corresponding segments of an artifi-cial Moving Object action (sample action 1) in whicha black round object is moving from a yellow vesselinto a red vessel. We will come back to real sceneslater.

Scene graphs, such as those depicted in Fig. 4, rep-resent spatial relations between nodes. Unless spa-tial relations change, the scene graphs remain topo-logically the same. The only changes in the graphstructures are the node positions or the edge lengthsdepending on the object trajectory and speed. Con-sequently, any change in the spatial relation betweennodes corresponds to a change in the main struc-ture of the scene graphs. Therefore, those changes inthe graphs can be employed to define manipulationprimitives. Considering this fact, we apply an exactgraph-matching method in order to extract the maingraphs by computing the eigenvalues and eigenvec-tors of the adjacency matrices of the graphs (Sumsi,2008). A change in the eigenvalues or eigenvectorsthen corresponds to a structural change of the graph.The whole image sequence of the artificial MovingObject action has 92 frames, however, after extractingthe main graphs, only 5 frames are left, each defining

Figure 5: Simple example of the Moving Object action(sample action 1). (a) Original images. (b) Corre-sponding image segments. (c) Semantic scene graphs.In blue and red are indicated Overlapping and Touch-ing relations. (d) Original semantic event chain (ξ1o).(e) Derivative of the semantic event chain (ξ1d). (f)Compressed semantic event chain (ξ1c ).

a single manipulation primitive (Fig. 5 (c)).

3.5 Event Chains (Step 4)

All existing spatial relations in the main graphs aresaved in the form of a table where the rows representspatial relations between each pair of nodes. Themaximum total number of spatial relations, hence themaximum total number of rows, is defined as

ρtotal = n(n− 1)/2 , (1)

where n is the total number of segments. For thesample Moving Object action we have n = 4 (yellow

9

Page 10: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

and red vessels, a black moving object, and a greenbackground) and therefore ρtotal = 6. Those relationsare ρ2,1, ρ3,1, ρ4,1, ρ3,2, ρ4,2, and ρ4,3.

Since any change in the spatial relations representsan event that defines an action, we refer to this ta-ble as original semantic event chain (ξo). Fig. 5 (d)shows it for the artificial action explained above.

It is now important to understand that these tablescontain spatial-relational information (rows) as wellas temporal information in the form of a sequenceof time-points (sequence of columns) when a certainchange has happened. To compare two manipula-tions with each other, spatial and temporal aspectsare being analyzed in two steps by different sub-stringsearch algorithms.

To achieve this, we first perform two data-compression steps. In general, it suffices to onlyencode the transitions from one state (one column)in the original chain (ξo) to another (next column).Therefore, we can perform a derivative-like operationon ξo and represent the result by ξd to simplify thechains.

For this we scan each row of ξo from left to rightand substitute “changes” by combining their numer-ical values into a two-digit number. For example achange from Overlapping to Touching, hence from 1to 2, will be now encoded by 12. When nothing haschanged a double digit, like 11, occurs. Rows wherenothing ever happens (e.g. row ρ3,2 in Fig. 5 (d))are immediately removed. The resulting represen-tation (ξd) is, thus, a mild, loss-less compression ofthe original one. It is required for temporal analy-sis. Fig. 5 (e) shows ξ1d for the sample Moving Objectaction.

Then, in a second compression step all double-digits (00, 11, 22, and 99) are removed leading to ξc.This representation has lost all temporal informationand will be used for the spatial-relational analysis. ξ1cof the artificial action is given in Fig. 5 (f).

3.6 Similarity Measure (Step 5)

Next we will discuss how to calculate the similarity oftwo actions. Essential this comes down to sub-stringsearch algorithms in the spatial as well as the tempo-ral domain. In the spatial domain we are searching

for the correspondences between rows of two com-pressed event chains to reduce the combinatorics (seeSection 3.6.1). Then in the temporal domain the or-der of columns is examined to get the final recognitionresult (see Section 3.6.2).

To explain this we created one more sample forthe artificial Moving Object action. Fig. 6 depictsthe main graphs with respective image segments ofsample action 2 in which a red rectangular object ismoving from a blue vessel into a yellow vessel fol-lowing a different trajectory with different speed ascompared to the first sample. Moreover, the scenecontains two more objects which are either station-ary (red round object) or moving randomly (blackround object). Following the same procedure, theevent chain ξ2o and their compressed versions (ξ2d andξ2c ) for the second sample are calculated and given

Figure 6: Different version of the simple Moving Ob-ject action (sample action 2). (a) Original images.(b) Respective image segments. (c) Semantic scenegraphs. In blue and red are indicated Overlappingand Touching relations. (d) Original semantic eventchain (ξ2o). (e) Derivative of the semantic event chain(ξ2d). (f) Compressed semantic event chain (ξ2c ).

10

Page 11: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

in Fig. 6 (d-f). Note that even though the secondsample contains more objects, the dimensions of thedifferent chains are accidentally the same. This isof no importance as the sub-string search describednext does not rely on dimensions, allowing to com-pare arbitrarily long action sequences.

3.6.1 Spatial Similarity Analysis

The goal of this subsection is to provide the first oftwo subsequent analysis steps, required to obtain afinal measure of similarity between two event chains.The first step is based on a spatial analysis compar-ing the rows of compressed event chains (ξ1c and ξ2c )accounting for a possibly shuffling of rows in differ-ent versions of the same manipulations. This waythe number of possible relations is reduced before wecan finally, in the second step, find the true similaritymeasures.

Let ξ1c and ξ2c be the sets of rows for the two ma-nipulations, written as a matrix (e.g. Fig. 5 (f) and6 (f)):

ξ1c =

r11,1 r11,2 · · · · · · r1

1,γ11

r12,1 r12,2 · · · r12,γ1

2

......

. . ....

r1m,1 r1m,2 · · · · · · · · · r1m,γ1m

,and

ξ2c =

r21,1 r21,2 · · · · · · · · · · · · r2

1,γ21

r22,1 r22,2 · · · · · · r22,γ2

2

......

. . ....

r2k,1 r2k,2 · · · · · · · · · r2k,γ2

k

,

where ri,j represents a relational change between asegment pair

ri,j ∈ {01, 02, 09, 10, 12, 19, 20, 21, 29, 90, 91, 92} .

The lengths of the rows are usually different andgiven by indices γ.

Now each row of ξ1c is compared with each row ofξ2c in order to find the highest similarity. The com-parison process searches for equal entries of one rowagainst the other using a standard sub-string search,briefly described next. Assume that we compare theath row of ξ1c with the bth row of ξ2c . If row a is shorteror of equal length than row b (γ1a ≤ γ2b ), the ath rowof ξ1c is shifted γ2b −γ1a+1 times to the right. At eachshift its entries are compared with the one of the bth

row of ξ2c and we get as a result set Fa,b defined as:

Fa,b = {ft : t ∈ [1, γ2b − γ1a + 1]} ,

ft =100

γ2b

γ1a∑

i=1

δi , (2)

where γ2b is the normalization factor and i is therow index and with

δi =

{1 if r1a,i = r2b,i+t−1

0 else, (3)

where the set Fa,b represents all possible similar-ities for every shift t, given by ft, which holds thenormalized percentage of the similarity calculated be-tween the shifted rows.

As usual for sub-string searches, we are only inter-ested in the maximum similarity of every comparisonhence we define:

Ma,b = max(Fa,b),

For the case γ1a > γ2b , a symmetrical procedure isperformed by interchanging all indices of Eqs. (2),(3) above.

Spatial similarity values between all rows of ξ1c andξ2c are stored in a matrix ζspatial with size m× k as

ζspatial =

M1,1 M1,2 · · · M1,k

M2,1 M2,2 · · · M2,k

......

. . ....

Mm,1 Mm,2 · · · Mm,k

,

11

Page 12: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

The final similarity value (ψspatial) between therows of two compressed event chains is calculatedby taking the mean value of the highest similaritiesacross rows of ζspatial as

ψspatial =1

m

m∑i=1

maxj

(Mi,j), j ∈ [1, · · · , k] . (4)

The complete similarity matrix (ζspatial) betweenthe artificial Moving Object samples (ξ1c and ξ2c ) isgiven in Table 1. Visual inspection of ξ1c (Fig. 5 (f))and ξ2c (Fig. 6 (f)) immediately confirms these simi-larity values. We find 100% similarity (ψspatial) be-tween both artificial “manipulations”. One can seethat Eq. 4 can still lead to multiple assignments ofpermutations with the same maximal similarity. Thiswill be resolved by the temporal similarity measure-ment stage following below. In more realistic scenes100% is, of course, often not reached (see Fig. 7 be-low) and one needs to define a threshold above whichone would consider two similarity values as equal.

However, we also observe that there are several100% matches between rows in these examples. Asa consequence, for the second example two permu-tations ξ2,1c and ξ2,2c exist with equal row-matchingprobability as given in Table 2. Note, for real scenesafter thresholding, even more permutations might ex-ist. Hence, the analysis in not yet complete.

3.6.2 Temporal Similarity Analysis

In the now following second step we can use the timesequence, encoded in the order of events in the event

HHHHHξ1c

ξ2c ρ2,1 ρ3,1 ρ6,1 ρ3,2 ρ3,6

ρ2,1 100% 25% 100% 50% 50%

ρ3,1 100% 25% 100% 50% 50%

ρ4,1 25% 100% 25% 50% 50%

ρ4,2 50% 50% 50% 0% 100%

ρ4,3 50% 50% 50% 100% 0%

Table 1: Similarity table (ζspatial). Similarity valuesbetween the rows of ξ1c and ξ2c the artificial MovingObject samples.

ξ1c ξ21cρ2,1 ⇐ 100%⇒ ρ2,1ρ3,1 ⇐ 100%⇒ ρ6,1ρ4,1 ⇐ 100%⇒ ρ3,1ρ4,2 ⇐ 100%⇒ ρ3,6ρ4,3 ⇐ 100%⇒ ρ3,2

ξ21d11 11 12 2112 21 11 1102 21 12 2012 20 00 0000 00 02 21

(a) Permutation ξ21d

ξ1c ξ22cρ2,1 ⇐ 100%⇒ ρ6,1ρ3,1 ⇐ 100%⇒ ρ2,1ρ4,1 ⇐ 100%⇒ ρ3,1ρ4,2 ⇐ 100%⇒ ρ3,6ρ4,3 ⇐ 100%⇒ ρ3,2

ξ22d12 21 11 1111 11 12 2102 21 12 2012 20 00 0000 00 02 21

(b) Permutation ξ22d

Table 2: Permutations ξ2,pd .

chains to find the best matching permutation andthereby arrive at the final result. To this end willnow use temporal information, hence a derivative liketerm ξd, to find the truly matching permutation.

Thus, we will compare both permutations ξ2pd , p =

1, 2 of the second event chain, shown in Tab. 2 (a-b), with the first one ξ1d. In such cases where ξ1dhas more rows than ξ

2pd , hence rows which have no

correspondences, ξ2pd will be filled with dummy rows

that have no possible similarities.

Let ξ1d and ξ2pd be matrices with the sizes of q × u

and q × v and assume that u ≤ v as

ξ1d =

e11,1 e11,2 · · · e11,ue12,1 e12,2 · · · e12,u

......

. . ....

e1q,1 e1q,2 · · · e1q,u

,

and

12

Page 13: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

ξ2pd =

e2p1,1 e

2p1,2 · · · e

2p1,v

e2p2,1 e

2p2,2 · · · e

2p2,v

......

. . ....

e2pq,1 e

2pq,2 · · · e

2pq,v

,where ξ

2pd is a permutation of ξ2d and e1i,j and e

2pi,j

represent the possible relational changes between anysegment pair

ei,j ∈ {00, 01, 02, 09, 10, 11, 12, 19, 20, 21,

22, 29, 90, 91, 92, 99} . (5)

Following this, the columns of ξ1d and ξ2pd are com-

pared. Note that by contrast to rows, columns ofevent chains will never be shuffled unless they rep-resent different types of actions. Therefore, the col-umn orders of type-similar event chains have to bethe same. Assume that we compare the ath and bth

columns of ξ1d and ξ2pd , respectively. The procedure

is very similar to the one for the spatial analysis.Since the lengths of the columns q are the same,

no shift-operator is required and columns are directlycompared index-wise as

θa,b =100

q

q∑j=1

δj , (6)

where j is the column index and with

δj =

{1 if e1j,a = e

2pj,b

0 else, (7)

where θa,b holds the normalized percentage of thesimilarity value calculated between columns.

Similarity values between all columns of ξ1d and ξ2pd

are stored in a matrix ζptemporal with the size of u× vas

ζptemporal =

θ1,1 θ1,2 · · · θ1,vθ2,1 θ2,2 · · · θ2,v

......

. . ....

θu,1 θu,2 · · · θu,v

,

The final similarity value ψptemporal between thecolumns of two event chains is then calculated by tak-ing the mean value of the highest similarities acrossrows as

ψptemporal =1

u

u∑i=1

maxj

(θi,j), j ∈ [1, · · · , v] (8)

For each permutation ξ2pd given in Tab. 2 a

ψptemporal value is calculated by using Eqs. (6), (7),

and (8), yielding ψ1temporal = 60% and ψ2

temporal =100%. Note that we use Longest Common Subse-quence (LCS) in order to guarantee that the order ofcolumns is the same. LCS is generally used to findthe longest sequence observed in both input samplesequences. Columns of event chains are used as se-quences for this task. Since the number of sequencesis constant, the problem is solvable in polynomialtime by dynamic programming. Consequently, ourtwo sample actions have 100% similarity and permu-tation ξ22d represents the final row correspondencesto the first action. The best matching permutationis further used for categorizing objects as describedin sub-section 3.8.

As mentioned above, in real scenes often 100%are not reached and we will call two actions “type-similar” as soon as their final similarity value exceedsa certain threshold. We would also like to point outthat the final spatial and temporal similarity valuesare not necessarily identical. Action classificationshould use the final temporal similarity values as thismeasure is more restrictive and therefore provides thefinal means for classification.

The question arises, why we use a 2-step processas the second step might suffice on its own. In thiscase however all possible permutations would have tobe analyzed, which can be very costly (up to O(n!))for big event chains. Particularly, decisive “no-matchdecisions”, which occur for all non-type similar ma-nipulations – hence, quite often – could only be ob-tained at the end of the complete permutation analy-sis. Whereas, when performing spatial analysis whichhas the time complexity of O(n3), this result is ob-tained faster. This leads to a substantial algorithmic

13

Page 14: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

speed-up and makes the choice of a two-step algo-rithm useful.

3.7 Action Classification (Step 6A)

We applied our framework to four different real ac-tion types: Moving Object, Making Sandwich, FillingLiquid, and Opening Book (see Fig. 4). For each ofthese actions, we recorded four movies with highlydifferent trajectories, speeds, hand positions, and ob-ject shapes. These examples were introduced to showthat really different instantiations of a manipulationwill still be recognized as belonging to the same type.

Event chains of each real test data are comparedwith each other by using the similarity measurementalgorithm explained in Step 5. The resulting finalmaximal similarity values are given in Fig. 7. Eachtest data has high similarity with the other versionsof its type-similar action (see close to diagonal) andalmost always, the similarity between type-similar ac-tions is much bigger than the similarity between non-type-similar actions. The minimum similarity valuefor type-similar actions is measured as 62% betweenthe forth and second versions of Filling Liquid, but

Figure 7: Similarity values between event chains ofthe real test data set.

the similarity between Filling Liquid and non-type-similar actions is far less. Setting a threshold at 60%would, across all examples, lead to zero false nega-tives and to two false positives (opening book ver-sion II and moving object version I), which would beconfused with other manipulations. This is interest-ing as opening a book can indeed look very similarto the moving of an object (picking up a book-coverand moving it - say - up and towards you will indeedlook very similar from picking up an object and mov-ing it up and towards you). The same is true for thecomparison of moving-object with making-sandwich.In the making-sandwich action many subcomponentsexist where objects are being moved. Thus, in gen-eral the manipulation analysis shown in Fig. 7 cor-responds very well to our human understanding ofaction (sub-)components!

Note, as soon as the complete table has been mea-sured, it is also accessible to unsupervised classifica-tion (X-means, (Dan Pelleg, 2000)). We have donethis by using correlation values between columns asfeatures and we receive four classes with no outliers.Hence with clustering one gets completely correctclassification of all individual manipulations.

3.8 Segment Categorization (Step 6B)

The row correspondence, determined by finding thebest matching permutation, also implicitly encodesthe similarity of the graph nodes between the twodifferent examples.

We will explain this using again the two artificialexamples (Fig. 5 and Fig. 6). The row similarity val-ues of the second permutation given in Table 2 (b)represent the correspondences between manipulatednodes in ξ1 and ξ2 (now we can drop all other indicesas descriptions will not suffer).

The question, which we would like to answer is:which nodes represent segments that play the samerole in type-similar actions.

This can be achieved in a fully unsupervised wayby a simple counting procedure. We first analyzewhich node number in ξ1 is repeating in conjunctionwith which node number in ξ2 in Table 2. We startwith node number 1 in ξ1, which occurs in relationsρ2,1, ρ3,1 and ρ4,1. Its corresponding best matching

14

Page 15: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 8: Segment categorization results. In each action type, the manipulated segments can be classified andgrouped based on their action roles. Note, classification happens at the level of segments or segment groups.A link to the object-domain (indicated by the dashed arrows) could be introduced by including explicit objectmodels, but this is not part of this study.

relations are given by:

ρ2,1 ⇐ 100%⇒ ρ6,1ρ3,1 ⇐ 100%⇒ ρ2,1 ⇒ 1 ≈ 1 .ρ4,1 ⇐ 100%⇒ ρ3,1

While node 1 is repeating three times in ξ1 (leftside), the same node number 1 in ξ2 (right side) is alsorepeating three times. However, node numbers 2, 3,and 6 in ξ2 occur only once. Therefore, we concludethat graph nodes 1 in both examples, ξ1 and ξ2, hadthe same roles. In fact, both graph nodes representthe green background in both actions.

We continue the node relation analysis with nodenumber 2 in ξ1, and obtain

ρ2,1 ⇐ 100%⇒ ρ6,1 ⇒ 2 ≈ 6 .ρ4,2 ⇐ 100%⇒ ρ3,6

Node number 2 in ξ1 is repeating twice with nodenumber 6 in ξ2. Those graph nodes represent the yel-low and blue vessels within which the moving objectsare initially placed and from which they then moveaway.

For the case of node number 3 in ξ1 we obtain

ρ3,1 ⇐ 100%⇒ ρ2,1 ⇒ 3 ≈ 2 .ρ4,3 ⇐ 100%⇒ ρ3,2

Node number 3 in ξ1 corresponds to node number2 in ξ2 because both of them are repeating twice.Those graph nodes define the destination vessels forthe moving objects.

The last node number 4 in ξ1 is obtained as

ρ4,1 ⇐ 100%⇒ ρ3,1ρ4,2 ⇐ 100%⇒ ρ3,6 ⇒ 4 ≈ 3 .ρ4,3 ⇐ 100%⇒ ρ3,2

Node number 4 in ξ1 and node number 3 in ξ2 areboth repeating three times. In fact, both graph nodesrepresent the moving objects, which are the roundblack object in ξ1 and the rectangular red object inξ2.

In the case of having the same highest value morethan once, e.g. having two times 100% similarity val-ues in the same row of the similarity matrix, segmentcategorization might lead to ambiguous results, i.e.one segment would correspond to two different seg-ment in the other manipulation. This sort of con-flicts can be solved by taking the second highest val-ues in the similarity matrix and calculating the node-similarity again. This way we always achieve uniquesegment categorization results.

We applied this categorization algorithm to ourfour different real manipulation scenes (Fig. 4). Theresults showed that the manipulated segments in each

15

Page 16: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 9: Sample frames from 10 different versions of a “Putting an object on a plate” action. In this actiontype a hand is appearing in the scene, putting different kinds of objects (e.g. apple, orange, a piece of bread,etc.) on a plate following different trajectories with different speeds, and then leaving the scene.

action type can be categorized according to their rolesin the actions. Fig. 8 illustrates the categorizationresults, e.g. the Moving Object actions include threedifferent segment groups here named by their object-names for simplicity (apples or oranges, plates, andhands) each of which performed different roles. Inthe Filling Liquid action the hands are grouped cor-rectly despite having different poses. Note that forthe sake of simplicity the backgrounds are ignored inFig. 8 although they are also detected and groupedcorrectly.

While we are here strictly at the level of segments itis evident (albeit non-trivial!) that this unsupervisedcategorization process could be coupled to objectmodels, thus, providing access to object categoriza-tion, too. It is interesting to remark that in this caseany object-like entity will be classified strictly in thecontext of the observed manipulation. Thus, steps6A and 6B are tightly linked as depicted by the graybox in Fig.2. For example a “cup-being-filled” wouldbe grouped with other objects-being-filled. The samecup, when occurring in an action of “cup-used-as-pedestal” (where the cup is first turned upside downand then something is put on top), would be clas-sified together with other objects-used-as-pedestals.This relates to the cognitive concept of affordances(Gibson, 1977) and will be discussed later in moredetail.

3.9 Learning (Step 7)

In the next step we will show that the SECs of dif-ferent instantiations of type-similar manipulation canbe combined by statistical learning to render a modelSEC for this manipulation type. Also this is done inan unsupervised way.

We know that rows of the event chain encode themain relational changes between segments. To ar-rive at a model, the learning procedure just needsto search for all common relational changes observedacross repeated type-similar manipulations. A simpleaveraging algorithm suffices for this.

We describe an on-line version of the learning,but the same procedure could also be employed inbatch-mode. Learning is initiated by assigning smallweights ωri to all rows and ωci to all columns of thefirst observed chain. When observing the next ma-nipulation, we use Step 6A (action classification) tofind out if it is type-similar. If this is the case theweights of each row and column are incremented bya small amount ∆ωi if the row and column have a cor-respondence in the new event chain. If the new chainhas additional, so far unobserved rows, the model isextended by these rows, which start with the initialsmall weight value. This is repeated as long as de-sired but usually 10 instantiations suffice for a stablemodel. After this, weights are thresholded, deletingall rows and columns which are subthreshold and re-turning the resulting model event chain for this ma-

16

Page 17: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 10: (a) Similarity values between event chainsof “Putting an object on a plate” and “Taking anobject from a plate” actions. (b) The learned SECmodel for the action type “Putting an object on aplate” with corresponding row (ωri ) and column (ωci )weight values. These weight vectors are just for illus-tration since different weight values might be observedfor different action types due to degree of noise in theevent chains. (c) Same for “Taking an object from aplate”.

nipulation type.In addition to this, for each manipulation instance,

action-relevant segments (segment groups) are ex-tracted and labeled according to their roles withinthe observed action as explained in Step 6B (segmentcategorization).

Note, online learning could suffer from bad first ex-amples with which all next following manipulationswould be classified. There are obvious work-arounds,for example cross-comparing the manipulations witheach other. Ultimately, batch-mode learning is moreuseful. For this one would first record scenes frommany manipulations, then perform clustering of thesimilarity matrix (e.g. Fig. 10) after which learningcan be done for each cluster in the same way as de-scribed above.

We applied the learning framework in batch-modeto two different manipulation types: “Putting an ob-ject on a plate” and “Taking an object from a plate”each of which has 10 different versions with stronglydifferent trajectories, speeds, hand positions, and ob-jects (see Fig. 9 to get an impression of the level ofdifference).

Unsupervised classification of the similarity matrix(see Fig. 10) is used to classify those 20 versions.Note, many times a high similarity values (around50%) is observed between non-type-similar actions.The reason is that except for the sequencing, whichis inverted for “putting” versus “taking”, primitivesof both action types necessarily look similar. Differ-ences are big enough, though, such that unsupervisedclassification will still lead to completely correct clas-sification.

Next, a SEC model is learned for each manipula-tion class by searching for the similar common rowsand columns observed in all 10 different versions asexplained above. Fig. 10 (b-c) shows the learned SECmodels for both action types with corresponding row(ωri ) and column (ωci ) weight values. To prove the ac-curacy of the learned SEC models we prepared 5 testmovies which all contain both action types – puttingand taking –, but performed in different temporal or-der (or sometimes with two hands at the same time!).Fig. 11 shows some sample frames from each of thetest movies.

Fig. 12 depicts the similarity results between two

17

Page 18: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 11: Sample frames from 5 different mixed actions in which both manipulation types “Putting an objecton a plate” and “Taking an object from a plate” are performed in different orders. (a) A hand is first takinga piece of bread from a plate and then putting it on a different plate. (b) Another piece of bread is movedfrom one plate to another with a different trajectory. (c) A hand is replacing an orange. (d) A hand is firstputting an orange on a plate and then taking a piece of bread from another plate. (e) A hand is putting anorange on a plate and in the mean time the other hand is simultaneously taking an apple from the secondplate.

learned models and all 25 movies, 20 of which are thetraining data and the remaining 5 are unknown testdata. Similarity is measured as described in Step 5.In red and blue are indicated the similarities for agiven movie with the “Putting an object on a plate”and “Taking an object from a plate” models by leave-one-out method, respectively. For the first 10 train-ing data the learned model of “Putting an object ona plate” has higher a similarity, whereas the modelof “Taking an object from a plate” has a lower one(Fig. 12, green area). It is the other way around forthe next 10 training data (Fig. 12 yellow area). How-ever, for the last 5 test data, in which both manip-ulation types are performed in different orders bothlearned models have high similarity. (Fig. 12 bluearea). When doing time-slicing (data not shown) onesees that the similarity in the last 5 data for eithermanipulation increases together with the completionof the respective manipulation. Thus, one after theother in the first 4 movies and simultaneously in thelast one, where both actions are performed simulta-neously.

4 Case Study: Learning and re-playing an action sequence

Artificial intelligence (AI) systems almost alwaysfollow logic rules structured as: pre-condition, ac-tion, post-condition. Assessment of success ofrule-execution requires measuring the post-condition.Hence, such systems rely on Thorndike’s law of causeand effect (Thorndike, 1911) and, traditionally, theywere defined by their programmers. Thus, it is dif-ficult to find ways for an agent to learn cause-effectrules by itself (without explicit interference of a su-pervisor, see “grounding problem”, (Harnad, 1990)).Furthermore, especially in complex situations, agentsare faced with the problem of how to assess “effect”as many aspects of a situation might change follow-ing an action (see “frame problem”, (McCarthy andHayes, 1969)).

In the following we show our first results of a sys-tem that allows learning the rules of an action se-quence without explicit supervision and then exe-cuting actions in a scenario self-assessing “action-

18

Page 19: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 12: Similarity results between the two learnedmodes and all 25 movies. In red and blue are in-dicated the similarities for a given movie with the“Putting an object on a plate” and “Taking an ob-ject from a plate” models, respectively. First 20 dataare the training data and represent different versionsof the “Putting an object on a plate” and “Taking anobject from a plate” actions, respectively. The last 5data represent the mixed actions used for testing thelearned models.

effects”. Both processes rely on the event chains andthe agent can without any pre-defined rule set learnthe sequence and then assess the (in-)correctness ofits actions just by comparing the resulting chains.Condensation into event chains thus helps solving thegrounding- as well as the frame problem.

Our robot system is quite simple, consisting of a 3DOF arm an with magnetic gripper (Neurorobotics,Sussex). Thus, we used “pushing” as well as “pick-and-place” as action repertoire. To generate trajec-tories we used predefined dynamic movement prim-itives (Ijspeert et al., 2002; Ning et al., 2010) andtrajectory start- and end-points (for touching) werevisually pre-defined and transferred onto the robotvia a standard inverse kinematics procedure (no ser-voeing). Motion generation and control are not inthe focus of this study, therefore we kept this simplehere (for an advanced treatment of these aspects seeNing et al. (2010)). Objects for pick-and-place weremagnetic.

The desired action sequence was first demonstrated

by a human. Fig. 13 (a-b) (blue frame) shows sam-ple frames of the action sequence in which a hand is“pushing” a lid off a container and then “picking-and-placing” a ball inside. The event chains of this actionsequence is learned by our system. It can be brokeninto two sub-chains and the final result is shown inFig 14 (a,b).

In the next step we confront the robot with a scene,provide it with a possible set of motion-trajectorystart points, and let the robot randomly try outpushing and pick-and-place actions. Fig. 13 (c-f)(redframe) shows a subset of the different types of actionsthe robot has tried out (many more were performedbut cannot be shown here). The blue tip of the robotarm is visible in the images. Note, objects are usu-ally different from the ones used by the human. InFig. 13 (c) the robot is only pushing a lid but doesnot continue with pick&place. In (d) a black ballis pushed. Fig. 13 (e) shows how the robot picksup a ball and then drops it on the table. Panel (f)represents an action where the robot is taking theball from a container and places it on the table. Allthese examples do not (or only incompletely in (c))reproduce the observed action sequence. Fig 13 (g-h) (green frame) shows the correct action sequencewhich at some point was also executed by the robot.

Corresponding event chains of all those action se-quences are given in Fig 14. Due to different noisesources (in tracking, segmentation or depth informa-tion) the sizes of individual event chains can varyconsiderably. Still, as discussed in the section onLearning, individual chains contain the relevant infor-mation, which is not harmed by noise-induced rowsand columns. As a consequence, even very differentlooking event chains can be robustly compared to thelearned models (a,b) using the above described simi-larity algorithm. Figure labels (a-h) in Fig. 13 corre-spond to those in Fig. 14. Colored boxes in Fig. 14show rows with high similarities. This occurs forpanel (c) and (g), which are similar to (a), as well asfor (h), which is similar to (b). A similarity table isshown in Fig. 14 (i). It shows that manipulation (c) issimilar to the learned pushing model (a). The same istrue for manipulation (g), which both are above 60%similarity. Only manipulation (h) is similar to thepick and place-inside model (b) with 87% similarity.

19

Page 20: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 13: Action sequence of (a) pushing a lid off a container and then (b) putting a ball inside demonstratedby a human (blue frame). Different types of robot actions (red frame). (c) pushing a lid, (d) pushing a ball,(e) lifting a ball and dropping it on the table, (f) taking the ball from a container and putting it on the table.The green frame shows a robot action sequence similar to the one performed by the human, in which (g) alid is first pushed off and then (h) a ball is placed inside a container.

20

Page 21: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 14: Corresponding event chains of human demonstrated actions and different types of robot actions.Labels (a-h) correspond to the manipulations shown in Fig 13 (a-h). (a,b) Event chain model extractedfrom human demonstration of (a) “pushing” as well as (b) “pick-and-place-inside”. (c-f) Event chainscorresponding to the wrong or incomplete actions in Fig 13 (c-f, red frame). (g,h) Event chains correspondingto the correct sequence in Fig 13 (g,h, green frame). (i) Similarity table between all actions the robot hastried (c-h) and the learned models (a,b) demonstrated by the human.

21

Page 22: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Sequence (g-h) of both manipulations following eachother is, thus, correctly recognized as being the onethat reproduces the complete learned model (a-b).

Thus, this (still rather simple) set of examplesdemonstrates that by using SECs learning and recog-nition of manipulations is possible for a robot. Themain achievement, we believe, lies here in the veryhigh level of abstraction, which allows the machineto recognize (in-)correctness of its actions even whenobjects and their arrangements are very different inthe different scenes.

5 Discussion

In this paper we have introduced a novel represen-tation for manipulations, called the semantic eventchain, which focuses on the relations between objects(including hands) in a scene. The representation gen-erates column vectors in a matrix where every transi-tion between neighboring vectors can be interpretedas an action rule, which defines which object rela-tions have changed in the scene. Hence event chainsreach a rather high level of abstraction but on theother hand they remain tightly linked to the imagesfrom which they originate, because they rely on con-tinuously tracked segments. We have devised sim-ple algorithms based on sub-string comparisons andcounting procedures by which event chains can becompared and actions as well as segment (-groups)can be classified in an unsupervised way. No priorobject models are required in this approach and thelearning of archetypal event chains (“models”) reliesonly on weight upgrade of repeating columns (repeat-ing “rules”). Thus, learning also operates in an unsu-pervised way. The method was shown to be successfulin classifying and recognizing video showing differentmanipulations, and also in learning the archetypalSEC for a given action class. We further demon-strated the feasibility of the approach through ex-periments with an robotic arm. By observation, themachine extracted the SEC of a human manipula-tion, and then reproduced the associated manipula-tion type in a new scenario via repeated experimen-tation.

To our knowledge this is the first approach to-

wards manipulation recognition, which reaches an al-most symbolic level of abstraction1 while being fullygrounded in the signal domain. In the following wewill discuss related approaches and also problems andpossible extensions of our algorithm.

5.1 Related approaches

Our framework introduces the semantic event chainwhich is a novel representation that seems to holdsome promise for extracting action semantics. It di-rectly encodes the observed manipulations withouthidden states. Event chains remain tightly linked toperception but are of semantic character. Thus, theyare more invariant with respect to viewpoint changesand object features than most other approaches. Thishowever only holds if the visual entities used, hereimage segments, carry sufficient meaning, i.e. theyrepresent parts of objects. As a consequence, the se-mantic event chain is composed of action primitives,and no hidden model needs to be assumed (e.g. Hid-den Markov Model) as required in other works tobridge the gap between signal and symbol (Ogawaraet al., 2002; Raamana et al., 2007).

Similarities exist between our approach and thework by Sridhar et al. (2008), who analyzed manip-ulations in the context of a breakfast scenario andrepresented the whole image sequence by an activ-ity graph which holds spatiotemporal object inter-actions. By using statistical generalization, eventclasses are extracted from the activity graphs. Ob-ject categories are learned by calculating the similar-ity between object roles at each event class. How-ever, large activity graphs and the difficulty of find-ing exact graph isomorphisms are a major drawbackof this method. Furthermore, unlike in our work, ob-ject knowledge was provided beforehand, thus lackinggrounding in the signal domain.

Our approach is different from the one by Kjell-strom et al. (2008). In our method, event chainsare already highly invariant with respect to viewpoint

1Note, the symbolic domain can be directly coupled to theevent chains as these chains allow the parsing of meaningfulrule-like commands such as: “In the next transition performan action that assures the segment 17 touches segment 12 andthat segment 3 overlaps with segment 5.”

22

Page 23: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

changes and object features because the relations be-tween segments are of qualitative character - e.g.,touching is already a semantic description -, while inthe work of Kjellstrom et al. (2008) semantics onlyemerge at the last stage of the modeling process.

In the present paper, we do not perform any ob-ject recognition in the classical sense (see Section 2.3).The image segments used for finding the SECs are be-low the object level, which means that a “true” objectmay be composed of several segments. Nevertheless,during the manipulation recognition procedure, im-age segments emerge naturally in conjunction withtheir associated action, providing a means to extract“action-relevant” objects from the scene by recogniz-ing the respective actions in the SECs. This result iscongruent with psychophysical evidence that humansrecognize objects more easily if they are embedded ina consistent action context (Helbig et al., 2010). Theapproach has the advantage that it is highly invari-ant to the object’s appearance and only takes intoaccount the functionality of the object with respectto a given set of actions (see Fig. 8). However, therich information provided by the object’s appearancein the image is ignored and thus the algorithm doesnot allow recognizing objects without providing anyaction context.

5.2 Features and Problems of the Al-gorithm

The realm of object and action recognition is exceed-ingly rich carrying many facets and so far there is noalgorithm existing which would reach the level of hu-man proficiency. Adults can robustly classify objectsand actions using a very high degree of invariance andgeneralization. To reach classification robustness inartificial systems usually the application scenarios arerestricted and models of objects and/or actions areintroduced, which however limits the generalizationproperties of such systems. To improve on this, pref-erentially life-long model-learning would be requiredleading to the extension of existing models and to theacquisition of new ones across all scenarios. Little isknown how to do this.

In the introduction we had discussed that at leastsix properties define the requirements for a useful

manipulation representation for artificial agents: (1)sensory signal based, (2) learnable, (3) relational, (4)time-sliced (5) human-comprehensible and (6) com-patible with models.

We believe that the here introduced representationcarries these aspects at least to some degree and thatthe approach also reaches across scenarios, but weare also aware of the fact that we are still far awayfrom the final goal, which would be to reach humanproficiency.

What are the drawbacks we are still facing andwhich extensions need to be made in future work toimprove on this?

We heavily rely on advanced computer visionmethods and we are aware that failures in the com-puter vision can harm our approach. Measures areundertaken to reduce such failures (Dellen et al.,2009), but this was not in the core of this study.

It is evident that image segments do not correspondto objects and our approach remains “beneath” theobject level. While this way being strongly groundedin the image domain, we do not address complex as-pects such as “permanence” and “feature binding”,which are both required to assure the continuous ex-istence of an object in the memory of a system.

In the case of heavily textured objects, featurebinding based on color alone as employed in the seg-mentation framework will lead to a large numberof segments, i.e. objects will be highly fragmented.This may cause problems to the tracking procedures(matching complexity) and thus will affect the qual-ity of the SECs in an undesirable way. For tex-tured images, additional mechanisms should be usedat the segmentation stage to avoid over-segmentationand/or fragmented segments could be merged usingsegment-based grouping criteria. We are currentlyinvestigating potential solutions to this problem.

However, probably the most important feature ofthe here presented algorithms is that they do not relyon image segments as their input. Any (!) continu-ously trackable entity, as long as it is sufficiently closeto the semantic level, hence also object-model-basedtracking, can be used to design an event chain. Thus,while we still think that image segments are in manyways useful, the event chain representation and itscore algorithms are transferable to other inputs too.

23

Page 24: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Furthermore, as mentioned above, our approachand the resulting event chains cannot directly be usedto execute an action. For this, additional informationabout movement trajectories of the hands as well asinformation on the required poses of the objects needsto be stored and used. It will be a topic of future re-search to introduce these aspects into the framework.

5.3 Affordances and Object-ActionComplexes

The rule-character of the event chains was used inthis study to let an agent assess the success of its ownactions. The experiments shown in Figs. 13 and 14have demonstrated that an agent can first learn eventchains and then connect them to (predefined) motionpatterns trying out how to reproduce an action se-quence. In doing so correlations between actions andoutcomes (the resulting states) exist and the agentcan produce triplets of [starting state, action, result-ing state], where starting and resulting state are givenby adjacent columns of the event chain. If the re-sulting state matches the one remembered from thelearning phase then the agent can store the completetriplet. This way actions can be attached to startingstates taking the process forward from recognition topurposeful execution. Also this process works in anunsupervised way, where repetition would allow theagent to consolidate a triplet, but is also accessibleby supervision, where a teacher would tell the agentwhich action to perform when.

For easy of talking we will use the term “thing” and“object” now in this discussion avoiding the clumsyterms segment-group, entity, etc.

As mentioned above, in our representation ob-jects are being classified always in the context ofthe performed manipulation. A thing, when be-ing filled with liquid, could be grouped with fillableobjects into a category, which a human might call“containers”. Equally well the same thing (whenturned around) would be grouped together with otherplatforms into a category “pedestals” as soon assomeone puts something on top of it. This as-pect strongly relates to Gibson’s concept of affor-dances (Gibson, 1977), where objects suggest ac-tions. This concept had been extended by a group of

European researchers from the PACO-PLUS project(http://www.paco-plus.org/) stating the objects andaction are combined by humans into the cogni-tive concept of Object-Action Complexes (OACs)(Worgotter et al., 2009; Kruger et al., 2010). A phys-ical “thing” gets its meaning through the actions per-formed with it and will consequentially be interpreted– like the container/pedestal above – always in an ac-tion context.

Kruger et al. (2010) had given a formal definition ofthe Object-Action Complex. In an abbreviated wayit states that an OAC is a function that comparesobtained against excepted outcome when an actionis being executed at an object. The OAC therebyalso measures success of the action by assessing thesensor states of the agent. An OAC, thus, representsa model of the transformations that takes place at anobject following an action (Worgotter et al., 2009).OACs are learnable by the agent.

Adjacent columns of an event chain together withthe processes of action learning described above areclosely related to this concept. Thus, a completeevent chain (together with its actions and objects)represents a chain of OACs and can be understoodas a category which groups objects and actions intothe cognitive concept of a manipulation. Thus, thehere suggested set of algorithms provides – as far aswe know – the first entry point to a grounded, agent-learnable cognitive categorization process of ratherhigh complexity. In addition, it provides a link tothe symbolic, language domain because of its rule-like character.

As discussed above, many problems still exist, butwe would hope that this research might open somenew avenues into the difficult cognitive field of actionunderstanding.

Acknowledgment

The research leading to these results has receivedfunding from the European Community’s SeventhFramework Programme FP7/2007-2013 - Challenge2 - Cognitive Systems, Interaction, Robotics - undergrant agreement No 247947 - GARNICS. We thankTomas Kulvicius for valuable discussion. B.D. ac-

24

Page 25: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

knowledges support from the Spanish Ministry forScience and Innovation via a Ramon y Cajal Fellow-ship.

6 Appendix

6.1 Appendix 1 - Defining segment re-lations in a fast and efficient way

As defined in the main text, possible spatial rela-tions of each segment pair are Touching=2, and Over-lapping=1, No Connection=0, and Absence=9. Theprocess of calculating those relations has two mainsteps. In the very first step the segmented imageis scanned horizontally (from left to right) and ver-tically (from top to down) to calculate the existingsegment sequences. Following the scanning process,all lines (vertical+horizontal) are counted where acertain segment sequence has been observed and arestored in a list

L : (i1, i2, i3, ...) 7→ nS

where nS is the number of all vertical and horizon-tal lines with the segment sequence (i0, i1, i2, ...).

Fig. 15 illustrates how the sequences between 5different segments can be calculated, e.g. (1) and(1, 2, 1, 3, 1) are observed as 225 and 40 times, re-spectively.

The second main step analyzes the existing se-quences to calculate the spatial relations between seg-ment pairs. For this purpose, each sequence is iter-ated by considering the following rules:

• “Touching”: Segments follow one right afterthe other in any sequence are touching, e.g. seg-ments 5 and 3 are touching each other in suchsequences (..., 5, 3, ...) or (..., 3, 5, ...).

• “Overlapping”: (i) If a segment is observedtwice in a sequence, all segments in between areoverlapping with it, e.g. in (..., 1, 5, 3, 1, ...) 5 and3 are both overlapped (surrounded) by 1. (ii)However, the inner segments are not overlappingwith each other, e.g. in (..., 1, 5, 3, 1, ...) 5 cannotoverlap with 3 because it is not observed twice.

To each rule corresponds a counter of hints (eitherCti,j or Co

i,j). For each segment pair, counters storenumber of hints that show the rules are fulfilled foreach segment pair as

• Cti,j 7→ nt: Number of hints that i and j are

touching.

• Coi,j 7→ no: Number of hints that i is overlapping

with j.

Note that Cti,j ≡ Ct

j,i since the Touching relationis undirected, whereas Co

i,j is not symmetric.Each sequence S is processed separately. Its ele-

ments are stored in a stack one after another. Whenthe next element in is stored, the first rule indi-cates that in and the previous element in−1 have theTouching relation. Since the current sequence hasbeen found multiple times in the image (given byL(S)), the touching entry (in−1, in) is incrementedby L(S):

Ctin−1,in += L(S) .

Example: The sequence S := (1, 5, 3, 1) is analyzedby storing the first element i1 = 1 in the stack. Sincethere are always more than one element required forthe stack, the algorithm immediately skips to addingthe next element i2 = 5. The first rule indicates thatthe pair (1, 5) has the Touching relation. As a result,Ct

1,5 is increased by L(S) = 40. The same operationsare applied to the pair (5, 3) in the next step.

To fulfill the second rule the stored element needsto be checked whether it is already in the stack. Inthis case, the elements of the first occurrence is andin are recognized as having the Overlapping relationwith in. Therefore, the corresponding counter will beupdated as follows:

Coin,j += L(S), ∀j ∈ {is+1, ..., in−1} .

Example: In the same sequence given in the pre-vious example the next element i4 = 1 is added tothe stack and Ct

1,3 is incremented by 40. Since i4occurred earlier (is = i1), all elements in between,hence i2 = 5 and i3 = 3, Co

1,5 and Co1,3 are increased

by L(S) = 40.

25

Page 26: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Figure 15: Calculating the spatial segment relations between background, two vessels, and two contents whichare represented by segment number 1, 2, 3, 4, and 5, respectively.

The second rule also indicates that those inner el-ements j do not overlap with each other, thus:

Cojn,jm −= L(S),∀jn, jm ∈ {is+1, ..., in−1}, n 6= m.

Example: Due to this rule, Co3,5 and Co

5,3 are de-creased by 40.

Next, the inner elements are removed from the se-quence. This is important in cases of having recur-sive overlapping situations to get Overlapping rela-tions only between neighbor segments. In Fig. 15segment pairs (1, 2) and (2, 4) have the Overlappingrelations, whereas (1, 4) has No Connection.

Example: For the sequence S := (1, 2, 4, 2, 1), i4is added to the stack in the fourth step. By con-sidering the description given above, we computeCt

2,4+ = L(S) and Co2,4+ = L(S). The elements

i3 and i4 are then removed from the stack, whichleads to (1, 2). The algorithm is continuing by addingi5 = 1 to the stack and by computing Ct

1,2+= L(S)and Co

1,2+ = L(S) as described above. In the end itis observed that segment pairs (2, 4) and (1, 2) havethe Overlapping relation, however, (1, 4) has No Con-nection.

Once all sequences are iterated, the values in Cti,j

and Coi,j are used to compute the final spatial rela-

tions of the segments. Note that some counter values

might be wrong due to noisy segments. Instead ofdefining a minimum value as a static threshold, eachentry is normalized first using the size of the corre-sponding segments:

Cti,j :=

Cti,j

min(Ni, Nj)

where Ni is a list that stores the pixel size of seg-ment i. Normalization considers only the smaller seg-ment that makes the algorithm robust against noiseand accurate for small segments. Note that Co

i,j isalso normalized in the same way. Each normalizedentry Ct

i,j and Coi,j is then thresholded. Unless Ct

i,j

and Coi,j exceed the thresholds, relations are set to

No Connection.The main advantage of the proposed algorithm is

that each step explained above can be calculated sep-arately and hence can be parallelized.

Note, more complex 3D spatial segment relations(e.g. inside above, under, etc.) directly relate to theoverlapping and touching relations as only a third di-mension needs to be added. The following examplemakes this clear. Consider two 2D-Overlapping cases:“lying on top” (e.g. two flat objects) or “being in-side” (of one smaller object inside a container). Bothare 2D-identical in the sense of being an overlapping-

26

Page 27: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Extension Type Description

1 Video Four Different Action Types2 Video Case Study

Table 3: Multimedia Extensions

relation, but with adding 3D one could define newrelations (“on-top” and “inside”).

6.2 Appendix 2 - Index of MultimediaExtensions

The multimedia extensions (see Table 3) to this arti-cle are at: http://www.ijrr.org.

References

Abramov, A., Aksoy, E. E., Dorr, J., Pauwels, K.,Worgotter, F., and Dellen, B. (2010). 3d seman-tic representation of actions from efficient stereo-image-sequence segmentation on gpus. In 3DPVT.

Aksoy, E. E., Abramov, A., Worgotter, F., andDellen, B. (2010). Categorizing object-action rela-tions from semantic scene graphs. In IEEE Inter-national Conference on Robotics and Automation,ICRA2010 Alaska, USA.

Belhumeur, P. N. and Kriegmant, D. J. (1996). Whatis the set of images of an object under all possiblelighting conditions. IEEE CVPR, pages 270–277.

Breazeal, C. and Scassellati, B. (2002). Robots thatimitate humans. Trends Cogn. Sci. (Regul. Ed.),6:481–487.

Calinon, S. and Billard, A. (2004). Stochastic Ges-ture Production and Recognition Model for a Hu-manoid Robot. In Proceedings of the IEEE/RSJinternational Conference on Intelligent Robots andSystems (IROS), volume 3, pages 2769–2774.

Calinon, S. and Billard, A. (2005). Recognitionand Reproduction of Gestures using a Probabilis-tic Framework combining PCA, ICA and HMM.

In Proceedings of the International Conference onMachine Learning (ICML), pages 105–112.

Calinon, S. and Billard, A. (2007). Incremental learn-ing of gestures by imitation in a humanoid robot.In HRI ’07: Proceedings of the ACM/IEEE inter-national conference on Human-robot interaction,pages 255–262, New York, NY, USA. ACM.

Dan Pelleg, A. M. (2000). X-means: Extending k-means with efficient estimation of the number ofclusters. In Proceedings of the Seventeenth Inter-national Conference on Machine Learning, pages727–734, San Francisco. Morgan Kaufmann.

Dee, H., Hogg, D., and Cohn, A. (2009). Scene mod-elling and classification using learned spatial rela-tions. In Proc. Spatial Information Theory, volume5756, pages 295–311. Springer N.Y.

Dellen, B., Aksoy, E. E., and Worgotter, F. (2009).Segment tracking via a spatiotemporal linking pro-cess in an n-d lattice model. Sensors, 9(11):9355–9379.

Dellen, B. and Worgotter, F. (2009). Disparity fromstereo-segment silhouettes of weakly textured im-ages. In Proceedings of the British Machine VisionConference.

Fergus, R., Perona, P., and Zisserman, A. (2003).Object class recognition by unsupervised scale-invariant learning. In In CVPR, pages 264–271.

Gibson, J. (1977). The theory of affordances. In per-ceiving, acting, and knowing. Eds. Robert Shawand John Bransford.

Gilbert, A., Illingworth, J., and Bowden, R. (2009).Action recognition using mined hierarchical com-pound features. IEEE Transactions on PatternAnalysis and Machine Intelligence.

Hakeem, A. and Shah, M. (2005). Multiple agentevent detection and representation in videos. InAAAI.

Harnad, S. (1990). The symbol grounding problem.Physica D 42, pages 335–346.

27

Page 28: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Helbig, H. B., Steinwender, J., Graf, M., and Kiefer,M. (2010). Action observation can prime visualobject recognition. Experimental Brain Research,200(3-4):251–258.

Hoiem, D., Efros, A. A., and Hebert, M. (2008).Putting objects in perspective. Int. J. Comput.Vision, 80(1):3–15.

Hongeng, S. (2004). Unsupervised learning of multi-object event classes. In Proc. 15th British MachineVision Conference, pages 487–496.

Ijspeert, A. J., Nakanishi, J., and Schaal, S. (2002).Movement imitation with nonlinear dynamical sys-tems in humanoid robots. In in Proc. IEEE Int.Conf. Robotics and Automation, pages 1398–1403.

Kjellstrom, H., Romero, J., and Kragic, D. (2008). Si-multaneous visual recognition of manipulation ac-tions and manipulated objects. In European Con-ference on Computer Vision.

Kruger, N., Piater, J., Geib, C., Petrick, R., Steed-man, M., Worgotter, F., Ude, A., Asfour, T.,Kraft, D., Omrcen, D., Agostini, A., and Dillmann,R. (2010). Object-action complexes: Grounded ab-stractions of sensorimotor processes (in revision).Robotics and Autonomous Systems.

Laptev, I. and Perez, P. (2007). Retrieving actions inmovies. In ICCV.

Liao, L., Fox, D., and Kautz, H. (2005). Location-based activity recognition using relational markovnetworks. In Proceedings of the 19th InternationalJoint Conference on Artificial Intelligence, pages773–778.

Lowe, D. G. (2004). Distinctive image features fromscale-invariant keypoints. International Journal ofComputer Vision, 60:91–110.

Maurer, A., Hersch, M., and Billard, A. (2005). Ex-tended Hopfield Network for Sequence Learning:Application to Gesture Recognition. In Proceed-ings of ICANN’05.

McCarthy, J. and Hayes, P. (1969). Some philosoph-ical problems from the standpoint of artificial in-telligence. Machine Intelligence, pages 195–204.

Modayil, J., Bai, T., and Kautz, H. (2008). Im-proving the recognition of interleaved activities. InProceedings of the 10th international conference onUbiquitous computing, pages 40–43.

Mundy, J. and Zisserman, A. (1992). Geometric In-variance in Computer Vision. MIT Press.

Mundy, J. L. (2006). Object recognition in the ge-ometric era: A retrospective. In Toward Catego-ryLevel Object Recognition, volume 4170 of LectureNotes in Computer Science, pages 3–29. Springer.

Murase, H. and Nayar, S. K. (1995). Visual learningand recognition of 3-d objects from appearance.Int. J. Comput. Vision, 14(1):5–24.

Niebles, J., Wang, H., and Fei-Fei, L. (2008). Unsu-pervised learning of human action categories usingspatial-temporal words. International Journal ofComputer Vision, 79(3):299–318.

Ning, K., Kulvicius, T., Tamosiunaite, M., andWorgotter, F. (2010). A novel trajectory gener-ator based on dynamic motor primitives. IEEETransactions on Robotics (Submitted).

Nister, D. and Stewenius, H. (2006). Scalable recog-nition with a vocabulary tree. In CVPR, pages2161–2168.

Ogawara, K., Takamatsu, J., Kimura, H., and Kat-sushi, I. (2002). Modeling manipulation interac-tions by hidden markov models. In IEEE/RSJ In-ternational Conference on Intelligent Robots andSystems.

Oliva, A. and Torralba, A. (2009). The role of contextin object recognition. Trends in Cognitive Sciences,11(12):520–526.

Pauwels, K. and Van Hulle, M. (2008). Real-time phase-based optical flow on the GPU. InIEEE Conference on Computer Vision and Pat-tern Recognition, Workshop on Computer Visionon the GPU, Anchorage, Alaska.

28

Page 29: Learning the semantics of object-action relations by ......pair wise spatial relations between motion regions, roughly corresponding to semantic relations such as \above", \below",

Raamana, P. R., Grest, D., and Krueger, V. (2007).Human action recognition in table-top scenarios:an HMM-based analysis to optimize the perfor-mance. In CAIP’07: Proceedings of the 12th inter-national conference on Computer analysis of im-ages and patterns, pages 101–108, Berlin, Heidel-berg. Springer-Verlag.

Rizzolatti, G. and Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience,27:169–192.

Sabatini, S., Gastaldi, G., Solari, F., Diaz, J., Ros,E., Pauwels, K., Van Hulle, M., Pugeault, N., andKruger, N. (2007). Compact and accurate earlyvision processing in the harmonic space. In Inter-national Conference on Computer Vision Theoryand Applications, pages 213–220, Barcelona.

Shylo, N., Worgotter, F., and Dellen, B. (2009). As-certaining relevant changes in visual data by inter-facing AI reasoning and low-level visual informa-tion via temporally stable image segments. In Pro-ceedings of the International Conference on Cogni-tive Systems (Cogsys 2008).

Sivic, J. and Zisserman, A. (2003). Video google:A text retrieval approach to object matching invideos. In ICCV ’03: Proceedings of the NinthIEEE International Conference on Computer Vi-sion, page 1470, Washington, DC, USA. IEEEComputer Society.

Sridhar, M., Cohn, G. A., and Hogg, D. (2008).Learning functional object-categories from a re-lational spatio-temporal representation. In Proc.18th European Conference on Artificial Intelli-gence, pages 606–610.

Sumsi, M. F. (2008). Theory and Algorithms on theMedian Graph. Application to Graph-based Clas-sification and Clustering. PhD thesis, UniversitatAutonoma de Barcelona.

Thorndike, E. (1911). Animal intelligence. New York.Macmillan.

Torralba, A. (2003). Modeling global scene factors inattention. JOSA - A, 20:1407–1418.

Turk, M. and Pentland, A. (1991). Eigenfaces forrecognition. J. Cognitive Neuroscience, 3(1):71–86.

Vicente, I., Kyrki, V., and Kragic, D. (2007). Ac-tion recognition and understanding through motorprimitives. Advanced Robotics, 21(15):1687–1707.

Worgotter, F., Agostini, A., Kruger, N., Shylo, N.,and Porr, B. (2009). Cognitive agents - a procedu-ral perspective relying on predictability of object-action complexes (oacs). Robotics and AutonomousSystems, 57(4):420–432.

29